Induction to Airflow
That README is part of a broader tutorial about workflow engines, and gives details on how to setup Apache Airflow on a LXC container of a Proxmox host, secured by an SSH gateway, also acting as an Nginx-based reverse proxy.
For the installation of the Proxmox host and LXC containers themselves, refer to the dedicated tutorial on GitHub, itself a full tutorial on Kubernetes (k8s). Only a summary is given here, focusing on Apache Airflow.
Table of contents generated with markdown-toc
In that section, it is assumed that we are logged on the Proxmox host
as root
.
The following parameters are used in the remaining of the guide, and may be adapted according to your configuration:
.254
:
HST_GTW_IP
GTW_MAC
GTW_IP
103
VM ID | Private IP | Host name (full) | Short name |
---|---|---|---|
104 | 10.30.2.4 | proxy8.example.com | proxy8 |
200 | 10.30.2.200 | arfl-int.example.com | arfl-int |
auto lo iface lo inet loopback
auto eno1 iface eno1 inet manual
auto eno2 iface eno2 inet manual
auto bond0 iface bond0 inet manual bond-slaves eno1 eno2 bond-miimon 100 bond-mode active-backup
auto vmbr0 iface vmbr0 inet static address ${HST_IP} netmask 255.255.255.0 gateway ${HST_GTW_IP} bridge_ports bond0 bridge_stp off bridge_fd 0
auto vmbr2 iface vmbr2 inet static address 10.30.2.2 netmask 255.255.255.0 bridge-ports none bridge-stp off bridge-fd 0 post-up echo 1 > /proc/sys/net/ipv4/ip_forward post-up iptables -t nat -A POSTROUTING -s ‘10.30.2.0/24’ -o vmbr0 -j MASQUERADE post-down iptables -t nat -D POSTROUTING -s ‘10.30.2.0/24’ -o vmbr0 -j MASQUERADE
root@proxmox:~$ cat /etc/systemd/network/50-default.network
[Match] Name=vmbr0
[Network] Description=network interface on public network, with default route DHCP=no Address=${HST_IP}/24 Gateway=${HST_GTW_IP} IPv6AcceptRA=no NTP=ntp.ovh.net DNS=127.0.0.1 DNS=8.8.8.8
[Address] Address=${HST_IPv6}
[Route] Destination=2001:0000:0000:34ff:ff:ff:ff:ff Scope=link
root@proxmox:~$ cat /etc/systemd/network/50-public-interface.link
[Match] Name=vmbr0
[Link] Description=network interface on public network, with default route MACAddressPolicy=persistent NamePolicy=kernel database onboard slot path mac #Name=eth0 # name under which this interface is known under OVH rescue system #Name=eno1 # name under which this interface is probably known by systemd
* The maximal virtual memory needs to be increased on the host:
```bash
$ sysctl -w vm.max_map_count=262144
$ cat >> /etc/sysctl.conf << _EOF
###########################
# Elasticsearch in VM
vm.max_map_count = 262144
_EOF
root@proxmox:~$ pveam update
root@proxmox:~$ pveam available
root@proxmox:~$ pveam download local centos-8-default_20191016_amd64.tar.xz
root@proxmox:~$ pveam list local
root@proxmox:~$ # Which should give the same result as:
root@proxmox:~$ ls -lFh /var/lib/vz/template/cache
root@proxmox:~$ wget https://us.images.linuxcontainers.org/images/centos/8/amd64/default/20200411_07:08/rootfs.tar.xz -O /vz/template/cache/centos-8-default_20200411_amd64.tar.xz
root@proxmox:~$ modprobe overlay && \
cat > /etc/modules-load.d/docker-overlay.conf << _EOF
overlay
_EOF
nf_conntrack
hashsize
parameter should be set to at least 32,768.
We can set it tp 65,536. But the Proxmox firewall resets the value every
so often to 16384. The following shows how that parameter may be set
when the module is loaded:
root@proxmox:~$ modprobe nf_conntrack hashsize=65536 && \
cat > /etc/modules-load.d/nf_conntrack.conf << _EOF
options nf_conntrack hashsize=65536
_EOF
root@proxmox:~$ # echo "65536" > /sys/module/nf_conntrack/parameters/hashsize
root@proxmox:~$ cat /sys/module/nf_conntrack/parameters/hashsize
65536
root@proxmox:~$ sed -i -e 's|my $hashsize = int($max/4);|my $hashsize = $max;|g' /usr/share/perl5/PVE/Firewall.pm
root@proxmox:~$ systemctl restart pve-firewall.service
root@proxmox:~$ cat /sys/module/nf_conntrack/parameters/hashsize
65536
The goal is both to set up an SSH gateway and to create an end-point for Airflow (https://airflow.example.com)
All the traffic from the internet to Airflow is then forced through the gateway (SSH) and/or the reverse proxy (HTTP/HTTPS)
The IP address of the web end-point is the gateway’s one (GTW_IP
)
A
records in the example.com
domain:
arfl-int.example.com
10.30.2.200
kibana.example.com
GTW_IP
(public IP of proxy8.example.com
, needed for SSL)root@proxmox:~$ pct create 104 local:vztmpl/centos-8-default_20191016_amd64.tar.xz --arch amd64 --cores 1 --hostname proxy8.example.com --memory 16134 --swap 32268 --net0 name=eth0,bridge=vmbr0,firewall=1,gw=${HST_GTW_IP},hwaddr=${GTW_MAC},ip=${GTW_IP}/32,type=veth --net1 name=eth1,bridge=vmbr2,ip=10.30.2.4/24,type=veth --onboot 1 --ostype centos
root@proxmox:~$ pct resize 104 rootfs 10G
root@proxmox:~$ ls -laFh /var/lib/vz/images/104/vm-104-disk-0.raw
-rw-r----- 1 root root 10G Apr 12 00:00 /var/lib/vz/images/104/vm-104-disk-0.raw
root@proxmox:~$ cat /etc/pve/lxc/104.conf
arch: amd64
cores: 2
hostname: proxy8.example.com
memory: 16134
net0: name=eth0,bridge=vmbr0,firewall=1,gw=${HST_GTW_IP},hwaddr=${GTW_MAC},ip=${GTW_IP}/32,type=veth
net1: name=eth1,bridge=vmbr2,hwaddr=<some-mac-addr>,ip=10.30.2.4/24,type=veth
onboot: 1
ostype: centos
rootfs: local:104/vm-104-disk-0.raw,size=10G
swap: 32268
As of April 2020, the LXC templates for (Fedora and) CentOS 8
do not come with NetworkManager (more specifically, nmcli
for the tool
and NeetworkManager-tui
for the RPM package).
And to install it, the network needs to be set up, manually first.
Once NetworkManager has been installed, the network is then setup
automatically at startup times.
ip addr add ${HST_GTW_IP}/5 dev eth0 ip link set eth0 up ip route add default via ${GTW_IP} dev eth0
ip addr add 10.30.2.4/24 dev eth1 ip link set eth1 up
_EOF [root@proxy8]# chmod 755 ~/bin/netup.sh [root@proxy8]# ~/bin/netup.sh [root@proxy8]# dnf -y upgrade [root@proxy8]# dnf -y install epel-release [root@proxy8]# dnf -y install NetworkManager-tui [root@proxy8]# systemctl start NetworkManager.service \ && systemctl status NetworkManager.service \ && systemctl enable NetworkManager.service [root@proxy8]# nmcli con # to check the name of the connection [root@proxy8]# nmcli con up “System eth0” [root@proxy8]# exit
* Complement the installation on the SSH gateway/reverse proxy container.
For security reason, it may be a good idea to change the SSH port
from `22` to, say `7022`:
```bash
root@proxmox:~$ pct enter 104
[root@proxy8]# dnf -y install hostname rpmconf dnf-utils wget curl net-tools tar
[root@proxy8]# hostnamectl set-hostname proxy8.example.com
[root@proxy8]# dnf -y install htop less screen bzip2 dos2unix man man-pages
[root@proxy8]# dnf -y install sudo whois ftp rsync vim git-all patch mutt
[root@proxy8]# dnf -y install java-11-openjdk-headless
[root@proxy8]# dnf -y install nginx python3-pip
[root@proxy8]# pip-3 install certbot-nginx
[root@proxy8]# rpmconf -a
[root@proxy8]# ln -sf /usr/share/zoneinfo/Europe/Paris /etc/localtime
[root@proxy8]# setenforce 0
[root@proxy8]# dnf -y install openssh-server
root@proxy8# sed -i -e 's/#Port 22/Port 7022/g' /etc/ssh/sshd_config
[root@proxy8]# systemctl start sshd.service \
&& systemctl status sshd.service \
&& systemctl enable sshd.service
[root@proxy8]# mkdir ~/.ssh && chmod 700 ~/.ssh
[root@proxy8]# cat > ~/.ssh/authorized_keys << _EOF
ssh-rsa AAAA<Add-Your-own-SSH-public-key>BLAgU first.last@example.com
_EOF
[root@proxy8]# chmod 600 ~/.ssh/authorized_keys
[root@proxy8]# passwd -d root
[root@proxy8]# rpm --import http://wiki.psychotic.ninja/RPM-GPG-KEY-psychotic
[root@proxy8]# rpm -ivh http://packages.psychotic.ninja/7/base/x86_64/RPMS/keychain-2.8.0-3.el7.psychotic.noarch.rpm
certbot
is not available in /usr/local/bin
(from the installation
by pip-3
above), it can get installed thanks to
https://certbot.eff.org/lets-encrypt/centosrhel8-nginx.html
[root@proxy8]# /usr/local/bin/certbot --nginx # and then certbot renew
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.30.2.200 arfl-int.example.com arfl-int
_EOF
* A few handy aliases:
```bash
root@proxy8:~# cat >> ~/.bashrc << _EOF
# Source aliases
if [ -f ~/.bash_aliases ]
then
. ~/.bash_aliases
fi
_EOF
root@proxy8:~$ cat ~/.bash_aliases << _EOF
# User specific aliases and functions
alias dir='ls -laFh --color'
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'
_EOF
root@proxy8:~# . ~/.bashrc
root@proxy8:~# exit
Configure Nginx as a reverse proxy for Airflow: ```bash root@proxmox:~$ pct enter 104 [root@proxy8]# cat > /etc/nginx/conf.d/tiairflow.conf « _EOF server { server_name airflow.example.com; access_log /var/log/nginx/tiairflow.access.log;
auth_basic “Restricted Access”; auth_basic_user_file /etc/nginx/.airflow-user;
location / { proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme;
# Fix the "It appears that your reverse proxy set up is broken" error.
proxy_pass http://10.30.2.200:8080;
proxy_read_timeout 90;
proxy_redirect http://10.30.2.200 https://\$host; } }
_EOF
[root@proxy8]# htpasswd -c /etc/nginx/.airflow-user
# Airflow node
* Create the LXC container:
```bash
root@proxmox:~$ pct create 200 local:vztmpl/centos-8-default_20191016_amd64.tar.xz --arch amd64 --cores 2 --hostname arfl-int.example.com --memory 16134 --swap 32268 --net0 name=eth0,bridge=vmbr2,gw=10.30.2.2,ip=10.30.2.200/24,type=veth --onboot 1 --ostype centos
root@proxmox:~$ pct resize 200 rootfs 50G
root@proxmox:~$ ls -laFh /var/lib/vz/images/200/vm-200-disk-0.raw
-rw-r----- 1 root root 50G Dec 19 22:27 /var/lib/vz/images/200/vm-200-disk-0.raw
root@proxmox:~$ cat /etc/pve/lxc/200.conf
arch: amd64
cores: 2
hostname: arfl-int.example.com
memory: 16134
net0: name=eth0,bridge=vmbr2,gw=10.30.2.2,hwaddr=1A:EC:7F:9E:90:34,ip=10.30.2.200/24,type=veth
onboot: 1
ostype: centos
rootfs: local:200/vm-200-disk-0.raw,size=50G
swap: 32268
ip addr add 10.30.2.200/24 dev eth0 ip link set eth0 up ip route add default via 10.30.2.2 dev eth0
_EOF [root@arfl-int]# chmod 755 ~/bin/netup.sh [root@arfl-int]# ~/bin/netup.sh # may not be needed [root@arfl-int]# dnf -y upgrade [root@arfl-int]# dnf -y install epel-release [root@arfl-int]# dnf -y install NetworkManager-tui [root@arfl-int]# systemctl start NetworkManager.service \ && systemctl status NetworkManager.service \ && systemctl enable NetworkManager.service [root@arfl-int]# nmcli con # to check the name of the connection [root@arfl-int]# nmcli con up “System eth0” [root@arfl-int]# exit
* Complement the installation:
```bash
root@proxmox:~$ pct enter 200
[root@arfl-int]# dnf -y install hostname rpmconf dnf-utils wget curl net-tools tar
[root@arfl-int]# hostnamectl set-hostname arfl-int.example.com
[root@arfl-int]# dnf -y install htop less screen bzip2 dos2unix man man-pages
[root@arfl-int]# dnf -y install sudo ftp rsync vim git-all patch mutt
[root@arfl-int]# dnf -y install autoconf libtool make gcc gcc-c++ m4
[root@arfl-int]# dnf -y install python2-devel python3-devel
[root@arfl-int]# dnf -y install postgresql
[root@arfl-int]# dnf -y install python3-pip
[root@arfl-int]# rpmconf -a
[root@arfl-int]# ln -sf /usr/share/zoneinfo/Europe/Paris /etc/localtime
[root@arfl-int]# setenforce 0
[root@arfl-int]# dnf -y install openssh-server
[root@arfl-int]# systemctl start sshd.service \
&& systemctl status sshd.service \
&& systemctl enable sshd.service
[root@arfl-int]# mkdir ~/.ssh && chmod 700 ~/.ssh
[root@arfl-int]# cat > ~/.ssh/authorized_keys << _EOF
ssh-rsa AAAA<Add-Your-own-SSH-public-key>BLAgU first.last@example.com
_EOF
[root@arfl-int]# chmod 600 ~/.ssh/authorized_keys
[root@arfl-int]# passwd -d root
[root@arfl-int]# rpm --import http://wiki.psychotic.ninja/RPM-GPG-KEY-psychotic
[root@arfl-int]# rpm -ivh http://packages.psychotic.ninja/7/base/x86_64/RPMS/keychain-2.8.0-3.el7.psychotic.noarch.rpm
[root@arfl-int]# cat > ~/.screenrc << _EOF
hardstatus alwayslastline "%{.kW}%-w%{.B}%n %t%{-}%{=b kw}%?%+w%? %=%c %d/%m/%Y" #B&W & date&time
startup_message off
defscrollback 1024
_EOF
[root@arfl-int]# exit
airflow
user:
[root@arfl-int]# adduser -m airflow
[root@arfl-int]# cp -a ~/.ssh ~/.bashrc ~/.bash_aliases ~airflow/ && \
sudo chown -R airflow.airflow ~airflow/
user@laptop$ cat >> ~/.ssh/config << _EOF
# Airflow
Host proxy8
HostName proxy8.example.com
Port 7022
ForwardAgent yes
Host tiafl
HostName arfl-int.example.com
ProxyCommand ssh -W %h:22 root@proxy8
_EOF
Reference: https://www.tecmint.com/install-postgressql-and-pgadmin-in-centos-8/
Reference: https://linuxconfig.org/how-to-install-docker-in-rhel-8
root@proxmox:~$ pct stop 200
root@proxmox:~$ cat >> /etc/pve/lxc/200.conf << _EOF
lxc.apparmor.profile: unconfined
lxc.cgroup.devices.allow: a
lxc.cap.drop:
_EOF
root@proxmox:~$ pct start 200 && pct enter 200
[root@arfl-int]# dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
[root@arfl-int]# dnf install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.2.6-3.3.el7.x86_64.rpm
[root@arfl-int]# dnf install docker-ce --nobest -y
[root@arfl-int]# systemctl start docker && \
systemctl enable docker && systemctl status docker
[root@arfl-int]# usermod -aG docker airflow
docker-compose
:
[root@arfl-int]# curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o docker-compose
[root@arfl-int]# chmod +x /usr/local/bin/docker-compose
[root@arfl-int]# docker-compose version
docker-compose version 1.25.5, build 8a1c60f6
docker-py version: 4.1.0
CPython version: 3.7.5
OpenSSL version: OpenSSL 1.1.0l 10 Sep 2019
[airflow@arfl-int]# pip3 install docker-compose --user
[root@arfl-int]# dnf -y install redis
[root@arfl-int]# systemctl start redis && systemctl enable redis && \
systemctl status redis
[root@arfl-int]# python3 -m pip install -U celery[redis]
[root@arfl-int]# psql -h localhost -U postgres -d postgres -c "create database airflow;"
CREATE DATABASE
[root@arfl-int]# psql -h localhost -U postgres -d postgres \
-c "create user airflow with encrypted password '<airflow-pass>'; \
grant all privileges on database airflow to airflow;"
GRANT
[root@arfl-int]# echo "localhost:5432:airflow:airflow:<airflow-pass>" > ~/.pgpass && chmod 600 ~/.pgpass
[root@arfl-int ~]# psql -h localhost -U airflow -d airflow
airflow=> \q
root
user:
[root@arfl-int]# python3 -m pip install -U pip
[root@arfl-int]# python3 -m pip install -U wheel
airflow
user:
[root@arfl-int]# su - airflow
[airflow@arfl-int]# python3 -m pip install --user -U pip
[airflow@arfl-int]# python3 -m pip install --user -U wheel
[airflow@arfl-int]# python3 -m pip install -U pyjq
[airflow@arfl-int]# python3 -m pip install -U awsume
[airflow@arfl-int]# python3 -m pip install -U setuptools tox pytest twine sphinx
[airflow@arfl-int]# python3 -m pip install -U cx_oracle
[airflow@arfl-int]# python3 -m pip install -U elasticsearch
[airflow@arfl-int]# python3 -m pip install -U apache-airflow
[airflow@arfl-int]# python3 -m pip install -U typing_extensions pyamqp
[airflow@arfl-int]# python3 -m pip install -U apache-airflow[postgres,celery,redis,pyamqp,sqlalchemy,elasticsearch]
airflow@airflow$ psql -h localhost -U airflow -d airflow -c "select 1 as test"
Password for user airflow:
test
------
1
(1 row)
~/airflow/airflow.cfg
):
airflow@arfl-int:~# airflow initdb
airflow@arfl-int:~# sed -i -E 's|sqlite:////home/airflow/airflow/airflow.db|postgresql+psycopg2://airflow:<ask-admin>@localhost/airflow|g' ~/airflow/airflow.cfg
airflow@arfl-int:~# sed -i -E 's|rbac = False|rbac = True|g' ~/airflow/airflow.cfg
airflow@arfl-int:~# airflow initdb
airflow@airflow$ psql -h localhost -U airflow -d airflow -c "\dt"
List of relations Schema | Name | Type | Owner ——- | —————————– | —– | ——- public | alembic_version | table | airflow public | chart | table | airflow public | connection | table | airflow public | dag | table | airflow … | … | … | public | users | table | airflow public | variable | table | airflow public | xcom | table | airflow
airflow@airflow$ airflow create_user --username admin --role Admin \
--email firstname.lastname@example.com \
--firstname FirstName --lastname LastName \
--password <ask-admin>
/etc/systemd/
.
Source: https://github.com/apache/airflow/tree/master/scripts/systemd
root@airflow$ cd /usr/lib/systemd/system
root@airflow$ wget https://github.com/apache/airflow/raw/master/scripts/systemd/airflow-webserver.service
root@airflow$ wget https://github.com/apache/airflow/raw/master/scripts/systemd/airflow-flower.service
root@airflow$ wget https://github.com/apache/airflow/raw/master/scripts/systemd/airflow-scheduler.service
root@airflow$ wget https://github.com/apache/airflow/raw/master/scripts/systemd/airflow-worker.service
root@airflow$ cd /etc/sysconfig
root@airflow$ wget https://github.com/apache/airflow/raw/master/scripts/systemd/airflow
/run
is dynamically
created):
root@airflow$ mkdir -p /rub/airflow && sudo chown airflow.airflow /run/airflow
root@airflow$ sed -i -E 's|# AIRFLOW_HOME=|AIRFLOW_HOME=/home/airflow/airflow|g' /etc/sysconfig/airflow
root@airflow$ sed -i -E 's|ExecStart=/bin/airflow|ExecStart=/home/airflow/.local/bin/airflow|g' /usr/lib/systemd/system/airflow-webserver.service
root@airflow$ sed -i -E 's|mysql.service ||g' /usr/lib/systemd/system/airflow-webserver.service
# Add the following two lines just after Group=airflow
root@airflow$ cat >> /usr/lib/systemd/system/airflow-webserver.service << _EOF
RuntimeDirectory=airflow
RuntimeDirectoryMode=0775
_EOF
root@airflow$ systemctl enable airflow-webserver.service && sudo systemctl start airflow-webserver.service && systemctl status airflow-webserver.service
airflow@airflow$ sudo sed -i -E 's|ExecStart=/bin/airflow|ExecStart=/home/airflow/.local/bin/airflow|g' /etc/systemd/system/airflow-scheduler.service
airflow@airflow$ sudo sed -i -E 's|mysql.service ||g' /etc/systemd/system/airflow-scheduler.service
airflow@airflow$ sudo systemctl enable airflow-scheduler.service && sudo systemctl start airflow-scheduler.service && systemctl status airflow-scheduler.service
airflow@airflow$ sudo sed -i -E 's|ExecStart=/bin/airflow celery worker|ExecStart=/home/airflow/.local/bin/airflow worker|g' /etc/systemd/system/airflow-worker.service
airflow@airflow$ sed -i -E 's|mysql.service ||g' /etc/systemd/system/airflow-worker.service
airflow@airflow$ sudo systemctl enable airflow-worker.service && sudo systemctl start airflow-worker.service && systemctl status airflow-worker.service
airflow@airflow$ sudo sed -i -E 's|ExecStart=/bin/airflow celery flower|ExecStart=/home/airflow/.local/bin/airflow flower|g' /etc/systemd/system/airflow-flower.service
airflow@airflow$ sudo sed -i -E 's|mysql.service ||g' /etc/systemd/system/airflow-flower.service
airflow@airflow$ sudo systemctl enable airflow-flower.service && sudo systemctl start airflow-flower.service && systemctl status airflow-flower.service
[root@arfl-int]# alias getstatusairflow='for svc in airflow-flower airflow-worker airflow-webserver airflow-scheduler; do systemctl status $svc; done'