Skip to main content

Airflow

  • docs: https://airflow.apache.org/docs/apache-airflow/stable/index.html

    Apache Airflow is a platform used to programmatically author, schedule, and monitor workflows. It allows you to define complex workflows as directed acyclic graphs (DAGs) and execute them on a scheduled basis or triggered by events.

  • A DAG is the core concept in Airflow. It's a collection of tasks and their dependencies, organized as a directed acyclic graph. Example:

from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def build_task():
pass

def test_task():
pass

def dynamic_workflow():
dag = DAG(
f'dynamic_workflow',
start_date=datetime.now(),
schedule_interval=None,
catchup=False,
)

build = PythonOperator(
task_id=f'build',
python_callable=build_task,
op_args=[],
dag=dag,
)

test = PythonOperator(
task_id=f'test',
python_callable=test_task,
op_args=[],
dag=dag,
)

# Set task dependencies
build >> test

return dag

dynamic_workflow()

  • Airflow uses schedulers and executors to manage task execution. The scheduler determines when to execute tasks, while the executor defines how tasks are run.
  • allows dynamic pipelines generation
  • sensors: type of operators designed to wait for something to occur:
  • provides UI UI
  • from Why not in docs:

    Airflow™ was built for finite batch workflows. While the CLI and REST API do allow triggering workflows, Airflow was not built for infinitely running event-based workflows. Airflow is not a streaming solution. However, a streaming system such as Apache Kafka is often seen working together with Apache Airflow. Kafka can be used for ingestion and processing in real-time, event data is written to a storage location, and Airflow periodically starts a workflow processing a batch of data.

CeleryExecutor

Details: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/celery.html

  • it is also possible to use Flower UI

CeleryKubernetesExecutor

allows users to run simultaneously a CeleryExecutor and a KubernetesExecutor. An executor is chosen to run a task based on the task’s queue.

Details: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/celery_kubernetes.html (also see the section when to use)

KubernetesExecutor

  • each task run in its own pod
  • scheduler itself does not necessarily need to be running on Kubernetes, but does need access to a Kubernetes cluster

Details: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/kubernetes.html

Deployment