Skip to main content

Command Palette

Search for a command to run...

Airflow

Published
3 min read

1. What is Airflow?

Airflow is a platform to programmatically author, schedule, and monitor workflows (DAGs).

2. Key Concepts

  • DAG (Directed Acyclic Graph): Defines the workflow structure.

  • Task: A unit of work (e.g., running a Python function).

  • Operator: Defines what kind of work a task does (PythonOperator, BashOperator, etc.).

  • Scheduler: Triggers task execution based on schedule.

  • Web UI: Lets you monitor and manage DAGs.

What Exactly Is Airflow?

Airflow is a workflow orchestrator.

It:

  • Schedules tasks

  • Tracks dependencies

  • Retries on failure

  • Shows status in UI

2️⃣ Core Airflow Concepts (VERY IMPORTANT)

🔹 DAG (Directed Acyclic Graph)

A DAG = workflow.

  • Nodes → tasks

  • Edges → dependencies

  • A DAG never loops

Example:

extract → transform → load

🔹 Task

A task = one unit of work

Examples:

  • Run a Python function

  • Run a shell command

  • Trigger another DAG


🔹 Operator

Operator = template for a task

Examples:

  • BashOperator

  • PythonOperator

  • KubernetesPodOperator

👉 You never write tasks directly, you instantiate operators.


🔹 DAG Run vs Task Instance

This confuses everyone initially.

TermMeaning
DAGWorkflow definition
DAG RunOne execution of the DAG
Task InstanceOne task in one DAG run

3️⃣ Airflow Architecture (Understand This Once)

Airflow has 4 main components:

🧠 Scheduler

  • Reads DAGs

  • Decides when tasks should run

  • Puts tasks in queue

If scheduler dies → nothing runs


👷 Worker

  • Executes the task

  • Runs bash/python/etc.

If worker dies → task retries


🌐 Webserver

  • UI only

  • Does NOT run tasks

If webserver dies → UI down, jobs still run


🗄️ Metadata DB

  • Stores:

    • DAG state

    • Task status

    • Execution history

If DB dies → Airflow is blind


4️⃣ Your First Clean DAG (Simple but Correct)

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
from datetime import timedelta

default_args = {
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
}

with DAG(
    dag_id="airflow_basics_dag",
    start_date=days_ago(1),
    schedule_interval="@daily",
    catchup=False,
    default_args=default_args,
) as dag:

    hello = BashOperator(
        task_id="hello_task",
        bash_command="echo 'Hello Airflow'"
    )

    hello

Key things to notice:

  • catchup=False → prevents backfill storm

  • retries → reliability


5️⃣ Scheduling (VERY IMPORTANT)

ScheduleMeaning
@dailyOnce per day
0 2 * * *At 2 AM
NoneManual only

⚠️ Airflow runs for previous time window, not “now”.

This is why people say:

“Why is my DAG not running today?”


6️⃣ Retries, Timeouts & Failure

Retries

retries=3
retry_delay=timedelta(minutes=10)

Timeout

execution_timeout=timedelta(minutes=30)

Failure behavior

  • Task fails → retries

  • Retries exhausted → task fails

  • DAG marked failed


7️⃣ Logs (Where to Look Always)

Every task has logs:

DAG → Task → Logs

Logs show:

  • Command run

  • Error

  • Exit code

👉 80% of Airflow debugging = reading logs.


8️⃣ Catchup & Backfill (Beginner Trap 🚨)

catchup=True

Airflow runs all past dates since start_date

catchup=False

Only current execution

👉 In production:
Always catchup=False unless you really know why


9️⃣ XCom (Data Passing)

XCom = small data sharing between tasks.

return "hello"

Then:

{{ ti.xcom_pull(task_ids='task1') }}

⚠️ Not for large data.


10️⃣ Common Beginner Mistakes (Avoid These)

❌ start_date = datetime.now()
❌ missing retries
❌ long tasks in PythonOperator
❌ assuming webserver runs tasks


🧠 Pause & Absorb

Before moving forward, you should be able to answer:
1️⃣ What does scheduler do?
2️⃣ Difference between DAG run & task instance?
3️⃣ Why catchup=False is important?