Airflow
1. What is Airflow?
Airflow is a platform to programmatically author, schedule, and monitor workflows (DAGs).
2. Key Concepts
DAG (Directed Acyclic Graph): Defines the workflow structure.
Task: A unit of work (e.g., running a Python function).
Operator: Defines what kind of work a task does (PythonOperator, BashOperator, etc.).
Scheduler: Triggers task execution based on schedule.
Web UI: Lets you monitor and manage DAGs.
What Exactly Is Airflow?
Airflow is a workflow orchestrator.
It:
Schedules tasks
Tracks dependencies
Retries on failure
Shows status in UI
2️⃣ Core Airflow Concepts (VERY IMPORTANT)
🔹 DAG (Directed Acyclic Graph)
A DAG = workflow.
Nodes → tasks
Edges → dependencies
A DAG never loops
Example:
extract → transform → load
🔹 Task
A task = one unit of work
Examples:
Run a Python function
Run a shell command
Trigger another DAG
🔹 Operator
Operator = template for a task
Examples:
BashOperatorPythonOperatorKubernetesPodOperator
👉 You never write tasks directly, you instantiate operators.
🔹 DAG Run vs Task Instance
This confuses everyone initially.
| Term | Meaning |
| DAG | Workflow definition |
| DAG Run | One execution of the DAG |
| Task Instance | One task in one DAG run |
3️⃣ Airflow Architecture (Understand This Once)
Airflow has 4 main components:
🧠 Scheduler
Reads DAGs
Decides when tasks should run
Puts tasks in queue
If scheduler dies → nothing runs
👷 Worker
Executes the task
Runs bash/python/etc.
If worker dies → task retries
🌐 Webserver
UI only
Does NOT run tasks
If webserver dies → UI down, jobs still run
🗄️ Metadata DB
Stores:
DAG state
Task status
Execution history
If DB dies → Airflow is blind
4️⃣ Your First Clean DAG (Simple but Correct)
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
default_args = {
"retries": 2,
"retry_delay": timedelta(minutes=5),
}
with DAG(
dag_id="airflow_basics_dag",
start_date=days_ago(1),
schedule_interval="@daily",
catchup=False,
default_args=default_args,
) as dag:
hello = BashOperator(
task_id="hello_task",
bash_command="echo 'Hello Airflow'"
)
hello
Key things to notice:
catchup=False→ prevents backfill stormretries→ reliability
5️⃣ Scheduling (VERY IMPORTANT)
| Schedule | Meaning |
@daily | Once per day |
0 2 * * * | At 2 AM |
None | Manual only |
⚠️ Airflow runs for previous time window, not “now”.
This is why people say:
“Why is my DAG not running today?”
6️⃣ Retries, Timeouts & Failure
Retries
retries=3
retry_delay=timedelta(minutes=10)
Timeout
execution_timeout=timedelta(minutes=30)
Failure behavior
Task fails → retries
Retries exhausted → task fails
DAG marked failed
7️⃣ Logs (Where to Look Always)
Every task has logs:
DAG → Task → Logs
Logs show:
Command run
Error
Exit code
👉 80% of Airflow debugging = reading logs.
8️⃣ Catchup & Backfill (Beginner Trap 🚨)
catchup=True
Airflow runs all past dates since start_date
catchup=False
Only current execution
👉 In production:
Always catchup=False unless you really know why
9️⃣ XCom (Data Passing)
XCom = small data sharing between tasks.
return "hello"
Then:
{{ ti.xcom_pull(task_ids='task1') }}
⚠️ Not for large data.
10️⃣ Common Beginner Mistakes (Avoid These)
❌ start_date = datetime.now()
❌ missing retries
❌ long tasks in PythonOperator
❌ assuming webserver runs tasks
🧠 Pause & Absorb
Before moving forward, you should be able to answer:
1️⃣ What does scheduler do?
2️⃣ Difference between DAG run & task instance?
3️⃣ Why catchup=False is important?
