A deep dive into Apache Airflow structure and the way it orchestrates workflows
Apache Airflow is among the many mostly used frameworks in relation to pipeline scheduling and execution and has been gaining an enormous traction throughout the Knowledge Engineering neighborhood over the previous couple of years.
The know-how itself is consisted of many various elements that work collectively so as to carry out sure operations. In at the moment’s article we might be discussing in regards to the total structure of Apache Airflow and perceive how the completely different elements work together with one another so as to carry information pipelines to life.
Airflow lets construct, schedule and execute workflows. Every workflow is represented as a Directed Acyclic Graph (DAG) which is consisted of duties. Now these duties will carry out sure operations and may additionally have some dependencies in between them.
For instance, let’s suppose we need to ingest information from an exterior database into Google Cloud Massive Question (which is a managed information warehouse service) after which carry out a few transformation steps. In that case, we might assemble a DAG consisting of two duties; the primary process could be chargeable for copying information from the exterior database into BigQuery and the second would carry out the transformation steps. Moreover, the second process could be dependent to the profitable execution of the primary process.
Now Airflow is consisted of some completely different elements that carry out sure operations and work collectively so as to let customers design, construct, schedule and execute workflows. Within the subsequent few sections we are going to undergo Airflow’s structure and talk about about a few of its most essential elements.
Scheduler
A scheduler that primarily performs two sure duties; It schedules and triggers workflows and moreover it additionally submits to the executor all of the of the scheduled workflows.
With the intention to work out whether or not any duties might be triggered, the scheduler must run a subprocess that’s chargeable for monitoring the DAG folder (which is the folder that every one Python recordsdata containing the DAGs are speculated to stay in). By default, the scheduler will do that lookup as soon as per minute (however this may be adjusted in Airflow configuration file).
The scheduler makes use of the executor to run duties which can be prepared.
Executor
The executor is chargeable for operating duties. An Airflow set up can have just one executor at any given time. The executor is outlined within the [core]
part of the Airflow configuration file (airflow.cfg
). For instance,
[core]
executor = KubernetesExecutor
Word that there are primarily two sorts of executors; native and distant. The native executors embody DebugExecutor
, LocalExecutor
and SequentialExecutor
. A number of the distant executors are CeleryExecutor
and KubernetesExecutor
.
Native executors run duties domestically, contained in the scheduler’s course of. Alternatively, distant executors execute their duties remotely (e.g. in a pod inside a Kubernetes cluster), often with using a pool of staff.
Queue
As soon as the scheduler identifies duties which might be triggered, it’s going to push them right into a Process Queue in the precise order they’re speculated to be executed.
The Airflow staff will then pull the duties from the Queue so as to execute them.
Staff
Airflow staff are used to execute the assigned duties.
Metadata Database
It is a database utilized by the executor, scheduler and net server to retailer their state. By default a SQLite database will spin up however Airflow can use as its metadata database any db supported by SQLAlchemy. Typically although customers are likely to have a powerful choice in the direction of Postgres.
Net Server
It is a Flask net server that exposes a consumer interface that lets customers handle, debug and examine a workflow and its duties.
The diagram under illustrates how the completely different elements we mentioned earlier work together with one another and alternate information and messages so as to carry out sure duties.
The Net Server, primarily wants to speak with staff, DAG folder and the Metadata Database so as to fetch the duty execution logs, the DAG construction and the standing of duties respectively.
The employees, want to speak with the DAG folder so as to infer the construction of DAGs and execute their duties and likewise with the Metadata Database so as to learn and retailer details about the connection, the variables and XCOM.
The scheduler, has to speak with DAG folder to deduce the construction of the DAGs and schedule their duties, and the Metadata Database so as to write info relating to DAG runs and associated duties. Moreover, it wants to speak with the Process Queue through which it will likely be pushing the duties able to be triggered.
Airflow is a robust device that lets customers (largely engineers) design, construct, debug, schedule and execute varied workflows. In at the moment’s tutorial we went by a few of the most essential elements of Airflow that work collectively so as to carry out sure operations.
It’s all the time crucial to take the time to grasp the essential structure of a framework or device earlier than begin utilizing it. A lot of the instances, this can enable you extra significant and environment friendly code.
Grow to be a member and skim each story on Medium. Your membership payment instantly helps me and different writers you learn. You’ll additionally get full entry to each story on Medium.
Associated articles you might also like