Friday, November 22, 2024
HomeData ScienceSetting Up Apache Airflow with Docker-Compose in 5 Minutes | by Marvin...

Setting Up Apache Airflow with Docker-Compose in 5 Minutes | by Marvin Lanhenke | Might, 2022


Create a growth atmosphere and begin constructing DAGs

Photograph by Fabio Ballasina on Unsplash

Although being fairly late to the get together (Airflow grew to become an Apache High-Degree Challenge in 2019), I nonetheless had hassle discovering an easy-to-understand, up-to-date, and light-weight resolution to putting in Airflow.

At the moment, we’re about to vary all that.

Within the following sections, we are going to create a light-weight, standalone, and simply deployed Apache Airflow growth atmosphere in only a few minutes.

Docker-Compose might be our shut companion, permitting us to create a easy growth workflow with fast iteration cycles. Merely spin up a number of docker containers and we will begin to create our personal workflows.

Word: The next setup won’t be appropriate for any manufacturing functions and is meant for use in a growth atmosphere solely.

Apache Airflow is a batch-oriented framework that permits us to simply construct scheduled knowledge pipelines in Python. Consider “workflow as code” able to executing any operation we will implement in Python.

Airflow shouldn’t be a knowledge processing instrument itself. It’s an orchestration software program. We will think about Airflow as some sort of spider in an online. Sitting within the center, pulling all of the strings and coordinating the workload of our knowledge pipelines.

An information pipeline usually consists of a number of duties or actions that should be executed in a particular order. Apache Airflow fashions such a pipeline as a DAG (directed acyclic graph). A graph with directed edges or duties with none loops or cycles.

A easy instance DAG [Image by Author]

This strategy permits us to run impartial duties in parallel, saving money and time. Furthermore, we will cut up a knowledge pipeline into a number of smaller duties. If a job fails, we will solely rerun the failed and the downstream duties, as an alternative of executing the whole workflow over again.

Airflow consists of three predominant parts:

  1. Airflow Scheduler — the “coronary heart” of Airflow, that parses the DAGs, checks the scheduled intervals, and passes the duties over to the employees.
  2. Airflow Employee — picks up the duties and really performs the work.
  3. Airflow Webserver — supplies the primary person interface to visualise and monitor the DAGs and their outcomes.
A high-level overview of Airflow parts [Image by Author]

Now that we shortly launched Apache Airflow, it’s time to get began.

Step 0: Conditions

Since we are going to use docker-compose to get Airflow up and operating, we’ve got to put in Docker first. Merely head over to the official Docker web site and obtain the suitable set up file on your OS.

Step 1: Create a brand new folder

We begin good and sluggish by merely creating a brand new folder for Airflow.

Simply navigate by way of your most popular terminal to a listing, create a brand new folder, and alter into it by operating:

mkdir airflow
cd airflow

Step 2: Create a docker-compose file

Subsequent, we have to get our fingers on a docker-compose file that specifies the required providers or docker containers.

Through the terminal, we will run the next command contained in the newly created Airflow folder

curl https://uncooked.githubusercontent.com/marvinlanhenke/Airflow/predominant/01GettingStarted/docker-compose.yml -o docker-compose.yml

or just create a brand new file named docker-compose.yml and replica the beneath content material.

The above docker-compose file merely specifies the required providers we have to get Airflow up and operating. Most significantly the scheduler, the webserver, the metadatabase (postgreSQL), and the airflow-init job initializing the database.

On the prime of the file, we make use of some native variables which might be generally utilized in each docker container or service.

Step 3: Setting variables

We efficiently created a docker-compose file with the obligatory providers inside. Nonetheless, to finish the set up course of and configure Airflow correctly, we have to present some atmosphere variables.

Nonetheless, inside your Airflow folder create a .env file with the next content material:

The above variables set the database credentials, the airflow person, and a few additional configurations.

Most significantly, the sort of executor Airflow we are going to make the most of. In our case, we make use of the LocalExecutor.

Word: Extra info on the totally different sorts of executors may be discovered right here.

Step 4: Run docker-compose

And that is already it!

Simply head over to the terminal and spin up all the mandatory containers by operating

docker compose up -d

After a brief time frame, we will examine the outcomes and the Airflow Internet UI by visiting http://localhost:8080. As soon as we sign up with our credentials (airflow: airflow) we acquire entry to the person interface.

Airflow 2 Internet UI [Screenshot by Author]

With a working Airflow atmosphere, we will now create a easy DAG for testing functions.

To begin with, be certain that to run pip set up apache-airflow to put in the required Python modules.

Now, inside your Airflow folder, navigate to dags and create a brand new file referred to as sample_dag.py.

We outline a brand new DAG and a few fairly easy duties.

The EmptyOperator serves no actual goal apart from to create a mockup activity contained in the Internet UI. By using the BashOperator, we create a considerably inventive output of “HelloWorld!”. This permits us to visually affirm a correct operating Airflow setup.

Save the file and head over to the Internet UI. We will now begin the DAG by manually triggering it.

Manually triggering a DAG [Screenshot by Author]

Word: It might take some time earlier than your DAG seems within the UI. We will velocity issues up by operating the next command in our terminal docker exec -it --user airflow airflow-scheduler bash -c "airflow dags checklist"

Working the DAG shouldn’t take any longer than a few seconds.

As soon as completed, we will navigate to XComs and examine the output.

Navigating to Airflow XComs [Screenshot by Author]
Inspecting the output [Screenshot by Author]

And that is it!

We efficiently put in Airflow with docker-compose and gave it a fast take a look at trip.

Word: We will cease the operating containers by merely executing docker compose down.

Airflow is a batch-oriented framework that permits us to create complicated knowledge pipelines in Python.

On this article, we created a easy and easy-to-use atmosphere to rapidly iterate and develop new workflows in Apache Airflow. By leveraging docker-compose we will get straight to work and code new workflows.

Nonetheless, such an atmosphere ought to solely be used for growth functions and isn’t appropriate for any manufacturing atmosphere that requires a extra refined and distributed setup of Apache Airflow.

You’ll find the complete code right here on my GitHub.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments