Open Source Data Pipeline – Luigi vs Azkaban vs Oozie vs Airflow

Luigi

Luigi is a Python package that was developed by Spotify in 2011 to help build the complex pipelines needed for tasks like generating recommendations and top lists. It is also used by Foursquare, Stripe, the Wall Street Journal, Groupon, and other prominent businesses. It comes with Hadoop support built-in, but unlike similar workflow managers Oozie and Azkaban, which were built specifically for Hadoop, Luigi’s philosophy is to make everything as general as possible. This allows it to be extended with other tasks, such as Hive queries, Spark jobs in Scala or Python, etc. Luigi is code-based, rather than GUI-based or declarative, with everything (including the dependency graph) in Python. The user interface (UI) allows you to search, filter, or monitor of the status of each task. You can also visualize the workflow there in order to see which tasks on the dependency graph are complete and which have yet to be run.

Features:

Allows you to parallelize workflows as needed
Toolbox with common task templates
Supports Python mapreduce jobs in Hadoop, Hive, and Pig
Includes file system abstraction for Hadoop Distributed File System and local files that ensure all systems are atomic, preventing them from crashing in a state containing partial data

Azkaban

Azkaban is another open-source workflow manager which was created at LinkedIn for time-based dependency scheduling of Hadoop batch jobs. Unlike Luigi, it is written in Java and scheduling is done in GUI via a web browser. It consists of an AzkabanWebServer, which serves as the UI and handles project management, authentication, scheduling, and monitoring executions, a MySQL database for metadata, and an AzkabanExecutorServer (previously the web server and executor server were combined as one, but as Azkaban grew, it split the two to help roll upgrades to its users). The current version, 3.0, is available in three modes: a trial mode for a single server, a two-server mode for a production environment, and a distributed multiple-executor mode. Azkaban was designed with usability as a primary goal; as such, it contains a particularly easy-to-use UI with excellent visualizations.

Features:

Compatible with any version of Hadoop
Simple web and HTTP workflow uploads
Modular and pluginable for each Hadoop ecosystem
Tracks user actions, authentication, and authorization
Provides a separate workspace for each new project
Provides email alerts on SLAs, failures, and successes
Allows users to retry failed jobs

Oozie

Like Azkaban, Oozie is an open-source workflow scheduling system written in Java for Hadoop systems. However, Oozie is different from Azkaban in that it is less focused on usability and more on flexibility and creating complex workflows. Whereas Azkaban supports only time-based scheduling in GUI via a web browser, Oozie’s Coordinator allows for jobs triggered by either time, event, or data availability to account for situations where the data availability is unpredictable, and allows you to schedule via command line, Java API, and web browser, as well as GUI. Oozie also supports XML property files whereas Azkaban’s are Java-based. Finally, Azkaban keeps the state of all running workflows in its memory whereas Oozie uses an SQL database, only using its memory for state transactions.

Oozie Workflows are arranged as DAGs, with control nodes that define when jobs start and stop, decision, fork, and join nodes along the execution path, and action nodes that trigger a task’s execution. Each task is provided a unique callback HTTP URL that notifies the URL when the task is complete. If URL is not notified, Oozie will poll the task to determine if it is complete.

Features:

Provides out-of-the-box support for mapreduce, Pig, Hive, Sqoop, and Distcp, as well as jobs that are system-specific
Scalable, reliable, and extensible
Identical workflows can be parameterized to run concurrently
Allows for bulk kill, suspend, or resume jobs
High availability
Multiple coordinator and workflow jobs can be packaged and managed together through Oozie Bundle

Airflow

Airflow was created by Airbnb in 2015 for authoring, scheduling, and monitoring workflows as DAGs. It was developed for a programmatic environment with a focus on authoring. Similar to Luigi, it is Python-based with workflow DAGs defined as code to make it as collaborative as possible and to ensure it can be easily maintained, versioned, and tested. The architecture consists of job definitions in source control; a command line interface where you would test, run, and describe the various parts of the DAG; a web application for exploring dependencies, progress, metadata, and logs; a metadata repository; workers running distributed job task instances; and scheduler processes that trigger task instances when they are ready to run.

Features:

Rich CLI and UI that allows users to visualize dependencies, progress, logs, related code, and when various tasks are completed during the day
Modular, scalable, and highly extensible
Parameterizing scripts are built in using Jinja templating engine
Provides analytics on search ranking and sessionization information to track users’ clickstream and time spent
Can interact with Hive, Presto, MySQL, HDFS, Postgres, or S3

Open Source Data Pipeline – Luigi vs Azkaban vs Oozie vs Airflow

Categories

Luigi

Azkaban

Oozie

Airflow