As companies grow, their workflows become more complex, comprising of many processes with intricate dependencies that require increased monitoring, troubleshooting, and maintenance. Without a clear sense of data lineage, accountability issues can arise and operational metadata can be lost. This is where directed acyclic graphs (DAGs), data pipelines, and workflow managers come into play.
Complex workflows can be represented through DAGs. DAGs are graphs where information must travel between the vertices in a specific direction, but there is no way for information to travel through the graph in a loop that circles back to the starting point. The building blocks of DAGs are data pipelines, or sequential processes where the output from one process becomes the input for the next.
Building these pipelines can be tricky, but luckily there are several open-source workflow managers available to assist with this problem, allowing programmers to focus on individual tasks and dependencies. To help choose among the various workflow managers available, we’ve included a discussion of several of them below.
Luigi is a Python package that was developed by Spotify in 2011 to help build the complex pipelines needed for tasks like generating recommendations and top lists. It is also used by Foursquare, Stripe, the Wall Street Journal, Groupon, and other prominent businesses. It comes with Hadoop support built-in, but unlike similar workflow managers Oozie and Azkaban, which were built specifically for Hadoop, Luigi’s philosophy is to make everything as general as possible. This allows it to be extended with other tasks, such as Hive queries, Spark jobs in Scala or Python, etc. Luigi is code-based, rather than GUI-based or declarative, with everything (including the dependency graph) in Python. The user interface (UI) allows you to search, filter, or monitor of the status of each task. You can also visualize the workflow there in order to see which tasks on the dependency graph are complete and which have yet to be run.
- Allows you to parallelize workflows as needed
- Toolbox with common task templates
- Supports Python mapreduce jobs in Hadoop, Hive, and Pig
- Includes file system abstraction for Hadoop Distributed File System and local files that ensure all systems are atomic, preventing them from crashing in a state containing partial data
Azkaban is another open-source workflow manager which was created at LinkedIn for time-based dependency scheduling of Hadoop batch jobs. Unlike Luigi, it is written in Java and scheduling is done in GUI via a web browser. It consists of an AzkabanWebServer, which serves as the UI and handles project management, authentication, scheduling, and monitoring executions, a MySQL database for metadata, and an AzkabanExecutorServer (previously the web server and executor server were combined as one, but as Azkaban grew, it split the two to help roll upgrades to its users). The current version, 3.0, is available in three modes: a trial mode for a single server, a two-server mode for a production environment, and a distributed multiple-executor mode. Azkaban was designed with usability as a primary goal; as such, it contains a particularly easy-to-use UI with excellent visualizations.
- Compatible with any version of Hadoop
- Simple web and HTTP workflow uploads
- Modular and pluginable for each Hadoop ecosystem
- Tracks user actions, authentication, and authorization
- Provides a separate workspace for each new project
- Provides email alerts on SLAs, failures, and successes
- Allows users to retry failed jobs
Like Azkaban, Oozie is an open-source workflow scheduling system written in Java for Hadoop systems. However, Oozie is different from Azkaban in that it is less focused on usability and more on flexibility and creating complex workflows. Whereas Azkaban supports only time-based scheduling in GUI via a web browser, Oozie’s Coordinator allows for jobs triggered by either time, event, or data availability to account for situations where the data availability is unpredictable, and allows you to schedule via command line, Java API, and web browser, as well as GUI. Oozie also supports XML property files whereas Azkaban’s are Java-based. Finally, Azkaban keeps the state of all running workflows in its memory whereas Oozie uses an SQL database, only using its memory for state transactions.
Oozie Workflows are arranged as DAGs, with control nodes that define when jobs start and stop, decision, fork, and join nodes along the execution path, and action nodes that trigger a task’s execution. Each task is provided a unique callback HTTP URL that notifies the URL when the task is complete. If URL is not notified, Oozie will poll the task to determine if it is complete.
- Provides out-of-the-box support for mapreduce, Pig, Hive, Sqoop, and Distcp, as well as jobs that are system-specific
- Scalable, reliable, and extensible
- Identical workflows can be parameterized to run concurrently
- Allows for bulk kill, suspend, or resume jobs
- High availability
- Multiple coordinator and workflow jobs can be packaged and managed together through Oozie Bundle
Airflow was created by Airbnb in 2015 for authoring, scheduling, and monitoring workflows as DAGs. It was developed for a programmatic environment with a focus on authoring. Similar to Luigi, it is Python-based with workflow DAGs defined as code to make it as collaborative as possible and to ensure it can be easily maintained, versioned, and tested. The architecture consists of job definitions in source control; a command line interface where you would test, run, and describe the various parts of the DAG; a web application for exploring dependencies, progress, metadata, and logs; a metadata repository; workers running distributed job task instances; and scheduler processes that trigger task instances when they are ready to run.
- Rich CLI and UI that allows users to visualize dependencies, progress, logs, related code, and when various tasks are completed during the day
- Modular, scalable, and highly extensible
- Parameterizing scripts are built in using Jinja templating engine
- Provides analytics on search ranking and sessionization information to track users’ clickstream and time spent
- Can interact with Hive, Presto, MySQL, HDFS, Postgres, or S3