Data engineers are often responsible for the planning, building and maintaining of data pipelines. During that process, they may face the challenging decision to either create a custom system or use an existing framework.
Data engineers are often responsible for the planning, building and maintaining of data pipelines. During that process, they may face the challenging decision to either create a custom system or use an existing framework. In this post, I’ll address some of the common pitfalls one may encounter when evaluating such an important decision, as well as the ongoing consequences of each choice.
Workflow Framework vs Completely Custom
The decision to adopt a workflow management system or build one from scratch is daunting. The data engineering ecosystem has many frameworks and services available for building pipelines. You can expect these systems to handle a wide range of features including being robust to occasional outages, handling large amounts of data in real time, having meaningful metric handling, and much more. Some popular choices include Apache Airflow, Luigi, and my company’s new platform, Prefect
The advantage of using an existing framework is that the effort required to create new workflows becomes dramatically lower. Most of the code is already written! You can incur enormous technical debt trying to replicate that in a custom system, especially because you’ll need to educate your colleagues (and we’ll talk about documentation in a minute!). However, if the framework doesn’t do exactly what you need, then you’ll probably end up spending a lot of time integrating it with your infrastructure. This tradeoff — would you rather code now, or code later? — is a critical part of the framework versus custom decision.
Sometimes, a relatively minor detail can be the difference between a framework working for you or not. For example, you might be really excited about a product’s headline features, like a UI or user permissions, but even the most beautiful dashboard won’t help you if the framework simply doesn’t support the type of distributed computations you need to run, or doesn’t know how to talk to your data warehouse.
An immediate question, therefore, is what infrastructure and resources do you require — and how hard is it for a framework to address them? Are you using Kubernetes, Mesos, Swarm, or another container orchestration system? Does your pipeline run on AWS, Google Cloud, Azure, or another cloud provider? Many of the popular data engineering frameworks were originally created by a single company to solve that company’s own infrastructure challenges, and may not be immediately applicable to the data challenges you need to solve.
Flexibility in a data pipeline is crucial. It needs to be developed in a way where both major and minor changes can be made efficiently.
A common requirement is adding a step to the pipeline that performs a new transformation: a seemingly simple request, but one that many systems can’t easily accommodate. Some tightly couple tasks in a way that makes it impossible to add new tasks. Others have no way to migrate to a new pipeline “version” in a way that keeps history — and API endpoints — intact. Yet others are “too” flexible, allowing changes at any time and providing no guarantee that the same pipeline is running on all workers. All of this means that users must often resort to deleting the old pipeline and uploading a completely new one, just to add one new task. If you think your pipelines will evolve over time, then this is an important consideration!
Another, more drastic, example would be changing the data warehouse. In some cases, this could require a total pipeline redesign. However, a flexible pipeline would be designed to abstract the database logic into a separate set of tasks from the core data transformations. This would minimize the impact of changing those tasks. Typically, this is only possible in frameworks that have first-class support for dataflow.
It helps to think of “flexibility” as a system of building blocks that are connected with common methodologies. This allows each chunk of the pipeline to easily change as development progresses without requiring much (or any) change in any other module.
When designing a pipeline with flexibility in mind it helps to refer to the SOLID principles of software design: https://en.wikipedia.org/wiki/SOLID
Imagine a scenario where you plan out a massive pipeline, choose the latest trendy tech to build it with, and though you encounter a few small hiccups along the way, you’re finally there: your awesome pipeline is ingesting real-time data and running smoothly in production.
What happens now? You’re stuck maintaining it!
Even if your job description states that you build workflows, you’ll be spending a large chunk of your time maintaining them. It sure would be nice if you could just provide the infrastructure, define your pipeline, and let someone else take over the maintenance… but if you build a custom system, you need to be prepared for this reality.
I dealt with this problem over and over in my previous jobs. No software engineer should be surprised that they’re responsible for their work, but debugging distributed data pipelines can be a full-time job in and of itself — and that’s assuming you bothered to add debug hooks to your custom system! (You did, right?) Pipelines are never fire-and-forget projects, so you need to remember this when you’re planning your approach.
Infrastructure is increasingly available from large cloud providers, but I see many people focusing on code and ignoring maintenance. Unfortunately, ignoring it doesn’t make it go away.
In a perfect world, pipelines and workflows would be built with perfect knowledge about how much data they will consume. That would be great because you could plan all your resources ahead of time. However, it’s often not the case.
Most pipelines you build will probably have a fluctuating amount of data being consumed and outputted so you better be prepared to handle it. Let’s say one of your tasks is slower than the others. Along comes a large amount of data. Now all steps before the slow one execute quickly and the data gets clogged up at the slow step; this leads to a bottleneck in the pipeline.
What do you do now? Obviously, it would be irresponsible and poor design to just wait for that step to complete its backlog of computation. This is where a system of scale needs to be planned out and (preferably, as to avoid another aspect of maintenance) automatically enabled. One way of doing this consists of placing checks which are aware of the amount of data queued up to be processed for that step and scaling the amount of resources accordingly.
Cluster auto-scaling and web server load balancing are life savers: use them. A delay in data delivery is a critical error in pipeline development, especially when that pipeline is used in a production environment.
If you build a custom pipeline, then you’re going to need to support it. There are no online documentation examples or previously answered Stack Overflow questions for other developers on your team to take advantage of. This becomes an even bigger roadblock when new engineers join your team. Fortunately, there’s a simple remedy: just write extensive documentation, tutorials, and examples when building your custom framework!
I’m joking, but I’m serious.
You’re promising your team that your framework will support their needs. How can you do that if they have no way to learn how to use it effectively?
The best time to write documentation was yesterday. Failing that, do it contemporaneously. Keeping a running repository of up-to-date docstrings, best practices, and examples will pay off significantly in the long run.
Here’s a scenario I’m sorry to say I’ve seen before:
New engineer: What do all these lambda functions in the workflow initialization do?
Don’t be definitely not me.
This point is a bit of a branch off of the “flexibility” section. When building a custom data pipeline you should hope for the best — but be prepared for the worst. Of course this is true of all production software, but I believe it is especially important in data engineering, where data delivery is crucial.
Let’s say you have a step in a pipeline that hits an outside service to get some data and that service happens to be down, or maybe the format of the data changes. Your pipeline needs to know how to handle cases like this to prevent errors from taking down the system. You can accomplish this with mechanisms such as retries, error handling, queueing, and whatever else you can imagine that could prevent the pipeline from stopping due to semi-random events.
While attempting to alleviate all possible mishaps it is also important to realize that things just happen to break or go down. Sometimes there are forces simply outside of your control and the amount of things you have less control over substantially increase when building a pipeline that is distributed. In this case it is important to implement the storage of relevant metrics, metadata, and logging.
Keeping a record of what happens in your pipeline through combinations of recording metadata and logging is crucial. This will be a huge help, both in allowing you to keep track of pipeline metrics and quickly diagnosing issues that you were previously unaware of. You want to be prepared for when the unexpected happens.
There are a lot of tools available that are easy to plug into your pipeline which can provide this functionality. In the past I have used Grafana for metrics and Logstash for logging. You can also take it a step further and use something like Sentry to monitor for errors.
Three words: directed acyclic graphs.
Loops in pipelines are bad. There is a reason most (if not all) frameworks available enforce DAGs for pipeline building: cycles introduce unnecessary complexity and make it very difficult to reason about your system.
Consider: if you had a “real” pipeline that processes oil, it should never refine the oil and then throw it back into a pipe with unrefined product.(…unless that’s actually part of the process; I must admit that I am not well versed in petroleum engineering).
With that said, there actually are ways to build interesting control-flow mechanisms that can mimic loop-like behavior, but they require incredible care to get right.
When building a pipeline there is no “one true way.” It is important to make an informed decision about the total work involved with a custom solution versus the ability of that solution to better support your internal practices.
The point of this post isn’t that you shouldn’t write a custom pipeline; it’s that there are many second-order effects to be aware of if you do. All pipelines require some custom effort; the real question is how much you actually need to do, and what framework will support you best.
For the record, we think ours is Prefect