Successfully Deploying a Task Queue
Firing off a background task is easy. But what if it fails? What if it never got fired at all? How do you distribute a large number of async tasks across your computing infrastructure - and keep tabs on all of them?
Every system needs to run background tasks. But managing them isn’t as simple as telling a cron job or Lambda function to “just do it.”
In this article, we’ll dive into what successfully deploying a task queue for background tasks requires, including orchestrating and monitoring tasks at scale.
The utility of background tasks and task queues
💡 A background task is any computational task that runs asynchronously without delivering an immediate response to a user. It may be a discrete task or part of a larger workflow of related tasks.
Examples of background tasks include parsing PDF documents, sending multi-cast e-mails, generating image thumbnails, crawling a Web site, fulfilling an order, or kicking off an Extract-Load-Transform (ELT) job. In each of these cases, we expect the action will be:
- Long-running. Image transformation, number-crunching, and data loads are compute-intensive tasks that may take minutes - or even hours - to complete.
- Asynchronous. For example, if a task needs to make multiple HTTP requests to external sites, we can’t guarantee how long each server will take to respond.
A user’s connection will likely time out if we force them to wait for a response from such tasks. Even if it doesn’t, the wait may still be too long. These types of tasks should not be run in an application; one recent survey found that 54% of e-commerce site users expect Web pages to load in three seconds or less. That makes it critical to move any long-running task into background tasks, which requires separate architecture to scale successfully (more on that later).
Background tasks can be either schedule-driven or event-driven. An example of a schedule-driven background task is a long-running ELT job or other data migration task that runs once a day. Use cases like image thumbnail processing would fall under event-driven tasks, as they’re activated by a user action (in this case, uploading an image).
A task queue may run a single task or it may run multiple tasks, either sequentially or concurrently, as part of a larger workflow. Traditional task queue architecture gets complex when considering these dependencies.
Challenges with deploying a robust task queue
Many teams that start with background tasks will start with the simplest approach they can manage. This usually means running a task as a cron job, a serverless cloud function (such as AWS Lambda), or a Docker container in Kubernetes.
As a team’s needs grow, these simple solutions become quite complex. If you’re running dependent workloads with different architectural requirements, it becomes increasingly challenging to orchestrate when, where, and how you’ll run them. Monitoring becomes more complicated as you may have tasks running in different locations and using different technologies (virtual machines, containers, serverless functions, third-party tools, etc.). This all comes to a head when failure occurs: the more intertwined the system, the longer it takes to debug the failure.
There are challenges with managing even simple tasks in a highly scalable architecture. For example, how does your task queue handle notifications and retry semantics? What happens if a container faults or an HTTP request returns a 500 server error?
Teams also find they need greater visibility into the tasks themselves. Suppose you support dozens of webhooks wired to serverless functions. Do you know when they last ran - and for how long? How do you detect anomalies when managing dozens or hundreds of background tasks?
Teams in this position sometimes turn to a distributed task queue like Celery to create a basic task queue and orchestration system. But tools like Celery still leave teams with a lot to build in terms of infrastructure versatility and monitoring logic.
Tips for deploying a successful task queue
How do you go beyond cron and deploy a more scalable and versatile task queue? Here are the core elements you should include in any background task manager architecture:
- Ensure scalability
- Account for failure
- Turn tasks into workflows
- Create a debugging infrastructure for tasks
You can have some level of monitoring and observability if you’re running a single task. But what about dozens? Or hundreds? As the complexity of your architecture grows, you need more than a single cron job or an unmonitored callback to a serverless function.
You might think, why can’t I run multiple just like I run one? A better question would be, what happens when you need to run a function in different infrastructure, or debug a dependency between multiple functions? This gets quite time consuming at scale.
To ensure scalability, your task queue should support:
- Managing multiple task queues that run tasks on specific target infrastructure and distributing them at scale
- Monitoring all of your tasks through a central pane of glass
- Centralized logging, metrics, multiple trigger types, and retry semantics at scale
Supporting this means building out an infrastructure to support it. For example, a typical scalable Celery deployment will involve standing up and maintaining scalable clusters of multiple worker nodes. (Remember: you’ll need to monitor the scripts and that infrastructure itself to ensure reliability and uptime.)
Account for failure
It’s inevitable. Your background tasks are going to fail. The question isn’t whether or not they fail - it’s how well they handle the failure.
A robust task queue needs multiple mechanisms to account for failure, including:
- A variety of retry semantics. For example, you will want to use a retry with exponential backoff for an apparent transient error, such as a Web server being unreachable or returning a 500 server error.
- Self-healing logic. If a virtual machine fails to respond, a task queue’s task can restart it via its cloud provider before re-attempting the task.
- Graceful failure. For permanent errors - a database item can’t be found, a page returns a 404, credentials are rejected - you’ll need support for reasonable fallback behaviors.
- Notification of failure. When failure happens, your team needs to know so they can remediate it ASAP.
Additionally, think about accounting for the hardest-to-detect failure case: what if your task fails to run? If you expect a task to run every hour and it hasn’t run for three days, you have a problem. Without the proper monitoring, that problem may go unnoticed for weeks or months.
Build in mechanisms, such as pings or logs sent to a centralized monitoring service, to keep tabs on your task’s runs. Once implemented, you can define alerts to generate notifications if the number of expected runs in a time period fails to meet the expected threshold.
Turn tasks into workflows
Smaller units of software that do one thing - and do it well - are easier to maintain, monitor, and debug. That means that tasks should be small and discrete. If your “task” is complete - fulfilling an order, deploying a CI/CD pipeline, processing data - you need a task queue that supports workflows. Additionally, if your discrete tasks are interdependent, that interdependency itself is a workflow.
💡Workflows are logical units of execution comprised of two or more tasks. A workflow system can create, deploy, and monitor complex tasks, particularly important when tasks depend on each other for proper execution. It can also coordinate tasks across infrastructure, systems, triggers, and teams.
Workflows and workflow orchestration support the following basic functions:
- Scheduling. Enables tasks to run reliably at a set time, not when an individual engineer remembers to run a script.
- Ordering operations. Runs tasks in the proper order, and scales out resources as needed to prevent failure.
- Observability. Provides visibility into the status of a workflow and its individual tasks, along with detailed logs for debugging when failures inevitably occur.
- Versatile triggers. Enables running a workflow via scheduled or event-based triggers across a variety of systems, including virtual machines, Kubernetes clusters, and cloud providers.
Debugging infrastructure for tasks
It’s impossible to recreate certain failures you’ve seen in production. Rich, centralized logging and metrics are critical to debugging issues you can’t reproduce easily. An important aspect of this is a rich UI which will help you find the failure and its downstream dependencies quickly.
All logging and metrics should be part of a centralized logging interface easily accessible by any engineer or SRE. Team members should be able to find and drill down into the logs for a specific task run to discover error messages and pinpoint root causes.
Get started with Prefect workflows
When tasks require diverse infrastructure, grow increasingly dependent on each other, and have users that need to understand their state, they function more like workflows. Scaling a complex workflow architecture is not dissimilar from scaling an application - you might need an API, background tasks, have stakeholders, and more. That requires a workflow orchestrator that’s up to par.
Deploying a scalable task queue isn’t a stroll through the daisies. There are a lot of moving parts. And at the end of the day, you’re signing up to create yet another piece of infrastructure that must be deployed, monitored, and maintained.
That’s why Prefect provides full support for deploying complex workflows. Using Prefect, you can define even the most involved workflow as a series of tasks with a simple decorator run via Python.
Prefect’s workflow orchestration solution will work with any set of diverse infrastructure along with full observability, retry semantics, and a wide range of triggers. You can run workflow tasks in all major cloud providers, trigger workflows from any third-party system, and integrate with dozens of external systems via our pre-provided integrations packages.
Prefect makes complex workflows simpler, not harder. Try Prefect Cloud for free for yourself, download our open source package, join our Slack community, or talk to one of our engineers to learn more.