Why Not Cron?
I was showing off Prefect, the open source dataflow coordination platform, the other day and someone watching asked, “Why should I use Prefect instead of Cron?”
That’s such a great question that I decided to write this blog post to answer it! 🚀
Cron is a great option to take care of scheduling. But scheduling is just one piece of orchestration. And orchestration, together with observability, make up dataflow coordination.
TL;DR: If you’re only using Cron, you’re missing out on a bunch of dataflow orchestration and observation benefits to save you time, money, and headaches.
In this post we’ll explore cron and other scheduling options available in Prefect. Then, you’ll see how easy it is to add other orchestration and observation functionality.
If you work with data, whether as a data scientist, software engineer, data engineer, machine learning engineer, statistician, or analytics engineer, you’ll almost surely eventually need to move data on a schedule. 📆
Cron-speak is the language of five asterisks that you can replace with numbers to create a schedule.
8 5 * 3 * is not the most intuitive code most of us have seen, but it is pretty straightforward.
Here’s how it works. The five values represent the following (in order):
- day of month
- day of week
Each value is either an integer or an asterisk. An asterisk means “every”.
A cron scheduler can read the code and then schedule your script to run. 🎉
In the example above, your script would be scheduled for 5:08AM every day of March.
Cron is fine for many use cases, but it’s nice to have a quick, human readable option to make a repeating schedule. And cron can’t be used to make some more complex schedules. Fortunately, Prefect has all three options.
Scheduling in Prefect
Prefect supports three formats for schedule creation: interval, cron, and rrule.
Interval is the simplest scheduler. Just specify a time interval for the deployment to run, such as every 10 minutes.
RRule stands for recurrence rule. Many of us have interacted with recurrence rules when creating a recurring meeting in Google Calendar. RRule allows you to do more advanced scheduling than cron. For example, you can schedule a flow to run every month on the 2nd last Friday, for 7 occurrences. No problem. You just need the string RRULE:FREQ=MONTHLY;BYDAY=-2FR;COUNT=7. 😎
RRule is a bit trickier than cron to remember. Luckily there are handy websites to help.
You can create a schedule with Prefect a number of ways.
First, you can use the Prefect Cloud GUI. You can edit a Prefect deployment to add an Interval or Cron schedule. A deployment is a server-side concept that holds a Prefect flow, allowing it to be scheduled and run through an API. A Prefect flow is just a Python function decorated with @flow.
The GUI is an especially great way for team members to schedule deployment runs, regardless of their coding skills.
The second way to create a schedule is from the command line when building a deployment. Just add --interval 20 and your code deployment will be set to run every 20 seconds. 🎉
Third, if you create a Python deployment file that uses Prefect’s Deployment class, you can specify a schedule by passing an instance of IntervalSchedule, RRuleSchedule, CronSchedule argument to Deployment.build_from_flow.
Finally, you can edit the schedule section of the deployment definition YAML file directly.
With any creation option you can specify what time zone you want to use for your schedule. 🌍
As you can see, if you want scheduling options, Prefect’s got you covered beyond cron! But scheduling is just one aspect of orchestration.
Is a scheduler by itself enough to run your data engineering code in a reliable, robust manner?
Unfortunately not. 😔
It’s estimated that up to 80% of a data engineer’s time is spent doing defensive coding — all the stuff to deal with failure. Negative engineering, we sometimes call it. Luckily Prefect makes it easy to reduce the need for negative engineering.
Retries can be especially handy when fetching data from a website or third party API server. 🙌
Want some cache?💰
Running code when you don’t need to takes time and compute resources. Save time. ⏳ Save the earth. 🌍 Save a buck. 💵
Use Prefect’s free caching in the same decorator.
The built-in task_input_hash reruns your tasks only when the input to the function changes. Or use any custom caching function that returns a unique value.👍
You can also use Prefect’s tasks to wait for the results of other upstream tasks they depend upon. If you only have cron, you need to cross your fingers and hope that those upstream functions all complete successfully before your downstream job is scheduled to start. 🤞
What about Observation?
Dataflow coordination is made up of both orchestration and observation.
Decorating flow and task functions gives you automatic logging in the UI and wherever your Prefect agent is running. 🪵
Although many tools give you some observability, no tool on the market today can provide you with easy insights across all the many parts of your organization’s data stack. That’s about to change with Prefect’s observability API. Check out Prefect 2’s intro post to see what you’ll be able to do soon.
In this post we’ve seen why cron alone is not a full coordination plane, an orchestration plane, or even a scheduling plane.
If you want to use cron alone to schedule workflows, that’s fine. Then if you want to better coordinate your dataflows, you might decide to build a whole lot more (often brittle) functionality from scratch.
Alternatively, you could use the open source Prefect library for easy, incremental adoption that requires few additions to your code and few new concepts to learn. 🙂
If solving dataflow coordination problems to save you time, money, and headaches sounds good to you, check out Prefect. Feel free to join the Prefect Slack Community with over 20,000 users if you have any questions.
Happy coordinating! 🪄