Prefect Logo
Workflow Orchestration

How to Cut Data Pipeline Costs by 75% with Kubernetes Spot Instances

June 29, 2025
Chris White
CTO

Data teams have embraced Kubernetes for many reasons. It provides dynamic resource allocation for workloads that swing from lightweight data ingestion to GPU-hungry ML training. Its native support for auto-scaling matches infrastructure to demand. Declarative configuration codifies requirements alongside pipelines and multiple teams can share infrastructure safely with workload isolation.

But there's a fundamental economics problem lurking beneath this technical elegance: the same dynamic allocation that makes Kubernetes perfect for data workloads also makes it expensive. Your ETL pipeline needs 16 cores for 2 hours daily but pays for them 24/7. Your ML training job scales to 100 nodes for a few hours weekly, burning budget on unused capacity the other 160+ hours.

This is precisely why spot instances exist—and why they're uniquely suited to data workloads running on Kubernetes.

The Spot Instance Opportunity

Spot instances represent cloud providers' excess capacity, offered at steep discounts when demand is low. The numbers are compelling: using Amazon EC2 Spot Instances can result in up to a 90% discount compared to on-demand prices, with typical savings of 60-75% according to AWS's Spot Instance Advisor. Google Cloud's Spot VMs offer similar discounts of 60-91% off standard on-demand prices for most machine types.

For data teams running substantial compute workloads—ML training, ETL pipelines, analytical queries—these savings compound fast. A team spending $50,000 monthly on compute could dramatically reduce costs down to $5,000-$15,000 by migrating appropriate workloads to spot instances.

The economics work because cloud providers need spare capacity for demand spikes but don't want it sitting idle. Instead of wasting this capacity, they offer it at reduced rates with the caveat that it can be reclaimed with minimal notice when needed elsewhere. AWS provides a two-minute interruption notice, while GCP typically provides one minute notice.

The State Management Problem

Spot instances create a tension for data workflows that doesn't exist for typical application workloads. When your web server gets terminated, users might see a brief error, but the request can be retried elsewhere, as there is no piece of state that ties an individual request to a specific container instance. When a batch processing job gets terminated halfway through processing 10TB of customer transaction data, the consequences are very different.

Unlike stateless web applications that can be killed and restarted elsewhere with minimal impact, data processing jobs are inherently stateful and often tied to specific business outcomes. Consider a typical ETL pipeline that processes customer transactions:

  • Pipeline kickoff requires precise parametrized inputs
  • Loading data from multiple sources, possibly using costly queries
  • Transforming and validating the data through several interdependent stages
  • Storing results to a data warehouse that downstream systems depend on
  • Triggering downstream reporting processes that have SLA requirements

If this pipeline is interrupted midway through processing, the naive approach of "just restart everything" can lead to duplicate processing if your pipeline is not idempotent. This results in inconsistent states and the kind of data quality issues that keep data engineers awake at night.

Traditional Kubernetes approaches handle spot instance interruptions through basic restart mechanisms, but these aren't designed for the complex state management that data workflows require. A pod that gets terminated during a critical transformation step may restart from the beginning, losing hours of computation and risks data inconsistencies that propagate through your entire data ecosystem.

This is where workflow orchestrators come into the picture. Workflow orchestration is fundamentally about managing workflow state, from scheduling and triggering rules to intermediate checkpoints that turn fragile workflows into durable multi-step processes.

How Prefect enables Spot Instances

Prefect's architecture makes it uniquely suited for running data-intensive workloads on spot instances because it was designed from the ground up with failure recovery and dynamic infrastructure as core principles.

Expressive Task Caching for Idempotent Pipelines

When your pipeline gets interrupted halfway through processing—whether from spot instance termination or any other failure—you need to resume from the last successful task rather than starting over from scratch. Authoring truly idempotent code is difficult, but appropriate caching can give your workflows functional idempotency.

Sophistication here lies in configurability: Prefect offers many dimensions for expressing how workflow tasks should cache their results, and when:

  • Cache Policies allow you to define precisely what inputs (if any) determine whether tasks use cached results or execute fresh computation.
  • Result Storage controls how and where task results are serialized and stored. This is critical for spot instance deployments because your cache needs to survive infrastructure failures.
  • Transactions provide the ability to group tasks as atomic units. This prevents the scenario where some tasks in a dependent chain use cached results while others don't, creating inconsistent data states.

This level of caching sophistication is what enables workflow restarts to occur from meaningful checkpoints rather than complete do-overs. Instead of a 3-hour pipeline restarting from zero at hour 2.5, it resumes from checkooint and finishes in 30 minutes.

Infrastructure-Aware Orchestration

Most workflow orchestrators treat infrastructure as a black box—they submit jobs and at best retry them wholesale when they fail. Prefect's Kubernetes worker, by contrast, understands the infrastructure it's running on and can react intelligently to the specific failure modes that spot instances introduce.

When a pod receives SIGTERM (imminent termination), Prefect handles the signal and reschedules the workflow run with full context preservation. The worker is aware of each job’s backoff limit, and either manages rescheduling the work through Prefect’s backend - allowing any other Kubernetes worker to pick it up - or by letting Kubernetes reschedule the pod and automatically monitoring the new pod. Both patterns ensure no work is lost and the workflow continues seamlessly using task caches where necessary.

Stateless Worker Design for Maximum Resilience

The Prefect Kubernetes worker is stateless by design. Rather than storing run state in memory (where it can be lost during worker restarts or spot interruptions), Prefect stores all necessary run metadata in job and pod labels within the Kubernetes cluster itself. This metadata propagates to the underlying run, where the Prefect client can access task caches and the Prefect API for more granular state information.

Workers can be interrupted and restarted without losing track of running workflows. A terminated worker can restart on different infrastructure and immediately resume orchestrating flows based on the state recorded in the cluster. This resilience is crucial for spot instance deployments, where even the worker’s infrastructure can be reclaimed with minimal notice.

Dynamic Infrastructure Configuration

Unlike traditional orchestrators that require infrastructure to be defined statically at deployment time, Prefect allows for per-run infrastructure configuration.

This capability is useful for spot instance strategies because it enables your data engineers to:

  • Specify different instance types for different stages of a pipeline based on current spot pricing
  • Automatically fall back to on-demand instances for time-sensitive workloads when spot capacity is limited
  • Dynamically adjust resource requirements based on data volume and available spot instance types

This flexibility is useful for production spot instance deployments, where available instance types and pricing can vary significantly across availability zones and over time. Recent analysis shows that spot instance prices have increased by 21% on AWS and decreased by 26% on GCP over the past year, with significant regional variations. A static infrastructure configuration can't adapt to this pricing volatility, but Prefect's dynamic approach can.

The Economics of Reliability

Prefect's robust failure handling and spot instance cost savings creates a compelling economic argument beyond simple cost reduction. Consider a typical data team scenario using D-type ECS instances:

  • Traditional approach: $50,000/month on on-demand instances, no rescheduling of work
  • Prefect + Spot instances: $12,500/month, some rescheduling of work

The $37,500 monthly savings enables teams to invest in additional tooling, hire more engineers, or expand their data infrastructure significantly, going well beyond simple cost minimization. The slight reliability trade-off is usually acceptable for most data workloads, especially when Prefect's durability mechanisms ensure that jobs will complete successfully.

The traditional approach to data infrastructure often involves significant over-provisioning to handle peak workloads, leading to resource utilization rates of 30-40%. Spot instances with intelligent orchestration can achieve utilization rates of 80-90% while maintaining reliability, materially changing the economics of data processing at scale.

The Broader Implications

The combination of Kubernetes spot instances and intelligent orchestration can drive a significant shift in your team’s data infrastructure economics. When compute costs drop by 80-90%, new possibilities emerge—experimental workloads, deeper analytics, and more frequent model training all become possible within existing budget constraints.

The technical sophistication required to make spot instances work reliably for data workloads is considerable, but the right orchestration platform can abstract this complexity while providing the control and visibility that data engineers need.

While over 20% of Prefect Cloud users currently run workflows in Kubernetes, this percentage has been slowly declining as teams migrate toward even more dynamic infrastructure patterns. Fully serverless compute platforms like AWS Fargate and Google Cloud Run are becoming more and more common, along with emerging players like Modal and Coiled. Even as the ecosystem evolves toward serverless architectures, the cost optimization and durability strategies explored here remain relevant across different compute paradigms. For teams running substantial workloads, investigating this approach allows teams to get the most value out of their compute.

Further Reading

For more information check out Prefect’s Kubernetes documentation or start deploying Prefect flows for free.