Prefect Logo
Back
Chris White
CTO
Share
EnvelopeRSS Feed
Workflow Orchestration
Share:Envelope

Built to Fail: Design Patterns for Resilient Data Pipelines

November 05, 2024
Chris White
CTO

In the ever-evolving world of data engineering, one truth stands out: change is constant and failure is inevitable. But what if we could build systems that adapt to unexpected failures gracefully, and embrace these situations as a pathway to resilience? Let's explore how principles from resilience engineering can improve how we build data pipelines.

The True Nature of Software Design

Before diving into resilience patterns, we need to challenge a common misconception about software design. Many believe that design is about accomplishing tasks or meeting requirements. However, consider this thought experiment: if you had to build an application with completely known, unchanging requirements that would never need modification, would you need design patterns at all? You could simply brute force the solution without concern for structure, architecture or readability.

This reveals a crucial insight: software design isn't about the present—it's about the future. As Sandy Metz eloquently puts it, "practical design does not anticipate what will happen. It merely accepts that something will, and that in the present you cannot know what."

The Unique Challenges of Data Engineering

In data engineering, change isn't just common—it's relentless. Consider the typical scenarios:

  • Data schemas evolve, both upstream and downstream
  • Data sources appear and disappear, sometimes without warning
  • Data volumes fluctuate unpredictably
  • External system dependencies shift beneath our feet
  • Stakeholder expectations can change daily

As Heraclitus might say if he were a data engineer: "No person ever steps in the same data stream twice, for it's not the same stream, and they're not the same person."

Learning from Resilience Engineering

Resilience engineering emerged from studying major industrial disasters like Three Mile Island and the Challenger explosion. Two fundamental principles from Charles Perrow's Normal Accident Theory (1984) are particularly relevant:

  1. In complex systems, failure is 100% inevitable
  2. There's rarely a single root cause—failures emerge from multiple, interacting factors

This field continued to evolve through the study of high-reliability organizations in the 1990s and entered software engineering in the 2000s, pioneered by Netflix's chaos engineering and Google's site reliability engineering practices.

Core Principles for Resilient Data Pipelines

1. Reliability is Emergent

A system's reliability isn't just the sum of its parts. Two reliable components can create an unreliable system, while unreliable components can sometimes combine to create reliability. Think of a memory-leaking application that becomes reliable through scheduled Kubernetes restarts.

2. Design for the Unknown

When building data pipelines, we can't predict every failure mode. However, we can implement four key strategies to handle unexpected issues:

  • Functional Observability: Monitor at the business logic level, not just infrastructure. Your Datadog metrics might show perfect system health even while your pipeline is failing because they're monitoring the wrong abstraction layer.
  • Shift Left Practices: Don't wait until data reaches your warehouse to validate it. Integrate quality checks throughout your pipeline, starting with data ingestion.
  • Adaptive Capacity: Build systems that scale and adjust automatically based on changing demands through auto-scaling and serverless architectures.

3. Graceful Extensibility

Systems should be designed to bend rather than break when reaching their limits. This means balancing two principles:

  • Graceful Degradation: Continue processing valid records even when some fail. Use patterns like dead letter queues to handle problematic data without halting the entire pipeline.
  • Software Extensibility: Design configuration-driven pipelines that can be modified without code changes. Keep processing logic independent of data schemas to accommodate change.

4. The Human Element

The "irony of automation" is the principle that adding automation increases system complexity and potential failure modes. To counter this:

  • Maintain manual operational knowledge
  • Ensure pipelines can run locally for debugging
  • Establish a strong incident response culture
  • Conduct blameless postmortems
  • Document learnings for future reference

Embracing Change

As Grace Hopper famously warned, "The most dangerous phrase in the English language is 'we've always done it this way.'" Building resilient data pipelines isn't about preventing all failures—it's about creating systems that can adapt, recover, and learn from failures.

The key is to embrace change as a constant companion rather than an unwelcome guest. By applying principles from resilience engineering, we can build data pipelines that don't just survive change—they thrive on it.

View the recording of the talk below.

Related Content

Case Studies
How Climate Policy Radar Processes 25,000 Policy Documents with Prefect
September 2025
Prefect Product
The Build vs. Buy Debate: Why Prefect Chose WorkOS for Enterprise Auth
September 2025
AI
Accelerating AI with FastMCP Cloud
Move Fast and Make Things
August 2025
Workflow Orchestration
Change Data Capture Tutorial: Real-Time Event Workflows with Debezium and Prefect
August 2025
Case Studies
How Japan's Leading BNPL Company, Paidy, Transformed Their Data Operations with Prefect
August 2025
Press
Welcoming Chris White as Prefect's President
August 2025
Prefect Product
Airflow Local Development Sucks
July 2025
Prefect Product
Turn Your dbt Project Into a Production Pipeline in Minutes
July 2025
Case Studies
How Foundry Cut Workflow Deployment Time by 80% Using Prefect
July 2025
Product
  • Prefect Cloud
  • Prefect Open Source
  • Prefect Cloud vs OSS
  • Pricing
  • How Prefect Works
  • Prefect vs Airflow
  • Prefect vs Dagster
  • FastMCPExternal Link
  • FastMCP CloudNEW
Solutions
  • For Analytics Teams
  • For Machine Learning Teams
  • For Data Platform Teams
Resources
  • DocsExternal Link
  • Case StudiesExternal Link
  • Blog
  • Community
  • Events
  • On-Demand Resources
  • Support
  • Cloud StatusExternal Link
About
  • Company
  • Contact
  • CareersExternal Link
  • Legal
  • Security
  • Brand Assets
Social Icon for https://twitter.com/PrefectIOSocial Icon for https://github.com/PrefectHQ/prefectSocial Icon for https://www.linkedin.com/company/prefect/Social Icon for https://www.youtube.com/c/PrefectIORSS Feed

Stay in the flow

© Copyright 2025 Prefect Technologies, Inc.
All rights reserved.