Data Quality Design
Our data quality process is heavily influenced by the DataOps framework, which combines software engineering and manufacturing best practices to treat all aspects of the pipeline — software, data, and subsamples used in tests — as part of a CI/CD process.
What is DataOps?
DataOps is not simply "DevOps for data." It extends DevOps principles with manufacturing-inspired process control, treating the analytic pipeline as a production system where quality is built in, not bolted on.


DataOps Manifesto
The DataOps Manifesto defines 18 principles. The following table summarizes each and how we apply them at Emory.
| # | Principle | Summary |
|---|---|---|
| 1 | Continually satisfy your customer | Deliver valuable analytic insights early and continuously |
| 2 | Value working analytics | Measure performance by delivery of accurate, insightful analytics |
| 3 | Embrace change | Welcome evolving requirements to maintain competitive advantage |
| 4 | It's a team sport | Encourage diverse roles and skills within analytic teams |
| 5 | Daily interactions | Ensure daily collaboration among researchers, engineers, and operations |
| 6 | Self-organize | Allow teams to self-organize for optimal results |
| 7 | Reduce heroism | Build sustainable processes; minimize reliance on individual effort |
| 8 | Reflect | Regularly self-reflect to improve operational performance |
| 9 | Analytics is code | Version all aspects of the analytics process |
| 10 | Orchestrate | Coordinate data, tools, code, environments, and team efforts |
| 11 | Make it reproducible | Ensure results are reproducible by versioning everything |
| 12 | Disposable environments | Provide easy-to-create, isolated environments for experimentation |
| 13 | Simplicity | Focus on simplicity to enhance agility and efficiency |
| 14 | Analytics is manufacturing | Apply process-thinking to achieve continuous efficiencies |
| 15 | Quality is paramount | Build pipelines capable of automated anomaly and security detection |
| 16 | Monitor quality and performance | Continuously monitor performance, security, and quality measures |
| 17 | Reuse | Avoid repetition by reusing previous work |
| 18 | Improve cycle times | Minimize the time from customer need to analytic insight |
Implementation Steps
Step 1 — Add Data and Logic Tests
Inspired by Statistical Process Control (SPC) from manufacturing: data must stay within an acceptable statistical range. Tests validate data values at the inputs and outputs of each processing stage in the pipeline.
- Tests that fail trigger a notification-and-fix feedback loop
- Failed records are "quarantined" until resolved
- The loop continues until the agreed-upon measure of success is met
Step 2 — Use Version Control
Version code, documentation, tests, and meeting notes — all aligned within the context of releases. Understanding all aspects of version changes up and down the stack is critical.
Step 3 — Branch and Merge
Apply branching and merging not just to code, but across the entirety of the infrastructure — including data models, tests, and documentation.
Step 4 — Use Multiple Environments
Implement "test kitchens" where individual developers can experiment in isolated environments without affecting production data or pipelines.
Step 5 — Reuse and Containerize
Package reusable components so that engineers and analysts can utilize them without touching the internals — set up the container locally and use as needed.
Step 6 — Parameterize Processing
Design pipelines with flexibility for different run-time circumstances:
- Which version of raw data should be used?
- Is this a production or testing run?
- Should specific processing steps be included or skipped?
Atomizing code into discrete steps supports this naturally — each step can be independently included or excluded.
Step 7 — Work Without Fear
When tests, version control, and isolated environments are in place, team members can experiment and iterate without risking production data.
Related Pages
- Data Quality Results — OHDSI DQD summary and failure analysis
- DBT Pipeline Tests — column-level test definitions for every table
- Known Issues — table-by-table mapping gaps and workarounds
- DataOps Manifesto — the full 18 principles