Subsampling ("Canaries")
Subsampling selects a small, fixed set of patients — typically 40 — who are tracked longitudinally across every pipeline run. The same individuals appear in every snapshot, every reload, every environment test. This is not test data. It is a variance detection mechanism.
Like canaries in a coal mine, these patients serve as early-warning sentinels. When the pipeline transforms the same 40 patients and produces different output than the last run, something changed. That change must be found and explained. Subsampling turns invisible pipeline drift into a visible, auditable signal.
Design principles
Immutability. A subsample is frozen at creation. The same 40 person_ids are used indefinitely. Regenerating a subsample breaks longitudinal comparability and is treated as a major event requiring documented justification.
Representativeness. Patients are selected for data richness, not randomness. Each subsample patient must have clinical records across multiple domains (conditions, medications, measurements, visits, procedures) in both CDW and Epic source systems. This ensures the subsample exercises the full pipeline — every join, every transformation, every mapping.
Determinism. Patient selection uses xxhash64 ordering on person_id for reproducibility. Given the same input cohort, the same 40 patients are always selected.
Variance as signal. If a subsample run produces output that differs from the prior run for the same date range, that variance is not noise — it is a defect signal. Possible causes include:
- Upstream source data changed (new records, corrections, deletes)
- Pipeline logic changed (new transforms, bug fixes, mapping updates)
- Vocabulary updates (concept mappings shifted between loads)
- Identity resolution changes (person_id reassignment, merges)
Each cause has different implications. The subsample makes these visible before they reach production.
Architecture
The subsampling system spans two layers:
EmoryOmopSubsampling generates the frozen patient lists:
person_data_representation— flags indicating which clinical domains each patient has data in, across CDW and Epicperson_subsample— the core 40-patient set (≥3 domains in both CDW and Epic)- Disease-specific subsamples:
brain_health_subsample,winship_subsample,nursing_cohort
Downstream projects (CDW, Epic, Enterprise, Identity, Ingest, BrainHealth, Winship, Nursing) consume the subsample via --target subsample --vars '{"subsample": "omop_subsampling.person_subsample"}'. The cohort_person_filter or equivalent model filters all clinical data to just the 40 patients. Every table in the pipeline is rebuilt for only those individuals.
Target-based routing
Each dbt project supports multiple targets that control where output lands:
| Target | Schema | Purpose |
|---|---|---|
prod |
project-specific (e.g., omop_etl_cdw) |
Production pipeline |
dev |
project-specific dev schema | Development |
subsample |
omop_subsampling |
Longitudinal variance testing |
mock_prod |
omop_subsampling |
Structure testing with empty tables |
unit_test |
omop_subsampling |
Deterministic tests with seed data |
network_study |
omop_network |
Athena-only vocab resolution |
The subsample target:
- Collapses all output into the
omop_subsamplingschema - Applies an alias prefix (e.g.,
dbt__cdw__subsample__20260320__) to avoid collisions - Materializes reference tables (vocab, provider, care_site, location) as views instead of tables
- Filters clinical data through the subsample patient list
What NOT to do
- Never regenerate a subsample without explicit direction and documented reason
- Never delete subsample tables — they are permanent reference data
- Never assume variance is acceptable — every difference between subsample runs must be investigated
- Never use subsamples as disposable test data — they are longitudinal tracking cohorts
Subsample tables in omop_subsampling
Permanent (never delete):
person_subsample— core 40-patient OMOP subsampleperson_data_representation— data richness flagsbrain_health_subsample— BrainHealth 40-patient subsamplenursing_cohort— frozen 1M nursing cohort- Disease-specific
*_subsampleand*_data_representationtables
Temporary (clean up after runs):
dbt__<project>__subsample__<date>__<model>— downstream build artifacts- These are rebuilt on every run and can be safely dropped after verification