Skip to content

Subsampling ("Canaries")

Subsampling selects a small, fixed set of patients — typically 40 — who are tracked longitudinally across every pipeline run. The same individuals appear in every snapshot, every reload, every environment test. This is not test data. It is a variance detection mechanism.

Like canaries in a coal mine, these patients serve as early-warning sentinels. When the pipeline transforms the same 40 patients and produces different output than the last run, something changed. That change must be found and explained. Subsampling turns invisible pipeline drift into a visible, auditable signal.

Design principles

Immutability. A subsample is frozen at creation. The same 40 person_ids are used indefinitely. Regenerating a subsample breaks longitudinal comparability and is treated as a major event requiring documented justification.

Representativeness. Patients are selected for data richness, not randomness. Each subsample patient must have clinical records across multiple domains (conditions, medications, measurements, visits, procedures) in both CDW and Epic source systems. This ensures the subsample exercises the full pipeline — every join, every transformation, every mapping.

Determinism. Patient selection uses xxhash64 ordering on person_id for reproducibility. Given the same input cohort, the same 40 patients are always selected.

Variance as signal. If a subsample run produces output that differs from the prior run for the same date range, that variance is not noise — it is a defect signal. Possible causes include:

  • Upstream source data changed (new records, corrections, deletes)
  • Pipeline logic changed (new transforms, bug fixes, mapping updates)
  • Vocabulary updates (concept mappings shifted between loads)
  • Identity resolution changes (person_id reassignment, merges)

Each cause has different implications. The subsample makes these visible before they reach production.

Architecture

The subsampling system spans two layers:

EmoryOmopSubsampling generates the frozen patient lists:

  • person_data_representation — flags indicating which clinical domains each patient has data in, across CDW and Epic
  • person_subsample — the core 40-patient set (≥3 domains in both CDW and Epic)
  • Disease-specific subsamples: brain_health_subsample, winship_subsample, nursing_cohort

Downstream projects (CDW, Epic, Enterprise, Identity, Ingest, BrainHealth, Winship, Nursing) consume the subsample via --target subsample --vars '{"subsample": "omop_subsampling.person_subsample"}'. The cohort_person_filter or equivalent model filters all clinical data to just the 40 patients. Every table in the pipeline is rebuilt for only those individuals.

Target-based routing

Each dbt project supports multiple targets that control where output lands:

Target Schema Purpose
prod project-specific (e.g., omop_etl_cdw) Production pipeline
dev project-specific dev schema Development
subsample omop_subsampling Longitudinal variance testing
mock_prod omop_subsampling Structure testing with empty tables
unit_test omop_subsampling Deterministic tests with seed data
network_study omop_network Athena-only vocab resolution

The subsample target:

  • Collapses all output into the omop_subsampling schema
  • Applies an alias prefix (e.g., dbt__cdw__subsample__20260320__) to avoid collisions
  • Materializes reference tables (vocab, provider, care_site, location) as views instead of tables
  • Filters clinical data through the subsample patient list

What NOT to do

  • Never regenerate a subsample without explicit direction and documented reason
  • Never delete subsample tables — they are permanent reference data
  • Never assume variance is acceptable — every difference between subsample runs must be investigated
  • Never use subsamples as disposable test data — they are longitudinal tracking cohorts

Subsample tables in omop_subsampling

Permanent (never delete):

  • person_subsample — core 40-patient OMOP subsample
  • person_data_representation — data richness flags
  • brain_health_subsample — BrainHealth 40-patient subsample
  • nursing_cohort — frozen 1M nursing cohort
  • Disease-specific *_subsample and *_data_representation tables

Temporary (clean up after runs):

  • dbt__<project>__subsample__<date>__<model> — downstream build artifacts
  • These are rebuilt on every run and can be safely dropped after verification