v1.0.0 — March 2026
271 commits | 3 contributors | CDM v5.4 | Vocabulary v5.0 (Feb 2025)
What's New for Researchers
Stable Patient Identities
Patient identifiers (person_id) are now stable across data loads. Previously, person IDs could shift between ETL runs as new data arrived. The new identity stabilization system uses graph-based clustering to resolve patients across CDW and Epic sources, ensuring that a person_id assigned today remains the same in future releases.
What this means for you:
- Longitudinal studies can safely reference
person_idacross data refreshes - Patient cohorts built in ATLAS or SQL will not silently break between loads
- Cross-source patient matching (CDW + Epic) is now systematically resolved rather than heuristic
HIPAA-Compliant De-identification
Location and care site data now follows Safe Harbor de-identification standards:
- Patient addresses are reduced to ZIP3 (first 3 digits only)
- Care site names and source values are masked
- Facility addresses are preserved at full granularity (publicly available information)
- Latitude/longitude values are removed for person-linked locations
- International addresses receive additional suppression
Improved Data Quality Infrastructure
- Cross-project subsampling — a new testing paradigm lets the team validate ETL changes against representative patient subsets before full production runs
- Safety guards prevent accidental subsample execution against production data
- Mock production mode enables zero-row validation of pipeline logic across all dbt projects
Vocabulary Updates
- Vocabulary tables reloaded with v5.0 (February 2025)
- CPT4 vocabulary jar processing completed
source_to_concept_maptable structure updated with additional fields- Athena partitioning switched from date to year/month for improved query performance
ETL Fixes
- Corrected Epic identity type mapping (14 → 914)
- Provider model refactored with COALESCE-based merging (Epic priority)
- Fixed race and gender concept ID test conditions across projects
- Resolved duplicate person records linked to location assignments
- Increased
address_1field to 255 characters
Technical Changelog
Patient Identity Stabilization (~80 commits)
New dbt project: EmoryPatientIdentityStabilization
A ground-up identity resolution system built on graph-based clustering. Resolves patient identities across CDW and Epic sources using connected components analysis.
Architecture:
- Bronze layer — raw intake of identity evidence from CDW and Epic
- Silver layer — change detection, edge construction, delta event tracking
- Gold layer — resolved person identifiers, stable across loads
- Clustering layer — Python connected components algorithm against Redshift/Athena
Key components:
identity_id_delta_events— tracks identity changes over timeepic_pat_merge_events/epic_pat_identity_events— Epic-specific identity event modelsrun_connected_components.py— graph clustering pipelineidentity_graph_clustering.py— core clustering algorithm- Fellegi-Sunter match weight scoring for linkage quality assessment
- SSID tracer diagnostic tool for step-by-step identity tracing
- Evidence intake pipeline with QC, conflict detection, and clustering integration
- Merge/unmerge tracking and cluster stability monitoring
Testing:
- Comprehensive seed-based unit testing with
get_sourcemacro for test vs. production routing - Test cases for AUTHORITY_WINS, re-attach events, merge evidence scenarios
run_unit_test_pipeline.pyorchestrator
Documentation: 20-section project documentation covering design principles, architecture, glossary, operational procedures, and known limitations.
Versioned internally: v0.3.0 → v0.4.0 → v0.5.0 (no git tags)
Patient Ingest Pipeline (~15 commits)
New dbt project: EmoryPatientIngest
Centralizes patient demographic ingestion, replacing the legacy deident_driver approach.
identity_person_mapmodel backed by identity stabilization gold layerpatient_exclusionsmodel filtering new Epic IDs that would reuse existing person_ids- Deterministic person_id assignment using
GREATEST(MAX(person_id))for incremental runs - Deduplication and quarantine handling for duplicate identity_person_map rows
lkp_master_patient_rebuiltmodel in CDW, integrated across Epic, PatientIngest, Identity Stabilization, and Enterprise projects
HIPAA De-identification (~35 commits)
Stored procedures for location and care site de-identification.
Location de-identification (sp_location_jz.sql):
- ZIP3 suppression for patient (non-facility) addresses
- PII field removal (address lines, city, county)
- Facility-level vs. non-facility differentiation based on care_site match
- Latitude/longitude set to -88 for person-linked locations where original values were not null
- Non-USA zip suppression with special handling
- Schema refactoring to
adminschema with parameterizedcdmDatabaseSchema
Care site masking (sp_care_site_jz.sql):
care_site_name,care_site_source_value,place_of_service_source_valuemasked with**Emory Masked**
QC (jz_qc_explore.sql):
- 12-validation-query test suite covering zip suppression, facility matching, anomaly detection
Cross-Project Subsampling (~20 commits)
New dbt project: EmoryOmopSubsampling
subsample_filtermacro added to all dbt projects for conditional data filteringassert_subsample_targetsafety guard preventing accidental subsample execution against productiongenerate_schema_namemacro for dynamic schema naming (subsample vs. production)mock_prodvalidation mode added to all projects for zero-row production verificationempty_artifactsprofile target added to CDW, Identity Stabilization, and PatientIngest- Cross-project subsample pipeline orchestrator script
CI/CD & Infrastructure (~20 commits)
- dbt docs deployment — GitHub Actions workflow for building and deploying dbt documentation to GitHub Pages (multi-project support)
- Airflow sync — GitHub Actions workflow to sync dbt projects to the Airflow server
- Export-to-parquet pipeline — cross-account S3 and Redshift transfer for data export
prod_ec2_roleconfiguration added to dbt profiles for production deployment via EC2 IAM roles- Dependabot security fix upgrading transitive dependencies (11 vulnerabilities resolved)
Vocabulary Ingestion (~5 commits)
New dbt project: EmoryOMOPVocabulariesIngest
- dbt project for vocabulary table loading from S3
- Stored procedure for CSV → Redshift loading
drug_strength.sql— datatype conversion (BIGINT), amount/numerator/denominator handlingsource_to_concept_map— updated table structure with additional fields- Partitioning changed from date to year/month for Athena scale
ETL Bug Fixes & Data Quality (~15 commits)
- Epic identity type corrected from 14 to 914; excluded CDW facility MRNs from clustering
- Provider model refactored with COALESCE-based merging (Epic priority)
race_concept_idandgender_concept_idtest conditions fixed across multiple projects- CDW stage model: added
DISTINCTforqualifier_newinarray_agg - Discrepancy/impact models:
WHERE EXISTSinstead ofINNER JOIN, dedup withDISTINCT, scoped to rebuilt patient set location.address_1increased to 255 characters; reordered state/zip assignments
Documentation & Project Management (~20 commits)
- DevOps philosophy documentation: hybrid DataOps framework, observability, SPC, team workflow guide
- Identity Stabilization: 20-section project documentation
- Glossary expansion with comprehensive internal terminology
- LLM usage disclosure (Anthropic Claude transparency)
- GitHub project board tooling:
set_pr_fields.sh,project_field_ids.csv - Repository labels export (
repo_labels.csv)
QA Investigation Scripts
- Issue #320 — diagnostic queries for duplicate person_ids in Epic, filtering merged-away PAT records
- Issue #347 — Fellegi-Sunter scoring scripts, subsample validation, evidence pipeline match weights, identity_person_map CDW demographics join fix
- Issue #262 — lkp_master_patient model descriptions, duplicate patient_id fixes, discrepancy/impact model scoping
- Issue #264 — EmoryOmopCDW test infrastructure and pipeline configuration updates
Contributors
| Contributor | Commits | Focus Areas |
|---|---|---|
| Daniel Smith | 145 | Identity stabilization, infrastructure, subsampling, CI/CD |
| Jorge Marquez | 75 | ETL pipelines, stored procedures, vocabulary |
| Xueqiong Zhang | 51 | Location/care site de-identification, QC |