Custom Vocabulary Strategy
Emory's approach to custom vocabulary goes beyond the minimal OHDSI guidance for handling unmapped source codes. This page documents what we do, why each piece exists, and how the parts fit together.
The Two Standard Approaches to Unmapped Source Codes
OHDSI provides two mechanisms for handling source codes that don't resolve to a standard concept in Athena:
Source-to-Concept Map (STCM)
The source_to_concept_map table maps source codes directly to existing standard concepts. It's the lightest-weight option — no new concepts are created, and downstream tools don't need to know the mapping exists.
| Strength | Limitation |
|---|---|
| Zero vocabulary modification | Only works when an appropriate standard concept already exists in Athena |
| Easy stakeholder adoption — familiar CSV-based workflow | Not visible to OHDSI tools that navigate concept and concept_relationship |
| Useful for local code system aliasing | Cannot express hierarchical relationships or multi-target mappings |
2-Billionaire Custom Concepts
OHDSI reserves concept IDs ≥ 2,000,000,000 for site-local ("2-billionaire") concepts. Institutions create new rows in the concept table with IDs in this range to represent data elements that have no Athena equivalent.
The standard OHDSI guidance for 2-billionaire concepts is conservative:
- Custom concepts should be non-standard (
standard_concept = NULL) - Custom concepts should not appear in
concept_ancestor - Custom concepts should be mapped to existing standard concepts via
concept_relationship, not used as resolution targets themselves
Why Emory Goes Further
We follow the 2-billionaire protocol for ID assignment but diverge from the conservative guidance on standard status and hierarchy participation. The reason is straightforward: OHDSI tooling operates against the concept/concept_relationship model, not STCM.
OHDSI tools don't see STCM
ATLAS cohort definitions, CohortDiagnostics, CohortExplorer, PheValuator, and most OHDSI R packages build inclusion criteria from concept and concept_relationship. They navigate hierarchies through concept_ancestor. They do not query source_to_concept_map.
If a custom concept exists only as a non-standard entry in concept — mapped to a standard concept but never as a standard concept — it cannot be:
- Selected as a cohort entry event in ATLAS
- Found via descendant navigation in
concept_ancestor - Used as a target in phenotype definitions
- Resolved as a standard concept during ETL concept mapping
For Emory researchers using OHDSI tools locally, non-standard custom concepts are effectively invisible in the places that matter.
What CVB does differently
The Custom Vocabulary Builder promotes 2-billionaire concepts to Standard (S) status and inserts them into concept_ancestor with appropriate parentage. This means:
- Custom concepts are first-class citizens in ATLAS and all OHDSI tooling
- Hierarchy navigation works — a custom flowsheet concept descends from the correct SNOMED parent
- Cohort definitions can include custom concepts directly
- ETL concept resolution can target custom concepts as standard endpoints
This is an intentional, traceable divergence
Promoting custom concepts to Standard status is not an OHDSI-endorsed pattern. We mitigate this through cvb_provenance traceability and network study bifurcation — custom concepts never leak into multi-site study outputs.
CVB destandardization
CVB can also destandardize existing Athena concepts — changing standard_concept from 'S' to NULL — when a more clinically specific custom concept should replace a broad Athena concept as the resolution target.
This is tracked via cvb_provenance = 'override:<VOCAB_ID>' and is recoverable through the vocab_staging schema, which preserves unmodified Athena state. See Network Study Bifurcation for how this interacts with multi-site studies.
STCM as a Compatibility Layer
CVB populates source_to_concept_map in addition to the concept/concept_relationship model. This is not an alternative resolution path — it's a compatibility layer for downstream consumers whose pipelines resolve via STCM rather than C/CR.
Not every consumer has adopted the concept/concept_relationship resolution pattern. Some downstream ETL pipelines, research workflows, and legacy tooling resolve source codes by joining against source_to_concept_map directly. If we only wrote to C/CR, those consumers would get no resolution for CVB-mapped source codes — they'd be left out in the cold during their pipeline runs.
By populating both:
- C/CR consumers get full resolution with Standard concepts, hierarchy navigation, and OHDSI tool visibility
- STCM consumers get working resolution without needing to change their pipeline patterns
- We encourage STCM consumers to adopt C/CR over time, since OHDSI tooling (ATLAS, CohortDiagnostics, phenotype libraries) is built against the C/CR model and doesn't typically query STCM
The lift for populating STCM is low — CVB already knows the source-to-target mappings — so there's no reason not to serve both consumer patterns.
flowchart TD
CVB["CVB pipeline"]
subgraph writes ["CVB writes to both"]
direction LR
CR["concept / concept_relationship<br/><em>Standard concepts + hierarchy</em>"]
STCM["source_to_concept_map<br/><em>Compatibility layer</em>"]
end
subgraph consumers ["Downstream consumers"]
direction LR
OHDSI["OHDSI tools (ATLAS, etc.)<br/>C/CR resolution"]
LEGACY["Legacy pipelines<br/>STCM resolution"]
end
CVB --> CR
CVB --> STCM
CR --> OHDSI
STCM --> LEGACY
LEGACY -.->|"Encouraged migration path"| OHDSI
style CR fill:#e8f5e9,stroke:#2e7d32
style STCM fill:#fff3e0,stroke:#e65100
style OHDSI fill:#e8f5e9,stroke:#2e7d32
style LEGACY fill:#fff3e0,stroke:#e65100
CVB populates both C/CR and STCM. OHDSI tools consume C/CR; legacy pipelines consume STCM. Over time, consumers are encouraged to migrate to C/CR for full tooling compatibility.
Vocabulary Metadata Tables
CVB writes to two metadata tables that are not part of the standard OMOP CDM v5.4 schema:
mapping_metadata— provenance, authorship, and review status for individual concept mappingsconcept_relationship_metadata— audit trail for relationship modifications (additions, destandardizations)
These tables are used by the OHDSI Vocabulary Working Group internally but are not typically adopted by CDM implementers. We adopt them because:
- Traceability: Every CVB modification has a documented author, review date, justification, and confidence score
- Auditability: Regulatory and compliance workflows can trace any custom concept back to its origin
- Reversibility: Metadata records enable targeted rollback of specific vocabulary changes without rebuilding from scratch
CDM v5.5 and metadata tables
Parts of the vocabulary metadata model are expected to be adopted into CDM v5.5. Our early adoption means we will be positioned for that migration, but the timeline for v5.5 is uncertain and we are not waiting on it. The cvb_provenance column on core vocab tables provides immediate value independent of CDM version.
How the Pieces Fit Together
| Component | Role | Who sees it |
|---|---|---|
| STCM | Compatibility layer for consumers resolving via source_to_concept_map |
Legacy ETL pipelines |
| 2-billionaire concepts | Standard concepts in concept + concept_ancestor (primary resolution path) |
ETL, ATLAS, all OHDSI tools |
cvb_provenance |
Tracks origin of every vocab modification | Auditors, vocabulary team |
| Metadata tables | Full provenance for mappings and relationships | Vocabulary team, compliance |
vocab_staging |
Pristine Athena for network studies | Network study ETL builds |
| Compound targets | Separate materializations per resolution mode | dbt pipeline |
For details on how production and network study builds coexist, see Network Study Bifurcation.
Guidance for Stakeholders
If your pipeline resolves via STCM → CVB-mapped concepts are already available in source_to_concept_map. No changes needed. However, we encourage migrating to C/CR resolution over time — OHDSI tooling is built against that model.
If you need new concepts for local research → contribute a vocabulary through CVB using the contributing vocabularies workflow. Your concepts will be Standard, hierarchically placed, and fully integrated into the local OHDSI toolchain. CVB will populate both C/CR and STCM automatically.
If you have unmapped source codes → submit a mapping request and the vocabulary team will triage and build it into the next release.
If you're running a network study → your builds will use the vocab_staging schema automatically via compound targets. No action needed — CVB modifications are excluded by design.