v1.1.0 — March 2026
NLP Infrastructure | Brain Health Pilot | CDM v5.4 | Vocabulary v5.0 (Feb 2025)
What's New for Researchers
Clinical Notes in OMOP
For the first time, Emory Enterprise OMOP includes structured NLP output from clinical notes. The standard OMOP note_nlp table's flat design was insufficient for modern NLP workflows, so we developed a 4-layer span-based architecture that captures full pipeline provenance, assertion context, and clear separation from discrete EHR data.
What this means for you:
- NLP-extracted conditions, observations, and drug mentions are available in
_derivedtables (condition_occurrence_derived,observation_derived,drug_exposure_derived, etc.) - Every extracted finding traces back to the exact text span, source note, NLP pipeline version, and git commit that produced it
- Negated findings ("no evidence of hemorrhage") are filtered out — only affirmed, possible, and historical findings reach the
_derivedtables - An
_assertion_valuecolumn indicates whether a finding is affirmed (NULL), possible, or historical
Brain Health Pilot
The first NLP pipeline processes 2,488 imaging notes (CT Head, MR Brain) from 850 Brain Health (GINDR) patients, extracting clinical findings across 79 entity categories including:
| Domain | Examples | Count |
|---|---|---|
| Conditions | cerebral atrophy, hydrocephalus, small vessel disease, chronic infarct | 4,928 |
| Observations | mass effect, midline shift, focal slowing, calcification | 535 |
| Drugs | lecanemab, donanemab, donepezil, memantine | 31 |
| Measurements | MoCA | 1 |
Assertion Detection
The pipeline classifies extracted entities by clinical context using the ConText algorithm (Harkema et al., 2009), aligned with the i2b2 2010 assertion categories:
| Assertion | Count | Meaning |
|---|---|---|
| Negated | 3,916 | Finding is absent — excluded from _derived tables |
| Affirmed | 3,648 | Finding is present — _assertion_value = NULL |
| Possible | 994 | Uncertain ("cannot exclude", "suggestive of") — _assertion_value = 'possible' |
| Historical | 853 | Past finding ("history of", "prior") — _assertion_value = 'historical' |
| Hypothetical | 68 | Future/conditional ("monitor for", "evaluate for") |
| Family | 6 | Attributed to family member — excluded from _derived tables |
Context Snippets
The note_span_snippet table provides a 50-character context window around each extracted entity, enabling rapid validation of NLP output without accessing full note text. This is the only NLP table containing PHI — all other NLP infrastructure tables are PHI-free.
Annotation Review
The nlp_annotation_review table captures human judgments on NLP-extracted spans. Initial reviews identified 6 false positive familial assertions where the ConText algorithm misclassified patient seizure activity during EEG monitoring in the presence of a family member as "family history."
Architecture
4-Layer NLP Infrastructure
Layer 1: NLP Process Metadata (nlp_system, pipeline, component, nlp_execution)
Layer 2: NLP Output (note_span, note_span_concept, note_span_assertion, note_span_snippet)
Layer 3: Intermediate Translation (note_nlp_modifier)
Layer 4: _derived Tables (condition_occurrence_derived, observation_derived, etc.)
Full architecture specification: NLP Infrastructure
Key Design Decisions
note_span_assertionreplaces the originalnote_span_relationship— named after the i2b2 assertion classification standard. See Glossary.concept_code+vocabulary_idonnote_span_conceptinstead ofconcept_id— NLP implementers provide vocabulary codes they know (SNOMED, RxNorm, LOINC); dbt resolves to OMOPconcept_iddownstream via vocabulary joindomain_hintfor unmapped concepts — when no vocabulary mapping exists, NLP implementers provide a domain hint so findings still route to the correct_derivedtable- Source provenance (
source,source_uri,source_version) onnlp_system,pipeline, andcomponent— supports GitHub, HuggingFace, PyPI, Docker, and S3 artifact references _derivedtables use_prefix for non-standard columns (_note_span_id,_execution_id,_source_primary_key,_assertion_value) to distinguish from OMOP CDM standard columnscondition_type_concept_id= 32858 (NLP) marks all derived rows as NLP-sourced
FastAPI Ingestion Service
The NLP ingestion boundary is a FastAPI service (Docker, port 8006) that accepts NLP output from any tool (medspaCy, MedTagger, Claude API, custom models) and writes to a DuckDB bronze layer. This decouples "how you do NLP" from "how NLP results enter OMOP."
Pipeline Details
medspaCy Brain Health v1
| Field | Value |
|---|---|
| NLP System | medspacy v1 |
| Pipeline | brain_health_v1 |
| Git repo | github.com/EmoryDataSolutions/emory_omop_internal_nlp |
| Git commit | 3052d147910938b73bb5ed3b7f9a481ff359c42f |
| medspaCy | 1.3.1 (PyPI) |
| spaCy | 3.8.13 (PyPI) |
| Rules spec | 180+ target rules, 117 context rules, 24 section rules |
| Literature references | 21 (ConText, i2b2, Fazekas, ARIA, RadLex, ACR/RSNA) |
Note Extraction
- Source: Emory Healthcare EHAP (Clarity) via Athena federated query
- Cohort: 5,000 random patients from GINDR
cohort_person_filter - Note types: CT Head, MR Brain, PET Brain/Amyloid, EEG (via
V_IMG_STUDY) - Result: 2,488 notes from 850 patients (17% hit rate)
- Extraction method: Batched
IN (...)clauses (50 per batch, 100 batches) due to SQL Server 2,100 parameter limit on federated queries
Data Summary
| Table | Schema | Rows |
|---|---|---|
| note | omop_brain_health_ent | 2,473 |
| note_span | omop_brain_health_ent | 9,411 |
| note_span_concept | omop_brain_health_ent | 9,411 |
| note_span_assertion | omop_brain_health_ent | 5,837 |
| note_span_snippet | omop_brain_health_ent | 9,411 |
| note_nlp_modifier | omop_brain_health_ent | 5,495 |
| condition_occurrence_derived | omop_brain_health_ent | 4,928 |
| observation_derived | omop_brain_health_ent | 535 |
| drug_exposure_derived | omop_brain_health_ent | 31 |
| measurement_derived | omop_brain_health_ent | 1 |
| procedure_occurrence_derived | omop_brain_health_ent | 0 |
| nlp_annotation_review | omop_brain_health_ent | 6 |
Known Limitations
- Pilot scope — 5,000 random patients from GINDR cohort, not the full population
- Imaging notes only — extraction uses
V_IMG_STUDYjoin; primary care, EEG procedure notes, and lab notes not yet included - 58% of notes lack dates —
V_IMG_STUDY.begin_exam_dttmis sparse;HNO_INFOenrichment yielded only 3% additional coverage - No visit/provider linkage —
visit_occurrence_idandprovider_idare NULL;HNO_INFO.pat_enc_csn_idis 3% populated for imaging notes - Rule-based NLP — medspaCy uses literal string matching; no ML-based entity recognition
- Assertion concept_ids are placeholder (0) — custom OMOP concepts for assertion types pending CVB work
- EHAP federated query constraints — no predicate pushdown for JOINs, Lambda 900s timeout, SQL Server 2,100 parameter limit
- Duplicate note content — some notes with different
note_csn_idvalues contain identical text
Repositories
| Repository | Purpose |
|---|---|
| emory_omop_enterprise | dbt projects: EmoryOmopBrainHealthEnterprise (NLP wrapper models), EmoryOmopNoteIngestionPilot (pilot transforms) |
| emory_omop_internal_nlp | FastAPI ingestion service, medspaCy pipeline, rules specification, provenance capture |
| emory-omop | NLP architecture spec, glossary, validation framework, this documentation |
Related Pages
- NLP Architecture — 4-layer schema specification
- NLP Glossary — Clinical NLP terminology reference
- Entity Relationship Diagram — Visual schema reference