T2.1: Analyzes Demographic Data

Knowledge Review - InterSystems Enterprise Master Patient Index Technical Specialist

1. Common Data Quality Issues that Impact Linkage

Key Points

  • Missing or incomplete critical demographic fields reduce matching accuracy
  • Name variations and inconsistencies create false non-matches
  • MRN conflicts and reuse cause serious patient safety risks
  • Address inconsistencies affect geographic-based matching
  • Invalid or dummy values corrupt linkage algorithms

Detailed Notes

Data quality is the foundation of accurate patient matching in InterSystems EMPI. Poor data quality directly impacts the ability to correctly identify when two patient records represent the same individual (true matches) and when they represent different individuals (true non-matches).

Missing Critical Demographic Data

The absence of key demographic fields significantly reduces the EMPI's ability to match records accurately. Critical fields include:

Social Security Number (SSN): When SSN is missing, one of the strongest unique identifiers is unavailable. Some facilities may not collect SSN due to patient privacy concerns or collection policies. The impact is reduced matching confidence, requiring reliance on weaker identifiers like name and date of birth.

Date of Birth: Missing or defaulted dates of birth (such as 01/01/1900) prevent accurate age-based matching. Newborns may have temporary dates of birth that need correction later.

Gender: While seemingly simple, missing or incorrect gender codes can cause matching failures, especially when combined with unisex names.

Contact Information: Missing phone numbers, addresses, or email reduces the number of parameters available for comparison and verification.

Name Variations and Inconsistencies

Names present multiple data quality challenges that affect matching:

Spelling Variations: "Smith" vs "Smyth", "Johnson" vs "Johnsen" - different data entry personnel may spell names differently.

Nicknames vs Legal Names: "Bob" vs "Robert", "Liz" vs "Elizabeth" - patients may provide different name forms at different facilities.

Hyphenated Names: "Mary Smith-Jones" may appear as "Mary Smith", "Mary Jones", or "Mary Smith Jones" in different systems.

Suffix Handling: "John Smith Jr." vs "John Smith" - suffixes may be entered inconsistently or in different fields.

Special Characters: Apostrophes, hyphens, and accented characters may be stripped or handled differently: "O'Brien" vs "OBrien".

Name Order Reversals: First and last names accidentally swapped during data entry.

Medical Record Number (MRN) Conflicts

MRN issues are particularly serious because they can cause patient safety incidents:

MRN Reuse: When facilities reuse medical record numbers for different patients, it creates deterministic linkage conflicts. The system expects MRN to uniquely identify a patient within a facility.

MRN Format Changes: Facilities may change MRN numbering systems (e.g., adding prefixes), creating duplicate records for the same patient.

Temporary MRN Assignment: Emergency departments may assign temporary MRNs that later need to be merged with permanent numbers.

Cross-Facility MRN Conflicts: Different facilities may independently use the same MRN for different patients.

Address Data Inconsistencies

Address data presents standardization challenges:

Abbreviation Variations: "Street" vs "St.", "Avenue" vs "Ave", "Road" vs "Rd" - inconsistent abbreviations reduce matching.

Apartment/Unit Numbers: May be placed in different fields or formats: "Apt 5B", "Unit 5B", "#5B".

ZIP Code Errors: Transposed digits or missing ZIP+4 extensions.

PO Boxes vs Street Addresses: Patients may provide different addresses at different times.

Address Updates: Patients move but old addresses remain in source systems.

Invalid and Dummy Values

Many data quality issues stem from placeholder or default values:

Dummy Phone Numbers: "999-999-9999", "000-000-0000", "123-456-7890" - test data or placeholders entered when real number is unavailable.

Dummy Names: "Baby Boy", "Baby Girl", "Unknown", "Test Patient" - temporary values that should be updated but remain in the system.

Default Dates: "01/01/1900", "01/01/1800" - system defaults when real date is unknown.

Generic Addresses: "123 Main Street", "PO Box 0" - placeholder values.

Test Data: Records created during system testing that were not properly purged.

---

Documentation References

2. Using the Data Quality Tool to Discover Data Problems

Key Points

  • Three analytics cubes provide comprehensive data quality visibility
  • Business rules classify data as valid, invalid, or blank
  • Dashboards show data quality metrics at a glance
  • Trending pivots track data quality changes over time
  • Scheduled synchronization enables ongoing monitoring

Detailed Notes

The InterSystems EMPI Data Quality Tool is a comprehensive analytics solution that helps identify and track data quality issues. It uses analytics cubes to analyze patient demographic data and provides dashboards and pivot tables for visualization.

Data Quality Tool Components

The Data Quality tool consists of three main analytics cubes:

PatientIndexDQCube (Patient Cube): Built from the HSPI.Data.Patient table, this cube contains data quality information for all patient records as they exist in the original patient table before normalization. This cube shows the "raw" data quality from source systems.

PatientIndexNormalizedDQCube (Normalized Cube): Built from the normalized table associated with your linkage definition, this cube shows data quality after normalization functions have been applied. Comparing this cube to the patient cube reveals how effectively your normalization functions are cleaning the data.

DQTrend (Trend Cube): Built from the patient cube over time, this cube tracks trends in data quality. It allows you to identify when data quality deteriorated (e.g., when a specific facility started sending poor quality data or when a system integration change introduced errors).

Enabling and Configuring Data Quality

The Data Quality tool is disabled by default. To enable it:

1. Navigate to InterSystems EMPI menu > Data Quality Manager in the Management Portal 2. Configure the Data Quality Definition:

  • Package Name: Specify a custom package (avoid HealthShare product packages like HSDQ, HSMOD, HSPI to prevent upgrade conflicts)
  • Data Quality definition class description: Optional description
  • Base Linkage Definition: Name of your linkage definition (determines which normalized table to use)

3. Select "Enable Data Quality" to generate cubes and configure default synchronization 4. Select "Rebuild Cube Data" to populate the cubes with data

Important: You must perform a linkage data build before rebuilding Data Quality cubes.

Business Rules for Data Classification

The Data Quality tool uses business rules to classify each field value as valid, invalid, or blank. Each property (demographic field) has an associated classifier rule.

Default rules are provided for standard fields:

  • SSN: Validates format (9 digits), checks for common dummy values (000-00-0000, 123-45-6789)
  • Phone Number: Validates format, identifies dummy numbers (999-999-9999, 000-000-0000)
  • Birth Date: Checks for valid dates, identifies dummy dates (01/01/1900)
  • Name Fields: Identifies placeholder values ("Baby", "Unknown", "Test")
  • Gender: Validates against standard codes (M, F, U, O)
  • ZIP Code: Validates 5-digit or ZIP+4 format

Creating Custom Business Rules

To address site-specific data quality issues, you should create custom business rules:

1. Navigate to Interoperability > List > Business Rules 2. Select "New" to open Business Rule Wizard 3. Enter custom package name (critical for upgrade safety) 4. Provide rule name, optional alias and description 5. In the general tab, select HSDQ.Classifier.Rule.Assist as Rule Assist Class 6. In the ruleSet tab, define the rule logic using existing rules as templates 7. Save the rule 8. Return to Data Quality Manager, select the property, and choose your custom rule from the Classifier Rule dropdown

For example, if a facility uses "VET" prefix for veterinary patients, create a rule that classifies any MRN starting with "VET" as invalid.

Data Quality Dashboards

Three main dashboards provide data quality visibility:

InterSystems EMPI Data Quality Summary: Displays valid, invalid, and blank counts for each property in the patient cube. Shows data quality from source systems before normalization.

InterSystems EMPI Data Quality Summary - Normalized: Displays valid, invalid, and blank counts after normalization. Compare to the regular summary to measure normalization effectiveness. Ideally, normalized cube should have zero invalid values.

Trend Dashboard: Displays long-term trends for selected pivots, helping identify when data quality changed.

Dashboard features:

  • Filter by Facility or Assigning Authority to identify problem sources
  • Export to PDF or Excel for reporting
  • Drill down into specific properties for detailed analysis

Data Quality Pivots

Pivots provide detailed counts and percentages for individual property values:

  • Count of each unique value (e.g., how many patients have SSN "000-00-0000")
  • Percentage of total records
  • Filter by Assigning Authority or Facility to isolate problem sources
  • Identify patterns (e.g., one facility consistently sends dummy phone numbers)

Data Quality Properties Table

The Data Quality Manager displays a properties table showing:

  • Property Name: Demographic field (SSN, FamilyName, etc.)
  • Classifier Rule: Business rule applied to this property
  • Cubes: Whether property exists in Patient cube only (O) or both Patient and Normalized cubes (O+N)

For each property, you can view:

  • Top Count: Maximum number of distinct values tallied (others grouped as "other values")
  • Classifier Type: Expected value type (String, Date, etc.)
  • Classifier Rule: Specific business rule applied

---

Documentation References

3. Analyzing Data Using SQL Queries

Key Points

  • Direct SQL access to HSPI tables for custom analysis
  • Operational queries provide pre-built data quality reports
  • Terminal, ODBC, JDBC, or Management Portal SQL interface
  • Query patient demographics, linkage data, and audit history
  • Custom queries enable site-specific analysis needs

Detailed Notes

SQL queries provide powerful, flexible analysis capabilities for examining demographic data quality in InterSystems EMPI. SQL access allows you to create custom reports, investigate specific data quality issues, and perform ad-hoc analysis that may not be covered by standard Data Quality Tool dashboards.

Key EMPI Database Tables

Understanding EMPI table structure is essential for SQL analysis:

HSPI.Data.Patient: Contains all patient demographic records from source systems. Each row represents one patient record from one facility. Key fields include:

  • ID: Internal record identifier
  • MPIID: Master Patient Index ID (enterprise identifier)
  • AssigningAuthority: Facility or system that assigned the MRN
  • MRN: Medical Record Number from source facility
  • FamilyName, GivenName, MiddleName: Name components
  • SSN: Social Security Number
  • BirthDate, Gender: Core demographics
  • StreetLine, City, State, PostalCode: Address fields
  • PhoneNumber, TelecomUse: Contact information
  • UpdatedOn, CreatedOn: Record timestamps

Normalized Table (name varies by linkage definition, e.g., HSPI.Data.Normalized): Contains normalized versions of patient data after normalization functions have been applied. Same fields as Patient table but with cleaned/standardized values.

HSPI.Data.Classified: Contains linkage classification for record pairs:

  • ID1, ID2: Internal IDs of the two records in the pair
  • LinkWeight: Calculated similarity score
  • LinkStatus: Link, Potential Link, or Non-Link
  • LinkReason: Why records were linked (Threshold, Manual, Deterministic, etc.)
  • AgreementPattern: String showing which parameters agreed/disagreed

HSPI.Data.Worklist: Contains record pairs requiring human review:

  • Category: Worklist category (Review, Validate, Duplicate, Overlay, Overlap, etc.)
  • LinkWeight, LinkStatus, LinkReason: Same as Classified table
  • TimeCreated, TimeUpdated: When pair appeared on worklist

HSPI.Audit: Audit trail of all EMPI actions:

  • Action: What was done (Link, Unlink, Merge, etc.)
  • Username: Who performed the action
  • TimeStamp: When action occurred
  • Comment: User-entered comment

Operational Queries (Pre-Built SQL Procedures)

InterSystems EMPI provides a comprehensive set of operational queries - stored SQL procedures that answer common data quality questions. These queries are located in the HSPI_Report_Queries package.

To access operational queries: 1. Navigate to System Explorer > SQL in Management Portal 2. Expand "Procedures" in the left pane 3. Locate queries starting with HSPI_Report_Queries 4. Click a query to view its Catalog Details and parameters 5. Use Execute Query tab to run the query

Patient Record Queries:

  • `Patient_Records(AssigningAuthority)`: Total number of patient records
  • `Patient_RecordsByAssigningAuthority()`: Count by facility
  • `Patient_RecordsForMPIID(MPIID)`: All records linked under an MPIID
  • `Patient_Unique(AssigningAuthority)`: Count of distinct patients (by MPIID)

Facilities Queries:

  • `Facilities_Count()`: Total number of unique facilities
  • `Facilities_List()`: List of all facilities
  • `Facilities_AACount()`: Count of unique assigning authorities
  • `Facilities_AAList()`: List of assigning authorities

Worklist Category Queries:

  • `Review_Count()`, `Review_Detail()`: Review category items
  • `Validate_Count()`, `Validate_Detail()`: Validate category items
  • `Duplicate_Count()`, `Duplicate_Detail()`: Same-domain duplicates
  • `Overlay_Count()`, `Overlay_Detail()`: Different records, same MPIID
  • `Overlap_Count()`, `Overlap_Detail()`: Linked records, different MPIIDs
  • `Deterministic_Count()`, `Deterministic_Detail()`: Deterministic conflicts

Example Operational Query Usage: ```sql -- Count duplicates for a specific facility CALL HSPI_Report_Queries.Duplicates_ByDomain('FACILITYNAME')

-- Count merges in a date range CALL HSPI_Report_Queries.Merges_Count('FACILITYNAME', '2025-01-01', '2025-12-31')

-- Detailed audit activity for a user CALL HSPI_Report_Queries.Audit_Activity('jsmith', '2025-01-01', '2025-12-31') ```

Custom SQL Queries for Data Quality Analysis

Beyond operational queries, you can write custom SQL to investigate specific data quality issues:

Finding Missing SSN by Facility: ```sql SELECT AssigningAuthority, COUNT(*) AS MissingSSN FROM HSPI_Data.Patient WHERE SSN IS NULL OR SSN = '' GROUP BY AssigningAuthority ORDER BY MissingSSN DESC ```

Identifying Dummy Phone Numbers: ```sql SELECT PhoneNumber, COUNT(*) AS RecordCount FROM HSPI_Data.Patient WHERE PhoneNumber IN ('999-999-9999', '000-000-0000', '123-456-7890') GROUP BY PhoneNumber ```

Finding Records with Default Birth Dates: ```sql SELECT AssigningAuthority, COUNT(*) AS DefaultBirthDates FROM HSPI_Data.Patient WHERE BirthDate = '1900-01-01' OR BirthDate = '1800-01-01' GROUP BY AssigningAuthority ```

Detecting MRN Reuse: ```sql SELECT AssigningAuthority, MRN, COUNT(DISTINCT MPIID) AS UniqueMPIIDs FROM HSPI_Data.Patient GROUP BY AssigningAuthority, MRN HAVING COUNT(DISTINCT MPIID) > 1 ```

Finding Name Variations: ```sql SELECT FamilyName, COUNT(*) AS Variations FROM HSPI_Data.Patient WHERE UPPER(FamilyName) IN ( SELECT UPPER(FamilyName) FROM HSPI_Data.Patient GROUP BY UPPER(FamilyName) HAVING COUNT(DISTINCT FamilyName) > 1 ) GROUP BY FamilyName ORDER BY Variations DESC ```

Comparing Original vs Normalized Data: ```sql SELECT P.AssigningAuthority, COUNT(*) AS TotalRecords, SUM(CASE WHEN P.SSN IS NULL THEN 1 ELSE 0 END) AS OriginalNullSSN, SUM(CASE WHEN N.SSN IS NULL THEN 1 ELSE 0 END) AS NormalizedNullSSN FROM HSPI_Data.Patient P JOIN HSPI_Data.Normalized N ON P.ID = N.ID GROUP BY P.AssigningAuthority ```

SQL Query Tools and Interfaces

Several interfaces are available for executing SQL queries:

Management Portal SQL Interface:

  • Navigate to System Explorer > SQL
  • Select namespace (HSPI or Registry namespace)
  • Use Execute Query tab
  • Results can be exported or printed

Terminal:

  • Access via Management Portal > System Explorer > SQL
  • Execute SQL directly in Terminal window
  • Useful for scripting and automation

ODBC/JDBC:

  • Connect external tools (Excel, Tableau, custom applications)
  • Use EMPI as data source for reporting tools
  • Enables integration with enterprise reporting platforms

SQL Shell:

  • Command-line interface for SQL
  • Useful for batch processing and automation

---

Documentation References

4. Analyzing Pairs Using Worklist

Key Points

  • Worklist shows record pairs requiring human review
  • Seven categories identify different data quality issues
  • Filters enable targeted analysis of specific problems
  • Comparison Detail provides side-by-side record analysis
  • Agreement patterns reveal which fields matched or conflicted

Detailed Notes

The Worklist is a critical tool for identifying and understanding data quality problems through the lens of patient matching. Record pairs appear on the Worklist when the EMPI linkage engine cannot confidently determine whether they represent the same patient or when conflicts exist.

Worklist Categories and Data Quality Implications

Each Worklist category indicates a specific type of data quality or linkage issue:

Review Category: Link weight falls between autolink and review thresholds (e.g., 20-30). These pairs have moderate similarity - enough commonality to warrant review but not enough for automatic linking.

*Data Quality Implications*:

  • Partial demographic matches due to incomplete data
  • Name variations that reduce matching confidence
  • Missing key identifiers forcing reliance on weaker fields
  • Inconsistent address or contact information

Validate Category: Link weight falls between validate and autolink thresholds (e.g., 30-50). These pairs have high similarity and were automatically linked but should be validated.

*Data Quality Implications*:

  • Very similar but not identical demographics
  • Common names reducing uniqueness
  • Potential for false matches if not reviewed
  • May reveal systematic data entry differences between facilities

Duplicate Category: Two records with link weight above autolink threshold belong to the same data domain (same facility). This violates the expectation that each patient has only one record per facility.

*Data Quality Implications*:

  • MRN reuse at the facility (serious patient safety issue)
  • Temporary MRNs not properly merged
  • Data migration creating duplicate records
  • Multiple visits creating separate registrations

Overlay Category: Two records that are NOT linked but share the same MPIID. This is a serious data quality error indicating records for different patients were incorrectly merged.

*Data Quality Implications*:

  • Patient safety risk - one patient may see another patient's data
  • May result from MRN reuse combined with deterministic matching
  • Incorrect manual linking decisions
  • System migration errors

Overlap Category: Two records that ARE linked but have different MPIIDs. This conflict indicates the linking engine wants to link records but they're associated with different enterprise identifiers.

*Data Quality Implications*:

  • Historical manual unlinking that should be re-evaluated
  • Deterministic identifiers preventing linkage
  • Transitivity conflicts in complex linkage chains
  • May require expert review to resolve

Open-Chaining Category: Two records are both linked to a third record but not linked to each other. For example, Record A is linked to Record C, Record B is linked to Record C, but Record A is not linked to Record B.

*Data Quality Implications*:

  • Inconsistent demographic data across records
  • Partial matches that create ambiguous linkage patterns
  • May indicate gradual data quality degradation over time
  • Suggests need for linkage definition tuning

Deterministic Category: A deterministic identifier conflict prevents honoring expected behavior. For example, two records have the same SSN (should force link) but also have same facility/different MRN (should force non-link).

*Data Quality Implications*:

  • SSN data quality issues (duplicates, errors)
  • MRN reuse at facilities
  • Data entry errors in deterministic fields
  • Conflicting business rules

Worklist Search and Filter Capabilities

The Worklist search panel provides powerful filtering to isolate specific data quality issues:

Category Filter: Select one or more categories to focus on specific problem types. Most data quality analysis starts with Duplicate, Overlay, and Deterministic categories as these indicate serious issues.

Data Source Filters: Compare or filter by facility/data source:

  • Find all problems from a specific facility
  • Compare two facilities to identify systematic differences
  • Identify which source systems have poorest data quality
  • Filter to "same data source" to find within-facility duplicates
  • Filter to "different data source" to focus on cross-facility matching

Time Created/Updated: Identify when problems appeared:

  • New problems since last review session
  • Problems created after a system change
  • Historical problems not yet addressed
  • Trending analysis (problems increasing over time?)

Link Weight Range: Focus on pairs near threshold boundaries:

  • Pairs just below autolink threshold (near-misses)
  • Pairs just above review threshold (borderline matches)
  • Very high weight conflicts (unexpected problems in strong matches)

Link Reason: Filter by why records were linked:

  • Threshold: Linked by similarity score
  • Manual: Linked by user action
  • Deterministic: Linked by SSN, etc.
  • Rule: Linked by custom rule
  • Transitivity: Linked through common third record
  • Domain Conflict: Same facility, same MRN

ID Search: Investigate specific patients:

  • Search by MPIID to find all pairs involving a patient
  • Search by SSN or MRN using Record Search window
  • Follow up on reported data quality issues

Worklist Fields for Data Quality Analysis

Each Worklist row displays critical information for assessing data quality:

Link Status: Visual indicator of relationship:

  • Green: Linked
  • Yellow: Potential Link
  • Red: Non-Link with conflict

Link Weight: Numeric similarity score. Understanding weight distribution helps identify:

  • Threshold tuning needs
  • Systematic data quality issues (many pairs cluster near thresholds)
  • Parameter weight adjustments needed

Agreement Pattern: String showing which parameters agreed (H), disagreed (X), partially matched (L), or were neutral (N). This is extremely valuable for data quality analysis.

Example pattern: HHLNX for linkage definition with 5 weighted parameters:

  • Parameter 1 (e.g., Name): High agreement
  • Parameter 2 (e.g., DOB): High agreement
  • Parameter 3 (e.g., SSN): Low agreement (partial match)
  • Parameter 4 (e.g., Gender): Neutral (one or both null)
  • Parameter 5 (e.g., Address): Negative (strong disagreement)

Frequent patterns reveal data quality issues:

  • Many "N" (neutral) = missing data
  • Many "L" (low) = inconsistent data entry
  • Many "X" (negative) = conflicting data or bad normalization

Summary Field: Displays key demographics for quick assessment. Click summary to open Record History.

Unique Identifier: Shows AssigningAuthority:MRN for each record in the pair. Quickly identifies the facilities involved.

Data Source/Data Domain: Identifies which facilities or systems provided the records.

Time Created/Updated: When the pair appeared on worklist and when last modified.

Using Comparison Detail for Deep Analysis

Click the Comparison Detail icon (table icon) to open side-by-side record comparison. This view is essential for understanding specific data quality issues:

Original Properties Section: Shows raw demographic data from source systems:

  • Exact values as received from facilities
  • Identify data entry inconsistencies
  • Spot missing fields
  • See dummy/placeholder values

Normalized Properties Section: Shows data after normalization functions applied:

  • Assess normalization effectiveness
  • Identify normalization errors
  • Verify null value lists working correctly
  • Check exclusion conditions

Linkage Section: Shows matching algorithm details:

  • Current link weight and thresholds
  • Agreement pattern explained
  • Link status and reason
  • Secondary reason (if applicable)

Link Weight by Parameters: Breaks down how each parameter contributed to total link weight:

  • Positive contributions (agreement)
  • Negative contributions (disagreement)
  • Zero contributions (neutral)
  • Identify which parameters are most/least reliable

Other Links: Shows other records linked to each record in the pair:

  • Understand complete linkage chain
  • Identify open-chaining situations
  • See if either record has other conflicts

Color Coding: Comparison Detail uses colors consistently:

  • Green: Values match or strongly agree
  • Yellow: Values partially match
  • Red: Values strongly disagree
  • Gray: One or both values null

Worklist Data Quality Analysis Workflow

Systematic approach to using Worklist for data quality analysis:

1. Start with High-Priority Categories: Begin with Overlay and Duplicate categories as these have highest patient safety impact.

2. Filter by Facility: Analyze one facility at a time to identify facility-specific patterns.

3. Sort by Link Weight: Review highest-weight conflicts first (most unexpected).

4. Examine Agreement Patterns: Look for recurring patterns indicating systematic issues.

5. Open Comparison Detail: Investigate representative pairs in detail.

6. Document Findings: Note common data quality issues for each facility.

7. Identify Root Causes: Determine whether issues are:

  • Source system data entry problems
  • Interface/transformation errors
  • Normalization function gaps
  • Business rule conflicts

8. Quantify Impact: Use Worklist counts to measure scope of each issue.

9. Prioritize Remediation: Address issues with highest volume or safety impact first.

---

Documentation References

5. Reporting Data Quality Issues

Key Points

  • Document findings with specific examples and metrics
  • Classify issues by source facility and root cause
  • Quantify impact using counts and percentages
  • Provide actionable recommendations for remediation
  • Track issue resolution and measure improvement

Detailed Notes

Effective data quality reporting transforms analysis findings into actionable information that drives improvement. Reports must clearly communicate issues to technical teams, facility stakeholders, and management.

Report Structure and Components

A comprehensive data quality report should include:

Executive Summary:

  • Overall data quality assessment
  • Critical issues requiring immediate attention
  • Impact on patient matching accuracy
  • Recommended priorities for remediation

Methodology Section:

  • Analysis period and scope
  • Tools used (Data Quality Tool, SQL queries, Worklist analysis)
  • Data sources examined
  • Queries executed

Findings by Category:

*Missing Data Issues*:

  • Which fields are frequently missing
  • Percentage of records affected
  • Facilities with highest missing data rates
  • Impact on matching (e.g., "23% of records lack SSN, reducing match confidence by average of 12 points")

*Invalid Data Issues*:

  • Dummy values identified (with counts)
  • Invalid formats or values
  • Default values not yet updated
  • Test data in production

*Inconsistent Data Issues*:

  • Name variations and patterns
  • Address standardization gaps
  • Date format inconsistencies
  • Value encoding differences between facilities

*Conflict Issues*:

  • MRN reuse incidents
  • Deterministic conflicts
  • Overlay/Overlap situations
  • Duplicate records

Facility-Specific Reporting

Break down findings by facility/data source:

Facility Data Quality Scorecard: ``` Facility: Memorial Hospital Total Records: 45,230 Analysis Period: 2025-01-01 to 2025-12-31

Missing Data Rates:

  • SSN Missing: 18% (8,141 records)
  • Phone Missing: 34% (15,378 records)
  • Middle Name Missing: 67% (30,304 records)

Invalid Data Detected:

  • Dummy Phone "999-999-9999": 2,341 records
  • Default DOB "01/01/1900": 156 records
  • Name "Baby Boy/Girl": 89 records

Data Quality Issues:

  • MRN Reuse: 3 incidents affecting 6 patients
  • Duplicate Records: 127 same-patient duplicates
  • Name Variations: High (Smith vs Smyth pattern in 234 cases)

```

Comparative Analysis: Compare facilities to identify outliers and best practices: ``` SSN Completeness by Facility:

  • City General: 95% complete (best practice)
  • Memorial Hospital: 82% complete
  • County Clinic: 67% complete (requires attention)
  • Regional Medical: 45% complete (critical issue)

```

Data Quality Metrics and KPIs

Establish quantitative measures:

Completeness Metrics:

  • Percentage of records with complete core demographics
  • Field-level completeness rates
  • Trend over time (improving or degrading?)

Validity Metrics:

  • Percentage of records with valid values
  • Invalid value counts by type
  • Normalization effectiveness (invalid % before vs after normalization)

Consistency Metrics:

  • Name standardization rate
  • Address parse success rate
  • Date format standardization rate

Conflict Metrics:

  • MRN reuse rate per 10,000 records
  • Overlay/Overlap incidents
  • Deterministic conflicts per month

Matching Impact Metrics:

  • Average link weight for matched pairs (higher = more confidence)
  • Percentage of matches above autolink threshold
  • Review workload (pairs requiring human review)
  • False match indicators (overlays discovered)
  • False non-match indicators (duplicates discovered)

Root Cause Analysis

For each major issue, identify root cause:

Source System Issues:

  • Field not collected by source application
  • Field optional in source, frequently skipped
  • Source system accepts invalid values
  • Legacy data migration introduced errors

Interface Issues:

  • HL7 transformation errors
  • Field mapping mistakes
  • Character encoding problems
  • Date format conversion errors

Process Issues:

  • Staff training gaps
  • Registration workflow allows skipping required fields
  • Lack of data validation at entry point
  • No process for updating temporary values

EMPI Configuration Issues:

  • Normalization functions not addressing facility-specific patterns
  • Null value lists incomplete
  • Exclusion conditions not defined for known dummy values
  • Linkage parameters not weighted appropriately for data characteristics

Actionable Recommendations

Provide specific, prioritized recommendations:

High Priority (Patient Safety Impact): 1. Address MRN reuse at Facility X immediately - implement MRN validation to prevent reuse 2. Resolve 3 overlay incidents - patient may see wrong patient's data 3. Add "VET" prefix to MRN exclusion condition to prevent veterinary patient matching

Medium Priority (Matching Accuracy Impact): 1. Add common dummy phone numbers to null value list for Facility Y 2. Create normalization function to standardize name suffixes (Jr., Sr., III) 3. Update HL7 transformation for Facility Z to correctly map middle name field

Low Priority (Data Quality Improvement): 1. Request Facility A improve SSN collection rate 2. Standardize address abbreviations across all facilities 3. Implement data quality monitoring dashboards for ongoing visibility

Issue Tracking and Follow-Up

Establish tracking mechanism:

Issue Log Format: ``` Issue ID: DQ-2025-001 Category: Invalid Data Severity: High Facility: County Clinic Description: 2,341 records contain dummy phone number "999-999-9999" Root Cause: Registration system accepts any phone format, staff enter dummy value when patient declines Recommendation: Add "999-999-9999" to null value list in EMPI linkage definition Assigned To: EMPI Administrator Status: Resolved Resolution Date: 2025-02-15 Resolution: Added to null value list, rebuilt linkage data Verification: Confirmed normalized cube shows 0 instances of "999-999-9999" ```

Trend Reporting

Track data quality over time:

Monthly Data Quality Dashboard:

  • Total records processed
  • New data quality issues discovered
  • Issues resolved
  • Outstanding issue count
  • Data quality scores by facility
  • Matching accuracy metrics

Quarterly Executive Report:

  • Data quality trend (improving/stable/degrading)
  • Major issues addressed
  • Remaining challenges
  • Resource needs for further improvement
  • ROI from data quality initiatives (e.g., reduced manual worklist review time)

Stakeholder Communication

Tailor reports for different audiences:

Technical Teams (EMPI admins, interface developers):

  • Detailed technical findings
  • SQL queries used
  • Specific configuration changes needed
  • HL7 transformation corrections required

Facility Data Quality Coordinators:

  • Facility-specific scorecards
  • Examples of problematic records (de-identified)
  • Registration process improvements needed
  • Training recommendations

Management/Leadership:

  • Executive summary
  • Patient safety impact
  • Matching accuracy metrics
  • Resource requirements
  • Timeline for improvements

---

Documentation References

Report an Issue