T2.1: Analyzes Demographic Data - InterSystems Enterprise Master Patient Index Technical Specialist Study Guide

1. Common Data Quality Issues that Impact Linkage

Key Points

Missing or incomplete critical demographic fields reduce matching accuracy
Name variations and inconsistencies create false non-matches
MRN conflicts and reuse cause serious patient safety risks
Address inconsistencies affect geographic-based matching
Invalid or dummy values corrupt linkage algorithms

Detailed Notes

Data quality is the foundation of accurate patient matching in InterSystems EMPI. Poor data quality directly impacts the ability to correctly identify when two patient records represent the same individual (true matches) and when they represent different individuals (true non-matches).

Missing Critical Demographic Data

The absence of key demographic fields significantly reduces the EMPI's ability to match records accurately. Critical fields include:

Social Security Number (SSN): When SSN is missing, one of the strongest unique identifiers is unavailable. Some facilities may not collect SSN due to patient privacy concerns or collection policies. The impact is reduced matching confidence, requiring reliance on weaker identifiers like name and date of birth.

Date of Birth: Missing or defaulted dates of birth (such as 01/01/1900) prevent accurate age-based matching. Newborns may have temporary dates of birth that need correction later.

Gender: While seemingly simple, missing or incorrect gender codes can cause matching failures, especially when combined with unisex names.

Contact Information: Missing phone numbers, addresses, or email reduces the number of parameters available for comparison and verification.

Name Variations and Inconsistencies

Names present multiple data quality challenges that affect matching:

Spelling Variations: "Smith" vs "Smyth", "Johnson" vs "Johnsen" - different data entry personnel may spell names differently.

Nicknames vs Legal Names: "Bob" vs "Robert", "Liz" vs "Elizabeth" - patients may provide different name forms at different facilities.

Hyphenated Names: "Mary Smith-Jones" may appear as "Mary Smith", "Mary Jones", or "Mary Smith Jones" in different systems.

Suffix Handling: "John Smith Jr." vs "John Smith" - suffixes may be entered inconsistently or in different fields.

Special Characters: Apostrophes, hyphens, and accented characters may be stripped or handled differently: "O'Brien" vs "OBrien".

Name Order Reversals: First and last names accidentally swapped during data entry.

Medical Record Number (MRN) Conflicts

MRN issues are particularly serious because they can cause patient safety incidents:

MRN Reuse: When facilities reuse medical record numbers for different patients, it creates deterministic linkage conflicts. The system expects MRN to uniquely identify a patient within a facility.

MRN Format Changes: Facilities may change MRN numbering systems (e.g., adding prefixes), creating duplicate records for the same patient.

Temporary MRN Assignment: Emergency departments may assign temporary MRNs that later need to be merged with permanent numbers.

Cross-Facility MRN Conflicts: Different facilities may independently use the same MRN for different patients.

Address Data Inconsistencies

Address data presents standardization challenges:

Abbreviation Variations: "Street" vs "St.", "Avenue" vs "Ave", "Road" vs "Rd" - inconsistent abbreviations reduce matching.

Apartment/Unit Numbers: May be placed in different fields or formats: "Apt 5B", "Unit 5B", "#5B".

ZIP Code Errors: Transposed digits or missing ZIP+4 extensions.

PO Boxes vs Street Addresses: Patients may provide different addresses at different times.

Address Updates: Patients move but old addresses remain in source systems.

Invalid and Dummy Values

Many data quality issues stem from placeholder or default values:

Dummy Phone Numbers: "999-999-9999", "000-000-0000", "123-456-7890" - test data or placeholders entered when real number is unavailable.

Dummy Names: "Baby Boy", "Baby Girl", "Unknown", "Test Patient" - temporary values that should be updated but remain in the system.

Default Dates: "01/01/1900", "01/01/1800" - system defaults when real date is unknown.

Generic Addresses: "123 Main Street", "PO Box 0" - placeholder values.

Test Data: Records created during system testing that were not properly purged.

---

Documentation References

InterSystems Documentation

2. Using the Data Quality Tool to Discover Data Problems

Key Points

Three analytics cubes provide comprehensive data quality visibility
Business rules classify data as valid, invalid, or blank
Dashboards show data quality metrics at a glance
Trending pivots track data quality changes over time
Scheduled synchronization enables ongoing monitoring

Detailed Notes

The InterSystems EMPI Data Quality Tool is a comprehensive analytics solution that helps identify and track data quality issues. It uses analytics cubes to analyze patient demographic data and provides dashboards and pivot tables for visualization.

Data Quality Tool Components

The Data Quality tool consists of three main analytics cubes:

PatientIndexDQCube (Patient Cube): Built from the HSPI.Data.Patient table, this cube contains data quality information for all patient records as they exist in the original patient table before normalization. This cube shows the "raw" data quality from source systems.

PatientIndexNormalizedDQCube (Normalized Cube): Built from the normalized table associated with your linkage definition, this cube shows data quality after normalization functions have been applied. Comparing this cube to the patient cube reveals how effectively your normalization functions are cleaning the data.

DQTrend (Trend Cube): Built from the patient cube over time, this cube tracks trends in data quality. It allows you to identify when data quality deteriorated (e.g., when a specific facility started sending poor quality data or when a system integration change introduced errors).

Enabling and Configuring Data Quality

The Data Quality tool is disabled by default. To enable it:

1. Navigate to InterSystems EMPI menu > Data Quality Manager in the Management Portal 2. Configure the Data Quality Definition:

Package Name: Specify a custom package (avoid HealthShare product packages like HSDQ, HSMOD, HSPI to prevent upgrade conflicts)
Data Quality definition class description: Optional description
Base Linkage Definition: Name of your linkage definition (determines which normalized table to use)

3. Select "Enable Data Quality" to generate cubes and configure default synchronization 4. Select "Rebuild Cube Data" to populate the cubes with data

Important: You must perform a linkage data build before rebuilding Data Quality cubes.

Business Rules for Data Classification

The Data Quality tool uses business rules to classify each field value as valid, invalid, or blank. Each property (demographic field) has an associated classifier rule.

Default rules are provided for standard fields:

SSN: Validates format (9 digits), checks for common dummy values (000-00-0000, 123-45-6789)
Phone Number: Validates format, identifies dummy numbers (999-999-9999, 000-000-0000)
Birth Date: Checks for valid dates, identifies dummy dates (01/01/1900)
Name Fields: Identifies placeholder values ("Baby", "Unknown", "Test")
Gender: Validates against standard codes (M, F, U, O)
ZIP Code: Validates 5-digit or ZIP+4 format

Creating Custom Business Rules

To address site-specific data quality issues, you should create custom business rules:

1. Navigate to Interoperability > List > Business Rules 2. Select "New" to open Business Rule Wizard 3. Enter custom package name (critical for upgrade safety) 4. Provide rule name, optional alias and description 5. In the general tab, select HSDQ.Classifier.Rule.Assist as Rule Assist Class 6. In the ruleSet tab, define the rule logic using existing rules as templates 7. Save the rule 8. Return to Data Quality Manager, select the property, and choose your custom rule from the Classifier Rule dropdown

For example, if a facility uses "VET" prefix for veterinary patients, create a rule that classifies any MRN starting with "VET" as invalid.

Data Quality Dashboards

Three main dashboards provide data quality visibility:

InterSystems EMPI Data Quality Summary: Displays valid, invalid, and blank counts for each property in the patient cube. Shows data quality from source systems before normalization.

InterSystems EMPI Data Quality Summary - Normalized: Displays valid, invalid, and blank counts after normalization. Compare to the regular summary to measure normalization effectiveness. Ideally, normalized cube should have zero invalid values.

Trend Dashboard: Displays long-term trends for selected pivots, helping identify when data quality changed.

Dashboard features:

Filter by Facility or Assigning Authority to identify problem sources
Export to PDF or Excel for reporting
Drill down into specific properties for detailed analysis

Data Quality Pivots

Pivots provide detailed counts and percentages for individual property values:

Count of each unique value (e.g., how many patients have SSN "000-00-0000")
Percentage of total records
Filter by Assigning Authority or Facility to isolate problem sources
Identify patterns (e.g., one facility consistently sends dummy phone numbers)

Data Quality Properties Table

The Data Quality Manager displays a properties table showing:

Property Name: Demographic field (SSN, FamilyName, etc.)
Classifier Rule: Business rule applied to this property
Cubes: Whether property exists in Patient cube only (O) or both Patient and Normalized cubes (O+N)

For each property, you can view:

Top Count: Maximum number of distinct values tallied (others grouped as "other values")
Classifier Type: Expected value type (String, Date, etc.)
Classifier Rule: Specific business rule applied

---

Documentation References

InterSystems Documentation

3. Analyzing Data Using SQL Queries

Key Points

Direct SQL access to HSPI tables for custom analysis
Operational queries provide pre-built data quality reports
Terminal, ODBC, JDBC, or Management Portal SQL interface
Query patient demographics, linkage data, and audit history
Custom queries enable site-specific analysis needs

Detailed Notes

SQL queries provide powerful, flexible analysis capabilities for examining demographic data quality in InterSystems EMPI. SQL access allows you to create custom reports, investigate specific data quality issues, and perform ad-hoc analysis that may not be covered by standard Data Quality Tool dashboards.

Key EMPI Database Tables

Understanding EMPI table structure is essential for SQL analysis:

HSPI.Data.Patient: Contains all patient demographic records from source systems. Each row represents one patient record from one facility. Key fields include:

ID: Internal record identifier
MPIID: Master Patient Index ID (enterprise identifier)
AssigningAuthority: Facility or system that assigned the MRN
MRN: Medical Record Number from source facility
FamilyName, GivenName, MiddleName: Name components
SSN: Social Security Number
BirthDate, Gender: Core demographics
StreetLine, City, State, PostalCode: Address fields
PhoneNumber, TelecomUse: Contact information
UpdatedOn, CreatedOn: Record timestamps

Normalized Table (name varies by linkage definition, e.g., HSPI.Data.Normalized): Contains normalized versions of patient data after normalization functions have been applied. Same fields as Patient table but with cleaned/standardized values.

HSPI.Data.Classified: Contains linkage classification for record pairs:

ID1, ID2: Internal IDs of the two records in the pair
LinkWeight: Calculated similarity score
LinkStatus: Link, Potential Link, or Non-Link
LinkReason: Why records were linked (Threshold, Manual, Deterministic, etc.)
AgreementPattern: String showing which parameters agreed/disagreed

HSPI.Data.Worklist: Contains record pairs requiring human review:

Category: Worklist category (Review, Validate, Duplicate, Overlay, Overlap, etc.)
LinkWeight, LinkStatus, LinkReason: Same as Classified table
TimeCreated, TimeUpdated: When pair appeared on worklist

HSPI.Audit: Audit trail of all EMPI actions:

Action: What was done (Link, Unlink, Merge, etc.)
Username: Who performed the action
TimeStamp: When action occurred
Comment: User-entered comment

Operational Queries (Pre-Built SQL Procedures)

InterSystems EMPI provides a comprehensive set of operational queries - stored SQL procedures that answer common data quality questions. These queries are located in the HSPI_Report_Queries package.

To access operational queries: 1. Navigate to System Explorer > SQL in Management Portal 2. Expand "Procedures" in the left pane 3. Locate queries starting with HSPI_Report_Queries 4. Click a query to view its Catalog Details and parameters 5. Use Execute Query tab to run the query

Patient Record Queries:

`Patient_Records(AssigningAuthority)`: Total number of patient records
`Patient_RecordsByAssigningAuthority()`: Count by facility
`Patient_RecordsForMPIID(MPIID)`: All records linked under an MPIID
`Patient_Unique(AssigningAuthority)`: Count of distinct patients (by MPIID)

Facilities Queries:

`Facilities_Count()`: Total number of unique facilities
`Facilities_List()`: List of all facilities
`Facilities_AACount()`: Count of unique assigning authorities
`Facilities_AAList()`: List of assigning authorities

Worklist Category Queries:

`Review_Count()`, `Review_Detail()`: Review category items
`Validate_Count()`, `Validate_Detail()`: Validate category items
`Duplicate_Count()`, `Duplicate_Detail()`: Same-domain duplicates
`Overlay_Count()`, `Overlay_Detail()`: Different records, same MPIID
`Overlap_Count()`, `Overlap_Detail()`: Linked records, different MPIIDs
`Deterministic_Count()`, `Deterministic_Detail()`: Deterministic conflicts

Example Operational Query Usage: ```sql -- Count duplicates for a specific facility CALL HSPI_Report_Queries.Duplicates_ByDomain('FACILITYNAME')

-- Count merges in a date range CALL HSPI_Report_Queries.Merges_Count('FACILITYNAME', '2025-01-01', '2025-12-31')

-- Detailed audit activity for a user CALL HSPI_Report_Queries.Audit_Activity('jsmith', '2025-01-01', '2025-12-31') ```

Custom SQL Queries for Data Quality Analysis

Beyond operational queries, you can write custom SQL to investigate specific data quality issues:

Finding Missing SSN by Facility: ```sql SELECT AssigningAuthority, COUNT(*) AS MissingSSN FROM HSPI_Data.Patient WHERE SSN IS NULL OR SSN = '' GROUP BY AssigningAuthority ORDER BY MissingSSN DESC ```

Identifying Dummy Phone Numbers: ```sql SELECT PhoneNumber, COUNT(*) AS RecordCount FROM HSPI_Data.Patient WHERE PhoneNumber IN ('999-999-9999', '000-000-0000', '123-456-7890') GROUP BY PhoneNumber ```

Finding Records with Default Birth Dates: ```sql SELECT AssigningAuthority, COUNT(*) AS DefaultBirthDates FROM HSPI_Data.Patient WHERE BirthDate = '1900-01-01' OR BirthDate = '1800-01-01' GROUP BY AssigningAuthority ```

Detecting MRN Reuse: ```sql SELECT AssigningAuthority, MRN, COUNT(DISTINCT MPIID) AS UniqueMPIIDs FROM HSPI_Data.Patient GROUP BY AssigningAuthority, MRN HAVING COUNT(DISTINCT MPIID) > 1 ```

Finding Name Variations: ```sql SELECT FamilyName, COUNT(*) AS Variations FROM HSPI_Data.Patient WHERE UPPER(FamilyName) IN ( SELECT UPPER(FamilyName) FROM HSPI_Data.Patient GROUP BY UPPER(FamilyName) HAVING COUNT(DISTINCT FamilyName) > 1 ) GROUP BY FamilyName ORDER BY Variations DESC ```

Comparing Original vs Normalized Data: ```sql SELECT P.AssigningAuthority, COUNT(*) AS TotalRecords, SUM(CASE WHEN P.SSN IS NULL THEN 1 ELSE 0 END) AS OriginalNullSSN, SUM(CASE WHEN N.SSN IS NULL THEN 1 ELSE 0 END) AS NormalizedNullSSN FROM HSPI_Data.Patient P JOIN HSPI_Data.Normalized N ON P.ID = N.ID GROUP BY P.AssigningAuthority ```

SQL Query Tools and Interfaces

Several interfaces are available for executing SQL queries:

Management Portal SQL Interface:

Navigate to System Explorer > SQL
Select namespace (HSPI or Registry namespace)
Use Execute Query tab
Results can be exported or printed

Terminal:

Access via Management Portal > System Explorer > SQL
Execute SQL directly in Terminal window
Useful for scripting and automation

ODBC/JDBC:

Connect external tools (Excel, Tableau, custom applications)
Use EMPI as data source for reporting tools
Enables integration with enterprise reporting platforms

SQL Shell:

Command-line interface for SQL
Useful for batch processing and automation

---

Documentation References

InterSystems Documentation

4. Analyzing Pairs Using Worklist

Key Points

Worklist shows record pairs requiring human review
Seven categories identify different data quality issues
Filters enable targeted analysis of specific problems
Comparison Detail provides side-by-side record analysis
Agreement patterns reveal which fields matched or conflicted

Detailed Notes

The Worklist is a critical tool for identifying and understanding data quality problems through the lens of patient matching. Record pairs appear on the Worklist when the EMPI linkage engine cannot confidently determine whether they represent the same patient or when conflicts exist.

Worklist Categories and Data Quality Implications

Each Worklist category indicates a specific type of data quality or linkage issue:

Review Category: Link weight falls between autolink and review thresholds (e.g., 20-30). These pairs have moderate similarity - enough commonality to warrant review but not enough for automatic linking.

*Data Quality Implications*:

Partial demographic matches due to incomplete data
Name variations that reduce matching confidence
Missing key identifiers forcing reliance on weaker fields
Inconsistent address or contact information

Validate Category: Link weight falls between validate and autolink thresholds (e.g., 30-50). These pairs have high similarity and were automatically linked but should be validated.

*Data Quality Implications*:

Very similar but not identical demographics
Common names reducing uniqueness
Potential for false matches if not reviewed
May reveal systematic data entry differences between facilities

Duplicate Category: Two records with link weight above autolink threshold belong to the same data domain (same facility). This violates the expectation that each patient has only one record per facility.

*Data Quality Implications*:

MRN reuse at the facility (serious patient safety issue)
Temporary MRNs not properly merged
Data migration creating duplicate records
Multiple visits creating separate registrations

Overlay Category: Two records that are NOT linked but share the same MPIID. This is a serious data quality error indicating records for different patients were incorrectly merged.

*Data Quality Implications*:

Patient safety risk - one patient may see another patient's data
May result from MRN reuse combined with deterministic matching
Incorrect manual linking decisions
System migration errors

Overlap Category: Two records that ARE linked but have different MPIIDs. This conflict indicates the linking engine wants to link records but they're associated with different enterprise identifiers.

*Data Quality Implications*:

Historical manual unlinking that should be re-evaluated
Deterministic identifiers preventing linkage
Transitivity conflicts in complex linkage chains
May require expert review to resolve

Open-Chaining Category: Two records are both linked to a third record but not linked to each other. For example, Record A is linked to Record C, Record B is linked to Record C, but Record A is not linked to Record B.

*Data Quality Implications*:

Inconsistent demographic data across records
Partial matches that create ambiguous linkage patterns
May indicate gradual data quality degradation over time
Suggests need for linkage definition tuning

Deterministic Category: A deterministic identifier conflict prevents honoring expected behavior. For example, two records have the same SSN (should force link) but also have same facility/different MRN (should force non-link).

*Data Quality Implications*:

SSN data quality issues (duplicates, errors)
MRN reuse at facilities
Data entry errors in deterministic fields
Conflicting business rules

Worklist Search and Filter Capabilities

The Worklist search panel provides powerful filtering to isolate specific data quality issues:

Category Filter: Select one or more categories to focus on specific problem types. Most data quality analysis starts with Duplicate, Overlay, and Deterministic categories as these indicate serious issues.

Data Source Filters: Compare or filter by facility/data source:

Find all problems from a specific facility
Compare two facilities to identify systematic differences
Identify which source systems have poorest data quality
Filter to "same data source" to find within-facility duplicates
Filter to "different data source" to focus on cross-facility matching

Time Created/Updated: Identify when problems appeared:

New problems since last review session
Problems created after a system change
Historical problems not yet addressed
Trending analysis (problems increasing over time?)

Link Weight Range: Focus on pairs near threshold boundaries:

Pairs just below autolink threshold (near-misses)
Pairs just above review threshold (borderline matches)
Very high weight conflicts (unexpected problems in strong matches)

Link Reason: Filter by why records were linked:

Threshold: Linked by similarity score
Manual: Linked by user action
Deterministic: Linked by SSN, etc.
Rule: Linked by custom rule
Transitivity: Linked through common third record
Domain Conflict: Same facility, same MRN

ID Search: Investigate specific patients:

Search by MPIID to find all pairs involving a patient
Search by SSN or MRN using Record Search window
Follow up on reported data quality issues

Worklist Fields for Data Quality Analysis

Each Worklist row displays critical information for assessing data quality:

Link Status: Visual indicator of relationship:

Green: Linked
Yellow: Potential Link
Red: Non-Link with conflict

Link Weight: Numeric similarity score. Understanding weight distribution helps identify:

Threshold tuning needs
Systematic data quality issues (many pairs cluster near thresholds)
Parameter weight adjustments needed

Agreement Pattern: String showing which parameters agreed (H), disagreed (X), partially matched (L), or were neutral (N). This is extremely valuable for data quality analysis.

Example pattern: HHLNX for linkage definition with 5 weighted parameters:

Parameter 1 (e.g., Name): High agreement
Parameter 2 (e.g., DOB): High agreement
Parameter 3 (e.g., SSN): Low agreement (partial match)
Parameter 4 (e.g., Gender): Neutral (one or both null)
Parameter 5 (e.g., Address): Negative (strong disagreement)

Frequent patterns reveal data quality issues:

Many "N" (neutral) = missing data
Many "L" (low) = inconsistent data entry
Many "X" (negative) = conflicting data or bad normalization

Summary Field: Displays key demographics for quick assessment. Click summary to open Record History.

Unique Identifier: Shows AssigningAuthority:MRN for each record in the pair. Quickly identifies the facilities involved.

Data Source/Data Domain: Identifies which facilities or systems provided the records.

Time Created/Updated: When the pair appeared on worklist and when last modified.

Using Comparison Detail for Deep Analysis

Click the Comparison Detail icon (table icon) to open side-by-side record comparison. This view is essential for understanding specific data quality issues:

Original Properties Section: Shows raw demographic data from source systems:

Exact values as received from facilities
Identify data entry inconsistencies
Spot missing fields
See dummy/placeholder values

Normalized Properties Section: Shows data after normalization functions applied:

Assess normalization effectiveness
Identify normalization errors
Verify null value lists working correctly
Check exclusion conditions

Linkage Section: Shows matching algorithm details:

Current link weight and thresholds
Agreement pattern explained
Link status and reason
Secondary reason (if applicable)

Link Weight by Parameters: Breaks down how each parameter contributed to total link weight:

Positive contributions (agreement)
Negative contributions (disagreement)
Zero contributions (neutral)
Identify which parameters are most/least reliable

Other Links: Shows other records linked to each record in the pair:

Understand complete linkage chain
Identify open-chaining situations
See if either record has other conflicts

Color Coding: Comparison Detail uses colors consistently:

Green: Values match or strongly agree
Yellow: Values partially match
Red: Values strongly disagree
Gray: One or both values null

Worklist Data Quality Analysis Workflow

Systematic approach to using Worklist for data quality analysis:

1. Start with High-Priority Categories: Begin with Overlay and Duplicate categories as these have highest patient safety impact.

2. Filter by Facility: Analyze one facility at a time to identify facility-specific patterns.

3. Sort by Link Weight: Review highest-weight conflicts first (most unexpected).

4. Examine Agreement Patterns: Look for recurring patterns indicating systematic issues.

5. Open Comparison Detail: Investigate representative pairs in detail.

6. Document Findings: Note common data quality issues for each facility.

7. Identify Root Causes: Determine whether issues are:

Source system data entry problems
Interface/transformation errors
Normalization function gaps
Business rule conflicts

8. Quantify Impact: Use Worklist counts to measure scope of each issue.

9. Prioritize Remediation: Address issues with highest volume or safety impact first.

---

Documentation References

InterSystems Documentation

5. Reporting Data Quality Issues

Key Points

Document findings with specific examples and metrics
Classify issues by source facility and root cause
Quantify impact using counts and percentages
Provide actionable recommendations for remediation
Track issue resolution and measure improvement

Detailed Notes

Effective data quality reporting transforms analysis findings into actionable information that drives improvement. Reports must clearly communicate issues to technical teams, facility stakeholders, and management.

Report Structure and Components

A comprehensive data quality report should include:

Executive Summary:

Overall data quality assessment
Critical issues requiring immediate attention
Impact on patient matching accuracy
Recommended priorities for remediation

Methodology Section:

Analysis period and scope
Tools used (Data Quality Tool, SQL queries, Worklist analysis)
Data sources examined
Queries executed

Findings by Category:

*Missing Data Issues*:

Which fields are frequently missing
Percentage of records affected
Facilities with highest missing data rates
Impact on matching (e.g., "23% of records lack SSN, reducing match confidence by average of 12 points")

*Invalid Data Issues*:

Dummy values identified (with counts)
Invalid formats or values
Default values not yet updated
Test data in production

*Inconsistent Data Issues*:

Name variations and patterns
Address standardization gaps
Date format inconsistencies
Value encoding differences between facilities

*Conflict Issues*:

MRN reuse incidents
Deterministic conflicts
Overlay/Overlap situations
Duplicate records

Facility-Specific Reporting

Break down findings by facility/data source:

Facility Data Quality Scorecard: ``` Facility: Memorial Hospital Total Records: 45,230 Analysis Period: 2025-01-01 to 2025-12-31

Missing Data Rates:

SSN Missing: 18% (8,141 records)
Phone Missing: 34% (15,378 records)
Middle Name Missing: 67% (30,304 records)

Invalid Data Detected:

Dummy Phone "999-999-9999": 2,341 records
Default DOB "01/01/1900": 156 records
Name "Baby Boy/Girl": 89 records

Data Quality Issues:

MRN Reuse: 3 incidents affecting 6 patients
Duplicate Records: 127 same-patient duplicates
Name Variations: High (Smith vs Smyth pattern in 234 cases)

```

Comparative Analysis: Compare facilities to identify outliers and best practices: ``` SSN Completeness by Facility:

City General: 95% complete (best practice)
Memorial Hospital: 82% complete
County Clinic: 67% complete (requires attention)
Regional Medical: 45% complete (critical issue)

```

Data Quality Metrics and KPIs

Establish quantitative measures:

Completeness Metrics:

Percentage of records with complete core demographics
Field-level completeness rates
Trend over time (improving or degrading?)

Validity Metrics:

Percentage of records with valid values
Invalid value counts by type
Normalization effectiveness (invalid % before vs after normalization)

Consistency Metrics:

Name standardization rate
Address parse success rate
Date format standardization rate

Conflict Metrics:

MRN reuse rate per 10,000 records
Overlay/Overlap incidents
Deterministic conflicts per month

Matching Impact Metrics:

Average link weight for matched pairs (higher = more confidence)
Percentage of matches above autolink threshold
Review workload (pairs requiring human review)
False match indicators (overlays discovered)
False non-match indicators (duplicates discovered)

Root Cause Analysis

For each major issue, identify root cause:

Source System Issues:

Field not collected by source application
Field optional in source, frequently skipped
Source system accepts invalid values
Legacy data migration introduced errors

Interface Issues:

HL7 transformation errors
Field mapping mistakes
Character encoding problems
Date format conversion errors

Process Issues:

Staff training gaps
Registration workflow allows skipping required fields
Lack of data validation at entry point
No process for updating temporary values

EMPI Configuration Issues:

Normalization functions not addressing facility-specific patterns
Null value lists incomplete
Exclusion conditions not defined for known dummy values
Linkage parameters not weighted appropriately for data characteristics

Actionable Recommendations

Provide specific, prioritized recommendations:

High Priority (Patient Safety Impact): 1. Address MRN reuse at Facility X immediately - implement MRN validation to prevent reuse 2. Resolve 3 overlay incidents - patient may see wrong patient's data 3. Add "VET" prefix to MRN exclusion condition to prevent veterinary patient matching

Medium Priority (Matching Accuracy Impact): 1. Add common dummy phone numbers to null value list for Facility Y 2. Create normalization function to standardize name suffixes (Jr., Sr., III) 3. Update HL7 transformation for Facility Z to correctly map middle name field

Low Priority (Data Quality Improvement): 1. Request Facility A improve SSN collection rate 2. Standardize address abbreviations across all facilities 3. Implement data quality monitoring dashboards for ongoing visibility

Issue Tracking and Follow-Up

Establish tracking mechanism:

Issue Log Format: ``` Issue ID: DQ-2025-001 Category: Invalid Data Severity: High Facility: County Clinic Description: 2,341 records contain dummy phone number "999-999-9999" Root Cause: Registration system accepts any phone format, staff enter dummy value when patient declines Recommendation: Add "999-999-9999" to null value list in EMPI linkage definition Assigned To: EMPI Administrator Status: Resolved Resolution Date: 2025-02-15 Resolution: Added to null value list, rebuilt linkage data Verification: Confirmed normalized cube shows 0 instances of "999-999-9999" ```

Trend Reporting

Track data quality over time:

Monthly Data Quality Dashboard:

Total records processed
New data quality issues discovered
Issues resolved
Outstanding issue count
Data quality scores by facility
Matching accuracy metrics

Quarterly Executive Report:

Data quality trend (improving/stable/degrading)
Major issues addressed
Remaining challenges
Resource needs for further improvement
ROI from data quality initiatives (e.g., reduced manual worklist review time)

Stakeholder Communication

Tailor reports for different audiences:

Technical Teams (EMPI admins, interface developers):

Detailed technical findings
SQL queries used
Specific configuration changes needed
HL7 transformation corrections required

Facility Data Quality Coordinators:

Facility-specific scorecards
Examples of problematic records (de-identified)
Registration process improvements needed
Training recommendations

Management/Leadership:

Executive summary
Patient safety impact
Matching accuracy metrics
Resource requirements
Timeline for improvements

---

Documentation References

InterSystems Documentation

1. Common Data Quality Issues that Impact Linkage Report Issue

Key Points

Detailed Notes

Missing Critical Demographic Data

Name Variations and Inconsistencies

Medical Record Number (MRN) Conflicts

Address Data Inconsistencies

Invalid and Dummy Values

Documentation References

2. Using the Data Quality Tool to Discover Data Problems Report Issue

Key Points

Detailed Notes

Data Quality Tool Components

Enabling and Configuring Data Quality

Business Rules for Data Classification

Creating Custom Business Rules

Data Quality Dashboards

Data Quality Pivots

Data Quality Properties Table

Documentation References

3. Analyzing Data Using SQL Queries Report Issue

Key Points

Detailed Notes

Key EMPI Database Tables

Operational Queries (Pre-Built SQL Procedures)

Custom SQL Queries for Data Quality Analysis

SQL Query Tools and Interfaces

Documentation References

4. Analyzing Pairs Using Worklist Report Issue

Key Points

Detailed Notes

Worklist Categories and Data Quality Implications

Worklist Search and Filter Capabilities

Worklist Fields for Data Quality Analysis

Using Comparison Detail for Deep Analysis

Worklist Data Quality Analysis Workflow

Documentation References

5. Reporting Data Quality Issues Report Issue

Key Points

Detailed Notes

Report Structure and Components

Facility-Specific Reporting

Data Quality Metrics and KPIs

Root Cause Analysis

Actionable Recommendations

Issue Tracking and Follow-Up

Trend Reporting

Stakeholder Communication

Documentation References

Report an Issue

1. Common Data Quality Issues that Impact Linkage

2. Using the Data Quality Tool to Discover Data Problems

3. Analyzing Data Using SQL Queries

4. Analyzing Pairs Using Worklist

5. Reporting Data Quality Issues