1. Understanding Demographic Data Collection and Assessment
Key Points
- Collect comprehensive demographic data inventory from all participating sites
- Identify available patient identifiers: Names, SSN, Gender, Birth Date, Addresses, Telecoms
- Document data formats, completeness, and quality by facility/assigning authority
- Assess consistency across data domains (facilities)
- Identify dummy values, default values, and missing data patterns
- Use Data Quality tool to analyze valid, invalid, and blank values by source
Detailed Notes
When implementing or updating an EMPI system, the first critical step is collecting and appraising demographic data from each participating site. This assessment drives all subsequent configuration decisions in the linkage definition.
Collecting Site Information
Technical specialists must gather detailed information about the demographic data available from each facility or data domain. The default EMPI parameters include:
- Names (First, Last, Middle)
- Social Security Number (SSN)
- Gender
- Birth Date
- Identifiers (MRN, facility-specific IDs)
- Addresses (Street, City, State, ZIP)
- Telecoms (Phone numbers, emails)
Additionally, Facility and MRN parameters are used in preliminary matching. Records matching on the facility/assigning authority/MRN triplet are automatically updated without requiring full linkage comparison.
Appraising Data Quality
The InterSystems EMPI Data Quality tool provides analytics cubes to evaluate data completeness and validity:
- PatientIndexDQCube: Built from HSPI.Data.Patient table, shows original data quality
- PatientIndexNormalizedDQCube: Built from normalized table, shows data after normalization
- DQTrend: Tracks long-term trends in data quality over time
The Data Quality tool uses business rules to classify values as valid, invalid, or blank. Dashboards display counts by property, facility, and assigning authority. This analysis reveals:
- Which facilities provide complete SSN data versus those that don't collect it
- Sources using dummy telephone numbers (e.g., 000-000-0000)
- Default values that should be treated as null
- Properties with high percentages of blank or invalid values
Sites should create custom Data Quality dashboards and trending pivots to monitor specific concerns relevant to their implementation. Finding and remediating data quality issues significantly improves and accelerates the tuning process.
Documentation References
2. Data Normalization Strategies
Key Points
- Normalization standardizes data for consistent comparison
- Default normalization functions provided for each linkage type
- Custom normalization functions address site-specific data problems
- Common normalizations: case conversion, punctuation removal, date standardization
- Null value lists identify dummy values to exclude from matching
- Normalization function overrides null value list unless coded to check it
- Use %object, %property, and %parameters variables in custom functions
Detailed Notes
Normalization is the process of standardizing demographic data into a consistent format before comparison. Each linkage type has an associated default normalization function that describes how to normalize that particular type of data.
Default Normalization Functions
Default normalization includes:
- Converting all characters to lowercase
- Removing punctuation and special characters
- Standardizing presentation of addresses, phone numbers, and dates
- Stripping whitespace
For example, "SMITH, JOHN" and "Smith,John" both normalize to "smithjohn" for comparison purposes.
Selecting Normalization Approaches for Problem Data
When data quality issues are identified, technical specialists must select appropriate normalization approaches:
List of Values to be Treated as Null: For known dummy values (e.g., 000-00-0000 for SSN, 999-999-9999 for phone), add these to the null value list for the parameter. These values will be excluded from linkage weight calculations.
Custom Normalization Functions: Override default normalization by entering custom ObjectScript code in the Normalization Function field. Custom functions should return the normalized value.
Custom Normalization Function Structure
Custom functions have access to:
- %object: Access other normalized values for the record (e.g., %object.getDataSource())
- %property: The original value being normalized
- %parameters: List of linkage type parameter name/value pairs
Typical approaches: 1. Call default function then manipulate result 2. Manipulate original value then pass to default function
Example: Gender Normalization
```objectscript set gender=##class(%MPRL.LinkageType.Gender).getNormalized(%object,%property,.%parameters) if gender'="M",gender'="F" set gender ="U" quit gender ```
This normalizes any gender values other than M or F to U (Unknown).
Example: Date Normalization for Invalid Dates
```objectscript If +$zdt(%property,3)<1900 quit "" quit ##class(%MPRL.LinkageType.Date).getNormalized(%object,%property,.%parameters) ```
This treats dates before January 1, 1900, as missing values.
Important Consideration
If both a null value list and normalization function are defined, the normalization function overrides the null value list by default. To preserve null value functionality, add this code at the beginning of custom functions:
```objectscript if (%property="")||$data(%parameters("_MissingValueArray",$zcvt(%property,"l"))) quit "" ```
Documentation References
3. Modifying Parameters in Definition Designer
Key Points
- Access Definition Designer from EMPI menu
- Parameters tab displays default parameters (Names, SSN, Gender, Birth Date, etc.)
- Modify parameter values: linkage type, agreement/disagreement weights, normalization
- Settings tab: Locale, Enable Domain Conflict, threshold values
- Data Classes tab: Must use default HSPI.Data.Patient class (not customizable)
- Weighted parameters contribute to agreement pattern displayed in Worklist
- Save changes and rebuild linkage data to apply modifications
Detailed Notes
The Definition Designer is the primary interface for configuring the EMPI linkage definition. It provides a comprehensive UI for managing all aspects of how records are compared and linked.
Accessing Definition Designer
Navigate to Definition Designer from the InterSystems EMPI menu. The interface includes multiple tabs: Settings, Data Classes, Parameters, Link Keys, and Calibration.
Settings Tab
Key settings include:
Locale: Choose the geographical location that best represents your data. As of version 2020.1, EMPI supports: France, Germany, Italy, South Africa, Spain, and USA. The locale affects name matching algorithms, alias dictionaries, and address normalization.
Enable Domain Conflict: Controls how EMPI handles similar records from the same data domain (facility):
- Enabled: Records from same domain with different MRNs are NOT linked (duplicates should be resolved by source system)
- Above autolink threshold: Category = Duplicate, Status = non-link, Reason = domain conflict
- Below autolink threshold: Category = Review, Status = potential link, Reason = threshold
- Disabled: Records from same domain are linked and assigned same MPIID
Data Classes Tab
The Data Classes tab identifies locations and kinds of data to be analyzed. Settings are NOT customizable.
Critical: You must use the default HSPI.Data.Patient class for patient data. Do not modify default data classes or create additional data classes.
Parameters Tab
The Parameters tab is where most linkage definition work occurs. Default parameters are listed on the left:
- Names
- SSN
- Gender
- Birth Date
- Identifiers
- Addresses
- Telecoms
For each parameter, you can modify:
Parameter Name: Descriptive name for the parameter
Normalized Property Name: Field name in normalized class (e.g., stdSSN for SSN parameter)
Weighted Checkbox: If checked, parameter contributes to linkage weight calculation. Unweighted parameters can still be used in link keys.
Linkage Type: Class describing field contents (e.g., %MPRL.LinkageType.GivenName). Determines:
- How data is normalized
- Default agreement function
- Default agreement/disagreement weights
Linkage Type Parameters: Each linkage type has associated parameters affecting weight calculation. For example, HSPI.LinkageType.Name includes:
- FrequencyAdjusted
- FrequencyAdjustmentMaxFactor
- CheckTranspositions
- ScrubAffix
- AgreementWeightPercentage
Original Property Name: The field name as it appears in data class (case sensitive). Multiple properties can be comma-separated.
List of Values to be Treated as Null: Dummy values to exclude from comparison (e.g., 000-00-0000)
Agreement Weight: Amount added to link weight when values match
Disagreement Weight: Amount subtracted from link weight when values don't match (typically negative)
Normalization Function: Custom ObjectScript code to override default normalization
Agreement Function: Custom ObjectScript code to override default comparison logic
Saving and Applying Changes
After modifying parameters: 1. Click "Save All" to persist changes 2. Navigate to Linkage Data tab 3. Build or rebuild linkage data to apply changes to existing records
Documentation References
4. Threshold Values and Adjustment Process
Key Points
- Three thresholds control linkage decisions: Review, Autolink, Validate
- Review threshold: Minimum link weight for Worklist inclusion
- Autolink threshold: Cutoff between automatic links and potential links
- Validate threshold: Maximum link weight for Worklist appearance
- Adjust thresholds to balance automation vs. manual review
- Early implementation: Lower thresholds (more manual review)
- Mature implementation: Higher thresholds (more automation)
- MLE calibration does NOT adjust thresholds (manual adjustment required)
Detailed Notes
Threshold values are critical configuration parameters that determine how record pairs are classified and whether they require manual review.
The Three Thresholds
Review Threshold: Defines the lowest link weight at which a record pair is included in the Worklist. Any record pair with a link weight below the review threshold never undergoes further consideration as a match. These pairs are classified as "strong non-links."
Autolink Threshold: Defines the cutoff between record pairs that are automatically linked and those requiring review. Record pairs with link weight between review and autolink thresholds are "potential links" that appear in the Review category of the Worklist.
Validate Threshold: Defines the highest link weight at which pairs appear on the Worklist. All records above the autolink threshold are automatically linked. Those between autolink and validate thresholds are links that appear in the Validate category for quality assurance review.
Example Threshold Scenario
Consider these threshold values:
- Review: 10
- Autolink: 20
- Validate: 25
Record pair classifications:
- Link weight < 10: Strong non-link (not on Worklist)
- Link weight 10-19: Potential link (Review category)
- Link weight 20-24: Automatic link (Validate category)
- Link weight ≥ 25: Automatic link (not on Worklist)
Tuning Philosophy
Threshold adjustment is an iterative process aligned with implementation maturity:
Early Stage: Set thresholds to create larger Worklists:
- Lower autolink threshold: More pairs require manual review
- Higher validate threshold: More automatic links appear for validation
- Purpose: Allow manual review of borderline cases to understand data patterns
Mature Stage: Adjust thresholds for greater automation:
- Higher autolink threshold: Fewer pairs require review
- Lower validate threshold: High-confidence links skip validation
- Purpose: Only uncertain cases require manual intervention
Threshold Adjustment Process
1. Review Worklist after building linkage data 2. Analyze Review category pairs: Are they true matches? 3. Analyze Validate category pairs: Are they true links? 4. Identify patterns in link weights for true matches vs. non-matches 5. Adjust thresholds to optimize classification 6. Rebuild linkage data (Batch mode, Weights option) 7. Review results and iterate
Critical Note: Running MLE calibration adjusts parameter weights but does NOT adjust thresholds. Threshold tuning must be performed manually based on Worklist review and organizational capacity for manual review.
Tools for Threshold Analysis
Use the Worklist filtering and analysis features:
- Filter by link weight ranges
- Review agreement patterns
- Examine secondary reasons and comments
- Identify systematic misclassifications
The goal is to position thresholds so automatic decisions are highly accurate while manageable volumes appear for manual review.
Documentation References
5. Maximum Likelihood Estimation (MLE) Process
Key Points
- MLE uses probabilistic algorithms to calculate optimal parameter weights
- Analyzes actual data to determine agreement/disagreement weights
- Requires sizable dataset (minimum 100,000 records recommended)
- Accessed via "MLE Calibration" button in Definition Designer
- Monitor page shows real-time weight adjustments (aWeight, dWeight columns)
- Run until weights stabilize, then click "Stop Calibration"
- Apply Results to update parameter weights in linkage definition
- Multiple MLE iterations improve accuracy as data and thresholds evolve
Detailed Notes
Maximum Likelihood Estimation (MLE) is a powerful tool for determining appropriate parameter weights based on statistical analysis of actual data rather than generic defaults.
Understanding MLE
EMPI comes with default parameter weights based on generic patient data. These defaults typically produce large volumes of potential links on the Worklist. MLE analyzes your specific database to calculate what agreement and disagreement weights should be for each comparison field based on statistical calculations drawn from actual data patterns.
MLE uses probabilistic record linkage algorithms to determine:
- How discriminating each parameter is in your dataset
- Optimal weights that maximize linkage accuracy
- Statistical confidence in parameter comparisons
Running MLE Calibration
Step 1: Initiate Calibration
1. Navigate to Definition Designer 2. Click "MLE Calibration" in banner 3. Click OK to confirm (or Cancel to return) 4. If recently run, may be prompted to view previous results
Step 2: Monitor Calibration Process
The calibration runs in background. Click OK when prompted to open the Calibration Monitor page.
The monitor displays:
- Calibration Index: Current iteration number
- Linkage Definition: Which definition is being calibrated
- Sample Size: Dataset size being analyzed
- Status: Running status and timing
- Parameter table: Shows weight adjustments in real-time
- Name: Parameter name
- aWeight: Agreement weight (positive value)
- dWeight: Disagreement weight (negative value)
- maProb, uaProb, mdProb, udProb: Probability calculations
Step 3: Evaluate and Stop
Watch the aWeight and dWeight columns. Allow calibration to run until:
- Weights stop changing significantly
- Weighting scores are fairly stable
- Statistical convergence is achieved
Click "Stop Calibration" when satisfied with convergence.
Step 4: Apply Results
Two options when calibration completes:
Apply Results: Updates parameter weights in linkage definition based on calibration OK without applying: Closes monitor without updating weights (can note values for manual entry later)
Once applied, new weights appear in Parameters tab of Definition Designer.
Dataset Requirements
For useful MLE results:
- Minimum 100,000 records recommended
- Representative sample of data diversity
- Include all facilities/data domains
- Sufficient true matches and non-matches for statistical validity
Iterative MLE Process
InterSystems recommends running MLE in multiple iterations:
1. Initial run with default weights 2. Review results, adjust thresholds 3. Second MLE run with adjusted thresholds 4. Further threshold refinement 5. Continue iterations until satisfied with accuracy and Worklist volume
Each iteration provides new estimates based on current threshold values and accumulated data patterns.
After MLE Calibration
Critical: After applying MLE results, you must rebuild linkage data:
1. Navigate to Linkage Data tab 2. Select Batch mode 3. Select "Weights" option (recalculates linkages with new weights) 4. Run build process 5. Review Worklist pairs with new weights
Note that data is unavailable during batch rebuild process.
Documentation References
6. Agreement and Disagreement Weights Revision
Key Points
- Agreement weight: Added to link weight when parameters match
- Disagreement weight: Subtracted when parameters don't match (negative value)
- Overall link weight = sum of all parameter agreement/disagreement weights
- Higher agreement weight = parameter contributes more to match decisions
- Adjust weights based on parameter reliability in your dataset
- Unreliable data (e.g., phone numbers): Lower agreement, raise disagreement
- Critical identifiers (e.g., SSN): Higher agreement weight
- Manual adjustment complements MLE calibration results
Detailed Notes
Agreement and disagreement weights are fundamental to how EMPI calculates the overall link weight for each record pair. Understanding and properly tuning these weights is essential for accurate patient matching.
Weight Fundamentals
For each weighted parameter, EMPI compares values between two records and applies either:
- Agreement weight: If values match or are similar (positive value added to link weight)
- Disagreement weight: If values differ (negative value subtracted from link weight)
The sum of all parameter weights produces the overall link weight, which is compared against thresholds to determine link status.
How Weights Affect Linkage
Example Scenario:
Parameters and weights:
- SSN: Agreement +12, Disagreement -8
- Birth Date: Agreement +8, Disagreement -6
- Last Name: Agreement +6, Disagreement -4
- First Name: Agreement +5, Disagreement -3
- Gender: Agreement +2, Disagreement -1
Record Pair A vs. B:
- SSN: Match (+12)
- Birth Date: Match (+8)
- Last Name: Match (+6)
- First Name: Different (-3)
- Gender: Match (+2)
Link weight = 12 + 8 + 6 - 3 + 2 = 25
If autolink threshold is 20, this pair would be automatically linked.
Strategic Weight Adjustment
Increasing Parameter Importance: Raise agreement weight and/or make disagreement weight more negative (e.g., -4 to -6). This makes the parameter contribute more to both positive and negative match decisions.
Decreasing Parameter Importance: Lower agreement weight and/or make disagreement weight less negative (e.g., -4 to -2). Use this for unreliable data elements.
Data Quality Considerations:
If Data Quality tool reveals that telephone numbers are frequently missing or contain dummy values:
- Original: Agreement +5, Disagreement -4
- Adjusted: Agreement +2, Disagreement -1
- Result: Telecoms have less influence on link decisions
If SSN data is highly reliable and complete:
- Original: Agreement +10, Disagreement -6
- Adjusted: Agreement +15, Disagreement -10
- Result: SSN becomes stronger discriminator
Linkage Type Parameters
Beyond basic weights, linkage type parameters affect weight calculations:
FrequencyAdjusted: For name parameters, common surnames (Smith, Jones) contribute less weight than rare surnames, reflecting their lower discriminating value.
AgreementWeightPercentage / DisagreementWeightPercentage: Modify base weights by percentage for fine-tuning.
These parameters are set in the Linkage Type Parameter Name/Value fields in Definition Designer.
Weight Revision Workflow
1. Build linkage data with current weights 2. Review Worklist for misclassifications 3. Identify parameters causing false positives or false negatives 4. Adjust weights strategically 5. Rebuild linkage data (Batch mode, Weights option) 6. Evaluate results 7. Iterate until optimal accuracy achieved
Combine manual weight adjustment with MLE calibration for best results. MLE provides data-driven baseline; manual tuning addresses specific organizational priorities and data quality realities.
Documentation References
7. Evaluating Tuning Process Effectiveness
Key Points
- Use Worklist to review link and non-link pair results
- Analyze agreement patterns to understand match decisions
- Track Worklist volume trends over tuning iterations
- Measure false positive and false negative rates
- Review pairs by category: Review, Validate, Duplicate
- Filter by secondary reason, comment, facility, date range
- Monitor impact of each tuning change on classification accuracy
- Goal: High accuracy with manageable Worklist volume
Detailed Notes
Evaluating the effectiveness of tuning changes is critical to achieving optimal EMPI performance. The tuning process is iterative, requiring systematic measurement and validation after each modification.
Worklist Analysis
The Worklist is the primary tool for evaluating tuning effectiveness. It displays record pairs requiring review or validation, organized into categories:
Review Category: Pairs with link weight between review and autolink thresholds (potential links requiring decision)
Validate Category: Pairs with link weight between autolink and validate thresholds (automatic links recommended for quality assurance)
Duplicate Category: Same-domain pairs with different MRNs (when Domain Conflict enabled)
Key Metrics to Track
Worklist Volume: Total number of pairs requiring manual review
- Initial implementation: Expect high volumes
- After tuning: Should decrease to manageable levels
- Trend: Declining volume indicates improving configuration
Classification Accuracy:
- Review true matches in Review category: What percentage are actual matches?
- Review true non-matches: What percentage correctly identified?
- Validate links: What percentage of automatic links are accurate?
False Positives: Pairs incorrectly classified as links
- Check Validate category for non-matches
- Indicates thresholds too low or weights too generous
False Negatives: Pairs incorrectly classified as non-links
- Manually search for known matches not in Worklist
- Indicates thresholds too high or weights too conservative
Agreement Pattern Analysis
The agreement pattern is a string of characters showing parameter agreement for each pair. Example: XLXHN
Pattern characters:
- H: High agreement (exact match)
- X: Approximate agreement (similar but not exact)
- L: Low agreement (some similarity)
- N: No agreement (different)
- M: Missing data (one or both values null)
Analyze patterns to understand:
- Which parameter combinations produce accurate links
- Which combinations produce false positives
- Whether certain parameters are over/under-weighted
Filtering and Reporting
Use Worklist filters to focus analysis:
By Link Weight Range: Examine pairs near threshold boundaries to validate cutoff decisions
By Secondary Reason: Review pairs affected by specific rules (e.g., "Gender" rule, "Roommates-A" rule)
By Comment: Filter for specific automated or manual annotations
By Facility/Assigning Authority: Identify data quality issues by source
By Date Range: Track performance over time or for specific data loads
Effectiveness Evaluation Process
After each tuning change (weights, thresholds, normalization, rules):
1. Rebuild linkage data with new configuration 2. Snapshot Worklist volume before and after 3. Sample Review category pairs (e.g., 100 pairs): Count true matches vs. non-matches 4. Sample Validate category pairs: Verify automatic links are accurate 5. Calculate accuracy metrics: % correct in each category 6. Identify patterns in misclassifications 7. Document findings and determine next tuning actions
Success Criteria
Effective tuning achieves:
- High accuracy: >95% of automatic links are true matches
- Manageable volume: Worklist size matches organizational review capacity
- Clear decisions: Pairs near thresholds have ambiguous data justifying manual review
- Stable performance: Metrics remain consistent as new data arrives
When these criteria are met, the linkage definition is well-tuned for production use.
Documentation References
8. Recommending Data Quality Corrective Actions
Key Points
- Use Data Quality tool dashboards to identify systematic issues
- Recommend source system corrections for upstream data problems
- Implement null value lists for known dummy values
- Apply custom normalization for consistent formatting issues
- Configure custom data quality rules for organization-specific validation
- Engage data stewards at source facilities for long-term improvements
- Document data quality issues and remediation plans
- Balance EMPI configuration vs. source system fixes
Detailed Notes
Data quality issues significantly impact EMPI linkage accuracy. Technical specialists must identify these issues and recommend appropriate corrective actions at both source and EMPI levels.
Data Quality Tool Analysis
The Data Quality Manager provides three dashboards:
InterSystems EMPI Data Quality Summary: Shows valid, invalid, and blank value counts for each property in original patient data
InterSystems EMPI Data Quality Summary - Normalized: Shows data quality after normalization processing
Trend Dashboard: Displays long-term patterns in data quality metrics
Use these dashboards to identify:
- Properties with high percentages of blank values by facility
- Facilities using dummy/default values (e.g., all patients with same phone number)
- Invalid data formats (e.g., dates outside reasonable ranges)
- Inconsistent data entry practices across sources
Categorizing Data Quality Issues
Missing Data: Properties consistently blank from specific facilities
- Recommendation: Engage facility to improve data collection
- EMPI Action: Reduce weight of unreliable parameters for that source
Dummy Values: Placeholder values used when data unavailable (000-00-0000, 999-999-9999)
- Recommendation: Source system should use NULL instead of dummy values
- EMPI Action: Add dummy values to "List of values to be treated as null"
Format Inconsistencies: Same data represented differently (Jr., Jr, Junior)
- Recommendation: Source system standardization
- EMPI Action: Custom normalization function to standardize formats
Invalid Values: Data outside valid ranges (birth dates in future, impossible SSNs)
- Recommendation: Source system validation rules
- EMPI Action: Custom normalization to treat invalid values as null
Transposition Errors: Fields frequently swapped (first/last name, city/state)
- Recommendation: Source system UI/validation improvements
- EMPI Action: Configure "List of Possible Transpositions" parameter
Corrective Action Strategies
Immediate EMPI Configuration: 1. Add null value lists for identified dummy values 2. Implement custom normalization for format standardization 3. Adjust parameter weights for unreliable data sources 4. Create custom data quality rules for validation
Source System Engagement: 1. Document specific data quality issues with evidence 2. Quantify impact on EMPI linkage accuracy 3. Provide recommendations for source system improvements 4. Establish ongoing data quality monitoring
Long-term Improvements: 1. Establish data governance policies across organization 2. Implement validation at point of data entry 3. Provide training to data entry staff 4. Schedule regular data quality audits
Custom Data Quality Rules
Create custom rules in Data Quality Manager to validate organization-specific requirements:
1. Navigate to Data Quality Manager 2. Create custom rule in Validation Rule field 3. Define validation logic (e.g., SSN must be 9 digits, no repeating patterns) 4. Save changes 5. Select "Rebuild Cube Data" to apply custom rule 6. View results in dashboards
Custom rules appear in data quality reports alongside built-in validations.
Balancing Configuration vs. Correction
Determine whether issues should be addressed through:
EMPI Configuration: When source system changes are impractical or delayed Source System Fixes: For sustainable, long-term improvements Both: Immediate EMPI workaround while pursuing source corrections
The goal is clean, standardized data at the source with EMPI configuration providing resilience against remaining variability.
Documentation References
9. Composite Record Trust Tiers and Aging Factors
Key Points
- Composite Record aggregates demographic data from all linked records
- Trust tiers rank data sources by reliability (e.g., registration > billing)
- Manual overrides allow selecting specific property group as default
- Aging factors determine how long manual overrides remain effective
- Five aging options: change in record, days, change OR days, change AND days, no expiration
- Composite override expires automatically when linkage group changes
- Configure aging in Settings > Composites tab
- Changes apply to all subsequent overrides (not retroactive)
Detailed Notes
The Composite Record is an EMPI feature that creates a single, best-available view of patient demographics by aggregating data from all linked records. Configuring trust tiers and aging factors ensures the Composite Record reflects the most reliable and current information.
Understanding Composite Records
When multiple records are linked to the same patient (same MPIID), EMPI creates a Composite Record that selects the best value for each property group (address, phone, name, etc.). This selection is based on:
1. Data source trust rankings: Which facilities provide most reliable data 2. Data completeness: Records with more complete information rank higher 3. Data recency: More recent updates may be preferred 4. Manual overrides: User-specified selections that override automatic ranking
Trust Tier Configuration
Trust tiers establish a hierarchy of data source reliability. For example:
Tier 1 (Highest Trust): Registration/ADT systems Tier 2: Billing systems Tier 3: Laboratory systems Tier 4: External data sources
When selecting which address to use in the Composite Record, an address from a Tier 1 source outranks an address from a Tier 2 source, all else being equal.
Technical specialists should work with organizational stakeholders to: 1. Identify all data sources feeding EMPI 2. Assess reliability and completeness of each source 3. Establish consensus on trust rankings 4. Document trust tier rationale 5. Configure rankings in EMPI
Manual Overrides
Despite automated ranking, sometimes users identify the correct demographic value through manual review. For example, if a patient has multiple addresses but staff confirms the correct one through phone contact, they may manually select that property group to be used as default in the Composite Record.
This manual selection is called a "manual override." It supersedes automated ranking for that property group.
Aging Factor Configuration
Manual overrides should not persist indefinitely, as patient information changes over time. The Composites tab in Settings page controls how long manual overrides last.
Access: Settings > Composites tab
Five aging options:
When there is a change in a record (default):
- Override expires when Composite Record is edited
- Ensures overrides are reconsidered when new information arrives
- Most conservative approach
After a number of days:
- Override expires after specified number of days (enter positive integer)
- Example: 90 days for addresses, assuming patients move
- Time-based expiration regardless of data changes
After a change in a record or a number of days:
- Override expires when record changes OR days elapse, whichever comes first
- Combines both triggers
- More aggressive expiration policy
After a change in a record and a number of days:
- Override expires only after record changes AND days elapse
- Both conditions must be met
- More conservative, allows longer override persistence
No additional expiration:
- Override persists until linkage group changes
- No time-based or edit-based expiration
- Use cautiously, as overrides may become stale
Important Considerations
Linkage Group Changes: All manual overrides expire automatically whenever the linkage group changes (records are merged, unlinked, or re-linked). This ensures Composite Record is rebuilt from scratch when fundamental linkage structure changes.
Retroactive Application: Changes to aging settings apply only to subsequent overrides created after the change. Existing overrides follow the aging policy in effect when they were created.
Property Group Granularity: Aging applies at the property group level (entire address, not individual city field). Override of one address doesn't affect overrides of other property groups.
Recommendations by Property Type
Addresses: 90-180 day expiration - patients move, addresses become outdated
Phone Numbers: 90-120 day expiration - phone numbers change, patients switch carriers
Names: Change-based expiration only - names rarely change except for marriage/divorce events
Clinical Identifiers: No expiration - SSN and other identifiers are permanent
Technical specialists should recommend aging policies based on:
- How frequently each data type changes in real-world patient populations
- Organizational capacity to review and update overrides
- Regulatory requirements for data currency
- Downstream system dependencies on Composite Record accuracy
Documentation References
10. Recommending Changes to Rules
Key Points
- Built-in rules handle common scenarios: Twins, Siblings, Roommates
- Custom rules created via onCreateClassifiedPair() method in linkage definition class
- Rules modify linkStatus, secondaryReason, and comment properties
- Link status values: 0=Strong Non-Link, 1=Non-Link to Review, 2=Link to Validate, 3=Link
- Common rule patterns: gender mismatch, DOB discrepancies, suspicious address sharing
- Rules filter Worklist by Secondary Reason for targeted review
- Do NOT modify other classified pair properties (causes unexpected behavior)
- Test rules thoroughly before production deployment
Detailed Notes
Rules provide fine-grained control over linkage decisions by overriding weight-based classifications in specific scenarios. Technical specialists must understand when to recommend rules and how to implement them safely.
Understanding Rules
After EMPI calculates link weight and applies thresholds, rules provide a final opportunity to adjust classification. Rules examine the specific pattern of agreement and disagreement across parameters and modify link status based on organizational business logic.
Built-in Rules
EMPI includes several built-in rules for common scenarios:
Twins Rule: Two records from same facility, same birth date, same last name, different first names, link weight above threshold → Downgrade to potential link (manual review)
Siblings Rule: Similar to Twins but with different birth dates → Non-link
Roommates Rules (A, B, C variants): Records that match on address but disagree on names/identifiers → Downgrade to non-link or potential link for review
These built-in rules prevent common false positive scenarios.
When to Recommend Custom Rules
Consider custom rules when:
1. Specific data quality patterns cause systematic misclassifications 2. Organizational policies require manual review of certain scenarios 3. Regulatory requirements mandate human verification for specific cases 4. Built-in rules don't cover observed false positive/negative patterns
Custom Rule Implementation
Custom rules are implemented by adding an onCreateClassifiedPair() method to the linkage definition class. This method is called immediately before the record pair is saved to the database.
Method Signature: ```objectscript ClassMethod onCreateClassifiedPair(pClassifiedPair as %MPRL.Linkage.Classified, isModified As %Boolean) As %Status ```
Editable Properties (ONLY these should be modified):
linkStatus - Numerical value:
- 0 = Strong Non-Link (below Review threshold)
- 1 = Non-Link to Review (Potential Link)
- 2 = Link to Validate
- 3 = Link
secondaryReason - String identifying which rule was applied (used for Worklist filtering)
comment - Comment associated with action (displayed in Worklist)
Critical Warning: Do NOT edit any other properties of the classified pair. Modifying other properties may cause unexpected behavior and database corruption.
Example Custom Rules
Gender Mismatch Rule: ```objectscript ClassMethod onCreateClassifiedPair(pClassifiedPair as %MPRL.Linkage.Classified, isModified As %Boolean) As %Status { // Set appropriate variables (tRecNormalizedA, tRecNormalizedB) if ((tRecNormalizedA.stdGender'="")&(tRecNormalizedB.stdGender'=""))&& ($zcvt(tRecNormalizedA.stdGender,"u")'=$zcvt(tRecNormalizedB.stdGender,"u")) { set pClassifiedPair.linkStatus = 1 set pClassifiedPair.secondaryReason = "Gender" set pClassifiedPair.comment = "AutoNonLink rule: Genders are different" } } ```
This rule downgrades to potential link any pair where genders are both present and different.
Birth Date Component Mismatch Rule: If at least one portion of birth date (day, month, year) doesn't match → Downgrade to potential link
SSN/Name/Address Triple Disagreement Rule: If all three critical parameters disagree → Force to non-link regardless of link weight from other parameters
Rule Development Best Practices
1. Analyze Worklist First: Identify specific patterns causing misclassifications before writing rules
2. Start Conservative: Begin with rules that downgrade to "Non-Link to Review" rather than forcing final decisions
3. Use Descriptive secondaryReason: Make it easy to filter and analyze rule impact in Worklist
4. Document Thoroughly: Comment code with business justification for each rule
5. Test on Sample Data: Evaluate rule impact before applying to full dataset
6. Monitor Ongoing: Track Worklist volume and accuracy after rule deployment
Rule Grouping
Individual rules can be grouped together in a single onCreateClassifiedPair() method:
```objectscript ClassMethod onCreateClassifiedPair(pClassifiedPair as %MPRL.Linkage.Classified, isModified As %Boolean) As %Status { // Gender rule if (gender mismatch logic) { set pClassifiedPair.linkStatus = 1 set pClassifiedPair.secondaryReason = "Gender" }
// DOB rule if (birthdate discrepancy logic) { set pClassifiedPair.linkStatus = 1 set pClassifiedPair.secondaryReason = "BirthDate" }
// Address sharing rule if (suspicious address pattern) { set pClassifiedPair.linkStatus = 1 set pClassifiedPair.secondaryReason = "Roommates" }
quit $$$OK } ```
Recommended Rules by Scenario
High-Stakes Environments (e.g., oncology, transplant):
- Force manual review for any gender mismatch
- Force manual review for SSN disagreement when names match
- Require validation for all automatic links above certain weight
High-Volume Environments (e.g., large hospital systems):
- Aggressive automatic linking for high-weight pairs
- Rules to prevent only most obvious false positives
- Minimize manual review requirements
Multi-Facility Environments:
- Domain conflict handling rules
- Facility-specific trust rankings
- Rules accounting for data quality variations by source
Technical specialists should recommend rules aligned with organizational risk tolerance, data quality reality, and operational capacity for manual review.
Documentation References
11. Link-Key Indices and Preliminary Matching
Key Points
- Link-key index identifies feasible linkage candidates before detailed comparison
- Preliminary comparison based on hash of several fields
- Eliminates obviously unrelated records from intensive parameter comparison
- Reduces time required to build linkage data
- Default link keys provided (accept during initial setup)
- Link Keys tab in Definition Designer for customization
- Non-weighted parameters can still be used in link keys
- Advanced tuning may adjust link-key composition for performance optimization
Detailed Notes
Link-key indices are performance optimization features that reduce the computational cost of linkage by pre-filtering record pairs before detailed parameter comparison.
Purpose of Link-Key Index
Without link-key indices, EMPI would compare every record against every other record using all weighted parameters—an O(n²) operation that becomes computationally prohibitive as datasets grow.
Link-key indices use a preliminary comparison based on a hash of several fields to identify all record pairs that are similar in some way. Only pairs identified by the link-key index undergo detailed parameter comparison and weight calculation.
How Link-Key Indices Work
1. Link Key Creation: Combine several parameters (e.g., first 3 letters of last name + birth year) into a hash value 2. Index Construction: Build index of hash values for all records 3. Candidate Identification: Records sharing any link-key hash value become candidates for detailed comparison 4. Parameter Comparison: Only candidate pairs undergo full weighted parameter evaluation
This approach eliminates obviously unrelated records (different birth years, completely different names) from intensive comparison, dramatically reducing build time.
Default Link Keys
EMPI provides default link keys optimized for patient matching:
- Name-based keys (phonetic and substring variations)
- Birth date keys (exact and partial matches)
- Identifier keys (SSN, MRN)
- Address keys (geographic proximity)
During initial setup, accept these defaults. They are designed to cast a wide net while maintaining performance.
Link Keys Tab
Access link-key configuration in Definition Designer > Link Keys tab.
For each link key, you can specify:
- Which parameters contribute to the hash
- Transformation functions applied before hashing (e.g., soundex, substring)
- Combinations of parameters to create multiple keys
Parameters in Link Keys vs. Weighted Parameters
A parameter can be:
- Weighted: Contributes to link weight calculation
- In link key: Used to identify candidate pairs
- Both: Most common scenario
- Neither: Displayed but not used in linkage logic
Non-weighted parameters can still be useful in link keys if they help identify candidates efficiently.
Advanced Link-Key Tuning
For performance optimization in large datasets:
Too Few Candidates: If valid matches are being missed, link keys may be too restrictive
- Solution: Add more link keys or loosen transformation functions
Too Many Candidates: If build time is excessive, link keys may be too permissive
- Solution: Make link keys more selective (requires careful validation that true matches aren't excluded)
This is advanced tuning typically performed after initial implementation stabilizes.
Documentation References
12. Exam Preparation Tips
Key Points
- Understand complete workflow: data collection → assessment → normalization → tuning
- Master Definition Designer navigation and all tabs
- Know three thresholds and their ranges (Review < Autolink < Validate)
- Understand MLE process: when to run, how to interpret, how to apply
- Differentiate agreement weights (positive) vs. disagreement weights (negative)
- Know when to use normalization functions vs. null value lists
- Understand Composite Record aging options and appropriate use cases
- Know onCreateClassifiedPair() method structure and editable properties
- Practice scenario-based problem solving for data quality issues
Detailed Notes
Key Concepts to Master
1. Data Collection and Assessment (KSAs 1-2)
- What demographic data to collect from sites
- How to use Data Quality tool for assessment
- Identifying patterns in valid/invalid/blank values
- Facility-specific data quality issues
2. Normalization (KSA 3)
- Default normalization functions by linkage type
- Custom normalization function structure
- Variables available: %object, %property, %parameters
- When normalization overrides null value list
- Common normalization patterns (gender, dates, names)
3. Definition Designer (KSA 4)
- Navigate Settings, Parameters, Link Keys tabs
- Modify parameter values in UI
- Understand Locale setting impact
- Enable Domain Conflict behavior
4. Thresholds (KSA 5)
- Review threshold: Lowest weight for Worklist inclusion
- Autolink threshold: Cutoff for automatic linking
- Validate threshold: Highest weight for Worklist
- How threshold positioning affects Worklist categories
- Tuning philosophy: early vs. mature implementation
5. MLE Process (KSA 6)
- Purpose of Maximum Likelihood Estimation
- Dataset size requirements (100,000+ records)
- How to run MLE Calibration
- Interpreting monitor page (aWeight, dWeight columns)
- When to stop calibration (weights stabilize)
- Applying results to update weights
- Iterative MLE approach
6. Evaluating Effectiveness (KSA 7)
- Worklist analysis techniques
- Metrics to track (volume, accuracy, false positive/negative rates)
- Agreement pattern interpretation
- Filtering by secondary reason, link weight, facility
- Success criteria for well-tuned definition
7. Data Quality Corrective Actions (KSA 8)
- Categorizing issues: missing, dummy, format, invalid, transposition
- EMPI configuration vs. source system fixes
- Null value lists for dummy data
- Custom normalization for format issues
- Engaging data stewards for long-term improvements
8. Composite Records (KSA 9)
- Purpose of Composite Record
- Trust tier ranking concept
- Manual override scenarios
- Five aging options and appropriate use cases
- When overrides expire (linkage group changes)
- Retroactive vs. prospective application
9. Rules (KSA 10)
- Built-in rules: Twins, Siblings, Roommates
- onCreateClassifiedPair() method structure
- Editable properties: linkStatus, secondaryReason, comment
- Link status values (0-3)
- Gender mismatch example
- When to recommend custom rules
- Testing and monitoring rule impact
Common Exam Scenarios
Scenario 1: Given data quality issues (high percentage of dummy phone numbers from Facility A), what corrective actions would you recommend?
- Answer: Add dummy values to null value list for Telecoms parameter, reduce Telecoms agreement weight for that facility, engage Facility A to improve data collection
Scenario 2: After running MLE calibration, you see SSN aWeight is 15 and dWeight is -12. What does this mean and what should you do next?
- Answer: SSN is highly discriminating in your dataset. Agreement strongly indicates match, disagreement strongly indicates non-match. Apply results, rebuild linkage data with Weights option, review Worklist to validate improved accuracy.
Scenario 3: Worklist shows many gender mismatch pairs being automatically linked. What actions would you take?
- Answer: Implement custom onCreateClassifiedPair() rule to downgrade pairs with gender disagreement to potential link (linkStatus = 1, secondaryReason = "Gender"), rebuild linkage data, review filtered Worklist by secondaryReason.
Scenario 4: Organization wants manual overrides for addresses to persist for 6 months. How would you configure this?
- Answer: Navigate to Settings > Composites tab, select "After a number of days" option, enter 180 in days field, apply changes. Note this applies only to future overrides.
Scenario 5: Which threshold values would produce the largest Worklist volume: Review=5/Autolink=15/Validate=25 or Review=15/Autolink=25/Validate=30?
- Answer: First option (5/15/25) produces larger Worklist because lower thresholds capture more pairs in Review and Validate categories.
Study Approach
1. Hands-on Practice: Use Definition Designer in test environment 2. Understand Relationships: How weights, thresholds, and rules interact 3. Scenario-Based Thinking: Practice diagnosing issues and recommending solutions 4. Terminology Mastery: Know precise definitions of technical terms 5. Process Flows: Understand sequences (MLE → Rebuild → Review → Adjust → Iterate) 6. Tool Navigation: Be able to quickly locate specific settings in UI 7. Code Examples: Study normalization and rule examples, understand variables 8. Data Quality: Connect quality issues to appropriate remediation strategies
Critical Distinctions
- Normalization (standardizing format) vs. Agreement Function (comparing normalized values)
- Agreement Weight (positive, added) vs. Disagreement Weight (negative, subtracted)
- Review Threshold (minimum to consider) vs. Autolink Threshold (minimum to link automatically)
- MLE Calibration (calculates weights) vs. Threshold Adjustment (manual tuning)
- Trust Tiers (source reliability) vs. Aging Factors (override duration)
- Built-in Rules (predefined) vs. Custom Rules (onCreateClassifiedPair method)
- Link Keys (identify candidates) vs. Parameters (detailed comparison)
Final Preparation
Review all 10 KSAs systematically: 1. Collect information on demographic data 2. Appraise demographic data quality 3. Select normalization approaches 4. Modify parameters in Definition Designer 5. Review and revise thresholds 6. Review and revise weights using MLE 7. Evaluate tuning effectiveness 8. Recommend data quality corrective actions 9. Recommend Composite Record trust tiers and aging 10. Recommend changes to rules
Master these skills through practical application, scenario analysis, and thorough understanding of underlying concepts. The exam tests not just knowledge but ability to apply these concepts to real-world EMPI implementation challenges.