T3.2: Explains Fundamental EMPI Concepts

Knowledge Review - InterSystems Enterprise Master Patient Index Technical Specialist

1. Overview of EMPI Matching Fundamentals

Key Points

  • EMPI links patient records from multiple facilities using probabilistic matching
  • Link weights are calculated by summing agreement/disagreement weights
  • Three thresholds (Review, Autolink, Validate) determine linkage outcomes
  • Deterministic identifiers can override probabilistic scores
  • Normalization standardizes data before comparison
  • Rules provide fine-tuning for specific matching scenarios

Detailed Notes

InterSystems EMPI (Enterprise Master Patient Index), formerly HealthShare Patient Index (HSPI), employs sophisticated matching algorithms to determine when patient records from different healthcare facilities represent the same individual. Understanding these fundamental concepts is essential for configuring, tuning, and maintaining an EMPI system.

The core challenge EMPI solves is identity resolution: Given two patient records with potentially different spellings of names, variations in addresses, typos in demographic data, and incomplete information, how can the system determine with sufficient confidence whether they represent the same person or different people?

EMPI addresses this challenge through several complementary mechanisms:

1. Probabilistic Matching: Calculates likelihood scores based on demographic similarity 2. Deterministic Matching: Uses exact identifier matches to force linkage decisions 3. Normalization: Standardizes data formats to improve comparison accuracy 4. Agreement Functions: Measures similarity between field values 5. Weighted Parameters: Assigns importance to different demographic fields 6. Thresholds: Defines decision boundaries for linking, review, and validation 7. Rules: Provides business logic to handle special cases

Together, these mechanisms create a flexible, tunable system that can adapt to the unique characteristics of different healthcare organizations' data.

---

Documentation References

2. Linkage Categories and Their Impact on Data Display

Key Points

  • Auto-Link: Automatically linked, above Autolink threshold
  • Potential Duplicate: Between Review and Autolink thresholds, requires review
  • Potential Overlay: Incorrectly linked records (different individuals, same MPIID)
  • Potential Overlap: Linked records with different MPIIDs
  • Open-Chaining: Complex linkage chains requiring investigation
  • Validate: Linked but low confidence, between Autolink and Validate thresholds

Detailed Notes

InterSystems EMPI classifies record pairs into categories based on their linkage status, link weight, and the presence of conflicts. Understanding these categories is critical because they determine how patient data appears in clinical systems and which records require human review.

The Six Primary Linkage Categories

Documentation References

3. Probabilistic Matching Fundamentals

Key Points

  • Probabilistic matching uses demographic similarity to calculate link weights
  • Agreement weights (positive) added when fields match
  • Disagreement weights (negative) added when fields don't match
  • Total link weight compared against three thresholds
  • Based on Maximum Likelihood Estimation (MLE) statistical methods
  • Handles incomplete data, typos, and variations better than exact matching

Detailed Notes

Probabilistic matching is the core algorithm that powers InterSystems EMPI. Unlike deterministic matching (which requires exact matches on specific identifiers), probabilistic matching assesses the overall similarity of two patient records by comparing multiple demographic fields and calculating a composite likelihood score.

The Probabilistic Matching Process

When EMPI compares two patient records, it follows this process:

Step 1: Normalization

  • Raw data is normalized (standardized) according to field type
  • Example: "SMITH" and "Smith" are normalized to "smith"
  • Example: "123-45-6789" and "123456789" (SSN) are normalized to "123456789"

Step 2: Field-by-Field Comparison

  • Each weighted parameter is compared using its agreement function
  • The agreement function returns a weight indicating similarity:
  • Agreement weight (positive value) when fields match or are similar
  • Partial agreement weight when fields are somewhat similar
  • Disagreement weight (negative value) when fields don't match
  • Zero weight when one or both values are null/missing

Step 3: Weight Summation

  • Individual weights from all parameters are summed to produce a total link weight for the record pair

Step 4: Threshold Comparison

  • The link weight is compared against three thresholds to determine the linkage outcome:
  • Below Review threshold: Strong non-link (records are not related)
  • Between Review and Autolink: Potential link (requires human review)
  • Between Autolink and Validate: Auto-link (but appears on worklist for validation)
  • Above Validate threshold: Auto-link (high confidence, no validation needed)

Example: Calculating a Link Weight

Consider two patient records being compared with the following linkage definition parameters and weights:

Record A: John Michael Smith, DOB 5/15/1980, SSN 123-45-6789, Gender M, Address: 123 Main St, Boston MA Record B: John M. Smith, DOB 5/15/1980, SSN 123-45-6789, Gender M, Address: 456 Oak Ave, Boston MA

| Parameter | Record A Value | Record B Value | Match Quality | Weight Returned | Agreement/Disagreement Weight | |-----------|----------------|----------------|---------------|-----------------|-------------------------------| | Last Name | Smith | Smith | Exact match | +8.0 | Agreement Weight: 8.0 | | First Name | John | John | Exact match | +7.0 | Agreement Weight: 7.0 | | Middle Name | Michael | M | Partial match (initial) | +2.0 | Agreement Weight: 5.0 (partial) | | Date of Birth | 5/15/1980 | 5/15/1980 | Exact match | +10.0 | Agreement Weight: 10.0 | | SSN | 123-45-6789 | 123-45-6789 | Exact match | +12.0 | Agreement Weight: 12.0 | | Gender | M | M | Exact match | +2.0 | Agreement Weight: 2.0 | | Address | 123 Main St, Boston MA | 456 Oak Ave, Boston MA | City match, street disagrees | -1.0 | Disagreement Weight: -3.0 (partial match on city) |

Total Link Weight = 8.0 + 7.0 + 2.0 + 10.0 + 12.0 + 2.0 + (-1.0) = 40.0

Threshold Comparison:

  • Review Threshold: 14
  • Autolink Threshold: 24
  • Validate Threshold: 34

Since 40.0 > 34 (Validate threshold), the records are automatically linked and do NOT appear on the worklist. The system is highly confident these represent the same person.

Statistical Foundation: Maximum Likelihood Estimation

InterSystems EMPI's probabilistic matching is based on statistical methods developed in the 1960s-1970s, particularly Maximum Likelihood Estimation (MLE). While EMPI administrators don't need to understand the complex mathematics, it's useful to know that agreement and disagreement weights are derived from:

  • m-probability: The probability that a field matches GIVEN that the records represent the same person
  • u-probability: The probability that a field matches GIVEN that the records represent different people

The log-likelihood ratio of these probabilities produces the agreement weight:

  • High agreement weight: The field is a strong predictor of a match (e.g., rare last names)
  • Low agreement weight: The field is a weak predictor (e.g., common last names)

Similarly, disagreement weights reflect how unlikely a mismatch is for true matches.

Advantages of Probabilistic Matching

Compared to deterministic (exact match) approaches, probabilistic matching offers several advantages:

1. Handles data quality issues: Typos, abbreviations, nicknames, and formatting variations don't prevent matching 2. Compensates for missing data: If SSN is missing, strong matches on name and DOB can still produce high link weights 3. Tunable: Weights and thresholds can be adjusted based on the characteristics of your data 4. Transparent: Link weights and agreement patterns show why records were linked or not linked 5. Handles uncertainty: Records near thresholds can be flagged for human review rather than forcing a binary decision

---

Documentation References

4. Deterministic Matching and Identifiers

Key Points

  • Deterministic matching uses exact identifier matches (SSN, MRN, etc.)
  • Deterministic identifiers override probabilistic scores
  • Configured via DeterministicIdentifierTypes parameter
  • If two records have same deterministic ID, they MUST be linked
  • Creates "Deterministic" worklist category when conflicts occur
  • Higher in precedence than threshold-based matching

Detailed Notes

While probabilistic matching handles the majority of linkage decisions, deterministic matching provides a mechanism to force specific linkage outcomes based on trusted identifiers. This is particularly useful when you have high-quality, unique identifiers from reliable data sources.

What Are Deterministic Identifiers?

A deterministic identifier is a patient identifier type (such as Social Security Number, Medical Record Number, or a corporate patient ID) that you configure EMPI to treat as definitive proof of identity. When two records have the same value for a deterministic identifier, EMPI will:

  • Automatically link them (assign the same MPIID), OR
  • Flag them for review if they conflict with a higher-precedence decision (manual link/unlink or rule)

Configuring Deterministic Identifiers

Deterministic identifiers are configured in the linkage definition class using the `DeterministicIdentifierTypes` parameter. This parameter contains a comma-separated list of identifier types.

Example: ```objectscript Parameter DeterministicIdentifierTypes = "SSN,MRN,CORPID"; ```

This configuration tells EMPI:

  • If two records have the same SSN, link them
  • If two records have the same MRN from the same facility, link them
  • If two records have the same CORPID (corporate identifier from a trusted source), link them

Where Identifier Types Are Defined

Before an identifier can be used as a deterministic identifier in EMPI, it must be defined in the Registry or standalone EMPI namespace:

Navigate to: Management Portal > [Your Namespace] > Other Management > Identifier Types

The Identifier Types page shows:

  • Name: The identifier name (e.g., "Social Security Number", "Medical Record Number")
  • Type: The uppercase code (e.g., "SSN", "MR")
  • Exact Match: Whether this identifier requires exact matching
  • Additional Patient Identifier: Must be checked for use in EMPI
  • Active: Must be active

Important: An identifier does NOT need to have "Exact Match" checked in the Registry to be used as a deterministic identifier in EMPI. The deterministic behavior comes from the EMPI configuration, not the Registry setting.

How Deterministic Matching Works

When EMPI processes a record pair:

1. Check for deterministic identifier matches: If both records have a value for a deterministic identifier type, compare them 2. Exact match found: If the deterministic identifier values are identical:

  • Check for conflicts with higher-precedence decisions (manual or rule-based)
  • If no conflicts, automatically link the records
  • Set Link Reason to "Deterministic"

3. No match or missing values: Proceed with probabilistic matching

The Deterministic Worklist Category

According to the EMPI certification exam (sample question Q11), a Worklist entry in the Deterministic category occurs when there are conflicting deterministic IDs (Answer D: Conflicting deterministic IDs).

This happens when:

  • Two records have the same SSN (which should force linking)
  • BUT they also have different MRNs from the same facility (which indicates they're different people)
  • OR a user has manually unlinked them (manual decisions take precedence over deterministic)
  • OR a rule forces them to be unlinked

The Deterministic category alerts administrators to investigate these conflicts, which often indicate:

  • Data quality issues at the source (incorrect SSN entry, duplicate SSN assignment)
  • Identity fraud or data contamination
  • MRN re-use at facilities
  • System errors during data entry

Linkage Precedence Hierarchy

When multiple linkage reasons apply to a record pair, EMPI uses a precedence hierarchy to determine which reason is displayed and which takes effect:

1. Manual (highest precedence) 2. Rule 3. Deterministic 4. Domain Conflict 5. Transitivity 6. Threshold (lowest precedence)

Example: If a deterministic identifier says two records should be linked (same SSN), but a user has manually unlinked them (Manual), the manual decision prevails. The records will NOT be linked, and they will appear on the worklist in the Deterministic category with a warning.

Use Cases for Deterministic Identifiers

High-Quality Corporate Identifiers: If your organization has a corporate MPI (Master Patient Index) that's been carefully maintained and you trust its identifiers, you can configure those corporate IDs as deterministic identifiers. Any records with matching corporate IDs will automatically link.

Social Security Numbers (with caution): While SSN can be a strong identifier, it should be used cautiously as a deterministic identifier because:

  • SSNs are sometimes entered incorrectly
  • Patients might provide someone else's SSN
  • Facilities might reuse SSN fields for other purposes
  • Not all patients have SSNs

Facility-Specific MRNs: MRNs within a single facility are typically unique and can be deterministic identifiers. However, MRNs should not be used deterministically across facilities, as different facilities use different numbering schemes.

---

Documentation References

5. Normalization: Standardizing Data for Comparison

Key Points

  • Normalization converts data to standardized format before comparison
  • Examples: uppercase conversion, whitespace removal, hyphen removal
  • Each linkage type has a default normalization function
  • Custom normalization functions can be defined
  • Null values (unknown data) can be excluded from matching
  • Normalization improves match accuracy despite format variations

Detailed Notes

Normalization is the preprocessing step that standardizes patient demographic data before it's compared during the matching process. Without normalization, minor formatting differences would cause records to fail to match even when they represent the same individual.

Why Normalization Is Necessary

Patient data arrives at EMPI from diverse sources with different data entry conventions, system formats, and human input variations. Examples of these variations include:

Name Variations:

  • "SMITH" vs. "Smith" vs. "smith"
  • "Mary Ann" vs. "Mary-Ann" vs. "MaryAnn"
  • "O'Brien" vs. "OBrien" vs. "O BRIEN"

Date Format Variations:

  • "5/15/1980" vs. "05/15/1980" vs. "1980-05-15"
  • "May 15, 1980" vs. "15-May-80"

SSN Format Variations:

  • "123-45-6789" vs. "123 45 6789" vs. "123456789"

Address Variations:

  • "123 Main Street" vs. "123 Main St" vs. "123 MAIN ST."
  • "Apartment 5B" vs. "Apt 5B" vs. "#5B"

Without normalization, these formatting differences would prevent matches. By converting all values to a standard format, normalization allows the agreement function to focus on actual semantic differences rather than formatting artifacts.

How Normalization Works

Each linkage type in EMPI has an associated default normalization function that defines how to standardize that particular type of data. The normalization function is called automatically during the matching process.

Process Flow: 1. Raw data arrives from source system 2. Normalization function is applied 3. Normalized value is stored in the normalized property 4. Agreement function compares the normalized values (not the original values)

Default Normalization Functions by Linkage Type

Different field types require different normalization approaches:

Names (First Name, Last Name, Middle Name):

  • Convert to lowercase
  • Remove extra whitespace
  • Remove punctuation (hyphens, apostrophes, periods)
  • Handle prefixes and suffixes (Jr., Sr., III, etc.)
  • Example: "O'BRIEN " → "obrien"

Social Security Number:

  • Remove hyphens, spaces, other separators
  • Validate format (9 digits)
  • Example: "123-45-6789" → "123456789"

Date of Birth:

  • Convert to standard format (YYYY-MM-DD)
  • Validate as real date
  • Example: "5/15/1980" → "1980-05-15"

Gender:

  • Standardize codes (M/F/U)
  • Handle variations (Male→M, Female→F, Unknown→U)
  • Example: "Male" → "M"

Address Components:

  • Standardize street suffixes (Street→ST, Avenue→AVE)
  • Convert to uppercase or lowercase consistently
  • Remove punctuation and extra whitespace
  • Example: "123 Main Street, Apt. 5B" → "123 MAIN ST APT 5B"

Phone Numbers:

  • Remove formatting (parentheses, hyphens, spaces)
  • Extract digits only
  • Example: "(617) 555-1234" → "6175551234"

Custom Normalization Functions

While default normalization functions handle most scenarios, you can override them with custom normalization functions to address specific data quality issues or business requirements.

Common Use Cases for Custom Normalization:

1. Treating specific values as null/missing:

  • Dates of birth before 1900 should be treated as unknown
  • Names like "BABY", "UNKNOWN", "TEST" should be treated as null
  • Gender values other than M or F normalized to U (Unknown)

2. Handling data source-specific formatting:

  • One facility always appends " JR" to last names; remove it during normalization
  • SSNs from a specific source use a different format

3. Business logic:

  • Normalize nicknames to formal names (Jim→James, Beth→Elizabeth)
  • Standardize apartment/unit formats

Example Custom Normalization Function (Gender): ```objectscript set gender=##class(%MPRL.LinkageType.Gender).getNormalized(%object,%property,.%parameters) if gender'="M",gender'="F" set gender ="U" quit gender ```

This custom function: 1. Calls the default gender normalization function 2. Checks if the result is anything other than M or F 3. If so, converts it to U (Unknown) 4. Returns the normalized gender value

Example Custom Normalization Function (Date of Birth): ```objectscript If +%property<1900 quit "" quit ##class(%MPRL.LinkageType.DateStamp).getNormalized(%object,%property,.%parameters) ```

This custom function: 1. Checks if the year is before 1900 2. If so, returns an empty string (treating it as missing/null) 3. Otherwise, calls the default date normalization function

Null Values and Missing Data

A null value is a value indicating an unknown or unavailable data value. EMPI allows you to define lists of values that should be treated as null for each parameter.

Configuration: In the Parameters tab of the Definition Designer, each parameter has a field labeled "List of values to be treated as null". You can enter a comma-separated list of values.

Example:

  • For Date of Birth: "01/01/1900,1/1/1900,00/00/0000"
  • For First Name: "BABY,UNKNOWN,TEST,PATIENT"
  • For Gender: "U,UNKNOWN,UNK"

Behavior: When a field value matches an entry in the null value list:

  • It is treated as missing/unknown
  • It does NOT contribute to the link weight (neither agreement nor disagreement)
  • Matches on null values are ignored during searches

Interaction with Normalization Functions: If you define both a null value list AND a custom normalization function, the normalization function overrides the null value list by default. To make the normalization function respect the null value list, add this code at the beginning of the function:

```objectscript if (%property="")||$data(%parameters("_MissingValueArray",$zcvt(%property,"l"))) quit "" ```

Normalization Best Practices

1. Test normalization results: Use the Data Quality Manager to see how normalization affects your actual data 2. Balance standardization with data loss: Overly aggressive normalization can remove meaningful distinctions 3. Document custom functions: Maintain clear documentation of any custom normalization logic 4. Consider locale: Normalization rules may need to vary by country/language (names, dates, addresses have different conventions) 5. Handle null values consistently: Decide on organization-wide standards for what constitutes "unknown" data

---

Documentation References

6. Agreement Functions: Measuring Similarity

Key Points

  • Agreement functions compare normalized values and return weights
  • Return full agreement weight for exact matches
  • Return partial agreement weight for similar-but-not-exact matches
  • Can account for uniqueness (rare names weighted higher than common names)
  • Frequency adjustment increases weights for uncommon values
  • Different field types use different comparison algorithms

Detailed Notes

After data has been normalized, the agreement function compares the normalized values from two records and returns a weight indicating how similar they are. This weight is then added to (or subtracted from) the overall link weight.

What Agreement Functions Do

An agreement function takes two normalized values (one from each record being compared) and returns a numerical weight:

  • Full agreement weight (positive): The values match exactly
  • Partial agreement weight (positive, less than full): The values are similar but not identical
  • Zero weight: One or both values are null/missing
  • Disagreement weight (negative): The values are different

Default Agreement Functions by Field Type

Different types of data require different similarity comparison algorithms:

Exact Match Fields (SSN, Date of Birth):

  • Return full agreement weight if values match exactly
  • Return disagreement weight if values differ
  • No partial credit for "almost matching"

Example (SSN):

  • "123456789" = "123456789" → +12.0 (agreement weight)
  • "123456789" ≠ "987654321" → -3.0 (disagreement weight)

Name Fields (First Name, Last Name):

  • Use string similarity algorithms (Jaro-Winkler, Levenshtein distance)
  • Return partial agreement for similar strings
  • Account for common typos, transpositions, missing characters

Example (Last Name):

  • "SMITH" = "SMITH" → +8.0 (full agreement)
  • "SMITH" ≈ "SMYTH" → +5.0 (partial agreement, similar spelling)
  • "SMITH" ≠ "JONES" → -2.5 (disagreement)

Date Fields (Date of Birth):

  • Exact match returns full agreement
  • Day/month transposition might return partial agreement
  • Off-by-one errors might return partial agreement

Address Fields:

  • Compare components (street number, street name, city, state, ZIP)
  • Partial agreement if some components match
  • Handle abbreviations (Street vs. St, Avenue vs. Ave)

Frequency Adjustment: Weighting Based on Uniqueness

One of the most sophisticated features of EMPI agreement functions is frequency adjustment, which adjusts the weight returned based on how common or rare a value is in your data population.

Concept: A match on an uncommon name is more significant than a match on a common name.

Example:

  • "PATEL" is a very common last name in certain populations. If two records both have last name "PATEL", it's less predictive of a match than you might think, because many different individuals share this name.
  • "KOWALCZYK" is a less common last name. If two records both have last name "KOWALCZYK", it's highly predictive of a match.

How It Works: The linkage type parameter FrequencyAdjusted controls this behavior:

  • FrequencyAdjusted = Yes: The agreement function returns a percentage of the full agreement weight based on the rarity of the value:
  • Rare values → Higher percentage (can exceed 100%, so weight > full agreement weight)
  • Common values → Lower percentage (weight < full agreement weight)
  • FrequencyAdjusted = No: The agreement function returns the full agreement weight for exact matches, regardless of frequency

Example with Frequency Adjustment:

Assume the Last Name parameter has:

  • Agreement Weight: 8.0
  • FrequencyAdjusted: Yes

Scenario 1 (Rare name):

  • Both records have last name "KOWALCZYK"
  • Frequency adjustment factor: 150% (rare name, highly predictive)
  • Weight returned: 8.0 × 1.5 = 12.0 (exceeds full agreement weight)

Scenario 2 (Common name):

  • Both records have last name "PATEL"
  • Frequency adjustment factor: 40% (common name, less predictive)
  • Weight returned: 8.0 × 0.4 = 3.2 (less than full agreement weight)

Note: According to the EMPI Configuration Guide, when FrequencyAdjusted is set to Yes, the overall weight of a parameter might exceed the agreement weight or be less than the disagreement weight due to the nature of probabilistic algorithms and frequency adjustments.

Linkage Type Parameters Affecting Agreement Functions

Each linkage type has associated parameters that control how the agreement function behaves:

Name Linkage Type Parameters:

  • FrequencyAdjusted: Yes/No - Use frequency-based weighting
  • FrequencyAdjustmentMaxFactor: Maximum multiplier for rare names (e.g., 2.0 = 200%)
  • FrequencyAdjustmentMinFactor: Minimum multiplier for common names (e.g., 0.3 = 30%)
  • CheckTranspositions: Automatically check for first/last name transpositions
  • ScrubAffix: Remove suffixes like Jr., Sr., III before comparison
  • AgreementWeightPercentage: Percentage of full agreement weight to return
  • DisagreementWeightPercentage: Percentage of full disagreement weight to return

SSN Linkage Type Parameters:

  • Typically no frequency adjustment (SSNs should be unique)
  • Exact match or disagreement only

Date Linkage Type Parameters:

  • May allow partial credit for day/month transposition
  • May allow partial credit for adjacent dates (off-by-one errors)

Custom Agreement Functions

You can override the default agreement function for a parameter by entering custom code in the Agreement Function field.

Use Cases:

  • Implement custom string similarity algorithms
  • Handle business-specific comparison logic
  • Adjust weights based on other field values

Variables Available in Custom Agreement Functions:

  • `%object1` - The normalized data object for record 1
  • `%object2` - The normalized data object for record 2
  • `%property1` - The normalized value of this parameter for record 1
  • `%property2` - The normalized value of this parameter for record 2
  • `%agreement` - The full agreement weight for this parameter
  • `%disagreement` - The full disagreement weight for this parameter
  • `%parameters` - Linkage type parameters

Example: You could create a custom agreement function that gives extra weight if two records match on rare names AND also match on date of birth, indicating very high confidence.

---

Documentation References

7. Agreement and Disagreement Weights: Building the Link Weight

Key Points

  • Agreement weight: positive value added when fields match
  • Disagreement weight: negative value added when fields don't match
  • Individual weights summed to produce total link weight for record pair
  • Link weight compared against three thresholds
  • Weights can be tuned based on predictive value of each field
  • MLE (Maximum Likelihood Estimation) provides statistical foundation

Detailed Notes

Agreement and disagreement weights are the numerical values assigned to each linkage parameter that indicate how much that field should contribute to the overall link weight when two records are compared.

Understanding Agreement Weights

An agreement weight is a positive numerical value that specifies how much should be added to the overall link weight when two records have matching or similar values for a parameter.

Configuration: In the Definition Designer > Parameters tab, each weighted parameter has fields for:

  • Agreement Weight: The positive value to add for a match
  • Disagreement Weight: The negative value to add for a mismatch

Example Parameter Weights: | Parameter | Agreement Weight | Disagreement Weight | Rationale | |-----------|------------------|---------------------|-----------| | SSN | +12.0 | -3.0 | SSN is highly unique, strong match indicator | | Date of Birth | +10.0 | -2.5 | DOB is fairly unique, strong indicator | | Last Name | +8.0 | -2.0 | Last names vary in uniqueness | | First Name | +7.0 | -1.5 | First names are somewhat common | | Gender | +2.0 | -4.0 | Low agreement (many people share gender), high disagreement (mismatch suggests different people) | | Middle Name | +5.0 | -1.0 | Often missing, moderate indicator | | Address | +6.0 | -1.5 | People move, addresses change |

Understanding Disagreement Weights

A disagreement weight is typically a negative numerical value that specifies how much should be subtracted from the overall link weight when two records have different values for a parameter.

Important Distinction: Disagreement is NOT simply the absence of agreement. It indicates an active mismatch between values.

Three Scenarios: 1. Agreement: Values match → Add agreement weight (+) 2. Disagreement: Values don't match → Add disagreement weight (-) 3. Null/Missing: One or both values are null → Add zero (no contribution)

Why Disagreement Weights Matter: Some fields are more significant when they disagree than when they agree.

Example (Gender):

  • Agreement weight: +2.0 (match on gender is weak evidence—many people share the same gender)
  • Disagreement weight: -4.0 (mismatch on gender is strong evidence against a match—same person shouldn't have different genders)

If two records match on gender (both Male), add +2.0 to link weight. If two records disagree on gender (one Male, one Female), add -4.0 to link weight. If one or both records have null/unknown gender, add 0.

How Individual Weights Are Summed

When EMPI compares two records, it:

1. Compares each weighted parameter 2. For each parameter, the agreement function returns a weight (positive, negative, or zero) 3. All individual weights are summed to produce the link weight

Example Link Weight Calculation:

Comparing Record A and Record B:

| Parameter | Record A | Record B | Weight Returned | Cumulative Link Weight | |-----------|----------|----------|-----------------|------------------------| | Last Name | Smith | Smith | +8.0 (exact match) | 8.0 | | First Name | John | John | +7.0 (exact match) | 15.0 | | Middle Name | Michael | M | +2.5 (partial match - initial) | 17.5 | | DOB | 5/15/1980 | 5/15/1980 | +10.0 (exact match) | 27.5 | | SSN | 123-45-6789 | 123-45-6789 | +12.0 (exact match) | 39.5 | | Gender | M | M | +2.0 (exact match) | 41.5 | | Address | 123 Main St, Boston | 456 Oak Ave, Boston | -0.5 (city match, street mismatch) | 41.0 |

Final Link Weight: 41.0

Comparing Link Weight Against Thresholds

Once the link weight is calculated, it's compared against three threshold values:

Three Thresholds: 1. Review Threshold (e.g., 14): Minimum weight for worklist inclusion 2. Autolink Threshold (e.g., 24): Minimum weight for automatic linking 3. Validate Threshold (e.g., 34): Minimum weight for high-confidence links

Decision Logic:

| Link Weight | Outcome | Link Status | Worklist Category | |-------------|---------|-------------|-------------------| | < 14 (Review) | Strong non-link, no further consideration | Non-link | Not on worklist | | 14-23 (Review to Autolink) | Potential link, requires review | Potential link | Review | | 24-33 (Autolink to Validate) | Automatically linked, requires validation | Link | Validate | | ≥ 34 (Validate) | Automatically linked, high confidence | Link | Not on worklist (unless other conflict) |

Example Outcomes:

  • Link weight = 10 → Strong non-link (below Review threshold)
  • Link weight = 18 → Potential link (between Review and Autolink) → Appears on worklist for Review
  • Link weight = 28 → Auto-linked (above Autolink) but appears on worklist for Validation (below Validate threshold)
  • Link weight = 41 → Auto-linked, high confidence, does not appear on worklist

Tuning Weights Based on Data Characteristics

The agreement and disagreement weights should be tuned based on the predictive value of each field in your specific data population.

Tuning Process: 1. Initial weights: Start with default MLE-derived weights or industry standards 2. Test matching: Run linkage build on sample data 3. Analyze results: Review record pairs on worklist—are true matches being linked? Are false matches being rejected? 4. Adjust weights: Increase agreement weights for fields that are strong predictors of matches in your data; decrease weights for weak predictors 5. Adjust thresholds: Move thresholds up or down to control the volume of worklist items 6. Iterate: Repeat the process until matching accuracy is acceptable

Example Tuning Scenario:

Your organization discovers that many patients don't have SSN in their records (field is frequently null). Currently, SSN has an agreement weight of +12.0, but because it's so often missing, it rarely contributes to link weights. Meanwhile, you notice that Date of Birth is highly accurate and rarely missing in your data.

Tuning Action: You might reduce the SSN agreement weight to +8.0 (acknowledging that it's less useful in your dataset) and increase the DOB agreement weight to +14.0 (reflecting its reliability and completeness).

Threshold Tuning Strategy

Early Implementation Phase:

  • Set thresholds lower to create more potential links and validation items
  • This causes more record pairs to appear on the worklist for manual review
  • Allows organization to learn about data quality and matching behavior
  • Builds confidence in the system

Mature Implementation Phase:

  • Adjust thresholds higher as confidence in the linkage definition improves
  • Reduces worklist volume to only the most ambiguous cases
  • System handles more decisions automatically
  • Staff focuses on genuine edge cases

Typical Threshold Evolution:

| Phase | Review | Autolink | Validate | Rationale | |-------|--------|----------|----------|-----------| | Initial | 10 | 18 | 25 | Conservative, many worklist items | | Intermediate | 12 | 22 | 30 | Moderate confidence | | Mature | 14 | 24 | 34 | High confidence, fewer worklist items |

---

Documentation References

8. Rules: Fine-Tuning the Matching Process

Key Points

  • Rules provide business logic to handle special matching scenarios
  • Implemented via onCreateClassifiedPair() method in linkage definition class
  • Can change link status, secondary reason, and comment
  • Rules are third in linkage precedence (after Manual, before Deterministic)
  • Both built-in rules and custom rules available
  • Rules fine-tune matching for edge cases not handled well by probabilistic scoring

Detailed Notes

While probabilistic matching handles the vast majority of linkage decisions effectively, there are always edge cases and special scenarios that require business logic beyond statistical scoring. Rules provide a mechanism to encode these business requirements and override or modify linkage decisions.

What Rules Do

A rule is custom code (typically ObjectScript) that examines a record pair and can change its linkage outcome. Rules can:

  • Change the link status from link to non-link, or vice versa
  • Move a record pair to a different category (e.g., from Auto-Link to Review)
  • Add a secondary reason to explain why the rule applied
  • Add a comment for documentation

Rules are evaluated AFTER the probabilistic matching calculates a link weight, but BEFORE the final linkage decision is saved to the database.

When to Use Rules

Common scenarios where rules are valuable:

1. Gender Disagreement Rule

  • Problem: Two records have high link weight based on name and DOB, but different genders
  • Rule: If genders disagree, set link status to "potential link" (requires review)
  • Rationale: Same person shouldn't have different genders; likely data error or false match

2. Date of Birth Component Disagreement Rule

  • Problem: Two records have same name and SSN, but DOB day/month/year components differ
  • Rule: If any component of DOB disagrees, set to "potential link" for review
  • Rationale: Even if other fields match, DOB mismatch suggests false match or data quality issue

3. Roommate Rule

  • Problem: Two records have high link weight because they share the same address, but different names
  • Rule: If address matches but names disagree, set to "non-link"
  • Rationale: Likely roommates or family members, not same person

4. Same MRN, Same Facility, Different Patients Rule

  • Problem: Two records from same facility have identical MRN but are clearly different people
  • Rule: Flag for review as potential MRN re-use
  • Rationale: Facility has assigned same MRN to two patients (serious data quality issue)

5. Transient Patient Rule

  • Problem: Records from homeless shelters often have incomplete or unreliable demographic data
  • Rule: Reduce auto-link threshold for records from specific facilities
  • Rationale: Prevent false matches in populations with unstable demographics

How Rules Are Implemented

Rules are implemented by adding a method called onCreateClassifiedPair() to your linkage definition class. This method is called automatically for every record pair immediately before the linkage decision is saved.

Method Signature: ```objectscript ClassMethod onCreateClassifiedPair(pClassifiedPair as %MPRL.Linkage.Classified, isModified As %Boolean) As %Status ```

Important Limitations: While the method CAN modify any property of the classified pair object, you should ONLY modify:

  • linkStatus: Change the linkage outcome
  • secondaryReason: Add explanation for the rule
  • comment: Add descriptive comment

Do NOT modify other properties (link weight, MPIIDs, etc.), as this can cause unexpected behavior.

Link Status Values

When implementing rules, you set the `linkStatus` property to one of these numerical values:

| Value | Link Status | Meaning | |-------|-------------|---------| | 0 | Strong Non-Link | Non-link with link weight below Review threshold | | 1 | Non-Link to Review | Non-link with Worklist category of Review (Potential Link) | | 2 | Link to Validate | Link with Worklist category of Validate | | 3 | Link | Standard link |

Built-In Rules Examples

InterSystems EMPI includes several pre-built rule examples that organizations commonly implement:

1. Different Gender Rule

  • Trigger: Genders of two records are different
  • Action: Set link status to 1 (Non-Link to Review / Potential Link)
  • Secondary Reason: "Gender"
  • Comment: "AutoNonLink rule: Genders are different"

Example Code: ```objectscript if ((tRecNormalizedA.stdGender'="")&(tRecNormalizedB.stdGender'=""))&& ($zcvt(tRecNormalizedA.stdGender,"u")'=$zcvt(tRecNormalizedB.stdGender,"u")) { set pClassifiedPair.linkStatus = 1 set pClassifiedPair.secondaryReason = "Gender" set pClassifiedPair.comment = "AutoNonLink rule: Genders are different" } ```

2. Date of Birth Component Mismatch Rule

  • Trigger: At least one component of DOB (day, month, or year) doesn't match
  • Action: Set link status to 1 (Potential Link)
  • Secondary Reason: "DOB"
  • Comment: Explains which component disagreed

3. Roommates Rule (Version A)

  • Trigger: For linked pair, if SSN, Given Name, and Family Name all disagree
  • Action: Set to Non-Link to Review
  • Secondary Reason: "Roommates-A"

4. Roommates Rule (Version B)

  • Trigger: For linked pair, if pattern for Name is X (negative) and Address or Telecom is H (high), and Given Name or Family Name disagree
  • Action: Set to Non-Link to Review
  • Secondary Reason: "Roommates-B"
  • Rationale: High match on address but poor match on names suggests people living together, not same person

5. Roommates Rule (Version C)

  • Trigger: For linked pair, if pattern for Name is not H, Family Name and DOB disagree, and SSN do not agree
  • Action: Set to Non-Link to Review
  • Secondary Reason: "Roommates-C"

Custom Rule Example: Different Gender

Here's a complete example of implementing the "Different Gender" rule:

```objectscript ClassMethod onCreateClassifiedPair(pClassifiedPair as %MPRL.Linkage.Classified, isModified As %Boolean) As %Status { // Get the normalized data objects for both records set tRecNormalizedA = ##class(MyLinkage.Normalized.Data).%OpenId(pClassifiedPair.normalizedId1) set tRecNormalizedB = ##class(MyLinkage.Normalized.Data).%OpenId(pClassifiedPair.normalizedId2)

// Check if genders are both present and different if ((tRecNormalizedA.stdGender'="")&(tRecNormalizedB.stdGender'=""))&& ($zcvt(tRecNormalizedA.stdGender,"u")'=$zcvt(tRecNormalizedB.stdGender,"u")) {

// Set link status to Potential Link (requires review) set pClassifiedPair.linkStatus = 1

// Set secondary reason for filtering in Worklist set pClassifiedPair.secondaryReason = "Gender"

// Add comment explaining the rule set pClassifiedPair.comment = "AutoNonLink rule: Genders are different" }

// More rules could go here...

quit $$$OK } ```

Variables Used:

  • `pClassifiedPair`: The record pair object being evaluated
  • `tRecNormalizedA`: Normalized data object for record A
  • `tRecNormalizedB`: Normalized data object for record B
  • `isModified`: Boolean indicating if the pair has been modified

Linkage Precedence: Where Rules Fit

When multiple linkage reasons apply to a record pair, EMPI uses this precedence hierarchy:

1. Manual (highest) 2. Rule 3. Deterministic 4. Domain Conflict 5. Transitivity 6. Threshold (lowest)

Example Precedence Scenario:

  • Probabilistic matching gives a link weight of 28 (above Autolink threshold of 24) → Would auto-link
  • A rule examines the pair and finds different genders → Sets link status to "Potential Link"
  • The rule decision overrides the threshold-based decision
  • Final outcome: Records are NOT linked; they appear on worklist for review
  • Link Reason displayed: "Rule"
  • Secondary Reason displayed: "Gender"

Best Practices for Rules

1. Document Rules Thoroughly:

  • Maintain a catalog of all rules implemented
  • Document the business rationale for each rule
  • Include examples of record pairs that trigger each rule

2. Test Rules Extensively:

  • Use sample data to verify rules behave as expected
  • Check for unintended consequences (rules triggering too frequently or not frequently enough)
  • Monitor worklist volumes after implementing new rules

3. Use Secondary Reason for Filtering:

  • Set meaningful secondary reason values
  • Allows worklist users to filter by rule type
  • Helps in analyzing rule effectiveness

4. Keep Rules Simple:

  • Each rule should address one specific scenario
  • Complex rules are difficult to debug and maintain
  • Multiple simple rules are better than one complex rule

5. Monitor Rule Impact:

  • Track how often each rule fires
  • Analyze whether rule-flagged pairs are true matches or true non-matches
  • Adjust or remove rules that don't improve matching accuracy

6. Coordinate with Threshold Tuning:

  • Rules and thresholds work together
  • If too many record pairs trigger rules, consider adjusting thresholds instead
  • Rules should handle exceptions, not common cases

---

Documentation References

9. Exam Preparation Tips

Key Points

  • Review section content

Detailed Notes

Review documentation for detailed information.

Documentation References

Report an Issue