T1.4: Manages Linkage Builds

Knowledge Review - InterSystems Enterprise Master Patient Index Technical Specialist

1. Understanding Linkage and the Need for Builds

Key Points

  • Linkage identifies records representing the same person
  • Links connect records from different source systems
  • Linkage algorithm uses demographic matching rules
  • Thresholds determine automatic vs. manual review
  • MPIID (Master Person Index ID) unifies linked records

Detailed Notes

Linkage is the core function of InterSystems EMPI (Enterprise Master Patient Index). As patient records arrive from multiple source systems and facilities, the linkage process determines which records represent the same individual and should be connected. Understanding linkage is essential for managing linkage builds effectively.

The Linkage Concept

In a healthcare organization with multiple facilities or systems, the same patient may have different medical record numbers (MRNs) at each location. For example:

  • Patient John Smith has MRN 123456 at Community General Hospital
  • The same John Smith has MRN 789012 at Metro Medical Center
  • He also has MRN 555555 at Regional Clinic

Without a master patient index, these three records appear to be three different patients. The linkage process in InterSystems EMPI analyzes demographic data (name, date of birth, gender, address, etc.) to determine that these three records represent the same person. Once linked, all three records share a common MPIID (Master Person Index ID), enabling clinicians to see a complete picture of John Smith's care across all facilities.

How Linkage Works

The linkage algorithm compares pairs of patient records using configurable matching rules defined in the Linkage Definition. The matching process:

1. Normalization: Demographic data is normalized to standard formats (e.g., removing punctuation from names, standardizing addresses) 2. Link Key Matching: Records must match on required "link keys" to be considered for linkage (e.g., same last name phonetic code) 3. Weight Calculation: Matching and non-matching attributes contribute positive or negative weight 4. Threshold Comparison: The calculated weight is compared to three thresholds: Review, Autolink, and Validate

Linkage Thresholds

Three thresholds control linkage decisions:

Review Threshold: The minimum weight for a record pair to be considered a potential match. Pairs below this threshold are classified as non-links (different people) and are not linked.

Autolink Threshold: Pairs exceeding this weight are automatically linked as they have very high confidence of representing the same person.

Validate Threshold: Pairs between Autolink and Validate thresholds are automatically linked but flagged for expert review to confirm the link is correct. Pairs between Review and Autolink thresholds are not linked but appear on the Worklist for manual review.

When Linkage Builds Are Necessary

Linkage builds are required in several scenarios:

Initial Data Load: After loading patient records via batch import, a linkage build must be run to identify potential duplicates and establish links between records.

Linkage Definition Changes: Any modification to the linkage definition (changing thresholds, adding/removing matching parameters, modifying weights) requires rebuilding linkage data to apply the changes to existing records.

Data Quality Corrections: After correcting data quality issues (e.g., fixing invalid dates, standardizing name formats), a linkage rebuild may improve matching accuracy.

Periodic Maintenance: Even without definition changes, periodic linkage rebuilds can be valuable in large, complex deployments to ensure linkage data remains consistent and up-to-date.

---

Documentation References

2. Types of Linkage Builds

Key Points

  • Full rebuild reprocesses all linkage data from scratch
  • Incremental/partial builds process only changed records
  • Build modes: batch vs. multi-process vs. normalized-only
  • Development/test environments may use different approaches than production
  • Build selection depends on what changed and system size

Detailed Notes

InterSystems EMPI provides different build modes to accommodate various scenarios and performance requirements. Understanding when to use each build type is critical for efficient system management.

Full Linkage Rebuild

A full linkage rebuild reprocesses all patient records in the system, recalculating linkage weights and reestablishing links based on the current linkage definition. This is the most comprehensive build type.

When to Use Full Rebuild:

  • After modifying the linkage definition (parameters, weights, thresholds)
  • After changing normalization functions that affect how data is compared
  • After correcting widespread data quality issues
  • When migrating to a new version of InterSystems EMPI
  • When linkage data integrity is questionable

Full Rebuild Process:

The full rebuild process includes multiple stages:

1. Purging: Existing linkage data tables are cleared (normalized data, link key indices, classified data, transitive links) 2. Normalized Database Build: Patient demographic data is normalized according to current rules 3. Warning Database Build: Data quality warnings are generated for problematic records 4. Classified Database Build: Records are classified according to link key requirements 5. Link Key Index Build: Link key indices are constructed for efficient pair comparison 6. Transitive Links Build: Link relationships are calculated based on matching weights 7. EID Synchronization: MPIIDs are synchronized so linked records share the same MPIID 8. Consistency Check: The system identifies linkage conflicts (overlaps and overlays) for the Worklist

Full Rebuild Duration:

Full rebuilds are resource-intensive and time-consuming. For large patient populations:

  • 100,000 records: 15-30 minutes
  • 1,000,000 records: 2-6 hours
  • 10,000,000 records: 12-24+ hours

Actual duration depends on server capacity, record complexity, and linkage definition complexity.

Partial/Incremental Build

Partial builds process only records that have changed since the last build, making them much faster than full rebuilds. However, partial builds are only available in specific scenarios.

When Partial Build Is Appropriate:

  • No changes to linkage definition
  • Only new records added since last build
  • No normalization function changes
  • No widespread data corrections

Limitations of Partial Builds:

  • Cannot be used after linkage definition modifications
  • May not detect all linkage conflicts
  • Not suitable for data quality remediation scenarios

Batch Mode Build

Batch mode is the standard build approach that processes all records or all changed records in a single continuous operation. The build runs in the background and logs progress to the Build Log.

Batch Mode Characteristics:

  • Processes all linkage data in predefined sequence
  • Provides detailed logging of each stage
  • Can be monitored via Build Log or show progress dialog
  • Can be stopped (though this leaves linkage data incomplete)

Multi-Process Mode:

Some versions of InterSystems EMPI support multi-process builds that can leverage multiple CPU cores for faster processing. This is particularly valuable for very large datasets.

Normalized-Only Build

In some cases, you may only need to rebuild the normalized database without recalculating all linkage relationships. This is much faster than a full rebuild but only appropriate when normalization functions changed without affecting matching logic.

---

Documentation References

3. Executing a Linkage Build from the Settings Menu

Key Points

  • Build is initiated from Settings menu in Person Index
  • Definition Designer also provides build access
  • Select build mode (batch, multi-process, etc.)
  • Process runs in background
  • Progress banner appears in Management Portal

Detailed Notes

Linkage builds are executed through the Management Portal interface. The process is straightforward but must be performed by users with appropriate privileges (typically the %HSPI_Master role).

Navigating to Build Options

To start a linkage build:

1. Open the Management Portal and navigate to the Person Index menu 2. Select Settings from the menu options 3. The Settings page contains the linkage build controls

Alternative Access Point:

Builds can also be initiated from the Definition Designer:

1. Navigate to Person Index > Definition Designer 2. The Definition Designer page includes build buttons 3. This approach is common when you've just modified the linkage definition and want to build immediately

Based on sample question Q4 from the EMPI exam, the exam expects candidates to know that builds are started from the Settings menu option.

Selecting Build Mode

On the Settings page, you'll see options to start different types of builds:

Build All Linkage Data: Initiates a full rebuild of all linkage data, including all stages (purge, normalized, warning, classified, link key index, transitive links, synchronization, consistency check).

Build Options Menu: Depending on the version, there may be additional options for:

  • Build mode selection (batch vs. multi-process)
  • Partial build (if conditions allow)
  • Specific database builds (normalized only, link key index only, etc.)

Starting the Build

After selecting the appropriate build type and options:

1. Click the Start Build or Build Linkage Data button 2. A confirmation dialog may appear warning that the build can take significant time 3. Confirm to start the build process 4. The build begins running in the background

Important Note: Users must have the %HSPI_Master role to initiate linkage builds. Additionally, to compile the linkage definition (which is often required before building), users need "USE" permissions on the %Ens_ProductionRun resource.

Build Progress Banner

Once the build starts, a banner appears in the Management Portal indicating that a linkage data build is currently running. The banner displays:

  • Build mode (e.g., "batch - normalized/eid")
  • Processing mode (e.g., "multi-process")
  • Linkage definition name
  • Log index number

The banner includes a Show Progress button that opens a detailed progress window.

Viewing Build Progress

Click the Show Progress button in the banner to open the Linkage Data Builder Output window. This window displays real-time progress information:

  • Current processing stage
  • Number of records processed in current stage
  • Estimated time remaining (in some versions)
  • Any errors or warnings encountered

The output window updates continuously as the build progresses. You can close this window without affecting the build—it continues running in the background.

Stopping a Build

The progress banner includes a Stop Build button. However, stopping a build is strongly discouraged:

Warning: Stopping the build linkage data process while it is running can cause your linkage data to be out-of-date. Use discretion when stopping a build.

If you must stop a build (e.g., due to an emergency or critical error), be aware that:

  • Linkage data will be incomplete and inconsistent
  • The system may not function correctly until a complete build finishes
  • You'll need to restart and complete a full build as soon as possible

Only stop builds in genuine emergencies or when you've discovered a critical error that must be corrected before proceeding.

---

Documentation References

4. Monitoring Build Progress and Status

Key Points

  • Real-time progress shown in output window
  • Build stages clearly identified in log
  • Processing time varies by system size and complexity
  • Build runs in background - other work can continue
  • Progress stored in global for persistent access

Detailed Notes

While linkage builds run in the background, several mechanisms provide visibility into build progress and status. Understanding how to monitor builds is essential for managing the process effectively.

Real-Time Progress Monitoring

The primary mechanism for monitoring build progress is the Linkage Data Builder Output window, accessed via the Show Progress button. This window displays detailed information about each stage of the build:

Build Stages Displayed:

1. Building started... - Initial message confirming build has begun 2. Purging [Database] Database started... - For each database being purged (Normalized, Warning, Classified, Link Key Index, Transitive Links) 3. Purging [Database] Database finished... - With record counts and duration 4. Building [Database] Database started... - For each database being built 5. Building [Database] Database finished... - With record counts saved and duration 6. Synchronizing EID started... - Beginning of MPIID synchronization 7. Synchronizing EID finished... - With count of records changed 8. Checking for EID consistency started... - Beginning of consistency validation 9. Checking for EID consistency finished... - With count of issues discovered 10. Building finished. Total duration is [X] seconds - Final completion message

Example Build Log Output:

``` Building All Linkage Data for Linkage Definition 'Local.Linkage.Definition' Building mode is 'batch - normalized/eid', processing mode is 'multi-process', log index is 2 Building all linkage data (preserving all manual linkages) 2018-08-24 16:58:43.842: Building started ... 2018-08-24 16:58:43.920: Purging Normalized Database started ... 2018-08-24 16:58:44.017: Purging Normalized Database finished, 5,012 records deleted, in 0.097 seconds 2018-08-24 16:58:44.017: Building Normalized Database started ... 2018-08-24 16:58:49.389: Building Normalized Database finished, 5,012 records saved, in 5.373 seconds ... 2018-08-24 16:58:56.335: Building finished. Total duration is 12.493 seconds ```

Understanding Processing Metrics

The build log provides important metrics for each stage:

Records Deleted: During purge stages, shows how many existing records were removed from each database.

Records Saved: During build stages, shows how many records were processed and saved.

Duration: Each stage reports completion time, helping identify bottlenecks in the build process.

Records Changed: During EID synchronization, shows how many records had their MPIID updated to match linked records.

Issues Discovered: During consistency checking, shows how many linkage conflicts (overlaps/overlays) were identified.

Background Processing

The linkage build process runs in the background, meaning:

  • You can close the progress window without stopping the build
  • You can navigate to other pages in the Management Portal
  • Other users can continue working (though system performance may be slower)
  • The production continues processing incoming messages

However, for large builds, system resources are heavily consumed, so it's best to schedule builds during off-peak hours when possible.

Persistent Progress Storage

Build progress is stored in a global variable:

``` ^CacheTemp.Output($J,"output",lineNumber) ```

If you close the progress dialog and the process is still running, you can review the current status by examining this global. This is particularly useful if:

  • Your browser session is disconnected during a build
  • You need to check build status from Terminal
  • You want to script build monitoring

---

Documentation References

5. Using the Build Log to Review Build History

Key Points

  • Build Log accessible from Person Index menu
  • Records all past builds with timestamps and details
  • Search and filter by build type, date range, status
  • Expand entries to see full build output
  • Purge old entries to manage log size

Detailed Notes

The Build Log provides a permanent record of all linkage builds executed on the system. This historical data is valuable for troubleshooting, auditing, and understanding system behavior over time.

Accessing the Build Log

To access the Build Log:

1. Navigate to the Person Index menu in the Management Portal 2. Select Build Log from the menu options 3. The Build Log page displays a list of all recorded builds

The Build Log page shows a summary of recent builds by default, with the most recent builds listed first.

Build Log Entry Information

Each entry in the Build Log displays:

Timestamp: Date and time when the build started (format: YYYY-MM-DD HH:MM:SS.mmm)

Build Status: Visual indicator and text showing:

  • Last Completed (green) - Build finished successfully
  • In Progress (yellow) - Build currently running
  • Error (red) - Build encountered errors
  • Stopped (orange) - Build was manually stopped before completion

Build Description: Brief summary including:

  • Build type (e.g., "Building All Linkage Data")
  • Linkage definition name
  • Build mode and processing mode
  • Preservation of manual linkages (yes/no)

Duration: Total time the build took to complete (shown only for completed builds)

Expanding Build Log Entries

By default, build entries are shown in collapsed form. Click on a build entry to expand it and see the full build output, including:

  • All stages (purging, building, synchronizing, checking)
  • Record counts for each stage
  • Timing information for each stage
  • Any errors or warnings generated
  • Final completion status and total duration

This detailed view is identical to what was shown in the real-time progress window during build execution.

Searching and Filtering Build Log

The Build Log page provides a search panel to filter builds:

Build Log Types: Filter by build status:

  • In Progress (currently running builds)
  • Error (builds that failed)
  • Stopped (builds manually interrupted)
  • Last Completed (successful completions)
  • All Completed (all finished builds, successful or not)
  • All Types (show everything)

Build Start Time Range: Filter by when builds started:

  • From: Start date/time
  • To: End date/time

Build End Time Range: Filter by when builds completed:

  • From: Completion date/time
  • To: Completion date/time

Display Options:

  • Logs per Page: Choose how many builds to display (e.g., 20, 50, 100)
  • Newest First / Oldest First: Sort order
  • Expand Logs / Contract Logs: Show all builds expanded or collapsed

After setting filters, click Refresh Table to reload the Build Log with the specified criteria.

Purging Old Build Logs

Over time, the Build Log accumulates many entries. To manage log size and improve performance:

1. The top of the Build Log page shows the total number of entries 2. Click the Purge button to delete old entries 3. In the "Do not purge most recent n days" field, specify how many days of history to retain (default: 7) 4. Confirm the purge operation

Purge Recommendations:

  • For development/test systems: Purge frequently (retain 7-14 days)
  • For production systems: Retain longer history for audit purposes (30-90 days)
  • Before major changes: Export or document recent build logs if needed for reference

---

Documentation References

6. Interpreting Build Results and Messages

Key Points

  • Synchronizing EID messages indicate MPIID assignments
  • Consistency check identifies overlaps and overlays
  • Record counts should match expected population
  • Errors indicate configuration or data problems
  • Build duration helps capacity planning

Detailed Notes

The output from a linkage build contains important information about the state of the patient index. Learning to interpret build messages helps identify issues and validate that the build completed correctly.

Normal Build Messages

A successful build will show consistent patterns in the output:

Purge and Build Balance: The number of records deleted during purging should match (approximately) the number of records saved during building for each database. If the normalized database purge deletes 5,012 records, the normalized database build should save approximately 5,012 records (the exact count may differ slightly if records were added during the build).

Processing Time Patterns: Different stages have characteristic durations:

  • Purging: Very fast (seconds)
  • Building Normalized: Moderate (minutes for large datasets)
  • Building Classified: Fast (seconds to minutes)
  • Building Link Key Index: Moderate to slow (depends on linkage complexity)
  • Building Transitive Links: Can be slow for large, highly linked populations
  • Synchronizing EID: Fast to moderate
  • Checking Consistency: Fast to moderate

Record Counts: Should align with known patient population size. If you expect 500,000 patients and the normalized database only builds 250,000 records, investigate why half the population is missing.

EID Synchronization Messages

The EID (Enterprise ID, also called MPIID) synchronization stage is particularly important:

Purpose: InterSystems EMPI tries to assign the same MPIID to linked records and different MPIIDs to unlinked records.

"Synchronizing EID finished, X records changed":

  • Large number of changed records after definition changes: Normal, indicates thresholds were adjusted and links changed
  • Large number of changed records with no definition changes: May indicate data quality issues or incorrect previous builds
  • Zero records changed: Normal if no new linkages were established

Understanding Synchronization:

When two records are determined to be linked, they should share the same MPIID. During synchronization, one record's MPIID is chosen as the "master" and the other record(s) in the link group are updated to match. The "records changed" count shows how many records had their MPIID updated.

Consistency Check Messages

The consistency check stage identifies linkage conflicts:

Overlaps: Linked records that have different MPIIDs. This shouldn't happen after synchronization, so overlaps found during consistency checking indicate a problem that needs investigation.

Overlays: Two records with the same MPIID that are not linked. This represents a serious data integrity issue where unrelated patients share an identifier.

"Checking for EID consistency finished, X issues discovered":

  • Zero issues: Ideal result, indicates clean linkage data
  • Small number of issues: Normal in complex environments, add to Worklist for review
  • Large number of issues: Investigate whether linkage definition is configured correctly

Worklist Impact:

Issues discovered during consistency checking are automatically added to the Worklist for manual review by data stewards with %HSPI_Operator or %HSPI_Master roles.

Error Messages in Build Log

If a build encounters errors, the Build Log will contain error messages:

Common Error Types:

  • Database errors (disk full, permissions issues)
  • Memory allocation errors (insufficient system memory)
  • Linkage definition errors (invalid configuration)
  • Data errors (corrupt records, invalid data types)

Responding to Errors:

When errors occur: 1. Review the full error message in the expanded Build Log entry 2. Address the root cause (free disk space, fix permissions, correct definition, repair data) 3. Restart the build after resolving the issue

Most build errors require a full rebuild after correction—partial builds may not be possible after an error.

---

Documentation References

7. Development/Test vs. Production Build Strategies

Key Points

  • Test environments: Frequent builds during definition tuning
  • Production: Scheduled builds during maintenance windows
  • Test with representative data subset
  • Document build duration for capacity planning
  • Coordinate builds with data loading activities

Detailed Notes

Linkage build strategies differ between development/test environments and production systems. Understanding these differences helps manage builds effectively across the system lifecycle.

Development and Test Environment Builds

In development and test environments, linkage builds are frequent and iterative:

Frequent Building: During linkage definition development, you may run dozens of builds as you:

  • Adjust matching parameters
  • Modify thresholds
  • Add or remove link keys
  • Test normalization functions
  • Evaluate data quality rules

Small Datasets: Test environments often use a subset of production data (10,000-100,000 records rather than millions). This allows:

  • Faster build cycles (minutes instead of hours)
  • Rapid iteration on linkage definition changes
  • Quick validation of configuration changes

Representative Data: Even though test datasets are smaller, they should be representative:

  • Include records from all facilities
  • Contain various data quality levels (clean and problematic records)
  • Represent diverse patient demographics (name variations, multiple addresses, etc.)
  • Include known duplicate pairs for validation

Build Validation: After each test build:

  • Review known duplicate pairs to verify they linked correctly
  • Check Build Log for unexpected record counts or errors
  • Inspect sample records to confirm normalization working as expected
  • Use Threshold Adjuster to visualize weight distribution

Production Environment Builds

Production systems require more careful build management:

Infrequent Scheduled Builds: Production builds should be infrequent and carefully planned:

  • Schedule during maintenance windows (overnight, weekends)
  • Coordinate with source system administrators
  • Notify users of potential performance impact
  • Plan for extended duration (hours to complete)

Build Triggers in Production:

Production builds are only needed when:

  • Linkage definition is updated (after thorough testing)
  • Large batch data loads are completed
  • Data quality remediation is performed
  • System upgrades require rebuild
  • Periodic maintenance (quarterly or semi-annually)

Change Control: Production linkage definition changes should go through formal change control:

  • Test thoroughly in development environment
  • Document expected impact on linkage
  • Get approval from data governance committee
  • Schedule implementation during maintenance window
  • Have rollback plan if issues occur

Performance Impact: Production builds consume significant resources:

  • CPU utilization may reach 80-100%
  • Memory usage increases substantially
  • Disk I/O is intensive
  • Concurrent user activity may slow
  • Message processing throughput may decrease

Therefore, schedule builds when system usage is minimal.

Coordinating Builds with Data Loading

Linkage builds and data loading activities must be coordinated:

After Batch Loads: When loading large numbers of new records via batch import: 1. Complete the batch load 2. Verify data loaded correctly 3. Schedule linkage build to process new records 4. Review Build Log to confirm appropriate processing 5. Check Worklist for potential duplicates requiring review

Ongoing HL7 Feeds: For systems with continuous HL7 data feeds:

  • Real-time linkage occurs for each incoming message
  • Periodic builds are less frequent (quarterly/semi-annually)
  • Builds primarily needed for definition changes or data cleanup

Temporary Feed Suspension: For very large batch loads or major linkage definition changes, consider:

  • Temporarily suspending HL7 feeds during build
  • Queuing incoming messages for processing after build completes
  • Coordinating with source systems to minimize traffic during build window

---

Documentation References

8. Exam Preparation Tips

Key Points

  • Review section content

Detailed Notes

Review documentation for detailed information.

Documentation References

Report an Issue