T4.2: Built-in Tools for Monitoring and Troubleshooting - HealthShare UCR Implementation and Customization Specialist Study Guide

1. Message Storage and Retention Optimization

Key Points

Message Storage Components: Message headers and message bodies are stored separately in the database
Purge Task: Ens.Util.Tasks.Purge is the built-in scheduled task that removes old messages
Retention Policies: Define how long message headers and bodies are kept before purging
Archival Strategies: Archive important messages before purging for compliance and audit purposes
Storage Impact: Unpurged messages accumulate and degrade system performance over time

Detailed Notes

Overview

HealthShare UCR productions generate large volumes of messages as clinical data flows through Edge Gateways and the Hub. Each message consists of a message header (metadata about the message: source, target, timestamps, status) and a message body (the actual content of the message). Without active management, these messages accumulate in the database, consuming disk space and degrading query performance. The built-in purge task provides automated message cleanup based on configurable retention policies.

Message Header and Body Storage

Message Headers (Ens.MessageHeader): Stored in the Ensemble message header table; contain metadata such as session ID, source component, target component, message type, creation time, and completion status
Message Bodies: Stored in class-specific tables or as streams; contain the actual message content (HL7 messages, SDA containers, request/response objects)
Headers and bodies are linked by message ID but stored separately, allowing different retention periods

Configuring Ens.Util.Tasks.Purge

1. Open the Management Portal and navigate to System Operation > Task Manager > Task Schedule 2. Locate or create the Ens.Util.Tasks.Purge task 3. Configure the following parameters:

NumberOfDaysToKeep: How many days of messages to retain (e.g., 30, 60, 90)
BodiesToo: Whether to purge message bodies along with headers (recommended: Yes)
TypesToPurge: Which message types to include in the purge (default: all)
KeepIntegrity: Whether to preserve referential integrity during purge

4. Set the schedule (recommended: daily during off-peak hours) 5. Save and activate the task

Retention Policy Considerations

Balance storage costs against troubleshooting and audit needs
Regulatory requirements may mandate minimum retention periods for clinical data messages
Consider longer retention for error messages (useful for trend analysis)
Production environments typically retain 30-90 days; development environments may use shorter periods
Monitor database size trends to validate that purge policies are effective

Archival Strategies

Export messages to external storage before purging for long-term retention
Use the Message Viewer export function to save specific message sessions
Implement automated archival using a custom Business Operation that writes to archive storage
Maintain an archive index for retrieval of archived messages when needed

---

Documentation References

InterSystems Documentation

2. Data Volume and System Integrity

Key Points

^%GSIZE Utility: Reports the size of globals (database tables) to identify large and growing data stores
Journal Management: Journals record all database changes for recovery; manage journal file retention and space
^Integrity Utility: Verifies database structural integrity by checking block-level consistency
Database Compaction: Reclaims unused space after data deletion or purging
Proactive Monitoring: Regular checks prevent storage exhaustion and data corruption

Detailed Notes

^%GSIZE for Global Sizing

The ^%GSIZE utility reports the size of globals (the underlying storage structures for database tables) in a namespace. This is essential for identifying which data stores are consuming the most space and tracking growth over time.

Running ^%GSIZE: 1. Open a Terminal session to the target namespace and execute: do ^%GSIZE 2. The utility reports each global's allocated and used blocks, along with total size 3. Review the output to identify the largest globals and compare results over time to track growth trends

Key globals to monitor in UCR:

Ens.MessageHeaderD: Message header storage (grows with message volume)
HS.SDA3.*: SDA streamlet storage (grows with clinical data volume)
IRIS.Temp.*: Temporary data that should be cleaned periodically
HS.IHE.XDS.*: Document registry metadata

Journal Management

Journals record every database modification operation (inserts, updates, deletes) and are essential for database recovery. However, journal files consume significant disk space.

Journal File Location: Configured during system installation; should be on a separate disk from databases
Retention Settings: Configure how many days of journal files to retain (System Administration > Configuration > Journal Settings)
Purge Schedule: Set up automatic journal file purging to remove files older than the retention period
Space Monitoring: Monitor the journal directory for available disk space; journal disk exhaustion causes system halts
Backup Coordination: Journal purging should be coordinated with backup schedules to ensure recovery point coverage

^Integrity for Database Checks

The ^Integrity utility verifies the structural integrity of database files by checking block-level consistency.

Running ^Integrity: 1. Open a Terminal session 2. Execute: do ^Integrity 3. Select the database(s) to check 4. The utility reports any block-level inconsistencies or corruption 5. Schedule integrity checks during maintenance windows (the check can be resource-intensive)

When to run integrity checks:

After unexpected system shutdowns or power failures
After disk hardware events or storage migrations
As part of regular preventive maintenance (monthly or quarterly)
When data corruption is suspected based on application errors

Database Compaction

After large purge operations or data deletions, databases may have unused space that is not automatically returned to the operating system. Use database compaction (accessible via System Operation > Databases) to reclaim unused space. Compaction can be run online but may impact performance; schedule it during maintenance windows. Monitor database file sizes before and after compaction to verify space reclamation.

---

Documentation References

InterSystems Documentation

3. Performance Troubleshooting

Key Points

OpenMetrics Endpoint: `/api/monitor/metrics` exposes real-time system metrics in Prometheus format
^SystemPerformance: Collects comprehensive system performance data over a configurable period
Bottleneck Identification: Use metrics to identify CPU, memory, disk I/O, and cache bottlenecks
Cache Efficiency: Monitor global buffer cache hit ratio to assess database cache effectiveness
Trend Analysis: Compare performance data over time to detect degradation

Detailed Notes

OpenMetrics (/api/monitor/metrics)

The OpenMetrics endpoint provides real-time system metrics in a format compatible with Prometheus and other monitoring systems.

Accessing OpenMetrics:

URL: http://<server>:<port>/api/monitor/metrics
Returns metrics in Prometheus text format
Metrics include: CPU utilization, memory usage, process counts, license usage, database cache statistics, and Ensemble-specific metrics
Can be scraped by external monitoring tools (Prometheus, Grafana) for dashboards and alerting

Key metrics to monitor:

iris_cpu_usage: CPU utilization percentage
iris_cache_efficiency: Global buffer cache hit ratio (should be above 95%)
iris_process_count: Number of active processes
iris_disk_reads_per_sec: Physical disk reads (high values indicate cache inefficiency)
iris_journal_space: Available journal disk space
iris_ens_queue_count: Ensemble production queue depths

^SystemPerformance Reports

The ^SystemPerformance utility collects comprehensive performance data over a configurable sampling period and generates a detailed report.

Running ^SystemPerformance: 1. Open a Terminal session 2. Execute: do ^SystemPerformance 3. Specify the output file path 4. Specify the sampling duration (e.g., 300 seconds) and interval (e.g., 5 seconds) 5. The utility collects system metrics at each interval 6. After completion, review the generated report

Report contents:

CPU utilization breakdown (user, system, idle)
Memory usage and paging statistics
Disk I/O rates (reads/writes per second, latency)
Global buffer cache performance (hit ratio, evictions)
Lock table utilization
Process activity and wait states
Database journal write rates

Identifying Bottlenecks

CPU Bottleneck: Sustained high CPU utilization (>80%) with slow response times
Memory Bottleneck: High paging rates, low available memory, cache evictions
Disk I/O Bottleneck: High disk latency (>10ms), high read rates with low cache hit ratio
Cache Inefficiency: Global buffer cache hit ratio below 95% indicates the cache is too small or working set is too large
Queue Buildup: Growing Ensemble queue depths indicate that Business Operations cannot keep up with inbound message volume

---

Documentation References

InterSystems Documentation

4. Event Log Interpretation

Key Points

Event Log Access: Available at Ensemble > Event Log in the Management Portal
Event Types: Error, Warning, Info, Trace, and Assert entries
Filtering and Searching: Filter by date range, component, severity, and text content
Common Error Patterns: Connection failures, transformation errors, validation failures, timeout errors
Correlation: Correlate Event Log entries with Visual Trace sessions for complete diagnosis

Detailed Notes

Accessing the Event Log

1. Open the Management Portal for the target namespace (Edge Gateway or Hub) 2. Navigate to Ensemble > Event Log (or Interoperability > Event Log in newer versions) 3. The Event Log displays entries in reverse chronological order 4. Use the filter controls to narrow the view

Event Types

Error: Indicates a failure that prevented normal operation (e.g., connection refused, DTL compilation error, validation failure). Requires investigation and resolution.
Warning: Indicates a potential problem that did not prevent operation but may lead to issues (e.g., slow response, retry succeeded after initial failure). Should be monitored.
Info: Informational entries about normal operations (e.g., production started, component connected, configuration loaded). Useful for understanding system activity.
Trace: Detailed diagnostic entries generated when trace logging is enabled. Used for in-depth troubleshooting. Not normally enabled in production due to volume.
Assert: Internal consistency check entries. Rare; indicate unexpected internal conditions.

Filtering and Searching

Date Range: Limit entries to a specific time window
Component: Filter by source component name (e.g., "HL7FileService", "DTLProcess", "MPIOperation")
Severity: Show only errors, or errors and warnings
Text Search: Search entry text for specific keywords (e.g., "timeout", "connection", "DTL", patient MRN)
Session ID: Find all entries related to a specific message session

Common Error Patterns in UCR

"Connection refused": The target system is not listening on the configured port; check network connectivity and target service status
"DTL Transform error": A Data Transformation Language transform encountered an error; check the DTL class for compilation errors or data mismatches
"Validation error": An incoming message failed schema validation; check the message structure against the expected schema
"Timeout": A component did not respond within the configured timeout period; check the target system's responsiveness and adjust timeout settings
"MPI match error": The MPI encountered an error during patient matching; check the matching algorithm configuration and input data quality
"Consent denied": A data access request was denied by consent policy; verify the consent configuration and patient's consent status

---

Documentation References

InterSystems Documentation

5. Alert Message Interpretation

Key Points

Alert Sources: Alerts are generated by production components, system monitors, and scheduled tasks
Severity Levels: Alerts range from informational notifications to critical system failures
AlertOnError Setting: Production components can be configured to generate alerts when errors occur
Alert Routing: Alerts can be routed to email, management console, custom handlers
Escalation: Unresolved alerts may escalate based on configured escalation policies

Detailed Notes

Alert Sources

Production Components: Business Services, Processes, and Operations generate alerts when configured with AlertOnError or AlertGracePeriod settings
System Monitors: Built-in monitors for disk space, license usage, database growth, and process limits
Scheduled Tasks: Tasks that fail or produce warnings generate task-level alerts
Custom Alert Generators: Business rules or custom code can generate alerts based on application-specific conditions

Alert Severity Levels

Info: Notification of normal but noteworthy events (e.g., scheduled maintenance completed)
Warning: Conditions that may require attention but are not immediately critical (e.g., disk space below 20%)
Error: Component failures or processing errors that affect data flow (e.g., connection lost to source system)
Critical: System-level failures that require immediate intervention (e.g., database corruption, license exhausted)

AlertOnError Configuration

Each production component has an AlertOnError setting (true/false)
When enabled, the component generates an alert whenever it encounters an error
AlertGracePeriod defines the minimum interval between repeated alerts for the same error (prevents alert flooding)
Configure AlertOnError on critical components (data feed services, MPI operations) and leave it off for non-critical components
Access the setting through the component configuration page in the Production Configuration

Responding to Alerts

1. Review the alert message for the component name, error description, and timestamp 2. Check the Event Log for additional context around the alert time 3. Open Visual Trace for the affected message session (if applicable) 4. Diagnose and resolve the root cause 5. Verify that the component has recovered (check the Production Monitor for green status) 6. Document the alert, root cause, and resolution for operational knowledge

Alert Escalation

Configure escalation policies that promote alerts to higher severity if not acknowledged within a defined period
Escalation can trigger additional notification channels (e.g., from email to SMS/pager)
Tune alert thresholds and grace periods to reduce noise while ensuring critical issues are not missed

---

Documentation References

InterSystems Documentation

6. Production Monitor

Key Points

Accessing Production Monitor: Available at Ensemble > Production Monitor in the Management Portal
Component Status Indicators: Green (running), yellow (warning/disabled), red (error/stopped)
Queue Depth Monitoring: View the number of queued messages for each component to detect backlogs
Throughput Analysis: Monitor messages processed per time period for each component
Identifying Failing Components: Quickly locate components in error state and access their Event Log entries

Detailed Notes

Accessing the Production Monitor

1. Open the Management Portal for the target namespace 2. Navigate to Ensemble > Production Monitor (or Interoperability > Monitor) 3. The Production Monitor displays all production components organized by type (Services, Processes, Operations) 4. The monitor refreshes automatically at a configurable interval

Component Status Indicators

Green: The component is running normally and processing messages
Yellow: The component is disabled (intentionally stopped) or experiencing warnings
Red: The component has encountered an error and is not processing messages
Gray: The component is not configured or has never been started
Click on any component to access its detailed status, configuration, and recent Event Log entries

Queue Depth Monitoring

Each component displays its current queue depth (number of messages waiting to be processed)
A growing queue indicates that the component is falling behind the incoming message rate
Normal queue depth is zero or near-zero for most components
Consistently high queue depths may indicate: slow downstream systems, resource constraints, configuration issues, or increased inbound volume
Monitor queue trends over time to distinguish between temporary spikes and sustained backlogs

Throughput Analysis

The Production Monitor shows message counts and processing rates for each component
Compare throughput across components to identify bottlenecks in the processing chain
A component with high input throughput but low output throughput is a processing bottleneck
Use throughput data to plan capacity and identify components that need performance tuning
Track throughput trends to detect changes in data volume from source systems

Identifying and Diagnosing Failing Components

1. Look for red status indicators in the Production Monitor 2. Click on the failing component to view its error details 3. Check the component's Event Log entries for error messages 4. Verify the component's configuration (endpoint URLs, credentials, timeout settings) 5. Test connectivity to external systems if the component communicates with external endpoints 6. Restart the component after resolving the issue 7. Monitor the component to confirm it returns to green status and processes queued messages

---

Documentation References

InterSystems Documentation

7. Business Rule Log

Key Points

Purpose: The Business Rule Log records every decision made by business rules during message processing
Accessing the Log: Available at Ensemble > Business Rule Log in the Management Portal
Rule Evaluation Records: Each entry shows which rule was evaluated, the conditions tested, and the action taken
Debugging Routing Logic: Use the log to understand why a message was routed to a specific path
Filtering: Filter by rule name, date range, session ID, and result

Detailed Notes

Purpose of the Business Rule Log

The Business Rule Log provides an audit trail of every business rule evaluation that occurs during production message processing. Business rules control message routing, transformation selection, and conditional processing logic. When troubleshooting unexpected routing behavior or trying to understand why a message was handled in a particular way, the Business Rule Log is the definitive source of information.

Accessing the Business Rule Log

1. Open the Management Portal for the target namespace 2. Navigate to Ensemble > Business Rule Log (or Interoperability > Business Rule Log) 3. The log displays rule evaluation entries in reverse chronological order 4. Each entry includes: timestamp, rule name, session ID, conditions evaluated, and action taken

Interpreting Rule Evaluation Results

Rule Name: The business rule class that was evaluated
Session ID: Links the rule evaluation to a specific message session (use this to cross-reference with Visual Trace)
Conditions Evaluated: Shows each condition in the rule and whether it evaluated to true or false
Action Taken: The action that was executed based on the rule evaluation (route to target, transform, discard, etc.)
Return Value: The final result of the rule evaluation (accept, reject, continue)

Debugging Routing Rule Logic

When a message is not being routed as expected: 1. Find the message's session ID from the Message Viewer 2. Search the Business Rule Log for entries with that session ID 3. Review the conditions that were evaluated and their true/false results 4. Identify which condition caused the unexpected routing decision 5. Check the condition logic: are the field references correct? Are the comparison values accurate? 6. Modify the business rule to correct the logic 7. Reprocess a test message and verify the corrected routing

Common Business Rule Issues in UCR

Incorrect Field Reference: The rule references a message field that does not exist or has a different path than expected
Value Mismatch: The rule compares against a hardcoded value that does not match the actual message content (case sensitivity, whitespace)
Missing Default Rule: No default action is defined for messages that do not match any specific condition
Rule Priority: Rules are evaluated in order; a higher-priority rule may match before the intended rule
Stale Conditions: Rule conditions reference values that have changed (e.g., facility codes, message types)

Filtering the Business Rule Log

Date Range: Limit entries to a specific time period
Rule Name: Filter by the specific business rule class
Session ID: Show all rule evaluations for a specific message session
Result: Filter by action taken (useful for finding all "discard" or "reject" decisions)
Component: Filter by the Business Process that executed the rule

---

Documentation References

InterSystems Documentation

Exam Preparation Summary

Critical Concepts to Master

Message Purge Configuration: Know how to configure Ens.Util.Tasks.Purge with retention periods and understand the impact of unpurged messages
^%GSIZE and ^Integrity: Understand when and how to use these utilities for data volume management and integrity verification
OpenMetrics and ^SystemPerformance: Know the key metrics to monitor and how to interpret performance reports
Event Log Event Types: Distinguish between Error, Warning, Info, Trace, and Assert entries and know their significance
AlertOnError and AlertGracePeriod: Understand how component alerts are configured and how escalation works
Production Monitor Status Colors: Know what green, yellow, red, and gray indicate and how to respond to each
Business Rule Log: Be able to find rule evaluation entries and interpret conditions and actions to debug routing logic

Common Exam Scenarios

Configuring a message purge task with appropriate retention periods for a production UCR environment
Running ^%GSIZE to identify which globals are consuming the most space and recommending actions
Interpreting a ^SystemPerformance report to identify a disk I/O bottleneck
Analyzing Event Log entries to diagnose a data feed failure at an Edge Gateway
Configuring AlertOnError on critical production components and setting appropriate grace periods
Using the Production Monitor to identify a component with growing queue depth and diagnosing the cause
Debugging a routing rule using the Business Rule Log to determine why messages are being sent to the wrong target
Diagnosing a cache efficiency problem using OpenMetrics data and recommending solutions

Hands-On Practice Recommendations

Configure and run Ens.Util.Tasks.Purge in a development environment; verify messages are removed after the retention period
Run ^%GSIZE in a UCR namespace and identify the largest globals; correlate them with production components
Execute ^SystemPerformance and review the report, focusing on cache efficiency and disk I/O
Generate errors intentionally and practice finding and interpreting Event Log entries
Configure AlertOnError on a Business Service, trigger an alert, and observe it in the Management Console
Monitor a running production using the Production Monitor; stop a component and observe the status change
Create a business rule, process test messages, and use the Business Rule Log to trace rule evaluation decisions
Access the OpenMetrics endpoint and identify key metrics for system health monitoring

1. Message Storage and Retention Optimization Report Issue

Key Points

Detailed Notes

Overview

Message Header and Body Storage

Configuring Ens.Util.Tasks.Purge

Retention Policy Considerations

Archival Strategies

Documentation References

2. Data Volume and System Integrity Report Issue

Key Points

Detailed Notes

^%GSIZE for Global Sizing

Journal Management

^Integrity for Database Checks

Database Compaction

Documentation References

3. Performance Troubleshooting Report Issue

Key Points

Detailed Notes

OpenMetrics (/api/monitor/metrics)

^SystemPerformance Reports

Identifying Bottlenecks

Documentation References

4. Event Log Interpretation Report Issue

Key Points

Detailed Notes

Accessing the Event Log

Event Types

Filtering and Searching

Common Error Patterns in UCR

Documentation References

5. Alert Message Interpretation Report Issue

Key Points

Detailed Notes

Alert Sources

Alert Severity Levels

AlertOnError Configuration

Responding to Alerts

Alert Escalation

Documentation References

6. Production Monitor Report Issue

Key Points

Detailed Notes

Accessing the Production Monitor

Component Status Indicators

Queue Depth Monitoring

Throughput Analysis

Identifying and Diagnosing Failing Components

Documentation References

7. Business Rule Log Report Issue

Key Points

Detailed Notes

Purpose of the Business Rule Log

Accessing the Business Rule Log

Interpreting Rule Evaluation Results

Debugging Routing Rule Logic

Common Business Rule Issues in UCR

Filtering the Business Rule Log

Documentation References

Exam Preparation Summary

Critical Concepts to Master

Common Exam Scenarios

Hands-On Practice Recommendations

Report an Issue

1. Message Storage and Retention Optimization

2. Data Volume and System Integrity

3. Performance Troubleshooting

4. Event Log Interpretation

5. Alert Message Interpretation

6. Production Monitor

7. Business Rule Log