T3.1: Monitors and Troubleshoots UCR Components

Knowledge Review - HealthShare UCR Deployment Specialist

1. Configuring System Alerts

Key Points

  • System alerts: Platform-level alerts from InterSystems IRIS (license limits, database issues, mirror failover)
  • Production alerts: Application-level alerts from business hosts via AlertOnError setting
  • Custom alerts: Alerts generated by custom business logic or business rules
  • Alert recipients: Configured via email, SMS, or integration with enterprise monitoring systems
  • Alerting Manager: Central configuration point in the Management Portal for defining alert rules

Detailed Notes

Alert Types in InterSystems IRIS

InterSystems IRIS supports multiple categories of alerts that a UCR Deployment Specialist must understand and configure:

  • System Alerts: Generated by the IRIS platform itself. These include license capacity warnings, database space alerts, journal file growth notifications, and mirror/failover status changes. System alerts are configured through the Management Portal under System Administration > Configuration > Additional Settings > Alerts.
  • Production Alerts: Generated by production business hosts when AlertOnError is enabled. When a business service, process, or operation encounters an error, it sends an alert message to the configured alert operation. These are the most common alerts in a UCR deployment.
  • Custom Alerts: Generated programmatically within custom business logic. Developers can call the SendAlert() method from within any business host to raise an alert based on application-specific conditions (e.g., patient matching confidence below threshold, document count exceeding expected volume).

Configuring Alert Recipients

Alert delivery requires configuring an alert operation in the production:

  • Email Alerts: Configure an alert operation using EnsLib.EMail.AlertOperation or a similar class. This requires SMTP server settings (host, port, credentials) and recipient email addresses. Multiple recipients can be specified for distribution to on-call teams.
  • SMS Alerts: Route alerts through an SMS gateway by configuring an HTTP-based alert operation that calls an SMS provider's API. This is typically used for critical, after-hours alerts.
  • Enterprise Monitoring Integration: Forward alerts to platforms such as Nagios, PagerDuty, or ServiceNow by configuring an alert operation that calls the monitoring system's API.

Alert Escalation

Alert escalation ensures that unacknowledged alerts reach progressively higher levels of support:

  • Configure AlertGracePeriod on each business host to prevent alert flooding. This setting specifies the minimum number of seconds between consecutive alerts from the same component.
  • AlertRetryGracePeriod separately controls alerting during retry cycles, preventing an alert storm when a business operation is retrying a failed connection.
  • Escalation can be implemented through the monitoring system (e.g., PagerDuty escalation policies) or by configuring multiple alert operations with different recipients and grace periods.

Configuring Alert Rules for UCR Components

In a UCR federation, alert configuration should cover the following critical components:

  • Hub: Alert on MPI processing errors, failed patient registrations, and ECR storage failures
  • Edge Gateways: Alert on inbound data feed failures, SDA transformation errors, and Hub connectivity loss
  • Access Gateways: Alert on Clinical Viewer query failures and failed document retrieval
  • Bus: Alert on routing failures and message delivery errors between components
  • Audit Edge: Alert on audit event processing failures, which can represent compliance violations

---

Documentation References

2. Diagnosing Common Production Issues

Key Points

  • Production Monitor: Real-time dashboard showing component states, queue depths, and error counts
  • Component states: Enabled (running), Disabled (gray), Error (red indicator), Retry (attempting reconnection)
  • Connection failures: Distinguish timeout vs connection refused vs DNS resolution errors
  • Throughput monitoring: Track messages processed per unit time; compare against baseline
  • Suspended messages: Messages that exhausted retries and require manual intervention

Detailed Notes

Production Monitor Overview

The Production Monitor is accessed from Management Portal > Interoperability > Monitor. It provides a real-time view of every business host in the production, displaying:

  • Component name and class
  • Status indicator (running, stopped, disabled, error)
  • Queue depth (messages waiting to be processed)
  • Error count since last restart
  • Last activity timestamp

For a UCR deployment, the Deployment Specialist should monitor productions across multiple namespaces: the Hub namespace, each Edge Gateway namespace, each Access Gateway namespace, and the Bus namespace.

Identifying Component States

Understanding component states is essential for rapid diagnosis:

  • Enabled/Running (green): The component is active and processing messages normally
  • Disabled (gray): The component is disabled in the production configuration and will not start when the production starts. This may be intentional (maintenance) or accidental (misconfiguration)
  • Error (red with error indicator): The component encountered an error that prevented normal processing. Check the Event Log for details
  • Retry (yellow/orange): The component is in a retry cycle, attempting to re-establish a connection or re-process a failed operation. The RetryInterval and FailureTimeout settings control retry behavior

Connection Failure Diagnosis

Connection failures between UCR components are among the most common production issues. Differentiating the type of failure is critical:

  • Connection Timeout: The target system did not respond within the configured timeout period. This may indicate network latency, firewall blocking, or the target system being under heavy load. Check network connectivity and firewall rules.
  • Connection Refused: The target system actively rejected the connection. This typically means the target service is not running on the expected port, or the port has changed. Verify the target system's service status and port configuration.
  • DNS Resolution Failure: The hostname of the target system could not be resolved to an IP address. Check DNS configuration, /etc/hosts entries, and network DNS server availability.
  • SSL/TLS Handshake Failure: The secure connection could not be established. Check certificate validity, trust chain configuration, and TLS version compatibility.

Throughput Monitoring and Bottleneck Identification

Throughput issues manifest as growing queue depths and delayed data processing:

  • Monitor queue depths in the Production Monitor. A consistently growing queue indicates the component cannot keep up with incoming message volume.
  • Check PoolSize settings. Increasing PoolSize allows a component to process multiple messages concurrently, but only if the downstream system can handle the increased load.
  • Identify bottlenecks by tracing the message flow. If an Edge Gateway's queue is growing, check whether the issue is at the Edge itself or at the Hub it is feeding.
  • Use the Message Viewer to compare message creation timestamps with completion timestamps to measure processing latency.

Queue Depth Analysis and Suspended Messages

Queue depth is a key health indicator for each component:

  • Normal: Queue depth fluctuates but stays low. The component processes messages as fast as they arrive.
  • Growing queue + recent activity: The component is processing but too slowly. Consider increasing PoolSize or optimizing processing logic.
  • Growing queue + no activity: The component may be hung or waiting on an external resource. Restart the component or investigate the external dependency.
  • Suspended messages: When a business operation exhausts its retry count (configured via FailureTimeout), the message is suspended. Suspended messages appear in the Message Viewer with Status = Suspended and must be addressed via the Resend Editor after the root cause is resolved.

---

Documentation References

3. Troubleshooting Performance Issues

Key Points

  • OpenMetrics APIs: RESTful endpoints exposing IRIS performance metrics in Prometheus-compatible format
  • ^SystemPerformance: Utility that collects system-wide performance data over a defined period
  • Key performance indicators: CPU utilization, memory consumption, disk I/O throughput, network latency
  • Slow business operations: Identified by high average processing time in the Production Monitor
  • Baseline establishment: Capture performance metrics during normal operation for comparison during incidents

Detailed Notes

OpenMetrics APIs for Monitoring

InterSystems IRIS provides OpenMetrics-compatible API endpoints that expose performance metrics:

  • Accessible via HTTP at the /api/monitor/metrics endpoint on the IRIS web server
  • Metrics are formatted in Prometheus exposition format, allowing integration with Prometheus, Grafana, and other monitoring tools
  • Key metrics include: process count, global references per second, routine commands per second, journal write operations, lock table usage, and cache efficiency
  • For UCR deployments, configure each component (Hub, Edge Gateways, Access Gateways) to expose its metrics endpoint, and aggregate them in a central monitoring dashboard
  • OpenMetrics provides continuous, real-time monitoring as opposed to the point-in-time snapshots from ^SystemPerformance

^SystemPerformance Utility

The ^SystemPerformance utility is a built-in tool for collecting comprehensive performance data:

  • Running the utility: From the Terminal, enter do ^SystemPerformance in the %SYS namespace. Specify the output directory, collection interval (seconds), and duration (number of intervals).
  • Example: do ^SystemPerformance("/tmp/perfdata", 5, 60) collects data every 5 seconds for 60 intervals (5 minutes total)
  • Output: Generates an HTML report containing CPU, memory, disk I/O, process activity, global buffer utilization, journal statistics, and lock contention data
  • Interpretation: The report includes time-series charts and summary tables. Look for spikes in CPU, sustained high disk I/O, low cache hit ratios, or lock contention as indicators of performance problems
  • Best practice: Run ^SystemPerformance during both normal operation (to establish a baseline) and during an incident (to capture the problem state), then compare the two reports

Key Performance Indicators for UCR

When troubleshooting UCR performance, focus on these indicators:

  • CPU Utilization: High CPU may indicate inefficient transformations, excessive XPath evaluations in SDA processing, or resource contention among UCR components sharing the same server
  • Memory Consumption: Monitor shared memory (global buffers, routine buffers) and per-process memory. Memory pressure can cause swapping and severe performance degradation
  • Disk I/O: Journal writes, database reads/writes, and temporary file operations all contribute to disk I/O. High disk wait times indicate storage bottlenecks
  • Network Latency: UCR components communicate over the network (SOAP/HTTP between Edge Gateways and Hub). Network latency directly affects message round-trip time
  • Cache Efficiency: The global buffer hit ratio indicates how often data is found in memory vs requiring a disk read. A ratio below 95% suggests the buffer pool may be too small

Identifying Slow Business Operations

To find slow-performing business operations in a UCR production:

  • In the Production Monitor, examine the average processing time displayed for each component
  • Use the Message Viewer to filter by a specific business operation and examine the time difference between message creation and completion
  • Check Visual Trace for individual messages to see where processing time is spent across multiple components
  • Common causes of slow operations: large message payloads (e.g., CDA documents with embedded images), complex DTL transformations, external system response time, insufficient PoolSize

Performance Baseline Establishment

A performance baseline provides a reference point for detecting degradation:

  • Run ^SystemPerformance during representative normal workload conditions
  • Record typical queue depths, processing times, and throughput rates for each component
  • Document expected daily and weekly traffic patterns (e.g., higher volume during business hours, batch feeds overnight)
  • Store baseline data for comparison when performance issues are reported
  • Re-establish the baseline after significant configuration changes, upgrades, or workload changes

---

Documentation References

4. Audit Logging Configuration

Key Points

  • ATNA: Audit Trail and Node Authentication; IHE profile for healthcare audit logging
  • Audit Edge Gateway: Central audit repository that collects audit events from all UCR components
  • Audit event types: Patient record access, document creation, user authentication, consent changes
  • Per-component enablement: Each UCR component must be configured to send audit events to the Audit Edge
  • Retention policies: Audit logs must meet regulatory retention requirements (often 6-7 years for HIPAA)

Detailed Notes

ATNA Overview

ATNA (Audit Trail and Node Authentication) is an IHE (Integrating the Healthcare Enterprise) integration profile that defines:

  • A standardized format for recording audit events in healthcare information systems
  • Requirements for secure node-to-node communication using TLS certificates
  • The concept of an Audit Record Repository (ARR) that centrally stores audit events from multiple systems
  • In a UCR deployment, the ATNA profile ensures that all access to patient data is recorded in a tamper-evident audit trail

Configuring the Audit Edge Gateway

The Audit Edge Gateway serves as the central Audit Record Repository for the UCR federation:

  • The Audit Edge is deployed as a separate HealthShare component (typically in its own namespace)
  • It receives audit events from all UCR components (Hub, Edge Gateways, Access Gateways, Clinical Viewer)
  • Configuration involves registering the Audit Edge in the UCR Hub's registry so all components know where to send audit events
  • The Audit Edge stores events in a structured database that supports querying and reporting
  • Redundancy should be considered: if the Audit Edge is unavailable, components should queue audit events for later delivery

Audit Event Types in UCR

The following types of events are typically audited in a UCR deployment:

  • Patient Record Access: Any query or retrieval of patient data through the Clinical Viewer or programmatic access
  • Document Creation and Update: When new clinical documents (CDA, SDA) are stored in the ECR or registered at the Hub
  • User Authentication: Login and logout events, failed authentication attempts
  • Consent Changes: Modifications to patient consent policies that affect data visibility
  • Export and Print: When patient data is exported, downloaded, or printed from the Clinical Viewer
  • Administrative Actions: Changes to user roles, access policies, or system configuration

Enabling Audit Logging on Each Component

Each UCR component must be individually configured to generate and send audit events:

  • In the Management Portal, navigate to HealthShare > Registry > Configuration for the component
  • Enable audit event generation in the component's production configuration
  • Configure the Audit Edge endpoint (host, port, credentials) for audit event delivery
  • Verify audit event flow by performing a test action (e.g., querying a patient record) and confirming the event appears in the Audit Edge
  • Monitor the AuditAlertOperations setting to ensure alerts are raised if audit event delivery fails

Audit Log Retention Policies

Audit log retention must comply with applicable regulations:

  • HIPAA: Requires retention of audit logs for a minimum of 6 years
  • Organizational policies: May require longer retention based on institutional requirements
  • Configure the Audit Edge's purge schedule to respect the required retention period
  • Consider archiving older audit data to long-term storage rather than purging
  • Ensure that archived audit data remains searchable and retrievable for compliance audits

---

Documentation References

5. Log Review and Interpretation

Key Points

  • InterSystems IRIS Audit Log: System-level audit of database operations, accessed via Management Portal
  • HS.Util.Installer log: Records installation and upgrade steps; used to verify successful deployment
  • Production Event Log: Application-level events from business hosts (errors, warnings, info)
  • Console log (cconsole.log): Platform-level events (license, database, startup/shutdown)
  • Common error patterns: Timeout errors, SSL failures, namespace errors, class compilation errors

Detailed Notes

InterSystems IRIS Audit Log

The IRIS Audit Log is distinct from the production Event Log and the ATNA audit trail:

  • Accessed via Management Portal > System Administration > Security > Auditing > View Audit Database
  • Records system-level security events: login/logout, database access, privilege escalation, configuration changes
  • Each entry includes: event type, user, timestamp, source (IP address/process), and event-specific details
  • Use the IRIS Audit Log to investigate unauthorized access attempts, track configuration changes, and support security incident investigations
  • Audit events can be configured under System Administration > Security > Auditing > Configure System Events

HS.Util.Installer Log

The HS.Util.Installer log records the steps performed during HealthShare installation and upgrade:

  • Captures each step of the installation process: namespace creation, database creation, class compilation, production configuration, and component registration
  • Location: Accessible through the Management Portal or in the installation directory
  • Use during installation verification: After installing or upgrading a UCR component, review this log to confirm all steps completed successfully
  • Use during troubleshooting: If a component is not functioning correctly after installation or upgrade, this log may reveal steps that failed or completed with warnings
  • Look for entries marked as errors or warnings; successful steps are typically marked as informational

Production Event Log

The Production Event Log is the primary troubleshooting tool for production-level issues:

  • Access: Management Portal > Interoperability > View > Event Log
  • Contains events generated by business hosts: errors, warnings, informational messages, alerts, and trace entries
  • Filter by component name, event type, time range, or text search
  • Each event includes: timestamp, component name, event type, and full error text with optional stack trace
  • The Event Log is namespace-specific; check the correct namespace for the UCR component under investigation

Alert Log and Console Log

Additional logs complement the Event Log:

  • Alert Log: Records all alerts generated by business hosts with AlertOnError enabled. Accessible alongside the Event Log in the Management Portal.
  • Console Log (cconsole.log): Located on the server file system in the IRIS installation directory (typically <install-dir>/mgr/cconsole.log). Contains platform-level events: instance startup/shutdown, license events, database errors, journal operations, and mirror status changes. Use tail -f cconsole.log for real-time monitoring.

Common Error Patterns and Their Meaning

Recognizing common error patterns accelerates troubleshooting:

  • "ERROR #6084: Timed out waiting for response": A business operation's call to an external system exceeded the configured timeout. Check network connectivity and the target system's responsiveness.
  • "ERROR #6023: Connection refused": The target system is not accepting connections on the expected port. Verify the target service is running and the port configuration is correct.
  • "SSL/TLS error": Certificate issues (expired, untrusted CA, hostname mismatch). Check certificate configuration on both sides of the connection.
  • "ERROR #5001: Class does not exist": A referenced class is missing, possibly due to incomplete installation or a failed compilation. Recompile the namespace or reinstall the component.
  • "Namespace does not exist": The target namespace is missing or was not created during installation. Review the HS.Util.Installer log for namespace creation errors.

---

Documentation References

6. Data Volume and System Integrity Management

Key Points

  • ^%GSIZE: Utility for analyzing global (database table) sizes to identify large or growing data stores
  • Journal management: Journal files record all database changes; manage growth with journal profiles and purging
  • ^Integrity: Database integrity check utility that validates database block-level consistency
  • Database compaction: Reclaims unused space within databases after data deletion or purging
  • Purge utilities: Built-in tools for purging old messages, Event Log entries, and production logs

Detailed Notes

^%GSIZE Utility for Global Size Analysis

The ^%GSIZE utility provides detailed information about the size of globals (database tables) in each namespace:

  • Running the utility: From the Terminal in the %SYS namespace, enter do ^%GSIZE. Select the database or namespace to analyze.
  • Output: Lists each global, its total size in MB, the number of blocks allocated, and the percentage of allocated space in use
  • Use in UCR: Identify which globals are consuming the most space. In a UCR deployment, the largest globals are typically those storing message headers/bodies (Ens.MessageHeaderD, Ens.MessageBodyD), the ECR data (clinical documents), and the MPI indices
  • Growth monitoring: Run ^%GSIZE periodically and compare results to track database growth trends. Rapid growth may indicate a purge task is not running or a data feed is sending unexpectedly high volumes
  • Action: If a specific global is disproportionately large, investigate whether purge tasks are configured correctly or whether the data retention policy needs adjustment

Journal File Management and Profiles

Journal files record all database write operations and are essential for recovery:

  • Purpose: Journals enable point-in-time recovery after a system failure. Every write to the database is recorded in the current journal file.
  • Journal growth: In a high-volume UCR deployment, journal files can grow rapidly (gigabytes per day). Insufficient disk space for journals can cause the system to halt.
  • Journal profiles: Configure journal file location, maximum size before rollover, and the number of historical journal files to retain. Journals should be stored on a separate physical disk from the database for performance and resilience.
  • Journal purging: Old journal files are purged based on the configured retention policy. Ensure the retention period covers at least the time needed for a full backup cycle.
  • Monitoring: Monitor journal disk space as part of routine system health checks. Configure system alerts for low disk space on the journal volume.

Database Integrity Checks with ^Integrity

The ^Integrity utility validates the structural integrity of InterSystems IRIS databases:

  • Running the utility: From the Terminal in the %SYS namespace, enter do ^Integrity. Select the database(s) to check.
  • What it checks: Block-level consistency of the database structure, pointer chain validity, and data block integrity. It does not check application-level data correctness.
  • When to run: After an unexpected system shutdown (power failure, crash), after restoring a database from backup, or as part of a scheduled maintenance routine
  • Output: Reports any integrity violations found, including the specific block numbers and global names affected
  • Resolution: If integrity violations are found, consult InterSystems support (WRC). Minor issues may be repaired with the ^REPAIR utility, but significant corruption may require restoring from backup.

Database Compaction

After deleting or purging large amounts of data, databases may contain unused allocated space:

  • Database files do not automatically shrink when data is deleted. The space is marked as available for reuse within the database but is not returned to the operating system.
  • Database compaction (or truncation) reclaims this unused space by physically shrinking the database file
  • Access compaction through Management Portal > System Operation > Databases, then select the database and use the Compact option
  • Schedule compaction during maintenance windows, as it can temporarily impact performance
  • Monitor the difference between allocated size (file size on disk) and used size (^%GSIZE output) to determine if compaction is needed

Purge Utilities for Message History and Production Logs

UCR productions generate large volumes of operational data that must be periodically purged:

  • Message purge: Removes old message headers, bodies, and search table entries. Configure via the Ens.Util.Tasks.Purge scheduled task. Specify the number of days to retain and the types of messages to purge.
  • Event Log purge: Removes old Event Log entries. Can be configured as part of the same purge task or separately.
  • Business Rule Log purge: Removes old business rule execution records.
  • Managed Alert purge: Removes old managed alert records.
  • Scheduling: Configure purge tasks through System Operation > Task Manager. Schedule during off-peak hours to minimize performance impact.
  • Retention guidelines: Balance operational needs (troubleshooting capability) against storage constraints. A typical retention period is 30-90 days for messages and Event Log entries, but regulatory requirements may mandate longer retention for audit data.
  • Verify purge execution: After configuring purge tasks, monitor the Task Manager to confirm they execute successfully. Failed purge tasks allow unbounded database growth.

---

Documentation References

Exam Preparation Summary

Critical Concepts to Master:

  1. System alerts vs production alerts: System alerts are platform-level (license, database); production alerts come from business hosts via AlertOnError. Both require separate configuration.
  2. Production Monitor: The first tool for assessing production health. Check component status, queue depths, error counts, and last activity timestamps.
  3. Connection failure types: Distinguish timeout (network/load), connection refused (service down/wrong port), DNS failure (name resolution), and SSL/TLS errors (certificate issues).
  4. ^SystemPerformance: Run from the %SYS namespace Terminal. Produces an HTML report covering CPU, memory, disk I/O, cache efficiency, and lock contention. Always compare incident data against a baseline.
  5. OpenMetrics APIs: RESTful endpoints at `/api/monitor/metrics` providing Prometheus-compatible metrics for continuous monitoring and Grafana dashboard integration.
  6. ATNA and Audit Edge: ATNA is the IHE audit profile. The Audit Edge Gateway is the central Audit Record Repository. Each UCR component must be configured to send audit events to the Audit Edge.
  7. HS.Util.Installer log: Essential for verifying installation and upgrade success. Review after every deployment for errors or warnings.
  8. Event Log vs cconsole.log: Event Log for production/application issues; cconsole.log for platform-level issues (license, database, startup). Know which log to check for each symptom.
  9. ^%GSIZE: Analyzes global sizes to identify storage consumption. Run periodically to track growth and verify purge task effectiveness.
  10. ^Integrity: Validates database structural integrity at the block level. Run after unexpected shutdowns and as part of scheduled maintenance.
  11. Journal management: Journals enable recovery. Store on a separate disk, monitor space, and configure retention policies aligned with backup cycles.
  12. Purge tasks: Configure via Ens.Util.Tasks.Purge in the Task Manager. Purged data is permanently lost; balance retention against storage needs.

Common Exam Scenarios:

  • Configuring AlertOnError and AlertGracePeriod to balance timely notification against alert fatigue
  • Choosing the correct troubleshooting tool: Production Monitor for status overview, Event Log for errors, Visual Trace for message flow, ^SystemPerformance for system-level performance
  • Diagnosing a growing queue: check component status, PoolSize, downstream dependencies, and Event Log for errors
  • Interpreting ^SystemPerformance report indicators: high CPU, low cache hit ratio, elevated disk wait times
  • Configuring ATNA audit logging across a UCR federation and verifying audit event delivery to the Audit Edge
  • Reviewing the HS.Util.Installer log after an upgrade to confirm all steps completed without error
  • Using ^%GSIZE to identify which globals are consuming the most space and correlating with purge task configuration
  • Running ^Integrity after an unexpected system shutdown and interpreting the results
  • Determining the appropriate purge retention period given organizational and regulatory requirements
  • Differentiating between the IRIS Audit Log (system security events), the ATNA audit trail (clinical data access events), and the Production Event Log (business host processing events)

Report an Issue