1. Message Storage and Retention Optimization
Key Points
- Message Storage Components: Message headers and message bodies are stored separately in the database
- Purge Task: Ens.Util.Tasks.Purge is the built-in scheduled task that removes old messages
- Retention Policies: Define how long message headers and bodies are kept before purging
- Archival Strategies: Archive important messages before purging for compliance and audit purposes
- Storage Impact: Unpurged messages accumulate and degrade system performance over time
Detailed Notes
Overview
HealthShare UCR productions generate large volumes of messages as clinical data flows through Edge Gateways and the Hub. Each message consists of a message header (metadata about the message: source, target, timestamps, status) and a message body (the actual content of the message). Without active management, these messages accumulate in the database, consuming disk space and degrading query performance. The built-in purge task provides automated message cleanup based on configurable retention policies.
Message Header and Body Storage
- Message Headers (Ens.MessageHeader): Stored in the Ensemble message header table; contain metadata such as session ID, source component, target component, message type, creation time, and completion status
- Message Bodies: Stored in class-specific tables or as streams; contain the actual message content (HL7 messages, SDA containers, request/response objects)
- Headers and bodies are linked by message ID but stored separately, allowing different retention periods
Configuring Ens.Util.Tasks.Purge
1. Open the Management Portal and navigate to System Operation > Task Manager > Task Schedule 2. Locate or create the Ens.Util.Tasks.Purge task 3. Configure the following parameters:
- NumberOfDaysToKeep: How many days of messages to retain (e.g., 30, 60, 90)
- BodiesToo: Whether to purge message bodies along with headers (recommended: Yes)
- TypesToPurge: Which message types to include in the purge (default: all)
- KeepIntegrity: Whether to preserve referential integrity during purge
4. Set the schedule (recommended: daily during off-peak hours) 5. Save and activate the task
Retention Policy Considerations
- Balance storage costs against troubleshooting and audit needs
- Regulatory requirements may mandate minimum retention periods for clinical data messages
- Consider longer retention for error messages (useful for trend analysis)
- Production environments typically retain 30-90 days; development environments may use shorter periods
- Monitor database size trends to validate that purge policies are effective
Archival Strategies
- Export messages to external storage before purging for long-term retention
- Use the Message Viewer export function to save specific message sessions
- Implement automated archival using a custom Business Operation that writes to archive storage
- Maintain an archive index for retrieval of archived messages when needed
---
Documentation References
2. Data Volume and System Integrity
Key Points
- ^%GSIZE Utility: Reports the size of globals (database tables) to identify large and growing data stores
- Journal Management: Journals record all database changes for recovery; manage journal file retention and space
- ^Integrity Utility: Verifies database structural integrity by checking block-level consistency
- Database Compaction: Reclaims unused space after data deletion or purging
- Proactive Monitoring: Regular checks prevent storage exhaustion and data corruption
Detailed Notes
^%GSIZE for Global Sizing
The ^%GSIZE utility reports the size of globals (the underlying storage structures for database tables) in a namespace. This is essential for identifying which data stores are consuming the most space and tracking growth over time.
Running ^%GSIZE: 1. Open a Terminal session to the target namespace and execute: do ^%GSIZE 2. The utility reports each global's allocated and used blocks, along with total size 3. Review the output to identify the largest globals and compare results over time to track growth trends
Key globals to monitor in UCR:
- Ens.MessageHeaderD: Message header storage (grows with message volume)
- HS.SDA3.*: SDA streamlet storage (grows with clinical data volume)
- IRIS.Temp.*: Temporary data that should be cleaned periodically
- HS.IHE.XDS.*: Document registry metadata
Journal Management
Journals record every database modification operation (inserts, updates, deletes) and are essential for database recovery. However, journal files consume significant disk space.
- Journal File Location: Configured during system installation; should be on a separate disk from databases
- Retention Settings: Configure how many days of journal files to retain (System Administration > Configuration > Journal Settings)
- Purge Schedule: Set up automatic journal file purging to remove files older than the retention period
- Space Monitoring: Monitor the journal directory for available disk space; journal disk exhaustion causes system halts
- Backup Coordination: Journal purging should be coordinated with backup schedules to ensure recovery point coverage
^Integrity for Database Checks
The ^Integrity utility verifies the structural integrity of database files by checking block-level consistency.
Running ^Integrity: 1. Open a Terminal session 2. Execute: do ^Integrity 3. Select the database(s) to check 4. The utility reports any block-level inconsistencies or corruption 5. Schedule integrity checks during maintenance windows (the check can be resource-intensive)
When to run integrity checks:
- After unexpected system shutdowns or power failures
- After disk hardware events or storage migrations
- As part of regular preventive maintenance (monthly or quarterly)
- When data corruption is suspected based on application errors
Database Compaction
After large purge operations or data deletions, databases may have unused space that is not automatically returned to the operating system. Use database compaction (accessible via System Operation > Databases) to reclaim unused space. Compaction can be run online but may impact performance; schedule it during maintenance windows. Monitor database file sizes before and after compaction to verify space reclamation.
---
Documentation References
3. Performance Troubleshooting
Key Points
- OpenMetrics Endpoint: `/api/monitor/metrics` exposes real-time system metrics in Prometheus format
- ^SystemPerformance: Collects comprehensive system performance data over a configurable period
- Bottleneck Identification: Use metrics to identify CPU, memory, disk I/O, and cache bottlenecks
- Cache Efficiency: Monitor global buffer cache hit ratio to assess database cache effectiveness
- Trend Analysis: Compare performance data over time to detect degradation
Detailed Notes
OpenMetrics (/api/monitor/metrics)
The OpenMetrics endpoint provides real-time system metrics in a format compatible with Prometheus and other monitoring systems.
Accessing OpenMetrics:
- URL:
http://<server>:<port>/api/monitor/metrics - Returns metrics in Prometheus text format
- Metrics include: CPU utilization, memory usage, process counts, license usage, database cache statistics, and Ensemble-specific metrics
- Can be scraped by external monitoring tools (Prometheus, Grafana) for dashboards and alerting
Key metrics to monitor:
iris_cpu_usage: CPU utilization percentageiris_cache_efficiency: Global buffer cache hit ratio (should be above 95%)iris_process_count: Number of active processesiris_disk_reads_per_sec: Physical disk reads (high values indicate cache inefficiency)iris_journal_space: Available journal disk spaceiris_ens_queue_count: Ensemble production queue depths
^SystemPerformance Reports
The ^SystemPerformance utility collects comprehensive performance data over a configurable sampling period and generates a detailed report.
Running ^SystemPerformance: 1. Open a Terminal session 2. Execute: do ^SystemPerformance 3. Specify the output file path 4. Specify the sampling duration (e.g., 300 seconds) and interval (e.g., 5 seconds) 5. The utility collects system metrics at each interval 6. After completion, review the generated report
Report contents:
- CPU utilization breakdown (user, system, idle)
- Memory usage and paging statistics
- Disk I/O rates (reads/writes per second, latency)
- Global buffer cache performance (hit ratio, evictions)
- Lock table utilization
- Process activity and wait states
- Database journal write rates
Identifying Bottlenecks
- CPU Bottleneck: Sustained high CPU utilization (>80%) with slow response times
- Memory Bottleneck: High paging rates, low available memory, cache evictions
- Disk I/O Bottleneck: High disk latency (>10ms), high read rates with low cache hit ratio
- Cache Inefficiency: Global buffer cache hit ratio below 95% indicates the cache is too small or working set is too large
- Queue Buildup: Growing Ensemble queue depths indicate that Business Operations cannot keep up with inbound message volume
---
Documentation References
4. Event Log Interpretation
Key Points
- Event Log Access: Available at Ensemble > Event Log in the Management Portal
- Event Types: Error, Warning, Info, Trace, and Assert entries
- Filtering and Searching: Filter by date range, component, severity, and text content
- Common Error Patterns: Connection failures, transformation errors, validation failures, timeout errors
- Correlation: Correlate Event Log entries with Visual Trace sessions for complete diagnosis
Detailed Notes
Accessing the Event Log
1. Open the Management Portal for the target namespace (Edge Gateway or Hub) 2. Navigate to Ensemble > Event Log (or Interoperability > Event Log in newer versions) 3. The Event Log displays entries in reverse chronological order 4. Use the filter controls to narrow the view
Event Types
- Error: Indicates a failure that prevented normal operation (e.g., connection refused, DTL compilation error, validation failure). Requires investigation and resolution.
- Warning: Indicates a potential problem that did not prevent operation but may lead to issues (e.g., slow response, retry succeeded after initial failure). Should be monitored.
- Info: Informational entries about normal operations (e.g., production started, component connected, configuration loaded). Useful for understanding system activity.
- Trace: Detailed diagnostic entries generated when trace logging is enabled. Used for in-depth troubleshooting. Not normally enabled in production due to volume.
- Assert: Internal consistency check entries. Rare; indicate unexpected internal conditions.
Filtering and Searching
- Date Range: Limit entries to a specific time window
- Component: Filter by source component name (e.g., "HL7FileService", "DTLProcess", "MPIOperation")
- Severity: Show only errors, or errors and warnings
- Text Search: Search entry text for specific keywords (e.g., "timeout", "connection", "DTL", patient MRN)
- Session ID: Find all entries related to a specific message session
Common Error Patterns in UCR
- "Connection refused": The target system is not listening on the configured port; check network connectivity and target service status
- "DTL Transform error": A Data Transformation Language transform encountered an error; check the DTL class for compilation errors or data mismatches
- "Validation error": An incoming message failed schema validation; check the message structure against the expected schema
- "Timeout": A component did not respond within the configured timeout period; check the target system's responsiveness and adjust timeout settings
- "MPI match error": The MPI encountered an error during patient matching; check the matching algorithm configuration and input data quality
- "Consent denied": A data access request was denied by consent policy; verify the consent configuration and patient's consent status
---
Documentation References
5. Alert Message Interpretation
Key Points
- Alert Sources: Alerts are generated by production components, system monitors, and scheduled tasks
- Severity Levels: Alerts range from informational notifications to critical system failures
- AlertOnError Setting: Production components can be configured to generate alerts when errors occur
- Alert Routing: Alerts can be routed to email, management console, custom handlers
- Escalation: Unresolved alerts may escalate based on configured escalation policies
Detailed Notes
Alert Sources
- Production Components: Business Services, Processes, and Operations generate alerts when configured with AlertOnError or AlertGracePeriod settings
- System Monitors: Built-in monitors for disk space, license usage, database growth, and process limits
- Scheduled Tasks: Tasks that fail or produce warnings generate task-level alerts
- Custom Alert Generators: Business rules or custom code can generate alerts based on application-specific conditions
Alert Severity Levels
- Info: Notification of normal but noteworthy events (e.g., scheduled maintenance completed)
- Warning: Conditions that may require attention but are not immediately critical (e.g., disk space below 20%)
- Error: Component failures or processing errors that affect data flow (e.g., connection lost to source system)
- Critical: System-level failures that require immediate intervention (e.g., database corruption, license exhausted)
AlertOnError Configuration
- Each production component has an AlertOnError setting (true/false)
- When enabled, the component generates an alert whenever it encounters an error
- AlertGracePeriod defines the minimum interval between repeated alerts for the same error (prevents alert flooding)
- Configure AlertOnError on critical components (data feed services, MPI operations) and leave it off for non-critical components
- Access the setting through the component configuration page in the Production Configuration
Responding to Alerts
1. Review the alert message for the component name, error description, and timestamp 2. Check the Event Log for additional context around the alert time 3. Open Visual Trace for the affected message session (if applicable) 4. Diagnose and resolve the root cause 5. Verify that the component has recovered (check the Production Monitor for green status) 6. Document the alert, root cause, and resolution for operational knowledge
Alert Escalation
- Configure escalation policies that promote alerts to higher severity if not acknowledged within a defined period
- Escalation can trigger additional notification channels (e.g., from email to SMS/pager)
- Tune alert thresholds and grace periods to reduce noise while ensuring critical issues are not missed
---
Documentation References
6. Production Monitor
Key Points
- Accessing Production Monitor: Available at Ensemble > Production Monitor in the Management Portal
- Component Status Indicators: Green (running), yellow (warning/disabled), red (error/stopped)
- Queue Depth Monitoring: View the number of queued messages for each component to detect backlogs
- Throughput Analysis: Monitor messages processed per time period for each component
- Identifying Failing Components: Quickly locate components in error state and access their Event Log entries
Detailed Notes
Accessing the Production Monitor
1. Open the Management Portal for the target namespace 2. Navigate to Ensemble > Production Monitor (or Interoperability > Monitor) 3. The Production Monitor displays all production components organized by type (Services, Processes, Operations) 4. The monitor refreshes automatically at a configurable interval
Component Status Indicators
- Green: The component is running normally and processing messages
- Yellow: The component is disabled (intentionally stopped) or experiencing warnings
- Red: The component has encountered an error and is not processing messages
- Gray: The component is not configured or has never been started
- Click on any component to access its detailed status, configuration, and recent Event Log entries
Queue Depth Monitoring
- Each component displays its current queue depth (number of messages waiting to be processed)
- A growing queue indicates that the component is falling behind the incoming message rate
- Normal queue depth is zero or near-zero for most components
- Consistently high queue depths may indicate: slow downstream systems, resource constraints, configuration issues, or increased inbound volume
- Monitor queue trends over time to distinguish between temporary spikes and sustained backlogs
Throughput Analysis
- The Production Monitor shows message counts and processing rates for each component
- Compare throughput across components to identify bottlenecks in the processing chain
- A component with high input throughput but low output throughput is a processing bottleneck
- Use throughput data to plan capacity and identify components that need performance tuning
- Track throughput trends to detect changes in data volume from source systems
Identifying and Diagnosing Failing Components
1. Look for red status indicators in the Production Monitor 2. Click on the failing component to view its error details 3. Check the component's Event Log entries for error messages 4. Verify the component's configuration (endpoint URLs, credentials, timeout settings) 5. Test connectivity to external systems if the component communicates with external endpoints 6. Restart the component after resolving the issue 7. Monitor the component to confirm it returns to green status and processes queued messages
---
Documentation References
7. Business Rule Log
Key Points
- Purpose: The Business Rule Log records every decision made by business rules during message processing
- Accessing the Log: Available at Ensemble > Business Rule Log in the Management Portal
- Rule Evaluation Records: Each entry shows which rule was evaluated, the conditions tested, and the action taken
- Debugging Routing Logic: Use the log to understand why a message was routed to a specific path
- Filtering: Filter by rule name, date range, session ID, and result
Detailed Notes
Purpose of the Business Rule Log
The Business Rule Log provides an audit trail of every business rule evaluation that occurs during production message processing. Business rules control message routing, transformation selection, and conditional processing logic. When troubleshooting unexpected routing behavior or trying to understand why a message was handled in a particular way, the Business Rule Log is the definitive source of information.
Accessing the Business Rule Log
1. Open the Management Portal for the target namespace 2. Navigate to Ensemble > Business Rule Log (or Interoperability > Business Rule Log) 3. The log displays rule evaluation entries in reverse chronological order 4. Each entry includes: timestamp, rule name, session ID, conditions evaluated, and action taken
Interpreting Rule Evaluation Results
- Rule Name: The business rule class that was evaluated
- Session ID: Links the rule evaluation to a specific message session (use this to cross-reference with Visual Trace)
- Conditions Evaluated: Shows each condition in the rule and whether it evaluated to true or false
- Action Taken: The action that was executed based on the rule evaluation (route to target, transform, discard, etc.)
- Return Value: The final result of the rule evaluation (accept, reject, continue)
Debugging Routing Rule Logic
When a message is not being routed as expected: 1. Find the message's session ID from the Message Viewer 2. Search the Business Rule Log for entries with that session ID 3. Review the conditions that were evaluated and their true/false results 4. Identify which condition caused the unexpected routing decision 5. Check the condition logic: are the field references correct? Are the comparison values accurate? 6. Modify the business rule to correct the logic 7. Reprocess a test message and verify the corrected routing
Common Business Rule Issues in UCR
- Incorrect Field Reference: The rule references a message field that does not exist or has a different path than expected
- Value Mismatch: The rule compares against a hardcoded value that does not match the actual message content (case sensitivity, whitespace)
- Missing Default Rule: No default action is defined for messages that do not match any specific condition
- Rule Priority: Rules are evaluated in order; a higher-priority rule may match before the intended rule
- Stale Conditions: Rule conditions reference values that have changed (e.g., facility codes, message types)
Filtering the Business Rule Log
- Date Range: Limit entries to a specific time period
- Rule Name: Filter by the specific business rule class
- Session ID: Show all rule evaluations for a specific message session
- Result: Filter by action taken (useful for finding all "discard" or "reject" decisions)
- Component: Filter by the Business Process that executed the rule
---
Documentation References
Exam Preparation Summary
Critical Concepts to Master
- Message Purge Configuration: Know how to configure Ens.Util.Tasks.Purge with retention periods and understand the impact of unpurged messages
- ^%GSIZE and ^Integrity: Understand when and how to use these utilities for data volume management and integrity verification
- OpenMetrics and ^SystemPerformance: Know the key metrics to monitor and how to interpret performance reports
- Event Log Event Types: Distinguish between Error, Warning, Info, Trace, and Assert entries and know their significance
- AlertOnError and AlertGracePeriod: Understand how component alerts are configured and how escalation works
- Production Monitor Status Colors: Know what green, yellow, red, and gray indicate and how to respond to each
- Business Rule Log: Be able to find rule evaluation entries and interpret conditions and actions to debug routing logic
Common Exam Scenarios
- Configuring a message purge task with appropriate retention periods for a production UCR environment
- Running ^%GSIZE to identify which globals are consuming the most space and recommending actions
- Interpreting a ^SystemPerformance report to identify a disk I/O bottleneck
- Analyzing Event Log entries to diagnose a data feed failure at an Edge Gateway
- Configuring AlertOnError on critical production components and setting appropriate grace periods
- Using the Production Monitor to identify a component with growing queue depth and diagnosing the cause
- Debugging a routing rule using the Business Rule Log to determine why messages are being sent to the wrong target
- Diagnosing a cache efficiency problem using OpenMetrics data and recommending solutions
Hands-On Practice Recommendations
- Configure and run Ens.Util.Tasks.Purge in a development environment; verify messages are removed after the retention period
- Run ^%GSIZE in a UCR namespace and identify the largest globals; correlate them with production components
- Execute ^SystemPerformance and review the report, focusing on cache efficiency and disk I/O
- Generate errors intentionally and practice finding and interpreting Event Log entries
- Configure AlertOnError on a Business Service, trigger an alert, and observe it in the Management Console
- Monitor a running production using the Production Monitor; stop a component and observe the status change
- Create a business rule, process test messages, and use the Business Rule Log to trace rule evaluation decisions
- Access the OpenMetrics endpoint and identify key metrics for system health monitoring