T42.4: Diagnoses and Troubleshoots InterSystems IRIS

Knowledge Review - InterSystems IRIS System Administration Specialist

1. Interpret cconsole.log entries

Key Points

  • cconsole.log records instance startup, shutdown, and critical system events
  • Located in mgr directory of installation
  • Shows system initialization sequence and error conditions
  • Critical for diagnosing startup failures and configuration issues
  • Review after any system restart or failure

Detailed Notes

Purpose and Location

The cconsole.log file is one of the most important diagnostic resources for InterSystems IRIS system administrators. Located in the mgr directory (e.g., C:\InterSystems\IRIS\mgr\cconsole.log), this file records the complete sequence of events during instance startup, shutdown, and certain critical runtime conditions.

What Gets Captured

During startup, cconsole.log captures the initialization of all system components including license validation, database mounting, service startup, configuration parameter loading, and system-level error conditions. The log uses a timestamp format showing the exact date and time of each event, making it invaluable for correlating system behavior with operational timelines. Common entries include license key loading and validation, database mount operations and failures, configuration parameter parsing and errors, service initialization (web server, superserver, etc.), memory allocation and system resource configuration, and critical system errors that prevent startup.

Troubleshooting with cconsole.log

When troubleshooting startup problems, cconsole.log is typically the first resource to examine. For example, if the instance fails to start, cconsole.log will show exactly which initialization step failed and often provide error codes or messages explaining why. The log captures both informational messages (normal startup progression) and error messages (failures requiring attention). Key scenarios where cconsole.log provides critical diagnostic information include license expiration or corruption (system won't start), database mount failures (corrupted or missing database files), configuration file errors (iris.cpf syntax or parameter value problems), insufficient system resources (memory allocation failures), port conflicts (services unable to bind to configured ports), and upgrade failures (compatibility or migration issues).

Best Practices

Best practices for using cconsole.log include reviewing it immediately after any failed startup attempt, comparing successful versus failed startup logs to identify differences, archiving logs before major system changes for rollback reference, and incorporating log review into standard operational procedures. The log file persists across restarts and grows continuously, so periodic archiving may be necessary to manage size.

2. Recognize and troubleshoot common startup problems

Key Points

  • License issues: expired, invalid, or missing license keys
  • Database problems: corrupted, unmounted, or inaccessible databases
  • Port conflicts: services unable to bind to configured ports
  • Resource constraints: insufficient memory or disk space
  • Configuration errors: invalid parameters in iris.cpf

Detailed Notes

License-Related Failures

InterSystems IRIS startup failures generally fall into several common categories, each with characteristic symptoms and resolution approaches. License-related failures are among the most common - if the license key is expired, corrupted, or missing, the instance will fail to start with specific license error messages in cconsole.log. Resolution involves installing a valid license key using the ^SYSKEY utility or Management Portal license management interface.

Database Mounting Failures

Database mounting failures occur when the system cannot access or mount one or more configured databases. This might result from corrupted database files, incorrect file permissions, missing database files, or file system issues. The cconsole.log will identify which specific databases failed to mount. Resolution typically involves verifying file existence and permissions, running integrity checks on suspect databases, or restoring from backup if corruption is detected.

Port Conflicts and Resource Constraints

Port conflict errors prevent network services from starting when other applications are already using configured ports. The web server port (typically 52773), superserver port (typically 1972), or other service ports may conflict. Resolution requires either stopping the conflicting application or reconfiguring IRIS to use different ports. Resource constraint failures occur when insufficient system resources prevent initialization - common scenarios include inadequate memory (RAM) for configured buffer sizes, exhausted disk space preventing journal or database expansion, or operating system limitations on shared memory or file handles. The cconsole.log will show memory allocation failures or resource limit errors. Resolution involves freeing resources, reducing configured resource usage, or increasing system limits.

Configuration File Errors

Configuration file errors result from invalid syntax or parameter values in iris.cpf - this might include referencing non-existent directories, specifying invalid parameter values, or conflicting settings. Careful review of cconsole.log error messages identifies the specific problematic parameter. Resolution involves editing iris.cpf directly (when instance is stopped) to correct the errors.

Systematic Troubleshooting Methodology

Systematic troubleshooting methodology includes examining cconsole.log for specific error messages, checking messages.log for additional context, verifying system resources and prerequisites, validating configuration file syntax and values, and testing fixes in isolated environments before applying to production.

3. Use Terminal to examine system state

Key Points

  • Direct access via Terminal when Management Portal unavailable
  • Key utilities: ^SYSLOG, ^JOURNAL, ^DATABASE, ^PERFMON
  • Examine system state, configuration, and runtime information
  • Review error globals ^ERRORS and ^rOBJ
  • Execute diagnostic commands for troubleshooting

Detailed Notes

When to Use Terminal Access

The Terminal provides direct access to InterSystems IRIS internals when web-based tools are unavailable or when detailed system examination is required. Terminal access is particularly valuable during troubleshooting scenarios where the Management Portal is inaccessible due to web server issues, authentication problems, or network connectivity failures.

Key Diagnostic Utilities

Several key utilities provide diagnostic capabilities from Terminal. The ^SYSLOG utility allows review of the messages.log file directly from Terminal, displaying recent system messages, errors, and informational entries without needing file system access. The ^JOURNAL utility provides detailed journal information including current journal status, history, statistics, and management capabilities. The ^DATABASE utility enables database examination and management including viewing free space, running integrity checks, and managing database configurations. The ^PERFMON utility offers real-time performance monitoring showing activity rates, resource utilization, and system health metrics.

Examining System Globals

Beyond utilities, Terminal allows direct examination of system globals. The ^ERRORS global contains error messages and can be queried to review recent error conditions. The ^rOBJ global contains runtime information about compiled routines and classes. Direct global examination using ObjectScript commands enables inspection of application data, system configuration globals, and internal state information.

Common Diagnostic Commands

Common diagnostic commands include "write $system.Version.GetBuildOS()" to display version and platform information, "do ^%SYSLOG" to view system logs, "do $system.Status.DisplayError(status)" to decode error status codes, "write $system.SYS.NameSpace()" to confirm current namespace, and "do $system.OBJ.ShowFlags()" to display compilation flags. Terminal also enables executing ObjectScript code snippets for custom diagnostics, examining process-specific variables, and performing system queries not available through other interfaces. For production troubleshooting, Terminal provides emergency access to critical system functions when other interfaces fail. System administrators should maintain familiarity with essential Terminal commands and utilities for effective incident response.

4. Analyze lock contention

Key Points

  • View locks via Process Details page: Lock field shows held locks
  • Identify blocking processes and waiting processes
  • Lock modes: Shared (S) and Exclusive (X)
  • Use Lock Manager to view system-wide lock status
  • Resolve via process termination or application redesign

Detailed Notes

Understanding Lock Contention

Lock contention occurs when multiple processes compete for access to the same global references, with one or more processes waiting for locks held by others. InterSystems IRIS uses locks to coordinate concurrent access and prevent data corruption, but excessive lock contention degrades performance and can cause application slowdowns or hangs.

Identifying Lock Contention

Analyzing lock contention begins with identifying affected processes through the Management Portal's Processes page (System Operations > Processes). The Process Details page displays lock information for each process - the Lock field shows what references the process currently holds locks on, including the lock mode (Shared or Exclusive) and reference name. Shared locks allow multiple readers; Exclusive locks prevent any other access. When diagnosing performance issues or hung processes, examine the Lock field to identify blocking relationships. A process waiting for a lock will show the desired lock in its information, while the blocking process will show that same reference as a held lock. The ^LOCKTAB utility provides system-wide lock analysis, displaying all active locks and which processes hold them.

Symptoms and Causes

Lock contention manifests in several ways: application slowdowns when processes wait for lock release, complete application hangs when circular lock dependencies (deadlocks) occur, or timeout errors when lock wait times exceed configured limits. Common causes of lock contention include poorly designed application logic that holds locks too long, high transaction volumes competing for popular global nodes, inadequate lock granularity (locking too broadly), or bugs causing locks to not be released.

Resolution and Best Practices

Resolving lock contention may involve immediate remediation (terminating blocking processes to release locks) or longer-term solutions (redesigning application logic to reduce lock duration, implementing more granular locking strategies, or optimizing transaction patterns). Best practices for lock management include keeping lock duration as short as possible, implementing consistent lock ordering to prevent deadlocks, using appropriate lock granularity (lock only what's necessary), and including lock timeout handling in application code. Regular monitoring of lock statistics helps identify developing contention problems before they impact users.

5. Review database integrity issues

Key Points

  • Run integrity check to detect structural corruption
  • Review integrity check output for error indicators
  • Common issues: pointer corruption, block errors, index inconsistencies
  • May result from hardware failures, unexpected shutdowns, or software bugs
  • Resolution often requires database restore from backup

Detailed Notes

Running Integrity Checks

Database integrity issues represent structural corruption within database files that can lead to data loss, application errors, or system instability. Integrity checking is the primary diagnostic tool for detecting such issues. Execute integrity checks via the Management Portal (System Operations > Databases, then Integrity Check button) or through the Task Manager as a scheduled task. The integrity check performs comprehensive examination of database structures including global node linkage, pointer validity, block structure consistency, index correctness, and free space management structures.

Interpreting Results

The output report indicates overall database health and identifies specific problems if found. Clean output shows all structures verified successfully; problematic output displays error messages identifying specific globals or blocks with corruption. Common integrity error types include pointer errors (references to invalid or incorrect blocks), block header corruption (damaged block metadata), index inconsistencies (index entries not matching data), and free space map errors (incorrect free space tracking).

Root Causes

Integrity issues typically result from several root causes: hardware failures including disk controller errors, bad sectors, or memory corruption writing blocks incorrectly; unexpected system shutdowns from power failures or operating system crashes interrupting database writes; software bugs (rare) in IRIS itself or third-party tools accessing database files directly; and file system corruption from underlying storage problems.

Response and Prevention

When integrity errors are detected, response depends on severity. Minor errors might be isolated to specific globals that can be rebuilt from source data. Major corruption affecting critical structures typically requires restoring the database from backup and replaying journals to restore recent transactions. The "freeze on error" configuration option prevents continued operation when serious database or journal errors occur, protecting against propagating corruption. Prevention strategies include using reliable hardware with ECC memory and enterprise-grade storage, implementing comprehensive backup strategies with tested restore procedures, enabling journaling for all critical databases to support recovery, monitoring disk and file system health proactively, and performing regular scheduled integrity checks to detect problems early before they worsen.

6. Identify and resolve journal problems

Key Points

  • Journal space exhaustion causes system freeze if "freeze on error" enabled
  • Monitor journal directory space continuously
  • Journal write failures prevent transaction commits
  • Alternate journal directory provides failover capability
  • Recovery requires journal integrity for replay

Detailed Notes

Journal Space Exhaustion

Journal-related problems can cause severe system impact ranging from performance degradation to complete system freeze. The most critical journal issue is journal space exhaustion - when the journal directory runs out of disk space, IRIS cannot create new journal files or continue writing to current files. With "freeze on error" enabled (recommended for production), the system halts all operations rather than risk data loss, requiring manual intervention to free space and resume operations. Preventive monitoring of journal directory space is essential - automated alerts should trigger when free space drops below 20%.

Performance and Corruption Issues

Journal write performance problems manifest as transaction slowdowns since transactions cannot commit until journal writes complete. Diagnosis involves monitoring journal write latency through ^PERFMON or system monitoring tools. Resolution might include moving journal directory to faster storage, increasing journal buffer size, or investigating underlying storage performance issues. Journal corruption, while rare, prevents journal replay for recovery. This might result from hardware failures writing journal files, file system corruption, or software bugs. Journal integrity can be verified using ^JOURNAL utility before attempting recovery operations. In the event of journal corruption affecting required recovery timeframe, recovery may only be possible to the last known good journal file, resulting in transaction loss.

Alternate Directory and Switch Failures

The alternate journal directory configuration provides protection against journal write failures - if the primary directory becomes unavailable (disk full, hardware failure, permission issues), IRIS automatically switches to writing the alternate directory, preventing system freeze. However, this requires the alternate directory to be on separate physical storage. Journal switch failures prevent normal journal file rotation, potentially leading to oversized journal files or space exhaustion. This might result from file system permissions, directory availability, or configuration errors. Diagnosis involves reviewing cconsole.log and messages.log for journal switch error messages. Resolution addresses the underlying cause - permission fixes, directory creation, or configuration correction.

Best Practices

Best practices for journal system health include monitoring journal directory space with automated alerting, configuring alternate journal directory on separate storage, performing regular journal file backups and purging, validating journal integrity periodically, and maintaining documentation of journal configuration and procedures.

7. Address performance bottlenecks

Key Points

  • Identify bottleneck type: CPU, memory, disk I/O, or lock contention
  • Use ^PERFMON for real-time monitoring and statistics
  • Examine process details to identify resource-consuming processes
  • Review global and routine buffer efficiency
  • Analyze query execution plans for SQL performance issues

Detailed Notes

Types of Performance Bottlenecks

Performance troubleshooting requires systematic identification of resource bottlenecks and their root causes. Performance problems generally fall into categories: CPU saturation (processes waiting for CPU time), memory pressure (insufficient buffers causing excessive disk I/O), disk I/O bottlenecks (slow storage limiting throughput), network latency (for distributed or client-server applications), and lock contention (processes waiting for locks).

Using ^PERFMON for Diagnosis

The ^PERFMON utility provides real-time performance monitoring showing metrics across all categories including commands per second, global references per second, disk reads and writes, buffer hit ratios, lock waits, and process activity. Begin diagnosis by running ^PERFMON to identify which resources show concerning patterns.

Diagnosing Specific Bottlenecks

CPU bottlenecks show high CPU utilization percentages with processes in running or runnable states. Resolution might involve optimizing application code to reduce computational work, adding CPU capacity, or redistributing workload. Memory bottlenecks manifest as low buffer hit ratios (global buffer or routine buffer) with high physical disk reads. The system is repeatedly reading from disk rather than finding data in memory buffers. Resolution involves increasing global buffer or routine buffer sizes in configuration (requires restart) or optimizing application access patterns to improve locality of reference. Disk I/O bottlenecks show high disk queue depths, slow response times, and processes waiting in disk I/O wait states. Resolution might include moving to faster storage (SSD instead of HDD), distributing databases across multiple spindles, or optimizing application I/O patterns. Lock contention shows processes in lock wait states with lock statistics showing high lock wait times. Resolution requires application design changes to reduce lock duration or granularity.

SQL and Process-Level Analysis

For SQL performance issues, examine query execution plans using EXPLAIN or SQL query statistics to identify inefficient queries, missing indexes, or poor optimization. The Process Details page helps identify specific processes consuming resources - sort by CPU time, memory usage, or global references to find problematic processes. Once identified, examine what routine the process is executing and its activity pattern.

Iterative Troubleshooting Approach

Performance troubleshooting is iterative: identify bottleneck, form hypothesis about cause, test hypothesis through monitoring or controlled changes, implement resolution, and verify improvement. Documentation of performance baselines helps recognize when performance degrades from normal levels.

8. Run diagnostic tools (IRISHung/Diagnostic Report)

Key Points

  • IRISHung script collects diagnostic data when system is unresponsive
  • Diagnostic Report task generates comprehensive system analysis
  • irisstat provides detailed runtime statistics and snapshots
  • Configure Diagnostic Report via Management Portal
  • Use for WRC (Worldwide Response Center) support cases

Detailed Notes

IRISHung Script

InterSystems IRIS provides several diagnostic tools for collecting system information during troubleshooting, particularly when the system is hung or unresponsive. The IRISHung script (irishung.sh on UNIX/Linux, IRISHung.bat on Windows) is specifically designed to gather diagnostic information when the system appears to be hung or unresponsive. This script collects irisstat snapshots, process information, operating system statistics, and other critical diagnostic data without requiring the Management Portal or normal system interfaces to be functional. The collected data is essential for the InterSystems Worldwide Response Center (WRC) when diagnosing complex system issues.

Diagnostic Report Task

The Diagnostic Report task provides automated generation of comprehensive system diagnostic information. Access this through the Management Portal at System Operations > Diagnostic Reports. Configuration options include specifying the archive directory for reports, configuring email notification for report availability, and selecting which information categories to include. The report includes basic information (system status, license details, error logs) and advanced information (multiple irisstat snapshots, network diagnostics, configuration details).

The irisstat Utility

The irisstat utility is a low-level diagnostic tool that provides detailed statistics about the running IRIS instance. Run irisstat from the operating system command line using platform-specific procedures: on Windows, navigate to the bin directory and run "irisstat"; on UNIX/Linux, use "iris stat " or run irisstat directly from the bin directory. Common irisstat options control output format and detail level. Output can be directed to text files for later analysis or automatically included in Diagnostic Reports.

Best Practices for Diagnostic Tools

For urgent support situations, the WRC typically requests irisstat output along with log files and Diagnostic Reports. Having these diagnostic tools readily available and understanding how to run them ensures rapid response to system problems. Best practices include running periodic Diagnostic Reports for baseline documentation, testing IRISHung script execution procedure before emergencies, and documenting local procedures for collecting diagnostic information.

9. Manage access restrictions and emergency access

Key Points

  • OS-based authentication enables emergency access when locked out
  • Terminal access provides direct system access bypassing web interfaces
  • ^SECURITY utility manages users and passwords from command line
  • Services can be enabled/disabled to control access points
  • Document emergency access procedures before they're needed

Detailed Notes

Emergency Access Scenarios

Managing access restrictions and emergency access procedures is critical for maintaining system availability while preserving security. Emergency access scenarios typically occur when administrators are locked out due to password issues, authentication system failures, or security misconfigurations.

OS-Based Authentication

OS-based authentication provides one of the primary emergency access mechanisms. When enabled for the Terminal service, users who are authenticated at the operating system level can access IRIS without additional password prompts. This allows system administrators with local OS access to reach the Terminal even when other authentication mechanisms fail. To enable this, navigate to System Administration > Security > Services and modify the %Service_Terminal settings to allow OS authentication.

The ^SECURITY Utility

The ^SECURITY utility provides command-line access to security management functions, enabling administrators to create or modify user accounts, reset passwords, and adjust roles directly from the Terminal. This is particularly valuable when the Management Portal is inaccessible. Run "do ^SECURITY" from a Terminal prompt to access the menu-driven interface for user management, role assignment, service configuration, and other security functions.

Service Management and Emergency Procedures

Service management allows controlling which access points are available. Services can be enabled or disabled through the Management Portal (System Administration > Security > Services) or using the ^SECURITY utility. Temporarily disabling non-essential services can isolate access during security incidents, while ensuring critical services remain available for legitimate administrative access. For emergency situations, prepare documented procedures including: OS credentials for local system access, procedures for accessing Terminal with OS authentication, ^SECURITY utility commands for password reset, steps for re-enabling disabled services, and escalation contacts for InterSystems WRC support. Restrictions on access should be applied carefully - overly restrictive configurations can prevent emergency recovery. Best practice is to always maintain at least one path for authorized administrative access (typically Terminal with OS authentication) while securing other access points appropriately.

Exam Preparation Summary

Critical Concepts to Master:

  1. Log Files: Understand cconsole.log (startup/shutdown), messages.log (runtime), and their locations
  2. Startup Sequence: Know common startup failure points (license, database mount, port binding)
  3. Lock Analysis: Understand how to identify lock contention using Process Details
  4. Integrity Checks: Recognize when and how to run integrity checks
  5. Journal Issues: Understand impact of journal space exhaustion and "freeze on error"
  6. Performance Tools: Know how to use ^PERFMON, Process Details, and Terminal utilities
  7. Error Interpretation: Ability to read and interpret error messages from logs

Common Exam Scenarios:

  • Diagnosing why an instance won't start using cconsole.log
  • Identifying and resolving lock contention between processes
  • Responding to journal space exhaustion
  • Interpreting database integrity check failures
  • Using Terminal utilities when Management Portal unavailable
  • Identifying performance bottlenecks from symptoms

Hands-On Practice Recommendations:

  • Review cconsole.log and messages.log during normal and failed startups
  • Practice using ^PERFMON to monitor system performance
  • Use Terminal utilities (^JOURNAL, ^DATABASE, ^SYSLOG) for diagnostics
  • Examine Process Details to analyze lock holdings
  • Run integrity checks and interpret output
  • Simulate and resolve journal space issues in test environment
  • Practice troubleshooting startup failures
  • Use EXPLAIN to analyze SQL query performance

Report an Issue