Table of Contents
What is the CHM resource name and how to monitor CHM status ?
- ora.crf is the Cluster Health Monitor resource name that ohasd manages.
- Issue “$GRID_HOME/bin/crsctl status res ora.crf -init” to check the current status of the Cluster Health Monitor
- Use : $ oclumon manage -get MASTER to display the Master
- CHM is already installed with 11.2.0.2, 11.2.0.3, 11.2.0.4 and 12.1.0.1
CHM metrics ( 12.1 )
- SYSTEM View Metric Descriptions
- PROCESSES View Metric Descriptions
- DEVICES View Metric Descriptions
- NICS View Metric Descriptions
- FILESYSTEMS View Metric Descriptions
- PROTOCOL ERRORS View Metric Descriptions
- CPUS View Metric Descriptions
What are the processes and components for the Cluster Health Monitor?
Cluster Logger Service (Ologgerd) – there is a master ologgerd that receives the data from other nodes and saves them in the repository (Berkeley database). It compresses the data before persisting to save the disk space. In an environment with multiple nodes, a replica ologgerd is also started on a node where the master ologgerd is not running. The master ologgerd will sync the data with replica ologgerd by sending the data to the replica ologgerd. The replica ologgerd takes over if the master ologgerd dies. A new replica ologgerd starts when the replica ologgerd dies. There is only one master ologgerd and one replica ologgerd per cluster. System Monitor Service (Sysmond) – the sysmond process collects the system statistics of the local node and sends the data to the master ologgerd. A sysmond process runs on every node and collects the system statistics including CPU, memory usage, platform info, disk info, nic info, process info, and filesystem info.
Locate CHM log directory
Check CHM resource status and locate Master Node [grid@grac41 ~] $ $GRID_HOME/bin/crsctl status res ora.crf -init NAME=ora.crf TYPE=ora.crf.type TARGET=ONLINE STATE=ONLINE on grac41 [grid@grac41 ~]$ oclumon manage -get MASTER Master = grac43 Login into grac43 and located CHM log directory ( ologgerd process ) [root@grac43 ~]# ps -elf |grep ologgerd | grep -v grep .... /u01/app/11204/grid/bin/ologgerd -M -d /u01/app/11204/grid/crf/db/grac43
Comparison of OSWatcher and CHM
- CHM CPU overhead for a single run is lower than OSWatcher as CHM don’t uses iostat,vmstat to collect data
- OSWatcher runs with user priorty compared to RT priority of CHM ( CHM should be able to collect data even under CPU starvation )
- OSWatcher does a better job tracing network related stats like top, traceroute, and netstat
- TFA can reduce the number of uploaded files
How to start and stop CHM that is installed as a part of GI in 11.2 and higher
Starting and stopping ora.crf resource starts and stops CHM. Check status: $GRID_HOME/bin/crsctl status res ora.crf -init To stop CHM (or ora.crf resource managed by ohasd) $GRID_HOME/bin/crsctl stop res ora.crf -init To start CHM (or ora.crf resource managed by ohasd) $GRID_HOME/bin/crsctl start res ora.crf -init Check status on a specific node; $ ssh grac42 $GRID_HOME/bin/crsctl status res ora.crf -init | grep STATE STATE=ONLINE on grac42
Error CRS-9011 running oclumon – ologerrd daemon not started
$ oclumon dumpnodeview -n grac41 -last "00:15:00" CRS-9011-Error dumpnodeview: Failed to initialize connection to the Cluster Logger Service $ ps -ef | egrep "sysmond|loggerd" root 3820 1 2 Feb20 ? 00:26:15 /u01/app/11204/grid/bin/osysmond.bin --> Ologgerd deamon is not running Fix 1. Stop ora.crf as root user on all nodes # /u01/app/11204/grid/bin/crsctl stop res ora.crf -init CRS-2673: Attempting to stop 'ora.crf' on 'grac41' CRS-2677: Stop of 'ora.crf' on 'grac41' succeeded 2. Comment the "BDBSIZE" entry and save the changes. ( file $GRID_HOME/crf/admin/crfgrac41.ora ) 3. Start the ora.crf resource on all nodes # /u01/app/11204/grid/bin/crsctl start res ora.crf -init CRS-2672: Attempting to start 'ora.crf' on 'grac41' CRS-2676: Start of 'ora.crf' on 'grac41' succeeded 4. Verify that ologgerd daemon is running # ps -ef | egrep "sysmond|loggerd" root 27213 1 4 11:22 ? 00:00:00 /u01/app/11204/grid/bin/osysmond.bin root 27227 1 4 11:22 ? 00:00:00 /u01/app/11204/grid/bin/ologgerd -M -d /u01/app/11204/grid/crf/db/grac41 root 27243 20061 0 11:22 pts/7 00:00:00 egrep sysmond|loggerd --> Ologgerd deamin is now running 5. Verify oclumon is working now $ oclumon manage -get MASTER Master = grac41 Done
References:
- Cluster Health Monitor (CHM) FAQ (Doc ID 1328466.1)
- Displaying CHM data using olcumon