Cluster Health Monitor

What is the CHM resource name and how to monitor CHM status ?

  • ora.crf is the Cluster Health Monitor resource name that ohasd manages.
  • Issue “$GRID_HOME/bin/crsctl status res ora.crf -init” to check the current status of the Cluster Health Monitor
  • Use : $ oclumon manage -get MASTER to display the Master 
  • CHM is already installed with 11.2.0.2, 11.2.0.3, 11.2.0.4 and 12.1.0.1

 

CHM metrics ( 12.1 )

 

 

What are the processes and components for the Cluster Health Monitor?

Cluster Logger Service (Ologgerd) – there is a master ologgerd that receives the data from other nodes and saves 
them in the repository (Berkeley database).  It compresses the data before persisting to save the disk space. 
In an environment with multiple nodes, a replica ologgerd is also started on a node where the master ologgerd 
is not running. The master ologgerd will sync the data with replica ologgerd by sending the data to the replica 
ologgerd.  The replica ologgerd takes over if the master ologgerd dies. A new replica ologgerd starts when the 
replica ologgerd dies. There is only one master ologgerd and one replica ologgerd per cluster.

System Monitor Service (Sysmond) – the sysmond process collects the system statistics of the local node and sends 
the data to the master ologgerd.  A sysmond process runs on every node and collects the system statistics including 
CPU, memory usage, platform info, disk info, nic info, process info,  and filesystem info.

 

Locate CHM log directory

Check CHM resource status and locate Master Node 
[grid@grac41 ~] $ $GRID_HOME/bin/crsctl status res ora.crf -init
NAME=ora.crf
TYPE=ora.crf.type
TARGET=ONLINE
STATE=ONLINE on grac41
[grid@grac41 ~]$ oclumon manage -get MASTER
Master = grac43

Login into grac43 and located CHM log directory  ( ologgerd process )  
[root@grac43 ~]#  ps -elf |grep ologgerd | grep -v grep
 .... /u01/app/11204/grid/bin/ologgerd -M -d /u01/app/11204/grid/crf/db/grac43

Comparison of  OSWatcher and CHM

  • CHM CPU overhead  for a single run is lower than OSWatcher as CHM don’t uses  iostat,vmstat to collect data
  • OSWatcher runs with user priorty compared to RT priority of CHM ( CHM should be able to collect data even under CPU starvation )
  • OSWatcher does a better job tracing network related stats like top, traceroute, and netstat
  • TFA can reduce the number of uploaded files

 

How to start and stop CHM that is installed as a part of GI in 11.2 and higher

Starting and stopping ora.crf resource starts and stops CHM.
Check status:
$GRID_HOME/bin/crsctl status res ora.crf -init

To stop CHM (or ora.crf resource managed by ohasd)
$GRID_HOME/bin/crsctl stop res ora.crf -init

To start CHM (or ora.crf resource managed by ohasd)
$GRID_HOME/bin/crsctl start res ora.crf -init

Check status on a specific node;
$ ssh grac42 $GRID_HOME/bin/crsctl status res ora.crf -init | grep STATE
STATE=ONLINE on grac42

Error CRS-9011 running oclumon – ologerrd daemon not started

oclumon dumpnodeview -n grac41  -last "00:15:00"
CRS-9011-Error dumpnodeview: Failed to initialize connection to the Cluster Logger Service
$ ps -ef | egrep "sysmond|loggerd"
root      3820     1  2 Feb20 ?        00:26:15 /u01/app/11204/grid/bin/osysmond.bin
--> Ologgerd deamon is not running

Fix
1. Stop ora.crf as root user on all nodes
# /u01/app/11204/grid/bin/crsctl stop res ora.crf -init
CRS-2673: Attempting to stop 'ora.crf' on 'grac41'
CRS-2677: Stop of 'ora.crf' on 'grac41' succeeded

2. Comment the "BDBSIZE" entry and save the changes. ( file $GRID_HOME/crf/admin/crfgrac41.ora )
3. Start the ora.crf resource on all nodes
# /u01/app/11204/grid/bin/crsctl  start res ora.crf -init
CRS-2672: Attempting to start 'ora.crf' on 'grac41'
CRS-2676: Start of 'ora.crf' on 'grac41' succeeded
4. Verify that ologgerd daemon is running 
#  ps -ef | egrep "sysmond|loggerd"
root     27213     1  4 11:22 ?        00:00:00 /u01/app/11204/grid/bin/osysmond.bin
root     27227     1  4 11:22 ?        00:00:00 /u01/app/11204/grid/bin/ologgerd -M -d /u01/app/11204/grid/crf/db/grac41
root     27243 20061  0 11:22 pts/7    00:00:00 egrep sysmond|loggerd
--> Ologgerd deamin is now running
5. Verify oclumon is working now
$ oclumon manage -get MASTER
Master = grac41
 Done

References:

Leave a Reply

Your email address will not be published. Required fields are marked *