Display CHM data with oclumon | Helmut's RAC / JEE Blog

Used Software

GRID: 11.2.0.3.4
OEL 6.3
VirtualBox 4.2.14

Table of Contents

Using oclumon to detect potential root causes for node evictions ( CPU starvation )

Using oclumon to monitor for CPU intensive application 
Monitor command: $ oclumon dumpnodeview -n grac2 -last "00:15:00"
----------------------------------------
Node: grac2 Clock: '08-19-13 20.22.26' SerialNo:356
----------------------------------------
SYSTEM:
#cpus: 1 cpu: 100.0;';3:Time=08-19-13 20.22.26, CPU usage on node grac2 (100.0%) is Very High (> 90%). 
131 processes are waiting for only 1 CPUs.' cpuq: 131 physmemfree: 915244 physmemtotal: 4055440 mcache: 1858816 
swapfree: 6373372 swaptotal: 6373372 ior: 44 iow: 116 ios: 26 swpin: 0 swpout: 0 pgin: 44 pgout: 68 
netr: 16.456 netw: 22.932 procs: 273 rtprocs: 11 #fds: 17184 #sysfdlimit: 6815744 #disks: 11 #nics: 3  
nicErrors: 0
TOP CONSUMERS:
topcpu: 'mp_cpu(6822) 69.95' topprivmem: 'ologgerd(5945) 86196' topshm: 'oracle(5201) 105600' 
topfd: 'ohasd.bin(2852) 720' topthread: 'mp_cpu(6822) 129'

Summary from above oclumon report:

System runs with a single CPU and 100 % CPU load
The program mp_cpu is taking about 70 % of our CPU and runs 129 threads
Even running that program mp_cpu multiple hours the RAC system doesn’t crash with any node eviction
If the sample program mp_cpu is written as a realtime program ( sched_setscheduler(0, SCHED_FIFO, .. ) olcumon may stop working with following errors : Waiting upto 300 secs for backend… CRS-9103-No data available
In case we have a RT process taking all the CPU oclumon may skip most of the records ( use top in that case to monitor your system )
For CPU problems oclumon reports: CPU usage on node grac2 (100.0%) is Very High (> 90%)

Using oclumon to detect potential root causes for node evictions ( low swap space )

Using oclumon to monitor for MEMORY leaking application 
Stage 1 : Normal running system - stable swapfree parameter wiht about 4.6 Gbyte
----------------------------------------
Node: grac1 Clock: '08-20-13 09.27.33' SerialNo:743
----------------------------------------
SYSTEM:
#cpus: 1 cpu: 18.27 cpuq: 4 physmemfree: 2981396 physmemtotal: 4055440 mcache: 341364 swapfree: 4641208 
swaptotal: 6373372 ior: 8862 iow: 567 ios: 733 swpin: 1092 swpout: 0 pgin: 4447 pgout: 293 
netr: 46.093 netw: 64.767 procs: 285 rtprocs: 11 #fds: 17184 #sysfdlimit: 6815744 #disks: 12 #nics: 3  
nicErrors: 0
TOP CONSUMERS:
topcpu: 'oracle(4622) 3.59' topprivmem: 'ologgerd(3415) 88000' topshm: 'ologgerd(3415) 59392' 
topfd: 'ohasd.bin(2901) 713' topthread: 'console-kit-dae(2639) 64'

Stage 2: A process ( mp_mem ) starts to eat up our memory-  swapfree parameter is increasing 
( current value 4.1 Gbyte )  
----------------------------------------
Node: grac1 Clock: '08-20-13 09.29.38' SerialNo:768
----------------------------------------
SYSTEM:
#cpus: 1 cpu: 19.7 cpuq: 4 physmemfree: 91252 physmemtotal: 4055440 mcache: 157784 swapfree: 4103960 
swaptotal: 6373372 ior: 17994 iow: 38462 ios: 590 swpin: 486 swpout: 19465 pgin: 9112 pgout: 19619 
netr: 26.962 netw: 13.023 procs: 285 rtprocs: 11 #fds: 16928 #sysfdlimit: 6815744 #disks: 12 #nics: 3  
nicErrors: 0
TOP CONSUMERS:
topcpu: 'oracle(4622) 3.99' topprivmem: 'mp_mem(8530) 3084812' topshm: 'ologgerd(3415) 57168' 
topfd: 'ohasd.bin(2901) 714' topthread: 'mp_mem
(8530) 85'

Stage 3 : OS is running out of swap space - OS may detect a process and  kill that application 
Monitor command: $ oclumon dumpnodeview -n grac1 -last "00:15:00"
Node: grac1 Clock: '08-20-13 09.37.28' SerialNo:862
----------------------------------------
SYSTEM:
#cpus: 1 cpu: 18.67 cpuq: 2 physmemfree: 125904;';3:Time=08-20-13 09.37.28, 
Available memory (physmemfree 125904 KB + swapfree 34652 KB) on node grac1 is Too Low 
(< 10% of total-mem + total-swap)' physmemtotal: 4055440 mcache: 171264 swapfree: 34652 
swaptotal: 6373372 ior: 17330 iow: 3462 ios: 602 swpin: 704 swpout: 1880 
pgin: 8662 pgout: 1957 netr: 30.729 netw: 26.343 procs: 286 rtprocs: 11 #fds: 16640 
#sysfdlimit: 6815744 #disks: 12 #nics: 3  nicErrors: 0
TOP CONSUMERS:
topcpu: 'oracle(3879) 3.59' topprivmem: 'mp_mem(8530) 2966308' topshm: 'ologgerd(3415) 59212' 
topfd: 'ohasd.bin(2901) 714' topthread: 'mp_mem(8530) 129'

In the above case the program mp_mem was killed by Linux. You can read the following article on how Linux detects a process for termination due MEMORY shortage:

Using oclumon to detect potential root causes for node evictions ( Network problem )

netrr :          Average network receive rate within the current sample interval (KB per second)
neteff:          Average effective bandwidth within the current sample interval (KB per second)
nicerrors:     Average error rate within the current sample interval (errors per second)

eth2 netrr: 21.005  netwr: 17.449  neteff: 38.454  nicerrors: 0 pktsin: 40  pktsout: 37  errsin: 0  
  errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast : 39  innonunicast: 1  type: PRIVATE latency: <1
Node: grac42 Clock: '03-05-14 16.01.04' SerialNo:30728 
eth2 netrr: 14.823  netwr: 16.298  neteff: 31.121  nicerrors: 0 pktsin: 32  pktsout: 34  errsin: 0  
   errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast: 30  innonunicast: 2  type: PRIVATE latency: <1

Node: grac42 Clock: '03-05-14 16.01.14' SerialNo:30730 
eth2 netrr: 0.000  netwr: 0.000  neteff: 0.000  nicerrors: 0 pktsin: 0  pktsout: 0  errsin: 0  
errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast: 0  innonunicast: 0  type: PRIVATE latency: <1

Node: grac42 Clock: '03-05-14 16.01.19' SerialNo:30731 
eth2 netrr: 0.000  netwr: 0.000  neteff: 0.000  nicerrors: 0 pktsin: 0  pktsout: 0  errsin: 0  
errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast: 0  innonunicast: 0  type: PRIVATE latency: <1
Node: grac42 Clock: '03-05-14 16.01.24' SerialNo:30732

at 03-05-14 16.01.14 the network activity for eth2 of our cluster interconnect drops
As eth2 is our cluster interconnect we saw a Instance Eviction later on

Summary Node Eviction

11.2.0.3 seems to be quite stable for CPU and Memory starvation ( reduced number of node evictions )
Out of memory scenarios may be handled by the Linux kernel by killing certain processes
Both CPU and Memory starvation should be solved asap as cluster performance my drop dramatically
For Memory problems oclumon reports: Available memory (physmemfree 125904 KB + swapfree 34652 KB) on node grac1 is Too Low (< 10% of total-mem + total-swap)’

Used Software

Using oclumon to detect potential root causes for node evictions ( CPU starvation )

Using oclumon to detect potential root causes for node evictions ( low swap space )

Using oclumon to detect potential root causes for node evictions ( Network problem )

Summary Node Eviction

Leave a Reply Cancel reply