Table of Contents
Overview
- LMBH process reports any heartbeat problems of LMON, LMD, and LMSn processes
- Seems LMHB tracks whether LMON, LMD, and LMSn processes has reported any wait states
- If not LMHB assumes there is a problem and reports this in lmbh trace file
- LMBH trace reports low memory and swap space problem
- LMBH reports system average load data
Typical LMHB trace file leading to a Note Eviction due to Memory/CPU starvation
*** 2014-03-22 10:03:47.024 ============================== LMD0 (ospid: 27060) has not moved for 333 sec (1395479026.1395478693) kjfmGCR_HBCheckAll: LMD0 (ospid: 27060) has status 2 : Not in wait; last wait ended 325 secs ago. : last wait_id 811409 at 'ges remote message'. ============================== Dumping PROCESS LMD0 (ospid: 27060) States ============================== ===[ Callstack ]=== *** 2014-03-22 10:03:47.025 Process diagnostic dump for oracle@grac41.example.com (LMD0), OS id=27060, pid: 12, proc_ser: 1, sid: 6, sess_ser: 1 ------------------------------------------------------------------------------- os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 10:03:46 ] NOTE: scheduling delay has not been sampled for 0.294992 secs 0.000000 secs from [ 10:03:42 - 10:03:47 ], 5 sec avg 0.054111 secs from [ 10:03:13 - 10:03:47 ], 1 min avg 0.314816 secs from [ 09:58:47 - 10:03:47 ], 5 min avg *** 2014-03-22 10:04:33.237 loadavg : 96.87 81.60 42.80 *** 2014-03-22 10:04:36.619 Memory (Avail / Total) = 85.77M / 4252.24M Swap (Avail / Total) = 105.59M / 5088.00M skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 27060' | /bin/grep -v grep timed out after 2.500 seconds Skipping stack dump because max dump time exceeded. ------------------------------------------------------------------------------- *** 2014-03-22 10:04:38.783 ============================== LMS0 (ospid: 27131) has not moved for 420 sec (1395479077.1395478657) kjfmGCR_HBCheckAll: LMS0 (ospid: 27131) has status 2 : Not in wait; last wait ended 375 secs ago. : last wait_id 2601414 at 'gcs remote message'. ============================== Dumping PROCESS LMS0 (ospid: 27131) States ============================== ===[ Callstack ]=== *** 2014-03-22 10:04:39.220 Process diagnostic dump for oracle@grac41.example.com (LMS0), OS id=27131, *** 2014-03-22 10:04:41.227 pid: 13, proc_ser: 1, sid: 131, sess_ser: 1 ------------------------------------------------------------------------------- os thread scheduling delay history: (sampling every 1.000000 secs) *** 2014-03-22 10:04:43.375 0.000000 secs at [ 10:04:41 ] NOTE: scheduling delay has not been sampled for 1.798665 secs 0.000000 secs from [ 10:04:38 - 10:04:43 ], 5 sec avg 0.002454 secs from [ 10:03:43 - 10:04:43 ], 1 min avg 0.166610 secs from [ 10:00:17 - 10:04:43 ], 5 min avg loadavg : 100.09 85.92 46.55 Memory (Avail / Total) = 3161.91M / 4252.24M Swap (Avail / Total) = 2518.66M / 5088.00M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 D oracle 27131 1 0 58 - - 392355 sleep_ Mar21 ? 00:07:41 ora_lms0_grac41 *** 2014-03-22 10:04:52.964 Short stack dump: ORA-32516: cannot wait for process 'Unix process pid: 27131, image: oracle@grac41.example.com (LMS0)' to finish executing ORADEBUG command 'SHORT_STACK'; wait time exceeds 2820 ms ------------------------------------------------------------------------------- Process diagnostic dump actual duration=11.990000 sec (max dump time=5.000000 sec) --> LMDO ( 333 seconds ) and LMS0 ( 375 seconds ) process are not progressing for a very long time LMD0 and LMS0 are in disk wait status ( verly likely due to paging/swapping activities ) LMHB reports low memory and low swap space around 2014-03-22 10:04:36 *** 2014-03-22 10:04:36.619 Memory (Avail / Total) = 85.77M / 4252.24M Swap (Avail / Total) = 105.59M / 5088.00M Interpreting Load stats loadavg : 100.09 85.92 46.55 during the last 60 s about 100.09 threads are concurrently running ( Run queue or waiting on Disk I/0) during the last 5 min about 85.92 threads are concurrently running ( Run queue or waiting on Disk I/0) during the last 15 min about 46.55 threads are concurrently running ( Run queue or waiting on Disk I/0) --> Load increases 2x in the last 15 minutes !