Debugging Node Eviction triggered by low CPU and/or Paging/Swapping by using TFA, CHM, OSWatcher

Key Facts

  • OSWatcher may not provide accurate data as OSWatcher commands runs at lower priority as CHM
  • Review LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning.
  •  It’s very likely that we get lmhb traces because LMON, LMS RAC processes will not to get enough CPU when running this test case
  • An overloaded RAC instance ( either CPU or due to Paging/Swapping ) can cause a Node Eviction with :  IPC Send Timeout message
  • When getting Node Eviction with IPC Send Timeout message monitor OSWatcher and CHM files carefully
  • Significant Errors:
    IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)]
    LMS0 (ospid: 10410) has detected no messaging activity from instance 1
    LMS0 (ospid: 10410) issues an IMR to resolve the situation

 

Program mp_stress to trigger node evictions by creating CPU/Memory starvation

 Compile : 
 cc -D_REENTRANT -std=gnu99 -o mp_stress mp_stress.c -lpthread -lm
 Run it   
 # ./mp_stress -t 4  -m 20 -p 50 -c 200
    -t   4 : Use 4 threads - the more RT threads your are running you systems gets overloaded very quickly
             For a 2 CPU system use 4 threads. Running 2 Threads per CPU with a higher priority than 
             our LMS, LMD0 process will create a node eviction   
    -m  20 : allocate 20 Mbyte memory for each run ( about 1 second ) - paging and swapping will 
             will occur sooner or later on your system
    -p  50 : run this process with in SCHED_RR class with priorty 50  - you can increase this value to 99
    -c 200 : conrols proocessing time - if you increase this value the  mp_stress program will run longer

Check RT process priorities:
#       ps -e -o pid,class,rtprio,pri,nlwp,cmd | grep RR | sort -k 3 -n  | grep -v TS
11827 RR       1  41    1 asm_vktm_+ASM1
11843 RR       1  41    1 asm_lms0_+ASM1
 6547 RR       1  41    1 ora_vktm_grac41
 6567 RR       1  41    1 ora_lms0_grac41
11579 RR      10  50    5 ./mp_stress -t 4 -m 1 -p 10 -c 2000
 1734 RR      99 139    6 /sbin/multipathd
 4536 RR      99 139   12 /u01/app/11204/grid/bin/osysmond.bin
 4550 RR      99 139   19 /u01/app/11204/grid/bin/cssdmonitor
 4568 RR      99 139   19 /u01/app/11204/grid/bin/cssdagent
 4582 RR      99 139   24 /u01/app/11204/grid/bin/ocssd.bin 
 4951 RR      99 139    8 /u01/app/11204/grid/bin/ologgerd -m grac43 -r -d /u01/app/11204/grid/crf/db/grac
 -->  ./mp_stress runs at a much higher priority as the Oracle shadow processes a these processes 
      are running in the TS scheduling class 
      mp_stress ( RTPRIO = 10 ) will prevent  ora_lms0_grac41 (RTPRIO=1) from running but  ocssd.bin,cssdmonitor 
      and  cssdagent ( RTPRIO=99 ) should get enough CPU. 

Collecting diagnostic data for a specific day by using TFA

Collecting diagnostic data for a specific day
# $GRID_HOME/bin/tfactl diagcollect -all -for "Mar/22/2014"
Collecting data for all components using above parameters...
Collecting data for all nodes
Scanning files for Mar/22/2014
Repository Location in grac41 : /u01/app/grid/tfa/repository
2014/03/22 14:03:35 CET : Running an inventory clusterwide ...
2014/03/22 14:03:36 CET : Collection Name : tfa_Sat_Mar_22_14_03_29_CET_2014.zip
2014/03/22 14:03:43 CET : Sending diagcollect request to host : grac42
2014/03/22 14:03:43 CET : Sending diagcollect request to host : grac43 
....
Logs are collected to:
/u01/app/grid/tfa/repository/collection_Sat_Mar_22_14_03_29_CET_2014_node_all/grac41.tfa_Sat_Mar_22_14_03_29_CET_2014.zip
/u01/app/grid/tfa/repository/collection_Sat_Mar_22_14_03_29_CET_2014_node_all/grac42.tfa_Sat_Mar_22_14_03_29_CET_2014.zip
/u01/app/grid/tfa/repository/collection_Sat_Mar_22_14_03_29_CET_2014_node_all/grac43.tfa_Sat_Mar_22_14_03_29_CET_2014.zip

Review Significant events like shutdown,startup, Instance Eviction

 2014-03-22 10:08:48
                    alert_grac41.log : Starting ORACLE instance (normal)
 2014-03-22 10:07:41
                    alert_grac41.log : Instance terminated by USER, pid = 18757
 2014-03-22 10:06:35
                    alert_grac41.log : Instance terminated by LMD0, pid = 27060
                    ==> LMDO0 terminates instance on grac41
 2014-03-22 10:06:12
                    alert_grac41.log : LMD0 (ospid: 27060): terminating the instance due to error 481
                    alert_grac41.log : LMD0 (ospid: 27060): terminating the instance due to error 481
 2014-03-22 10:05:56
                    alert_grac42.log : Remote instance kill is issued with system inc 34
                    alert_grac42.log : LMON received an instance eviction notification from instance 3 The instance eviction map is 1 
                    alert_grac42.log : LMON received an instance eviction notification from instance 2 The instance eviction map is 1 
 2014-03-22 10:05:52
                    alert_grac43.log : LMON received an instance eviction notification from instance 2 The instance eviction map is 1 
                    alert_grac43.log : LMON received an instance eviction notification from instance 3 The instance eviction map is 1 
                    alert_grac43.log : Remote instance kill is issued with system inc 34

 2014-03-22 10:04:29
                    alert_grac43.log : IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)]
                    alert_grac43.log : IPC Send timeout to 1.1 inc 32 for msg type 44 from opid 13
                    alert_grac43.log : Receiver: inst 1 binc 439148604 ospid 27131
                    alert_grac43.log : IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)]
                    alert_grac43.log : IPC Send timeout to 1.1 inc 32 for msg type 44 from opid 13
                    alert_grac43.log : Receiver: inst 1 binc 439148604 ospid 27131
                    ==> grac43 detects an IPC Send timeout
 2014-03-22 10:03:43
                    alert_grac42.log : IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)]
                    alert_grac42.log : IPC Send timeout to 1.1 inc 32 for msg type 34 from opid 13
                    alert_grac42.log : Receiver: inst 1 binc 439148604 ospid 27131
                    alert_grac42.log : Receiver: inst 1 binc 439148604 ospid 27131
                    alert_grac42.log : IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)]
                    alert_grac42.log : IPC Send timeout to 1.1 inc 32 for msg type 34 from opid 13
                    ==> grac42 detects an IPC Send timeout 
  • Eviction started around 10:03
  • Evicton error message   : IPC Send timeout detected is the key message
  • Node Reboot of grac41 finished at 10:08
  • Node grac42 and grac43 reports communiation problem with grac41
  • Very likely grac41 is the source of the problem
  • Instance grac41 was terminated by LMD0, pid = 27060
  • LMS0 processes have  a communication problem potentialy related to
    • not getting enough CPU  due to paging,swapping
    • not getting enough CPU due to scheduling problems raised by other RT processes

Summary of all Eviction Events

2014-03-22 10:03:43    trace/alert_grac42.log    IPC Send timeout to 1.1 inc 32 for msg type 34 from opid 13
2014-03-22 10:03:43    trace/alert_grac42.log    IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)]
2014-03-22 10:03:43    trace/alert_grac42.log    Receiver: inst 1 binc 439148604 ospid 27131
2014-03-22 10:03:43    trace/alert_grac42.log    IPC Send timeout to 1.1 inc 32 for msg type 34 from opid 13
2014-03-22 10:03:43    trace/alert_grac42.log    Receiver: inst 1 binc 439148604 ospid 27131
2014-03-22 10:03:43    trace/alert_grac42.log    IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)]
2014-03-22 10:04:29    trace/alert_grac43.log    IPC Send timeout to 1.1 inc 32 for msg type 44 from opid 13
2014-03-22 10:04:29    trace/alert_grac43.log    IPC Send timeout to 1.1 inc 32 for msg type 44 from opid 13
2014-03-22 10:04:29    trace/alert_grac43.log    Receiver: inst 1 binc 439148604 ospid 27131
2014-03-22 10:04:29    trace/alert_grac43.log    Receiver: inst 1 binc 439148604 ospid 27131
2014-03-22 10:04:29    trace/alert_grac43.log    IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)]
2014-03-22 10:04:29    trace/alert_grac43.log    IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)]
2014-03-22 10:05:52    trace/alert_grac43.log    LMON received an instance eviction notification from instance 3 The instance eviction map is 1
2014-03-22 10:05:52    trace/alert_grac43.log    LMON received an instance eviction notification from instance 2 The instance eviction map is 1
2014-03-22 10:05:56    trace/alert_grac42.log    LMON received an instance eviction notification from instance 2 The instance eviction map is 1
2014-03-22 10:05:56    trace/alert_grac42.log    LMON received an instance eviction notification from instance 3 The instance eviction map is 1

Reviewing RAC alert.log and ocssd logfiles

Check for CRS-16XX errors 
% ~/fn.sh CRS-16 -B 1
Search String:  CRS-16

alertgrac41.log
2014-03-22 10:05:51.446: 
[cssd(5283)]CRS-1662:Member kill requested by node grac42 for member number 0, group DBGRAC4
2014-03-22 10:05:53.685: 
[cssd(5283)]CRS-1662:Member kill requested by node grac43 for member number 0, group DBGRAC4
 -------------------------------------------------------------------- 

alertgrac42.log
2014-03-22 10:05:56.582: 
[cssd(5275)]CRS-1663:Member kill issued by PID 10406 for 1 members, group DBGRAC4. 
  Details at (:CSSGM00044:) in /u01/app/11204/grid/log/grac42/cssd/ocssd.log.
 -------------------------------------------------------------------- 

alertgrac43.log
2014-03-22 10:05:52.781: 
[cssd(5024)]CRS-1663:Member kill issued by PID 9112 for 1 members, group DBGRAC4. Details at (:CSSGM00044:) in 
  /u01/app/11204/grid/log/grac43/cssd/ocssd.log.
 -------------------------------------------------------------------- 
--> CRS alert logs reports a Member Kill Request  



Locate the right ocssd.log file and check for Kill actions 
% grep Kill    ocssd.log [ grac41 ]
2014-03-22 10:05:51.446: [    CSSD][2195117824]clssgmmkLocalKillThread: Local kill requested: id 4 mbr map 0x00000001 Group name DBGRAC4 flags 0x00000000 st time 174335374 end time 174365874 time out 30500 req node 2
2014-03-22 10:05:51.446: [    CSSD][2195117824]clssgmmkLocalKillThread: Kill requested for member 0 group (0x7f4464081010/DBGRAC4)
2014-03-22 10:05:53.685: [    CSSD][2196694784]clssgmmkLocalKillThread: Local kill requested: id 6 mbr map 0x00000001 Group name DBGRAC4 flags 0x00000000 st time 174337614 end time 174368114 time out 30500 req node 3
2014-03-22 10:05:53.685: [    CSSD][2196694784]clssgmmkLocalKillThread: Kill requested for member 0 group (0x7f4464081010/DBGRAC4)
2014-03-22 10:05:53.764: [    CSSD][2204579584]clssgmKillMbrGone: Posting mbrgone event Req node 2 id 4 member 0 grock name DBGRAC4
2014-03-22 10:05:53.764: [    CSSD][2204579584]clssgmKillMbrGone: Posting mbrgone event Req node 3 id 6 member 0 grock name DBGRAC4
2014-03-22 10:05:53.764: [    CSSD][2196694784]clssgmmkLocalKillDoneChk: Work item results. Success map 0x00000001 Fail map 0x00000000 Req node 3 Responding node 0 Num left 0 Kill count 1 Number of answers 1
2014-03-22 10:05:53.764: [    CSSD][2196694784]clssgmmkLocalKillResults: Replying to kill request from remote node 3 kill id 6 Success map 0x00000001 Fail map 0x00000000
2014-03-22 10:05:53.764: [    CSSD][2195117824]clssgmmkLocalKillDoneChk: Work item results. Success map 0x00000001 Fail map 0x00000000 Req node 2 Responding node 0 Num left 0 Kill count 1 Number of answers 1
2014-03-22 10:05:53.764: [    CSSD][2195117824]clssgmmkLocalKillResults: Replying to kill request from remote node 2 kill id 4 Success map 0x00000001 Fail map 0x00000000
--> CSSD on grac41 is still working -  CSSD on grac41 got a kill request for member 0 ( == instance 1 )   from instance grac43, grac42

Check for message :   IPC Send Timeout message
# ~/fn.sh '[Ii][Pp][Cc] [Ss]end [Tt]imeout' -B 1 -A 2 
grac41_diag_27033_1.trc
*** 2014-03-22 10:04:48.725
IPC Send timeout detected. Receiver ospid 27131 [

alert_grac42.log
Sat Mar 22 10:03:43 2014 IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)]
Receiver: inst 1 binc 439148604 ospid 27131
IPC Send timeout to 1.1 inc 32 for msg type 34 from opid 13

grac43_lms0_9116.trc
*** 2014-03-22 10:04:28.487
IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)]
alert_grac43.log
Sat Mar 22 10:04:29 2014
IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)]
Receiver: inst 1 binc 439148604 ospid 27131
IPC Send timeout to 1.1 inc 32 for msg type 44 from opid 1
--> Lots of IPC Send timeout errrors 
    Note this error is one of the 5 top-most error messages seen for Node Evictions
    Receiver instance is instance 1 with  is grac41    

Review LMON traces

Review Lmon traces in evicted instance grac41


*** 2014-03-22 10:04:41.225
kjxgmpoll: stalled for 382 seconds (threshold 42 sec)
  ===[ Session State Object ]===
*** 2014-03-22 10:04:43.531
    ----------------------------------------
    SO: 0xaf8a9100, type: 4, owner: 0xaf515428, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
*** 2014-03-22 10:04:44.675
     proc=0xaf515428, name=session, file=ksu.h LINE:12729, pg=0
    (session) sid: 130 ser: 1 trans: (nil), creator: 0xaf515428
              flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
--> Lmon process is not getting CPU 

LMON on grac43 reports
** 2014-03-22 10:04:35.123
kjxgrf_rr_lock: done - ret = 0  hist 0x13d
2014-03-22 10:04:35.123141 : kjxgrDiskVote: RR lock-get succeeded w/ status 0
2014-03-22 10:04:35.123172 : kjxgrDiskVote: Obtained RR update lock for sequence 33, RR seq 32
kjxgr_votesready: inst 1 vote is found
 vote: Inst 1 thread 1 seqno 32 state 0x10000000 mapsz 2624
 vote: map: 1 2 3
kjxgr_votesready: inst 2 vote is found
 vote: Inst 2 thread 2 seqno 33 state 0x10000007 mapsz 2624
 vote: map: 2 3
kjxgr_votesready: inst 3 vote is found
 vote: Inst 3 thread 3 seqno 33 state 0x10000007 mapsz 2624
 vote: map: 2 3
kjxgrDiskVote: still waiting for votes, time left: 55436 ms, wait_count 1
*** 2014-03-22 10:04:36.042
kjxgr_votesready: inst 1 vote is found
 vote: Inst 1 thread 1 seqno 32 state 0x10000000 mapsz 2624
 vote: map: 1 2 3
kjxgrDiskVote: still waiting for votes, time left: 54653 ms, wait_count 17
--> Vote from instance 1 still has old rac incarnation number: 32 
..
*** 2014-03-22 10:05:31.555
kjxgr_votesready: inst 1 vote is found
 vote: Inst 1 thread 1 seqno 33 state 0x10000004 mapsz 2624
 vote: map: 1 2 3
kjxgrDiskVote: still waiting for votes, time left: 442 ms, wait_count 1009

*** 2014-03-22 10:05:32.034
kjxgrDiskVote: voting members: 2 3
 Inst (2) st 0x1137, es 0x0000, vote: 2 3
 Inst (3) st 0x0107, es 0x0000, vote: 2 3
2014-03-22 10:05:32.087368 : kjxgrDiskVote: decidemem completes, seq 33
2014-03-22 10:05:32.087417 : proposed membership: 2 3

*** 2014-03-22 10:05:32.581
kjxgrmsghndlr: Queue msg (0x7f3c5e61a000->0x7f3c5e4ae510) type 10 for later
=====================================================
kjxgmpoll: CGS state (32 1) start 0x532d51f3 cur 0x532d525c rcfgtm 105 sec
2014-03-22 10:05:32.634325 : kjxgrDiskVote: new membership is updated by inst 3, seq 34
2014-03-22 10:05:32.634398 : kjxgrDiskVote: bitmap: 2 3
Evicting inst 1, stat 0x1017 err 0x0002
  --> New seq number is 34 : Instance 1 is evicted - New cluster consists now out of instance 2 and 3 
IMR state information
  Inst 3, thread 3, state 0x4:124c, flags 0x12ca9:0x0001
  RR seq commit 32  cur 34
  Propstate 4  prv 3  pending 0
  rcfg rsn 3, rcfg time 1267854488, mem ct 3
  master 3, master rcfg time 1267854488
  evicted memcnt 0, starttm 0 chkcnt 0
  system load 345 (normal)
  nonblocking disk voting method
  Disk Vote State:
     dvstate 0xfe  state 0x6  start seq 33
     rr holder is inst 3
     start time 0x4b929fbe  optime 0x532d525c  dmptm 0x532d525c
     state machine current 6  next 15  history 0x7a1234567a123456

  Member information:
    Inst 1, incarn 32, version 0x1a2ce112, thrd 1, cap 0x0
      prev thrd 1, status 0x1017 (JRM.), err 0x0012
      valct 0
    Inst 2, incarn 4, version 0x19996157, thrd 2, cap 0x3
      prev thrd -1, status 0x1137 (JRM.), err 0x0000
      valct 3
    Inst 3, incarn 2, version 0x1998f4e4, thrd 3, cap 0x3
      prev thrd -1, status 0x0107 (JRM.), err 0x0000
      valct 3
..
*** 2014-03-22 10:05:54.877
kjxgmpoll reconfig instance map: 2 3

*** 2014-03-22 10:05:54.877
kjxgmrcfg: Reconfiguration started, type 1
CGS/IMR TIMEOUTS:
  CSS recovery timeout = 31 sec (Total CSS waittime = 65)
  IMR Reconfig timeout = 75 sec
  CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 34 0.

*** 2014-03-22 10:05:55.053
     Name Service frozen
kjxgmcs: Setting state to 34 1.
kjxgrdecidever: No old version members in the cluster
kjxgrssvote: reconfig bitmap chksum 0x3332563b cnt 2 master 2 ret 0
kjxgrpropmsg: SSVOTE: Master indicates no Disk Voting
2014-03-22 10:05:55.231847 : kjxgrDiskVote: nonblocking method is chosen
kjxgrpropmsg: SSVOTE not this rcfg pstate 4 from inst 2 mtm(0x4b93e3f9 0x4b93e3f9)

*** 2014-03-22 10:05:56.399
2014-03-22 10:05:56.399274 : kjxgrDiskVote: new membership is updated by inst 2, seq 36
2014-03-22 10:05:56.399347 : kjxgrDiskVote: bitmap: 2 3
--> LMON on grac43 doesn't  get an  actual vote from grac41 
    Issuing an IMR by dropping instance grac41 from the RAC cluster 

Debugging Eviction by using OSWatcher

 

Debugging Eviction by using CHM data

 

Debugging Eviction by using LMBH trace file

 

Understanding the root cause of the problem

Check process priority of involved processes 
# ps -e -o pid,class,rtprio,pri,nlwp,cmd |  egrep 'PID|mp_stress|ocssd.bin|ora_lms0_grac41'
  PID CLS RTPRIO PRI NLWP CMD
 4582 RR      99 139   24 /u01/app/11204/grid/bin/ocssd.bin 
 7894 RR      10  50    5 ./mp_stress -t 4 -m 1 -p 10 -c 2000
13508 RR       1  41    1 ora_lms0_grac41

ocssd.bin, ora_lms0_grac41  and mp_stress belongs to Realtime Scheduling Class 
( Round Robin real-time process ). This means ocssd.bin runs with the highest RT priority 99  and will get 
CPU as needed. 

The  ./mp_stress process runs as RT process too with PRIORITY 10. If ocssd.bin releases CPU  ./mp_stress will 
be scheduled. If mp_stress does not release CPU ( due to system call waits, ..) ora_lms0_grac41 will never get 
scheduled and this can lead to  an Instance Eviction as ora_lms0_grac41 only runs only  with RR PRIORITY of 1. 

Note ./mp_stress runs with 5 threads. If we assume 4 worker threads this program can keep your system quite 
busy if you have a low number of CPUs and ./mp_stress doesn't releases the CPU.
  •  If your system runs multiple RT processes or even a single RT processes with multiple threads double check LMS0 and LMON and LMD process gets enough CPU
  • Always check LMBH trace files as a first action 

 Reference:

One thought on “Debugging Node Eviction triggered by low CPU and/or Paging/Swapping by using TFA, CHM, OSWatcher”

  1. What’s Happening i am new to this, I stumbled upon this I
    have discovered It positively useful and it has
    helped me out loads. I’m hoping to give a contribution & aid
    different customers like its aided me. Great job.

Leave a Reply

Your email address will not be published. Required fields are marked *