Table of Contents
Key Facts
- OSWatcher may not provide accurate data as OSWatcher commands runs at lower priority as CHM
- Review LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning.
- It’s very likely that we get lmhb traces because LMON, LMS RAC processes will not to get enough CPU when running this test case
- An overloaded RAC instance ( either CPU or due to Paging/Swapping ) can cause a Node Eviction with : IPC Send Timeout message
- When getting Node Eviction with IPC Send Timeout message monitor OSWatcher and CHM files carefully
- Significant Errors:
IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)]
LMS0 (ospid: 10410) has detected no messaging activity from instance 1
LMS0 (ospid: 10410) issues an IMR to resolve the situation
Program mp_stress to trigger node evictions by creating CPU/Memory starvation
- Source : mp_stress.c
Compile : cc -D_REENTRANT -std=gnu99 -o mp_stress mp_stress.c -lpthread -lm Run it : # ./mp_stress -t 4 -m 20 -p 50 -c 200 -t 4 : Use 4 threads - the more RT threads your are running you systems gets overloaded very quickly For a 2 CPU system use 4 threads. Running 2 Threads per CPU with a higher priority than our LMS, LMD0 process will create a node eviction -m 20 : allocate 20 Mbyte memory for each run ( about 1 second ) - paging and swapping will will occur sooner or later on your system -p 50 : run this process with in SCHED_RR class with priorty 50 - you can increase this value to 99 -c 200 : conrols proocessing time - if you increase this value the mp_stress program will run longer Check RT process priorities: # ps -e -o pid,class,rtprio,pri,nlwp,cmd | grep RR | sort -k 3 -n | grep -v TS 11827 RR 1 41 1 asm_vktm_+ASM1 11843 RR 1 41 1 asm_lms0_+ASM1 6547 RR 1 41 1 ora_vktm_grac41 6567 RR 1 41 1 ora_lms0_grac41 11579 RR 10 50 5 ./mp_stress -t 4 -m 1 -p 10 -c 2000 1734 RR 99 139 6 /sbin/multipathd 4536 RR 99 139 12 /u01/app/11204/grid/bin/osysmond.bin 4550 RR 99 139 19 /u01/app/11204/grid/bin/cssdmonitor 4568 RR 99 139 19 /u01/app/11204/grid/bin/cssdagent 4582 RR 99 139 24 /u01/app/11204/grid/bin/ocssd.bin 4951 RR 99 139 8 /u01/app/11204/grid/bin/ologgerd -m grac43 -r -d /u01/app/11204/grid/crf/db/grac --> ./mp_stress runs at a much higher priority as the Oracle shadow processes a these processes are running in the TS scheduling class mp_stress ( RTPRIO = 10 ) will prevent ora_lms0_grac41 (RTPRIO=1) from running but ocssd.bin,cssdmonitor and cssdagent ( RTPRIO=99 ) should get enough CPU.
- Please read to the following post for detail about: Linux scheduling classes ( check ps chapter )
Collecting diagnostic data for a specific day by using TFA
Collecting diagnostic data for a specific day
# $GRID_HOME/bin/tfactl diagcollect -all -for "Mar/22/2014"
Collecting data for all components using above parameters...
Collecting data for all nodes
Scanning files for Mar/22/2014
Repository Location in grac41 : /u01/app/grid/tfa/repository
2014/03/22 14:03:35 CET : Running an inventory clusterwide ...
2014/03/22 14:03:36 CET : Collection Name : tfa_Sat_Mar_22_14_03_29_CET_2014.zip
2014/03/22 14:03:43 CET : Sending diagcollect request to host : grac42
2014/03/22 14:03:43 CET : Sending diagcollect request to host : grac43
....
Logs are collected to:
/u01/app/grid/tfa/repository/collection_Sat_Mar_22_14_03_29_CET_2014_node_all/grac41.tfa_Sat_Mar_22_14_03_29_CET_2014.zip
/u01/app/grid/tfa/repository/collection_Sat_Mar_22_14_03_29_CET_2014_node_all/grac42.tfa_Sat_Mar_22_14_03_29_CET_2014.zip
/u01/app/grid/tfa/repository/collection_Sat_Mar_22_14_03_29_CET_2014_node_all/grac43.tfa_Sat_Mar_22_14_03_29_CET_2014.zip
Review Significant events like shutdown,startup, Instance Eviction
2014-03-22 10:08:48 alert_grac41.log : Starting ORACLE instance (normal) 2014-03-22 10:07:41 alert_grac41.log : Instance terminated by USER, pid = 18757 2014-03-22 10:06:35 alert_grac41.log : Instance terminated by LMD0, pid = 27060 ==> LMDO0 terminates instance on grac41 2014-03-22 10:06:12 alert_grac41.log : LMD0 (ospid: 27060): terminating the instance due to error 481 alert_grac41.log : LMD0 (ospid: 27060): terminating the instance due to error 481 2014-03-22 10:05:56 alert_grac42.log : Remote instance kill is issued with system inc 34 alert_grac42.log : LMON received an instance eviction notification from instance 3 The instance eviction map is 1 alert_grac42.log : LMON received an instance eviction notification from instance 2 The instance eviction map is 1 2014-03-22 10:05:52 alert_grac43.log : LMON received an instance eviction notification from instance 2 The instance eviction map is 1 alert_grac43.log : LMON received an instance eviction notification from instance 3 The instance eviction map is 1 alert_grac43.log : Remote instance kill is issued with system inc 34 2014-03-22 10:04:29 alert_grac43.log : IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)] alert_grac43.log : IPC Send timeout to 1.1 inc 32 for msg type 44 from opid 13 alert_grac43.log : Receiver: inst 1 binc 439148604 ospid 27131 alert_grac43.log : IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)] alert_grac43.log : IPC Send timeout to 1.1 inc 32 for msg type 44 from opid 13 alert_grac43.log : Receiver: inst 1 binc 439148604 ospid 27131 ==> grac43 detects an IPC Send timeout 2014-03-22 10:03:43 alert_grac42.log : IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)] alert_grac42.log : IPC Send timeout to 1.1 inc 32 for msg type 34 from opid 13 alert_grac42.log : Receiver: inst 1 binc 439148604 ospid 27131 alert_grac42.log : Receiver: inst 1 binc 439148604 ospid 27131 alert_grac42.log : IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)] alert_grac42.log : IPC Send timeout to 1.1 inc 32 for msg type 34 from opid 13 ==> grac42 detects an IPC Send timeout
- Eviction started around 10:03
- Evicton error message : IPC Send timeout detected is the key message
- Node Reboot of grac41 finished at 10:08
- Node grac42 and grac43 reports communiation problem with grac41
- Very likely grac41 is the source of the problem
- Instance grac41 was terminated by LMD0, pid = 27060
- LMS0 processes have a communication problem potentialy related to
- not getting enough CPU due to paging,swapping
- not getting enough CPU due to scheduling problems raised by other RT processes
Summary of all Eviction Events
2014-03-22 10:03:43 trace/alert_grac42.log IPC Send timeout to 1.1 inc 32 for msg type 34 from opid 13 2014-03-22 10:03:43 trace/alert_grac42.log IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)] 2014-03-22 10:03:43 trace/alert_grac42.log Receiver: inst 1 binc 439148604 ospid 27131 2014-03-22 10:03:43 trace/alert_grac42.log IPC Send timeout to 1.1 inc 32 for msg type 34 from opid 13 2014-03-22 10:03:43 trace/alert_grac42.log Receiver: inst 1 binc 439148604 ospid 27131 2014-03-22 10:03:43 trace/alert_grac42.log IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)] 2014-03-22 10:04:29 trace/alert_grac43.log IPC Send timeout to 1.1 inc 32 for msg type 44 from opid 13 2014-03-22 10:04:29 trace/alert_grac43.log IPC Send timeout to 1.1 inc 32 for msg type 44 from opid 13 2014-03-22 10:04:29 trace/alert_grac43.log Receiver: inst 1 binc 439148604 ospid 27131 2014-03-22 10:04:29 trace/alert_grac43.log Receiver: inst 1 binc 439148604 ospid 27131 2014-03-22 10:04:29 trace/alert_grac43.log IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)] 2014-03-22 10:04:29 trace/alert_grac43.log IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)] 2014-03-22 10:05:52 trace/alert_grac43.log LMON received an instance eviction notification from instance 3 The instance eviction map is 1 2014-03-22 10:05:52 trace/alert_grac43.log LMON received an instance eviction notification from instance 2 The instance eviction map is 1 2014-03-22 10:05:56 trace/alert_grac42.log LMON received an instance eviction notification from instance 2 The instance eviction map is 1 2014-03-22 10:05:56 trace/alert_grac42.log LMON received an instance eviction notification from instance 3 The instance eviction map is 1
Reviewing RAC alert.log and ocssd logfiles
Check for CRS-16XX errors % ~/fn.sh CRS-16 -B 1 Search String: CRS-16 alertgrac41.log 2014-03-22 10:05:51.446: [cssd(5283)]CRS-1662:Member kill requested by node grac42 for member number 0, group DBGRAC4 2014-03-22 10:05:53.685: [cssd(5283)]CRS-1662:Member kill requested by node grac43 for member number 0, group DBGRAC4 -------------------------------------------------------------------- alertgrac42.log 2014-03-22 10:05:56.582: [cssd(5275)]CRS-1663:Member kill issued by PID 10406 for 1 members, group DBGRAC4. Details at (:CSSGM00044:) in /u01/app/11204/grid/log/grac42/cssd/ocssd.log. -------------------------------------------------------------------- alertgrac43.log 2014-03-22 10:05:52.781: [cssd(5024)]CRS-1663:Member kill issued by PID 9112 for 1 members, group DBGRAC4. Details at (:CSSGM00044:) in /u01/app/11204/grid/log/grac43/cssd/ocssd.log. -------------------------------------------------------------------- --> CRS alert logs reports a Member Kill Request Locate the right ocssd.log file and check for Kill actions % grep Kill ocssd.log [ grac41 ] 2014-03-22 10:05:51.446: [ CSSD][2195117824]clssgmmkLocalKillThread: Local kill requested: id 4 mbr map 0x00000001 Group name DBGRAC4 flags 0x00000000 st time 174335374 end time 174365874 time out 30500 req node 2 2014-03-22 10:05:51.446: [ CSSD][2195117824]clssgmmkLocalKillThread: Kill requested for member 0 group (0x7f4464081010/DBGRAC4) 2014-03-22 10:05:53.685: [ CSSD][2196694784]clssgmmkLocalKillThread: Local kill requested: id 6 mbr map 0x00000001 Group name DBGRAC4 flags 0x00000000 st time 174337614 end time 174368114 time out 30500 req node 3 2014-03-22 10:05:53.685: [ CSSD][2196694784]clssgmmkLocalKillThread: Kill requested for member 0 group (0x7f4464081010/DBGRAC4) 2014-03-22 10:05:53.764: [ CSSD][2204579584]clssgmKillMbrGone: Posting mbrgone event Req node 2 id 4 member 0 grock name DBGRAC4 2014-03-22 10:05:53.764: [ CSSD][2204579584]clssgmKillMbrGone: Posting mbrgone event Req node 3 id 6 member 0 grock name DBGRAC4 2014-03-22 10:05:53.764: [ CSSD][2196694784]clssgmmkLocalKillDoneChk: Work item results. Success map 0x00000001 Fail map 0x00000000 Req node 3 Responding node 0 Num left 0 Kill count 1 Number of answers 1 2014-03-22 10:05:53.764: [ CSSD][2196694784]clssgmmkLocalKillResults: Replying to kill request from remote node 3 kill id 6 Success map 0x00000001 Fail map 0x00000000 2014-03-22 10:05:53.764: [ CSSD][2195117824]clssgmmkLocalKillDoneChk: Work item results. Success map 0x00000001 Fail map 0x00000000 Req node 2 Responding node 0 Num left 0 Kill count 1 Number of answers 1 2014-03-22 10:05:53.764: [ CSSD][2195117824]clssgmmkLocalKillResults: Replying to kill request from remote node 2 kill id 4 Success map 0x00000001 Fail map 0x00000000 --> CSSD on grac41 is still working - CSSD on grac41 got a kill request for member 0 ( == instance 1 ) from instance grac43, grac42 Check for message : IPC Send Timeout message # ~/fn.sh '[Ii][Pp][Cc] [Ss]end [Tt]imeout' -B 1 -A 2 grac41_diag_27033_1.trc *** 2014-03-22 10:04:48.725 IPC Send timeout detected. Receiver ospid 27131 [ alert_grac42.log Sat Mar 22 10:03:43 2014 IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)] Receiver: inst 1 binc 439148604 ospid 27131 IPC Send timeout to 1.1 inc 32 for msg type 34 from opid 13 grac43_lms0_9116.trc *** 2014-03-22 10:04:28.487 IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)] alert_grac43.log Sat Mar 22 10:04:29 2014 IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)] Receiver: inst 1 binc 439148604 ospid 27131 IPC Send timeout to 1.1 inc 32 for msg type 44 from opid 1 --> Lots of IPC Send timeout errrors Note this error is one of the 5 top-most error messages seen for Node Evictions Receiver instance is instance 1 with is grac41
Review LMON traces
Review Lmon traces in evicted instance grac41 *** 2014-03-22 10:04:41.225 kjxgmpoll: stalled for 382 seconds (threshold 42 sec) ===[ Session State Object ]=== *** 2014-03-22 10:04:43.531 ---------------------------------------- SO: 0xaf8a9100, type: 4, owner: 0xaf515428, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 *** 2014-03-22 10:04:44.675 proc=0xaf515428, name=session, file=ksu.h LINE:12729, pg=0 (session) sid: 130 ser: 1 trans: (nil), creator: 0xaf515428 flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- --> Lmon process is not getting CPU LMON on grac43 reports ** 2014-03-22 10:04:35.123 kjxgrf_rr_lock: done - ret = 0 hist 0x13d 2014-03-22 10:04:35.123141 : kjxgrDiskVote: RR lock-get succeeded w/ status 0 2014-03-22 10:04:35.123172 : kjxgrDiskVote: Obtained RR update lock for sequence 33, RR seq 32 kjxgr_votesready: inst 1 vote is found vote: Inst 1 thread 1 seqno 32 state 0x10000000 mapsz 2624 vote: map: 1 2 3 kjxgr_votesready: inst 2 vote is found vote: Inst 2 thread 2 seqno 33 state 0x10000007 mapsz 2624 vote: map: 2 3 kjxgr_votesready: inst 3 vote is found vote: Inst 3 thread 3 seqno 33 state 0x10000007 mapsz 2624 vote: map: 2 3 kjxgrDiskVote: still waiting for votes, time left: 55436 ms, wait_count 1 *** 2014-03-22 10:04:36.042 kjxgr_votesready: inst 1 vote is found vote: Inst 1 thread 1 seqno 32 state 0x10000000 mapsz 2624 vote: map: 1 2 3 kjxgrDiskVote: still waiting for votes, time left: 54653 ms, wait_count 17 --> Vote from instance 1 still has old rac incarnation number: 32 .. *** 2014-03-22 10:05:31.555 kjxgr_votesready: inst 1 vote is found vote: Inst 1 thread 1 seqno 33 state 0x10000004 mapsz 2624 vote: map: 1 2 3 kjxgrDiskVote: still waiting for votes, time left: 442 ms, wait_count 1009 *** 2014-03-22 10:05:32.034 kjxgrDiskVote: voting members: 2 3 Inst (2) st 0x1137, es 0x0000, vote: 2 3 Inst (3) st 0x0107, es 0x0000, vote: 2 3 2014-03-22 10:05:32.087368 : kjxgrDiskVote: decidemem completes, seq 33 2014-03-22 10:05:32.087417 : proposed membership: 2 3 *** 2014-03-22 10:05:32.581 kjxgrmsghndlr: Queue msg (0x7f3c5e61a000->0x7f3c5e4ae510) type 10 for later ===================================================== kjxgmpoll: CGS state (32 1) start 0x532d51f3 cur 0x532d525c rcfgtm 105 sec 2014-03-22 10:05:32.634325 : kjxgrDiskVote: new membership is updated by inst 3, seq 34 2014-03-22 10:05:32.634398 : kjxgrDiskVote: bitmap: 2 3 Evicting inst 1, stat 0x1017 err 0x0002 --> New seq number is 34 : Instance 1 is evicted - New cluster consists now out of instance 2 and 3 IMR state information Inst 3, thread 3, state 0x4:124c, flags 0x12ca9:0x0001 RR seq commit 32 cur 34 Propstate 4 prv 3 pending 0 rcfg rsn 3, rcfg time 1267854488, mem ct 3 master 3, master rcfg time 1267854488 evicted memcnt 0, starttm 0 chkcnt 0 system load 345 (normal) nonblocking disk voting method Disk Vote State: dvstate 0xfe state 0x6 start seq 33 rr holder is inst 3 start time 0x4b929fbe optime 0x532d525c dmptm 0x532d525c state machine current 6 next 15 history 0x7a1234567a123456 Member information: Inst 1, incarn 32, version 0x1a2ce112, thrd 1, cap 0x0 prev thrd 1, status 0x1017 (JRM.), err 0x0012 valct 0 Inst 2, incarn 4, version 0x19996157, thrd 2, cap 0x3 prev thrd -1, status 0x1137 (JRM.), err 0x0000 valct 3 Inst 3, incarn 2, version 0x1998f4e4, thrd 3, cap 0x3 prev thrd -1, status 0x0107 (JRM.), err 0x0000 valct 3 .. *** 2014-03-22 10:05:54.877 kjxgmpoll reconfig instance map: 2 3 *** 2014-03-22 10:05:54.877 kjxgmrcfg: Reconfiguration started, type 1 CGS/IMR TIMEOUTS: CSS recovery timeout = 31 sec (Total CSS waittime = 65) IMR Reconfig timeout = 75 sec CGS rcfg timeout = 85 sec kjxgmcs: Setting state to 34 0. *** 2014-03-22 10:05:55.053 Name Service frozen kjxgmcs: Setting state to 34 1. kjxgrdecidever: No old version members in the cluster kjxgrssvote: reconfig bitmap chksum 0x3332563b cnt 2 master 2 ret 0 kjxgrpropmsg: SSVOTE: Master indicates no Disk Voting 2014-03-22 10:05:55.231847 : kjxgrDiskVote: nonblocking method is chosen kjxgrpropmsg: SSVOTE not this rcfg pstate 4 from inst 2 mtm(0x4b93e3f9 0x4b93e3f9) *** 2014-03-22 10:05:56.399 2014-03-22 10:05:56.399274 : kjxgrDiskVote: new membership is updated by inst 2, seq 36 2014-03-22 10:05:56.399347 : kjxgrDiskVote: bitmap: 2 3 --> LMON on grac43 doesn't get an actual vote from grac41 Issuing an IMR by dropping instance grac41 from the RAC cluster
Debugging Eviction by using OSWatcher
Debugging Eviction by using CHM data
Debugging Eviction by using LMBH trace file
Understanding the root cause of the problem
Check process priority of involved processes # ps -e -o pid,class,rtprio,pri,nlwp,cmd | egrep 'PID|mp_stress|ocssd.bin|ora_lms0_grac41' PID CLS RTPRIO PRI NLWP CMD 4582 RR 99 139 24 /u01/app/11204/grid/bin/ocssd.bin 7894 RR 10 50 5 ./mp_stress -t 4 -m 1 -p 10 -c 2000 13508 RR 1 41 1 ora_lms0_grac41 ocssd.bin, ora_lms0_grac41 and mp_stress belongs to Realtime Scheduling Class ( Round Robin real-time process ). This means ocssd.bin runs with the highest RT priority 99 and will get CPU as needed. The ./mp_stress process runs as RT process too with PRIORITY 10. If ocssd.bin releases CPU ./mp_stress will be scheduled. If mp_stress does not release CPU ( due to system call waits, ..) ora_lms0_grac41 will never get scheduled and this can lead to an Instance Eviction as ora_lms0_grac41 only runs only with RR PRIORITY of 1. Note ./mp_stress runs with 5 threads. If we assume 4 worker threads this program can keep your system quite busy if you have a low number of CPUs and ./mp_stress doesn't releases the CPU.
- If your system runs multiple RT processes or even a single RT processes with multiple threads double check LMS0 and LMON and LMD process gets enough CPU
- Always check LMBH trace files as a first action
What’s Happening i am new to this, I stumbled upon this I
have discovered It positively useful and it has
helped me out loads. I’m hoping to give a contribution & aid
different customers like its aided me. Great job.