NODE EVICTION DUE TO MEMBER KILL ESCALATION
A node eviction will be initiated if:
- Cluster member cannot communicate via network heartbeat (default timeout 30 seconds)
- Cluster member cannot communicate via disk heartbeat using the majority of voting files (long I/O timeout – 200 sec.; short I/O timeout – 27 sec.)
- Server hung, CPU starvation
- Problems with core Oracle Clusterware processes (e.g.: OCSSD).
Tested Environment : Oracle 11.2.0.3 OEL 6.3 If a RAC instances is hanging , the database LMON process will request a member kill escalation and ask the CSS process to remove the hanging database instance from the cluster. Connect to GRACE22 instance and verify the connection SQL> select instance_name, host_name from v$instance; INSTANCE_NAME HOST_NAME ---------------- ------------------------------ GRACE22 grac2.example.com Stop the execution of all rdbms processes on grac2 (by sending the STOP signal) # ps -ef | grep ora_ | grep GRACE22 | awk '{ print $2 } ' | while read PID ; do kill -STOP $PID; done Verify that the related database process are waiting due to STOP signal # ps -elf | grep ora_ | grep GRACE22 0 T oracle 8084 1 0 80 0 - 470034 signal 09:19 ? 00:00:00 ora_pmon_GRACE22 0 T oracle 8086 1 0 80 0 - 468339 signal 09:19 ? 00:00:00 ora_psp0_GRACE22 0 T oracle 8088 1 3 58 - - 468339 signal 09:19 ? 00:02:06 ora_vktm_GRACE22 0 T oracle 8092 1 0 80 0 - 468690 signal 09:19 ? 00:00:00 ora_gen0_GRACE22 Verify the ASM alert.log from grac2: alert_+ASM2.log ( Note pid 8134 is ASMB process ) WARNING: client [GRACE22:GRACE2] not responsive for 217s; state=0x1. killing pid 8134 Sat Aug 03 10:25:49 2013 WARNING: client [GRACE22:GRACE2] cleanup delayed; waited 277s, pid 8134 mbr 0x1 Sat Aug 03 10:26:49 2013 WARNING: client [GRACE22:GRACE2] cleanup delayed; waited 337s, pid 8134 mbr 0x1 Check clusterware logfile in grac2: alertgrac2.log 2013-08-03 10:27:39.868 [cssd(2839)]CRS-1662:Member kill requested by node grac1 for member number 1, group DBGRACE2 Due to missing heartbeats the healthy RAC instance on node host01 will remove the hanging RAC instance by requesting a member kill escalation : alert_GRACE21.log Sat Aug 03 10:35:04 2013 IPC Send timeout detected. Sender: ospid 4113 [oracle@grac1.example.com (LGWR)] Receiver: inst 2 binc 430330249 ospid 8108 IPC Send timeout to 2.1 inc 16 for msg type 73 from opid 18 Sat Aug 03 10:35:10 2013 IPC Send timeout detected. Sender: ospid 4101 [oracle@grac1.example.com (LMS0)] Receiver: inst 2 binc 430330249 ospid 8108 IPC Send timeout to 2.1 inc 16 for msg type 32 from opid 13 Sat Aug 03 10:35:37 2013 Evicting instance 2 from cluster Waiting for instances to leave: 2 Sat Aug 03 10:35:54 2013 IPC Send timeout detected. Sender: ospid 4117 [oracle@grac1.example.com (SMON)] Receiver: inst 2 binc 430330245 ospid 8106 Sat Aug 03 10:35:57 2013 Remote instance kill is issued with system inc 18 Remote instance kill map (size 1) : 2 LMON received an instance eviction notification from instance 1 The instance eviction reason is 0x20000000 The instance eviction map is 2 Reconfiguration started (old inc 16, new inc 20) Allow database processes on grac2 to continue and restart the instance GRACE22 on grac2 # ps -ef | grep ora_ | grep GRACE22 | awk '{ print $2 } ' | while read PID ; do kill -CONT $PID; done grac2: alert_GRACE22.log Sat Aug 03 10:55:09 2013 NOTE: ASMB process is now terminating Errors in file /u01/app/oracle/diag/rdbms/grace2/GRACE22/trace/GRACE22_asmb_8130.trc: ORA-15064: communication failure with ASM instance ORA-03135: connection lost contact Process ID: Session ID: 33 Serial number: 17 Custerware logfiles on grac2 are updated very late: alertgrac2.log 2013-08-03 10:55:10.423 [/u01/app/11203/grid/bin/oraagent.bin(3558)]CRS-5011:Check of resource "GRACE2" failed: details at "(:CLSN00007:)" in "/u01/app/11203/grid/log/grac2/agent/crsd/oraagent_oracle/oraagent_oracle.log" 2013-08-03 10:55:10.426 [crsd(3080)]CRS-2765:Resource 'ora.grace2.db' has failed on server 'grac2'. grac2: /u01/app/11203/grid/log/grac2/agent/crsd/oraagent_oracle/oraagent_oracle.log 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] event type is CRS_RESOURCE_STATE_CHANGE 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] bodylen = 742 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} -----------BodyBlock---------- 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} CLS_TINT='{0:6:795}' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} COMPOSITE_RESOURCE_STATUS='' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} CURRENT_STATE='OFFLINE' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} DATABASE_TYPE='RAC' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} DB_UNIQUE_NAME='grace2' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} DEGREE='1' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} DERIVED_STATE='OFFLINE' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} ID='39247926' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} INSTANCE_NAME='GRACE22' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} INTERMEDIATE_INSTANCE_COUNT='0' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} NAME='ora.grace2.db' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} ONLINE_INSTANCE_COUNT='1' 2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} PREVIOUS_STATE='ONLINE' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} PREV_STATE_DETAILS='Open' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} REASON='FAILURE' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} RESOURCE_CLASS='database' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} RESOURCE_INCARNATION_NUMBER='3' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} RESOURCE_INSTANCE_ID='ora.grace2.db 2 1' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} RESOURCE_LOCATION='grac2' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} SEQUENCE_NUMBER='1000050' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} STATE_DETAILS='Abnormal Termination' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} TIMESTAMP='2013-08-03 10:55:10' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} TOTAL_INSTANCE_COUNT='2' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} TYPE='ora.database.type' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} UNKNOWN_INSTANCE_COUNT='0' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} USER='SYSTEM' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} Version='11.2.0.3.0' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} CLUSTER_NAME='GRACE2' 2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795} DB_UNIQUE_NAME='grace2' .. 2013-08-03 10:55:39.131: [ AGFW][771737344] {0:6:795} Agent sending reply for: RESOURCE_START[ora.grace2.db 2 1] ID 4098:4383 2013-08-03 10:55:39.131: [ AGFW][771737344] {0:6:795} ora.grace2.db 2 1 state changed from: STARTING to: ONLINE 2013-08-03 10:55:39.131: [ AGFW][771737344] {0:6:795} Started implicit monitor for [ora.grace2.db 2 1] interval=1000 delay=1000 2013-08-03 10:55:39.131: [ AGFW][771737344] {0:6:795} ora.grace2.db 2 1 state details has changed from: to: Open 2013-08-03 10:55:39.131: [ AGFW][771737344] {0:6:795} Agent sending last reply for: RESOURCE_START[ora.grace2.db 2 1] ID 4098:4383