NODE EVICTION DUE TO MEMBER KILL ESCALATION

A node eviction will be initiated if:
Cluster member cannot communicate via network heartbeat (default timeout 30 seconds)
Cluster member cannot communicate via disk heartbeat using the majority of voting files (long I/O timeout – 200 sec.; short I/O timeout – 27 sec.)
Server hung, CPU starvation
Problems with core Oracle Clusterware processes (e.g.: OCSSD).

Tested  Environment  : Oracle 11.2.0.3 OEL 6.3
If a  RAC instances is hanging , the database LMON process will request a member kill escalation and 
ask the CSS process to remove the hanging database instance from the cluster.

Connect to GRACE22 instance and verify the connection
SQL> select instance_name, host_name from v$instance;
INSTANCE_NAME     HOST_NAME
---------------- ------------------------------
GRACE22      grac2.example.com

Stop the execution of all rdbms processes on grac2 (by sending the STOP signal)
#  ps -ef | grep ora_ | grep GRACE22 | awk '{ print $2 } ' | while read PID ; do kill -STOP $PID; done

Verify that the related database process are waiting due to STOP signal
# ps -elf  | grep ora_ | grep GRACE22
0 T oracle    8084     1  0  80   0 - 470034 signal 09:19 ?       00:00:00 ora_pmon_GRACE22
0 T oracle    8086     1  0  80   0 - 468339 signal 09:19 ?       00:00:00 ora_psp0_GRACE22
0 T oracle    8088     1  3  58   - - 468339 signal 09:19 ?       00:02:06 ora_vktm_GRACE22
0 T oracle    8092     1  0  80   0 - 468690 signal 09:19 ?       00:00:00 ora_gen0_GRACE22

Verify the ASM alert.log from grac2: alert_+ASM2.log ( Note pid 8134 is ASMB process )
WARNING: client [GRACE22:GRACE2] not responsive for 217s; state=0x1. killing pid 8134
Sat Aug 03 10:25:49 2013
WARNING: client [GRACE22:GRACE2] cleanup delayed; waited 277s, pid 8134 mbr 0x1
Sat Aug 03 10:26:49 2013
WARNING: client [GRACE22:GRACE2] cleanup delayed; waited 337s, pid 8134 mbr 0x1

Check clusterware logfile in grac2: alertgrac2.log 
2013-08-03 10:27:39.868
[cssd(2839)]CRS-1662:Member kill requested by node grac1 for member number 1, group DBGRACE2

Due to missing heartbeats the healthy RAC instance on node host01  will remove the hanging RAC 
instance by requesting a member kill escalation : alert_GRACE21.log
Sat Aug 03 10:35:04 2013
IPC Send timeout detected. Sender: ospid 4113 [oracle@grac1.example.com (LGWR)]
Receiver: inst 2 binc 430330249 ospid 8108
IPC Send timeout to 2.1 inc 16 for msg type 73 from opid 18
Sat Aug 03 10:35:10 2013
IPC Send timeout detected. Sender: ospid 4101 [oracle@grac1.example.com (LMS0)]
Receiver: inst 2 binc 430330249 ospid 8108
IPC Send timeout to 2.1 inc 16 for msg type 32 from opid 13
Sat Aug 03 10:35:37 2013
Evicting instance 2 from cluster
Waiting for instances to leave: 2  
Sat Aug 03 10:35:54 2013
IPC Send timeout detected. Sender: ospid 4117 [oracle@grac1.example.com (SMON)]
Receiver: inst 2 binc 430330245 ospid 8106
Sat Aug 03 10:35:57 2013
Remote instance kill is issued with system inc 18
Remote instance kill map (size 1) : 2 
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2 
Reconfiguration started (old inc 16, new inc 20)

Allow database processes on grac2 to continue and restart the instance GRACE22 on grac2
#  ps -ef | grep ora_ | grep GRACE22 | awk '{ print $2 } ' | while read PID ; do kill -CONT $PID; done
grac2: alert_GRACE22.log
Sat Aug 03 10:55:09 2013
NOTE: ASMB process is now terminating
Errors in file /u01/app/oracle/diag/rdbms/grace2/GRACE22/trace/GRACE22_asmb_8130.trc:
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact
Process ID: 
Session ID: 33 Serial number: 17
Custerware logfiles on grac2 are updated very late: alertgrac2.log
2013-08-03 10:55:10.423
[/u01/app/11203/grid/bin/oraagent.bin(3558)]CRS-5011:Check of resource "GRACE2" failed: details at "(:CLSN00007:)" in "/u01/app/11203/grid/log/grac2/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2013-08-03 10:55:10.426
[crsd(3080)]CRS-2765:Resource 'ora.grace2.db' has failed on server 'grac2'.

grac2: /u01/app/11203/grid/log/grac2/agent/crsd/oraagent_oracle/oraagent_oracle.log
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] event type is CRS_RESOURCE_STATE_CHANGE
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] bodylen = 742
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795} -----------BodyBlock----------
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  CLS_TINT='{0:6:795}'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  COMPOSITE_RESOURCE_STATUS=''
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  CURRENT_STATE='OFFLINE'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  DATABASE_TYPE='RAC'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  DB_UNIQUE_NAME='grace2'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  DEGREE='1'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  DERIVED_STATE='OFFLINE'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  ID='39247926'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  INSTANCE_NAME='GRACE22'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  INTERMEDIATE_INSTANCE_COUNT='0'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  NAME='ora.grace2.db'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  ONLINE_INSTANCE_COUNT='1'
2013-08-03 10:55:10.430: [ USRTHRD][2483025664] {0:6:795}  PREVIOUS_STATE='ONLINE'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  PREV_STATE_DETAILS='Open'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  REASON='FAILURE'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  RESOURCE_CLASS='database'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  RESOURCE_INCARNATION_NUMBER='3'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  RESOURCE_INSTANCE_ID='ora.grace2.db 2 1'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  RESOURCE_LOCATION='grac2'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  SEQUENCE_NUMBER='1000050'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  STATE_DETAILS='Abnormal Termination'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  TIMESTAMP='2013-08-03 10:55:10'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  TOTAL_INSTANCE_COUNT='2'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  TYPE='ora.database.type'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  UNKNOWN_INSTANCE_COUNT='0'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  USER='SYSTEM'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  Version='11.2.0.3.0'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  CLUSTER_NAME='GRACE2'
2013-08-03 10:55:10.431: [ USRTHRD][2483025664] {0:6:795}  DB_UNIQUE_NAME='grace2'
..
2013-08-03 10:55:39.131: [    AGFW][771737344] {0:6:795} Agent sending reply for: RESOURCE_START[ora.grace2.db 2 1] ID 4098:4383
2013-08-03 10:55:39.131: [    AGFW][771737344] {0:6:795} ora.grace2.db 2 1 state changed from: STARTING to: ONLINE
2013-08-03 10:55:39.131: [    AGFW][771737344] {0:6:795} Started implicit monitor for [ora.grace2.db 2 1] interval=1000 delay=1000
2013-08-03 10:55:39.131: [    AGFW][771737344] {0:6:795} ora.grace2.db 2 1 state details has changed from:  to: Open
2013-08-03 10:55:39.131: [    AGFW][771737344] {0:6:795} Agent sending last reply for: RESOURCE_START[ora.grace2.db 2 1] ID 4098:4383
MEMBER_KILL_Eviction

NODE EVICTION DUE TO MEMBER KILL ESCALATION

Leave a Reply Cancel reply