Table of Contents
Overview Node Eviction
in 11gR2, we have many new background processes including two cssd related processes cssdmonitor and cssdagent. Cssdagent process is charge of respawn cssd.bin. Oracle has improved the node fencing in Oracle 11g Release 2 (11.2.0.2) by killing the processes on the failed node that are capable of performing IO and then stopping the Clusterware on the failed node rather than simply rebooting the failed node ( == Instance Eviction ). This works for scenarios like missing Disk and Network Heartbeats but if CSSD process hangs we got a system reboot ( == Node eviction ).
Setup Node Eviction
Find ocsd.bin process ID # ps -ef|grep cssd root 2849 1 0 13:54 ? 00:00:00 /u01/app/11203/grid/bin/cssdmonitor root 2868 1 0 13:54 ? 00:00:00 /u01/app/11203/grid/bin/cssdagent grid 2882 1 0 13:54 ? 00:00:05 /u01/app/11203/grid/bin/ocssd.bin root 4310 4291 0 14:06 pts/2 00:00:00 grep cssd Send STOP signal to ocssd.bin process # kill -SIGSTOP 2882 After a few seconds the system was rebooted. Check lastgasp file after reboot # cd /etc/oracle/lastgasp # more cssmonit_grac2.lgl OLG01,0301,0000,GRACE2,2ae3c3415014ef2abf2ff662c5bf8512,grac2,cssmonit,L-2013-08 -03-13:53:56.464,Rebooting after limit 28200 exceeded; disk timeout 28200, netwo rk timeout 27920, last heartbeat from CSSD at epoch seconds 1375530808.212, 2825 1 milliseconds ago based on invariant clock value of 22202483 --> Reboot at 2013-08-03-13:53:56.464 Check cssdagent logfile: $GRID_HOME/log/grac2/agent/ohasd/oracssdagent_root 2013-08-03 13:53:53.608: [ USRTHRD][4160747264] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28200; disk timeout 28200, network timeout 27920, last heartbeat from CSSD at epoch seconds 1375530808.212, 25392 milliseconds ago based on invariant clock 22202492; now polling at 100 ms 2013-08-03 13:53:53.608: [ USRTHRD][4160747264] clsnwork_queue: posting worker thread 2013-08-03 13:53:53.608: [ USRTHRD][4160747264] clsnwork_queue: posting worker thread 2013-08-03 13:53:53.608: [ USRTHRD][88942336] clsnwork_process_work: calling sync 2013-08-03 13:53:53.687: [ USRTHRD][88942336] clsnwork_process_work: sync completed 2013-08-03 13:53:53.687: [ USRTHRD][88942336] clsnSyncComplete: posting omon 2013-08-03 13:53:53.688: [ USRTHRD][88942336] clsnwork_process_work: Unknown work type Check csssmonitor logfile: $GRID_HOME/log/grac2/agent/ohasd/oracssdmonitor_root 2013-08-03 13:53:53.655: [ USRTHRD][878024448] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28200; disk timeout 28200, network timeout 27920, last heartbeat from CSSD at epoch seconds 1375530808.212, 25441 milliseconds ago based on invariant clock 22202483; now polling at 100 ms 2013-08-03 13:53:53.655: [ USRTHRD][878024448] clsnwork_queue: posting worker thread 2013-08-03 13:53:53.656: [ USRTHRD][878024448] clsnwork_queue: posting worker thread 2013-08-03 13:53:53.656: [ USRTHRD][874870528] clsnwork_process_work: calling sync 2013-08-03 13:54:31.472: [ AGFW][1279174432] Starting the agent: /u01/app/11203/grid/log/grac2/agent/ohasd/oracssdmonitor_root/ Check clusterware alertlog file: $GRID_HOME/log/grac2/alertgrac2.log 2013-08-03 10:55:10.426 [crsd(3080)]CRS-2765:Resource 'ora.grace2.db' has failed on server 'grac2'. <-- reboot occurs here - note there is entry logged for the system reboot in clusterware alertlog 2013-08-03 13:54:30.831 [ohasd(2500)]CRS-2112:The OLR service started on node grac2. 2013-08-03 13:54:30.892 [ohasd(2500)]CRS-1301:Oracle High Availability Service started on node grac2. 2013-08-03 13:54:30.937 [ohasd(2500)]CRS-8011:reboot advisory message from host: grac2, component: cssmonit, with time stamp: L-2013-08-03-13:53:56.464 [ohasd(2500)]CRS-8013:reboot advisory message text: Rebooting after limit 28200 exceeded; disk timeout 28200, network timeout 27920, last heartbeat from CSSD at epoch seconds 1375530808.212, 28251 milliseconds ago based on invariant clock value of 22202483 2013-08-03 13:54:30.937 [ohasd(2500)]CRS-8017:location: /etc/oracle/lastgasp has 4 reboot advisory log files, 1 were announced and 0 errors occurred 2013-08-03 13:54:32.379 [/u01/app/11203/grid/bin/oraagent.bin(2589)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11203/grid/log/grac2/agent/ohasd/oraagent_grid/oraagent_grid.log" 2013-08-03 13:54:46.400 [gpnpd(2810)]CRS-2328:GPNPD started on node grac2. 2013-08-03 13:54:49.119 [cssd(2882)]CRS-1713:CSSD daemon is started in clustered mode /var/log/messages Aug 3 13:54:27 grac2 abrtd: Init complete, entering main loop Aug 3 13:54:27 grac2 logger: Oracle HA daemon is enabled for autostart. Aug 3 13:54:27 grac2 logger: exec /u01/app/11203/grid/perl/bin/perl -I/u01/app/11203/grid/perl/lib /u01/app/11203/grid/bin/crswrapexece.pl /u01/app/11203/grid/crs/install/s_crsconfig_grac2_env.txt /u01/app/11203/grid/bin/ohasd.bin "reboot"