Generic Info - CW key process :cssd.bin process is multithreaded - tracks and monitors both Disk Hearbeat ( DHB) and Network Heartbeat ( NHB ) - Rdbms Key processes - LMS, LMD - if the can't communicate with their counterparts they will initate reconfiguration - LMOM - most important for Instance Eviction - implements IMR and drives reconfiguration Network related Instance Eviction - if there is no Network Heartbeat for more than 30 seconds clssnmvKillBlockThread Thread kills local CRS - Best practise is to use the private Interface for Cluster Interconnect but this can be overwritten Root Cause: Cluster interconnect down Significant Errors: clssnmPollingThread: node grac43 (3) at 90% heartbeat fatal, removal in 2.740 seconds CRS-1610:Network communication with node grac42 (2) missing for 90% of timeout interval CPU/swapping driven Instance Eviction - LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning. - Its very likely that we get lmhb traces because LMON, LMS RAC processes will not to get enough CPU - An overloaded RAC instance ( either CPU or due to Paging/Swapping ) can cause a Node Eviction LMS, LMD0 , Rdbms alert logs may reporting: IPC Send Timeout message - When getting Node Eviction with IPC Send Timeout message monitor OSWatcher and CHM files carefully Root Cause: CPU starvation, Paging Swapping Significant Errors: IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)] LMS0 (ospid: 10410) has detected no messaging activity from instance 1 LMS0 (ospid: 10410) issues an IMR to resolve the situation Node Evictions Top reason - Communication Errors ( Network related ) - Memory starvation ( Paging Swapping ) - CPU problems ( Scheduler problems / CPU load ) - Other resons - Node membership change due to Split brain issue - Instance Eviction related Bugs : Bug 16876500 - GI HAIP AGENT DROPS A ROUTE FREQUENTLY AND THAT LEADS TO THE INSTANCE EVICTION Bug 14385860 - SOL.SPARC64 : CLSRSC-257: CLUSTER TIME SYNCHRONIZATION SERVICE START IN EXCLUSIV Cluster Reconfiguration - CGS Cluster Group Service - CGS cluster Group Service tracks which instances are members of a cluster - CGS validate all members and update control file periodically - Failure lead to an Instance Membership Reconfiguration ( IMR ) - CGS is responsible for GMS ( Group Membership Syncronsition layer ) and IMR Important Logs and Trace files - Review traces in the same order than listed - Alert logs form all instances ( Cluster alert.log, Rdbms alert.log, ASM alert.log ) - ocssd.logs for all instances ( note older files are named ocssd.l01, .. ) - LMON, LMSn, LMD0 traces from all instances - Any other traces mentioned in any alert.log - lmhb traces ( LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning ) - CHM and OSWatcher logs from the eviction time - OS message logs form all nodes ( /var/log/messages for Linux )