Eviction Overview | Helmut's RAC / JEE Blog

Generic Info 
- CW key process :cssd.bin process is multithreaded -  tracks and monitors both Disk Hearbeat ( DHB) and Network Heartbeat ( NHB )
- Rdbms Key processes 
    - LMS, LMD - if the can't communicate with their counterparts they will initate reconfiguration
    - LMOM - most important for Instance Eviction - implements IMR and drives reconfiguration 

Network related Instance Eviction 
- if there is no Network Heartbeat for more than 30 seconds clssnmvKillBlockThread Thread kills local CRS
- Best practise is to use the private Interface for Cluster Interconnect but  this can be overwritten
  Root Cause: Cluster interconnect down

  Significant Errors:
   clssnmPollingThread: node grac43 (3) at 90% heartbeat fatal, removal in 2.740 seconds
   CRS-1610:Network communication with node grac42 (2) missing for 90% of timeout interval

CPU/swapping driven Instance Eviction    
- LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning. 
- Its very likely that we get lmhb traces because LMON, LMS RAC processes will not to get enough CPU 
- An overloaded RAC instance ( either CPU or due to Paging/Swapping ) can cause a Node Eviction
  LMS, LMD0 , Rdbms alert logs may reporting:   IPC Send Timeout message 
- When getting Node Eviction with IPC Send Timeout message monitor OSWatcher and CHM files carefully
  Root Cause: CPU starvation, Paging Swapping

  Significant Errors:
   IPC Send timeout detected. Sender: ospid 10410 [oracle@grac42.example.com (LMS0)]
   LMS0 (ospid: 10410) has detected no messaging activity from instance 1
   LMS0 (ospid: 10410) issues an IMR to resolve the situation

Node Evictions Top reason
- Communication Errors ( Network related  )
- Memory starvation ( Paging Swapping )
- CPU problems ( Scheduler problems / CPU load )
- Other resons   
    - Node membership change due to Split brain issue 
    - Instance Eviction related Bugs : 
      Bug 16876500 - GI HAIP AGENT DROPS A ROUTE FREQUENTLY AND THAT LEADS TO THE INSTANCE EVICTION 
      Bug 14385860 - SOL.SPARC64 : CLSRSC-257: CLUSTER TIME SYNCHRONIZATION SERVICE START IN EXCLUSIV 

Cluster Reconfiguration - CGS Cluster Group Service
- CGS cluster Group Service tracks which instances are members of a cluster
- CGS validate all members and update control file periodically
- Failure lead to an Instance Membership Reconfiguration ( IMR )
- CGS is responsible for GMS ( Group Membership Syncronsition layer ) and IMR 

Important Logs and Trace files - Review traces in the same order than listed
- Alert logs form all instances ( Cluster alert.log, Rdbms alert.log, ASM alert.log )
- ocssd.logs for all instances ( note older files are named ocssd.l01, .. )
- LMON, LMSn, LMD0 traces from all instances  
- Any other traces mentioned in any alert.log
- lmhb traces ( LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning )  
- CHM and OSWatcher logs from the eviction time
- OS message logs form all nodes (  /var/log/messages for Linux )
Leave a Reply Cancel reply