Case I: CRSD daemon doesn’t start due to permission problem ( crsd/crsd.bin)
Force that error and monitor Clusterware Resource status after startup:
The command below prints the status of the lower CRS stack every 2 seconds
This will allow you to track which compomet is starting again and finally failing
# watch crsi
***** Local Resources: *****
Resource NAME INST TARGET STATE SERVER STATE_DETAILS
--------------------------- ---- ---------- ------------ --------------- -----------------------------------------
ora.asm 1 ONLINE ONLINE hract21 STABLE
ora.cluster_interconnect.haip 1 ONLINE ONLINE hract21 STABLE
ora.crf 1 ONLINE ONLINE hract21 STABLE
ora.crsd 1 ONLINE OFFLINE - STABLE
ora.cssd 1 ONLINE ONLINE hract21 STABLE
ora.cssdmonitor 1 ONLINE ONLINE hract21 STABLE
ora.ctssd 1 ONLINE ONLINE hract21 OBSERVER,STABLE
ora.diskmon 1 OFFLINE OFFLINE - STABLE
ora.drivers.acfs 1 ONLINE ONLINE hract21 STABLE
ora.evmd 1 ONLINE INTERMEDIATE hract21 STABLE
ora.gipcd 1 ONLINE ONLINE hract21 STABLE
ora.gpnpd 1 ONLINE ONLINE hract21 STABLE
ora.mdnsd 1 ONLINE ONLINE hract21 STABLE
ora.storage 1 ONLINE ONLINE hract21 STABLE
--> CRSD doesn't start
CaseI-1: Shell script /u01/app/121/grid/bin/crsd with wrong protection
# chmod 000 /u01/app/121/grid/bin/crsd
CLUVFY:
Note as our CRS stack is not fully running on hract21 we have to start cluvfy from hract22
[grid@hract21 CLUVFY]$ ssh hract22 ~/CLUVFY/bin/cluvfy comp software -n hract21
Verifying software
Check: Software
Component: crs
Node Name: hract21
/u01/app/121/grid/bin/crsd..."Permissions" did not match reference
3162 files verified
Software check failed
TRACFILE review :
Use Grep command:
# grep "2015-02-16 09:2" * | egrep "no exe permission|failed to start process"
alert.log:
2015-02-16 09:25:05.681 [ORAROOTAGENT(16807)]CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/121/grid/bin/crsd"
for action "start": details at "(:CLSN00008:)" in "/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc"
2015-02-16 09:25:06.706 [ORAROOTAGENT(16807)]CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/121/grid/bin/crsd"
for action "start": details at "(:CLSN00008:)" in "/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc"
/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc:
2015-02-16 09:25:06.704671 :CLSDYNAM:3169421056: [ora.crsd]{0:0:2} [start] (:CLSN00008:)Utils:execCmd scls_process_spawn() failed 1
2015-02-16 09:25:06.704859 :CLSDYNAM:3169421056: [ora.crsd]{0:0:2} [start] (:CLSN00008:) category: -1, operation: fail, loc: canexec2,
OS error: 0, other: no exe permission, file [/u01/app/121/grid/bin/crsd]
2015-02-16 09:25:06.705358 :CLSDYNAM:3169421056: [ora.crsd]{0:0:2} [start] clsnUtils::error Exception type=2 string=
CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/121/grid/bin/crsd" for action "start":
details at "(:CLSN00008:)" in "/u01/app/grid/diag/crs/hract21/crs/trace/ohasd_orarootagent_root.trc"
DTRACE :
DTRACE is not helpful here !
--> Using dtrace will not help as OS error: 0
CW software reads the protections of crsd shell script and stops clusterware startup if protections are not sufficient.
FIX :
Change permsission and restart CRS
[root@hract21 Desktop]# ls -l /u01/app/121/grid/bin/crsd
---------- 1 root oinstall 9419 Jan 30 12:26 /u01/app/121/grid/bin/crsd
[root@hract21 Desktop]# chmod 755 /u01/app/121/grid/bin/crsd
[root@hract21 Desktop]# ls -l /u01/app/121/grid/bin/crsd
rwxr-xr-x 1 root oinstall 9419 Jan 30 12:26 /u01/app/121/grid/bin/crsd
CaseI-2: Executable /u01/app/121/grid/bin/crsd.bin with wrong protection
# ls -l /u01/app/121/grid/bin/crsd.bin
-rwxr----x 1 root oinstall 369207571 Jun 30 2014 /u01/app/121/grid/bin/crsd.bin
# chmod 000 /u01/app/121/grid/bin/crsd.bin
CLUVFY :
Note as our CRS stack is not fully running on hract21 we have to start cluvfy from hract22
[grid@hract21 CLUVFY]$ ssh hract22 ~/CLUVFY/bin/cluvfy comp software -n hract21
Verifying software
Check: Software
Component: crs
Node Name: hract21
/u01/app/121/grid/bin/crsd.bin..."Permissions" did not match reference
3162 files verified
Software check failed
TRACFILE review :
Grep command:
$ grep "2015-02-16 10:3" * | egrep " fatal | terminating "
alert.log
2015-02-16 10:30:04.075 [OCSSD(29033)]CRS-1713: CSSD daemon is started in hub mode
2015-02-16 10:30:04.271 [OCSSD(29033)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
2015-02-16 10:30:05.297 [OCSSD(29033)]CRS-1652: Starting clean up of CRSD resources.
2015-02-16 10:30:05.307 [OCSSD(29033)]CRS-1653: The clean up of the CRSD resources failed.
Mon Feb 16 10:30:05 2015
Errors in file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc (incident=2897):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2897/ocssd_i2897.trc
2015-02-16 10:30:05.475 [OCSSD(29033)]CRS-8503: Oracle Clusterware OCSSD process with operating system process ID 29033 experienced fatal signal or exception code 6
Sweep [inc][2897]: completed
--> The above mentioned tracefiles doesn't help - so what is the real problem here ?
DTRACE SCRIPT :
/*
Tracing all failed execve calls and stops tracing after oradism was started by calling exit()
*/
syscall::execve:entry
{
self->path = copyinstr(arg0);
}
/*
Trace all failed execve call on the system
This allows us to track failed calls for OS utils like awk, sed .....
*/
syscall::execve:return
/arg0<0 && execname != "crsctl.bin"/
{
printf("- Exec: %s - execve() %s failed with error: %d", execname, self->path, arg0 );
self->path = 0;
}
/*
Track successful CW start - stop tracing after oradism was started by calling exit()
*/
syscall::execve:return
/execname == "oradism"/
{
printf("- Exec: %s - execve() %s - Lower CLUSTERWARE stack successfully started - ret code: %d - EXITING !", execname, self->path, arg0 );
self->path = 0;
exit(0);
}
DTRACE OUTPUT :
Using our dtrace script will quickly identify the root cause [ crsd.bin failed due to Permission error ]
[root@hract21 DTRACE]# dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU ID FUNCTION:NAME
0 1 :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin - Temp Loc: /var/tmp/.oracle - PIDFILE: hract21.pid - Port for bind: 53
0 9 open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir: /var/tmp/.oracle
0 89 connect:return - Exec: mdnsd.bin - PID: 17355 connect() failed with error : -101 - fd : 39 - IP: 17.17.17.17 - Port: 256
0 89 connect:return - Exec: gipcd.bin - PID: 17400 connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0 89 connect:return - Exec: gipcd.bin - PID: 17412 connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0 89 connect:return - Exec: gipcd.bin - PID: 17429 connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0 123 execve:return - Exec: crsd - execve() /u01/app/121/grid/bin/crsd.bin failed with error: -13
FIX :
Change permsission and restart CRS
[root@hract21 Desktop]# ls -l /u01/app/121/grid/bin/crsd.bin
---------- 1 root oinstall 369207571 Jun 30 2014 /u01/app/121/grid/bin/crsd.bin
[root@hract21 Desktop]# chmod 741 /u01/app/121/grid/bin/crsd.bin
[root@hract21 Desktop]# ls -l /u01/app/121/grid/bin/crsd.bin
[root@hract21 network-scripts]# ls -l /u01/app/121/grid/bin/crsd.bin
-rwxr----x 1 root oinstall 369207571 Jun 30 2014 /u01/app/121/grid/bin/crsd.bin
Pages: Page 1, Page 2, Page 3, Page 4, Page 5, Page 6, Page 7, Page 8, Page 9, Page 10, Page 11, Page 12, Page 13, Page 14, Page 15, Page 16
Many thx
This is very helpful
I really like looking through an article that can make people think.
Also, thank you for permitting me to comment!