Troubleshooting Clusterware startup problems with DTRACE

Case III: GPnP fails due to unreadable CW “special” socket files located in var/tmp/.oracle/

Typically this directory contains a number of "special" socket files that are used by local clients to
connect via the IPC protocol (sqlnet) to various Oracle processes including the TNS listener, the CSS,
CRS & EVM daemons or even  database or ASM instances
Force that  error and monitor Clusterware Resource status after startup:

[root@hract21 DTRACE]# chmod 000  /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock
[root@hract21 DTRACE]#  ls -l  /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock
---------- 1 grid oinstall 0 Feb 15 08:18 /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock

*****  Local Resources: *****
Resource NAME               INST   TARGET       STATE        SERVER          STATE_DETAILS
--------------------------- ----   ------------ ------------ --------------- -----------------------------------------
ora.asm                        1   ONLINE       OFFLINE      -               STABLE
ora.cluster_interconnect.haip  1   ONLINE       OFFLINE      -               STABLE
ora.crf                        1   ONLINE       OFFLINE      -               STABLE
ora.crsd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssd                       1   ONLINE       OFFLINE      -               STABLE
ora.cssdmonitor                1   OFFLINE      OFFLINE      -               STABLE
ora.ctssd                      1   ONLINE       OFFLINE      -               STABLE
ora.diskmon                    1   OFFLINE      OFFLINE      -               STABLE
ora.drivers.acfs               1   ONLINE       ONLINE       hract21         STABLE
ora.evmd                       1   ONLINE       INTERMEDIATE hract21         STABLE
ora.gipcd                      1   ONLINE       OFFLINE      -               STABLE
ora.gpnpd                      1   ONLINE       OFFLINE      hract21         STARTING
ora.mdnsd                      1   ONLINE       ONLINE       hract21         STABLE
ora.storage                    1   ONLINE       OFFLINE      -               STABLE
--> GPnPD doesn't start !

CLUVFY:
Found no cluvfy command to detect this error

TRACFILE review :
alert.log :
2015-02-16 12:37:45.345 [GPNPD(21849)]CRS-8500: Oracle Clusterware GPNPD process is starting with operating system process ID 21849
2015-02-16 12:37:45.388 [GPNPD(21849)]CRS-8501: Oracle Clusterware GPNPD process with operating system process ID 21849 is ending with return value 48
2015-02-16 12:47:45.281 [ORAAGENT(21818)]CRS-5818: Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2}
in /u01/app/grid/diag/crs/hract21/crs/trace/ohasd_oraagent_grid.trc.

ohasd_oraagent_grid.trc:
2015-02-16 12:47:45.837353 :    AGFW:308582144: {0:0:2} Agent sending reply for: RESOURCE_START[ora.gpnpd 1 1] ID 4098:270
2015-02-16 12:47:45.838926 :CLSDYNAM:310683392: [ora.gpnpd]{0:0:2} [start] (:CLSN00107:) clsn_agent::start }
2015-02-16 12:47:49.288144 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [start] got lock
2015-02-16 12:47:49.288165 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [start] tryActionLock }
2015-02-16 12:47:49.288187 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [start] abort  }
2015-02-16 12:47:49.288205 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [start] (:CLSN00110:) clsn_agent::abort }
2015-02-16 12:47:49.288230 :    AGFW:304379648: {0:0:2} Command: start for resource: ora.gpnpd 1 1 completed with status: TIMEDOUT
2015-02-16 12:47:49.288313 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] Utils::getOracleHomeAttrib getEnvVar oracle_home:/u01/app/121/grid
2015-02-16 12:47:49.288361 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] Utils::getOracleHomeAttrib oracle_home:/u01/app/121/grid
2015-02-16 12:47:49.288672 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] PID 20189 from /u01/app/121/grid/gpnp/init/hract21.pid
2015-02-16 12:47:49.288708 :  CLSDMC:304379648: Connecting to ipc://hract21_DBG_GPNPD
2015-02-16 12:47:49.289427 :  CLSDMC:304379648: Error: gipcWait for gipcConnect - ret_gipcreqinfo=gipcretConnectionRefused, type_gipcreqinfo=gipcreqtypeConnect
2015-02-16 12:47:49.290221 :    AGFW:308582144: {0:0:2} Agent sending reply for: RESOURCE_START[ora.gpnpd 1 1] ID 4098:270
2015-02-16 12:47:49.292126 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] ClsdmClient::sendMessage clsdmc_send error rmsg:0 ecode:-7 errbuf:
2015-02-16 12:47:49.292208 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] Calling PID check for daemon
2015-02-16 12:47:49.292269 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] Process id 20189 translated to
2015-02-16 12:47:49.292384 :  CLSDMC:304379648: Connecting to ipc://hract21_DBG_GPNPD
2015-02-16 12:47:49.292604 :  CLSDMC:304379648: Error: gipcWait for gipcConnect - ret_gipcreqinfo=gipcretConnectionRefused, type_gipcreqinfo=gipcreqtypeConnect
2015-02-16 12:47:49.292658 :CLSDYNAM:304379648: [ora.gpnpd]{0:0:2} [check] ClsdmClient::sendMessage clsdmc_send error rmsg:0 ecode:-7 errbuf:
2015-02-16 12:47:49.293759 :    AGFW:308582144: {0:0:2} ora.gpnpd 1 1 state changed from: STARTING to: OFFLINE
2015-02-16 12:47:49.293871 :    AGFW:308582144: {0:0:2} Agent sending last reply for: RESOURCE_START[ora.gpnpd 1 1] ID 40
--> Got no further indications that file permissons on file /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock  is the root cause !

DTRACE SCRIPT :
syscall::open:entry
{
self->path = copyinstr(arg0);
}

syscall::open:return
/arg0<0 && substr( self->path,0,var_tmp_len) == var_tmp_loc &&  substr(execname,0,4)  != "asm_"  /
{
printf("- Exec: %s - open() %s failed with error: %d - scan_dir:  %s ", execname, self->path, arg0, substr( self->path,0,var_tmp_len)   );
}

DTRACE OUTPUT :
Again DTrace helps us to find that problem very quickly :
[root@hract21 DTRACE]# !dt
dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin  - Temp Loc: /var/tmp/.oracle -  PIDFILE: hract21.pid - Port for bind: 53
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0      9                      open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir:  /var/tmp/.oracle
0     89                   connect:return - Exec: mdnsd.bin - PID: 20575  connect() failed with error : -101 - fd : 39 - IP: 17.17.17.17 - Port: 256
0      9                      open:return - Exec: gpnpd.bin - open() /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock failed with error: -13 - scan_dir:  /var/tmp/.oracle

FIX :
Change permission and reboot Clusterware :
[root@hract21 DTRACE]# chmod 644  /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock
[root@hract21 DTRACE]# ls -l  /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock
-rw-r--r-- 1 grid oinstall 0 Feb 15 08:18 /var/tmp/.oracle/ora_gipc_GPNPD_hract21_lock

2 thoughts on “Troubleshooting Clusterware startup problems with DTRACE”

Leave a Reply

Your email address will not be published. Required fields are marked *