Case VIII: GIPCD, GPNPD CSDD not starting as Nameserver is not reachable
- OS Error [ ECONNREFUSED 111 -> Connection refused ]
Force that error and monitor Clusterware Resource status after startup:
On the nameserver run:
[root@ns1 ~]# service named stop
Stopping named: . [ OK ]
***** Local Resources: *****
Resource NAME INST TARGET STATE SERVER STATE_DETAILS
--------------------------- ---- ------------ ------------ --------------- -----------------------------------------
ora.asm 1 ONLINE OFFLINE - STABLE
ora.cluster_interconnect.haip 1 ONLINE OFFLINE - STABLE
ora.crf 1 ONLINE ONLINE hract21 STABLE
ora.crsd 1 ONLINE OFFLINE - STABLE
ora.cssd 1 ONLINE OFFLINE hract21 STARTING
ora.cssdmonitor 1 ONLINE ONLINE hract21 STABLE
ora.ctssd 1 ONLINE OFFLINE - STABLE
ora.diskmon 1 ONLINE OFFLINE - STABLE
ora.drivers.acfs 1 ONLINE ONLINE hract21 STABLE
ora.evmd 1 ONLINE INTERMEDIATE hract21 STABLE
ora.gipcd 1 ONLINE OFFLINE - STABLE
ora.gpnpd 1 ONLINE INTERMEDIATE hract21 STABLE
ora.mdnsd 1 ONLINE ONLINE hract21 STABLE
ora.storage 1 ONLINE OFFLINE - STABLE
--> GIPCD, GPNPD CSDD not starting
CLUVFY :
[grid@hract21 trace]$ ~/CLUVFY/bin/cluvfy -version
PRVF-0002 : Could not retrieve local nodename
TRACEFILE review :
Grep trace files for any Resolve errors [ OS function: getaddrinfo() ]
[grid@hract21 trace]$ grep "2015-02-17 14:1" * | grep gipcmodNetworkResolve
gipcd.trc:
2015-02-17 14:13:36.137197 :GIPCXCPT:2309576448: gipcInternalEndpoint: failed to bind address to endpoint name 'tcp://hract21.example.com', ret gipcretFail (1)
2015-02-17 14:13:41.141266 :GIPCXCPT:2309576448: gipcmodNetworkResolve: failed to create new address for osName 'hract21.example.com', name 'tcp://hract21.example.com'
2015-02-17 14:13:41.141285 :GIPCXCPT:2309576448: gipcmodNetworkResolve: slos op : sgipcnPopulateAddrInfo
2015-02-17 14:13:41.141289 :GIPCXCPT:2309576448: gipcmodNetworkResolve: slos dep : Connection refused (111)
2015-02-17 14:13:41.141293 :GIPCXCPT:2309576448: gipcmodNetworkResolve: slos loc : getaddrinfo(
2015-02-17 14:13:41.141297 :GIPCXCPT:2309576448: gipcmodNetworkResolve: slos info: server not available,try again
2015-02-17 14:13:41.141342 :GIPCXCPT:2309576448: gipcResolveF [gipcInternalBind : gipcInternal.c : 537]: EXCEPTION[ ret gipcretFail (1) ]
failed to resolve address 0x7fd764033be0 [0000000000000310]
{ gipcAddress : name 'tcp://hract21.example.com', objFlags 0x0, addrFlags 0x8 }, flags 0x4000
2015-02-17 14:13:41.141365 :GIPCXCPT:2309576448: gipcBindF [gipcInternalEndpoint : gipcInternal.c : 468]: EXCEPTION[ ret gipcretFail (1) ] failed to bind endp 0x7fd764033070 [000000000000030e] { gipcEndpoint : localAddr 'tcp://hract21.example.com', remoteAddr '', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj (nil), sendp (nil) status 13flags 0x40008000, flags-2 0x0, usrFlags 0x240a0 }, addr 0x7fd764034890 [0000000000000315] { gipcAddress : name 'tcp://hract21.example.com', objFlags 0x0, addrFlags 0x8 }, flags 0x200a0
DTRACE SCRIPT helper:
Use strace to get an idea how to write a working DTRACE script
22752 connect(27, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.5.50")}, 16 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth2", ifr_broadaddr={AF_INET, inet_addr("192.168.2.255")}}) = 0
22752 <... connect resumed> ) = 0
22750 ioctl(28, SIOCGIFFLAGS <unfinished ...>
22752 poll([{fd=27, events=POLLOUT}], 1, 0 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
22752 <... poll resumed> ) = 1 ([{fd=27, revents=POLLOUT}])
22750 ioctl(28, SIOCGIFADDR <unfinished ...>
22752 sendto(27, "\320X\1\0\0\1\0\0\0\0\0\0\7hract21\7example\3com"..., 37, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_addr={AF_INET, inet_addr("192.168.3.121")}}) = 0
22752 <... sendto resumed> ) = 37
22750 ioctl(28, SIOCGIFNETMASK <unfinished ...>
22752 poll([{fd=27, events=POLLIN|POLLOUT}], 1, 5000 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_netmask={AF_INET, inet_addr("255.255.255.0")}}) = 0
22752 <... poll resumed> ) = 1 ([{fd=27, revents=POLLOUT}])
22750 ioctl(28, SIOCGIFBRDADDR <unfinished ...>
22752 sendto(27, "\16\227\1\0\0\1\0\0\0\0\0\0\7hract21\7example\3com"..., 37, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
22750 <... ioctl resumed> , {ifr_name="eth3", ifr_broadaddr={AF_INET, inet_addr("192.168.3.255")}}) = 0
22752 <... sendto resumed> ) = -1 ECONNREFUSED (Connection refused)
--> connect call ( works with fd=27 works - parameter 2 of our connect call holds the IP adresss
The following sendto() call ( sendto(27,.. ) fails with error ECONNREFUSED
To select the right sendto call you need to use the PID ( 22752 ) and the filedescriptor fd=27 ( sendto(27, .. )
Requirements for DTRACE script details
- Collect info about the IP adress from a former connect() call ( we need to trace all conenct calls )
- Trace the sendto call for errors like ( ECONNREFUSED )
- Use Filedescriptor fd ( fd=27 ) to tie up the connect call and the sendto
- Always attach strace to the gipcd process to get an idea whether your oracle versions
executes the same system calls in the same order
DTRACE SCRIPT :
syscall::connect:return
/self->port == ns_ip_port && execname != "crsctl.bin" /
{
printf("- Exec: %s - PID: %d connect() - fd : %d - IP: %d.%d.%d.%d - Port: %d " , execname, pid, self->fd,
self->ip1, self->ip2, self->ip3, self->ip4, self->port );
}
syscall::sendto:entry
/execname != "crsctl.bin" /
{
self->fds = arg0;
}
syscall::sendto:return
/arg0<0 && execname != "crsctl.bin" /
{
printf("- Exec: %s - PID: %d sendto() failed with error : %d - fd : %d " , execname, pid, arg0, self->fds );
}
DTRACE output:
[root@hract21 DTRACE]# !dt
dtrace -s check_rac.d
dtrace: script 'check_rac.d' matched 21 probes
CPU ID FUNCTION:NAME
0 1 :BEGIN GRIDHOME: /u01/app/121/grid - GRIDHOME/bin: /u01/app/121/grid/bin - Temp Loc: /var/tmp/.oracle - PIDFILE: hract21.pid - Port for bind: 53
0 9 open:return - Exec: ohasd.bin - open() /var/tmp/.oracle/npohasd failed with error: -6 - scan_dir: /var/tmp/.oracle
0 93 sendto:return - Exec: orarootagent.bi - PID: 29204 sendto() failed with error : -111 - fd : 15
0 93 sendto:return - Exec: oraagent.bin - PID: 29308 sendto() failed with error : -111 - fd : 15
0 93 sendto:return - Exec: oraagent.bin - PID: 29308 sendto() failed with error : -111 - fd : 15
0 89 connect:return - Exec: gipcd.bin - PID: 29363 connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0 93 sendto:return - Exec: gipcd.bin - PID: 29363 sendto() failed with error : -111 - fd : 27
0 89 connect:return - Exec: gipcd.bin - PID: 29363 connect() to Nameserver - fd : 27 - IP: 192.168.5.50 - Port: 53
0 93 sendto:return - Exec: mdnsd.bin - PID: 29320 sendto() failed with error : -111 - fd : 7
0 93 sendto:return - Exec: gpnpd.bin - PID: 29343 sendto() failed with error : -111 - fd : 15
--> In this sample the gipcd.bin is failing to communicate with the namesever
The failed system call is sendto() - Error ECONNREFUSED 111 - Connection refused following a
successfull connect() system call.
Note: Filedescritor fd=27 signals that connect() and sendto() system call operates on the same socket/file discriptor
Investigate & Fix
[root@hract21 network-scripts]# ping ns1.example.com
ping: unknown host ns1.example.com
--> Fix : Restart your nameserver and check nameserver IP-Addres/Port
Pages: Page 1, Page 2, Page 3, Page 4, Page 5, Page 6, Page 7, Page 8, Page 9, Page 10, Page 11, Page 12, Page 13, Page 14, Page 15, Page 16
Many thx
This is very helpful
I really like looking through an article that can make people think.
Also, thank you for permitting me to comment!