Table of Contents
Reference for CHM statistics – you may review this first
Display and interpreting CHM data causing a Node Eviction
- Try to find two time windows we can easily compare a good and a bad system state
- 03-22-14 09.55.00 – 09.55.59 : [ WORKING SCENARIO ]
- 03-22-14 10.03.00 – 10.03.59 : [ PROBLEM SCENARIO ] – system freeze , heavy swapping, no CPU available
System Metrics: WORKING SCENARIO : % ~/print_sys.sh grac41_CHMOS pcpus | egrep '#pcpus|cpuq:|03-22-14 09.55' #pcpus: 2 #vcpus: 2 cpuht: N chipname: Intel(R) swaptotal: 5210108 physmemtotal: 4354292 #sysfdlimit: 6815744 #disks: 27 #nics: 6 cpu: cpuq: memfree: mcache: swapfree: ior: iow: ios: swpin: swpout: pgin: pgout: netr: netw: procs: rtprocs: #fds: nicErrors: 03-22-14 09.55.00 71.57 4 1851948 931104 3524948 546 599 582 135 0 390 473 754 797 365 14 19488 0 03-22-14 09.55.05 72.62 85 1831188 932568 3525120 208 692 244 42 0 115 404 357 379 364 14 19488 0 03-22-14 09.55.10 99.51 71 1765596 932800 3525144 101 127 37 12 0 55 72 58 18 364 15 19520 0 CPU > 90% 03-22-14 09.55.15 99.75 85 1696272 933380 3525172 151 381 48 15 0 83 196 18 14 369 15 19552 0 CPU > 90% 03-22-14 09.55.20 100.0 89 1616644 933664 3525196 39 79 26 1 0 21 44 32 18 376 15 19520 0 CPU > 90% 03-22-14 09.55.25 99.39 89 1557244 933736 3525228 122 99 40 25 0 70 60 30 15 366 15 19552 0 CPU > 90% 03-22-14 09.55.30 100.0 75 1498456 934120 3525236 164 872 63 6 0 90 442 14 10 369 15 19648 0 CPU > 90% 03-22-14 09.55.35 100.0 80 1437248 934300 3525288 145 441 108 38 0 80 227 69 35 378 15 19648 0 CPU > 90% 03-22-14 09.55.40 99.87 68 1381740 934524 3525376 220 291 59 55 0 146 157 30 22 367 15 19648 0 CPU > 90% 03-22-14 09.55.45 99.75 67 1315868 935176 3525424 161 418 56 43 0 82 215 38 16 367 15 19584 0 CPU > 90% 03-22-14 09.55.50 99.51 77 1268612 935332 3525460 126 170 51 38 0 74 94 24 16 367 15 19648 0 CPU > 90% 03-22-14 09.55.55 99.87 111 1207984 935820 3525508 209 921 148 25 0 106 493 31 23 373 15 19584 0 CPU > 90% Interpreting the data: - cpuq increases from 4 to 111 at the end of this monitor window and remains high after 09.55.05 - CPU usage increases for 71 % to about 100 % at the end of this monitor window and remains high after 09.55.10 - Free memory decrease from 1851948 kByte to 1207984 Kbyte in consistent chunks - Paging and Swapping is not yet problem PROBLEM SCENARIO: % ~/print_sys.sh grac41_CHMOS pcpus | egrep '#pcpus|cpuq:|03-22-14 10.03' #pcpus: 2 #vcpus: 2 cpuht: N chipname: Intel(R) swaptotal: 5210108 physmemtotal: 4354292 #sysfdlimit: 6815744 #disks: 27 #nics: 6 cpu: cpuq: memfree: mcache: swapfree: ior: iow: ios: swpin: swpout: pgin: pgout: netr: netw: procs: rtprocs: #fds: nicErrors: 03-22-14 10.03.00 11.74 1 93960 171796 172100 6049 3475 286 1899 1593 3033 1650 14 10 379 15 19584 0 MEMLOW < 10% 03-22-14 10.03.05 7.52 1 103496 172052 167384 6020 3678 129 1292 1524 3006 1580 24 10 384 15 19584 0 MEMLOW < 10% 03-22-14 10.03.10 6.60 2 113480 173020 188364 4170 1933 669 1078 960 2030 996 18 9 388 15 19488 0 MEMLOW < 10% 03-22-14 10.03.15 7.24 1 109800 173584 181808 6279 3576 190 1495 1655 3189 1679 27 17 381 15 19520 0 MEMLOW < 10% 03-22-14 10.03.20 24.20 61 85236 173572 195528 4258 2247 789 1174 1135 2088 1172 48 11 394 15 19456 0 MEMLOW < 10% 03-22-14 10.03.25 19.96 64 87344 166056 138244 255 11165 113 69 5647 98 5650 19 4 394 15 19456 0 MEMLOW < 10% 03-22-14 10.03.30 100.0 17 89932 164992 129100 35716 475205 126771 31478 472380 35716 472380 823 885 394 15 19456 0 CPU > 90% MEMLOW < 10% 03-22-14 10.03.35 12.18 1 87380 168188 120572 4599 5959 278 1435 2051 2234 2175 49 9 379 15 19392 0 MEMLOW < 10% 03-22-14 10.03.40 6.85 1 86364 172672 113828 4973 3378 129 1288 1649 2490 1791 15 4 390 15 19456 0 MEMLOW < 10% 03-22-14 10.03.45 5.22 1 86804 175592 112524 2504 1439 352 523 364 1314 452 15 14 392 15 19456 0 MEMLOW < 10% 03-22-14 10.03.50 6.89 1 86768 175572 105128 4016 3705 93 836 1697 1908 1791 8 3 391 15 19488 0 MEMLOW < 10% 03-22-14 10.03.55 51.5 124 69692 171176 18184 956 35388 311 213 17409 416 17555 13 8 393 15 19488 0 MEMLOW < 10% - memfree with 69 MByte and swapfree with 181 MByte are reaching critical values - swpout activity reaches high values 472 Mbyte at 10.03.30 - Heavy pagein activity ( 35716 pagess ) at 10.03.30 - runq shows only 1 processes but very likely we have a high count of blocked processes - CPU is off from 100 % but again very likely due to waiting on swapping/paging activities - This system is ready to freeze even runq and CPU usage looks healtyh ( 10.03.40 10.03.45 10.03.50 ) Lets investigate our swap device WORKING SCENARIO : % ~/print_disk.sh grac41_CHMOS dm-1 pcpus dm-1 | egrep 'ior:|Device|03-22-14 09.55' Device_name: dm-1 ior: iow: ios: qlen: wait: type: Disk/Part '03-22-14 09.55.00' 135.175 0.000 33 0 18 SWAP dm-1 '03-22-14 09.55.05' 42.441 0.000 10 0 18 SWAP dm-1 '03-22-14 09.55.10' 12.808 0.000 3 0 20 SWAP dm-1 '03-22-14 09.55.15' 15.168 0.000 3 0 68 SWAP dm-1 '03-22-14 09.55.20' 1.600 0.000 0 0 83 SWAP dm-1 '03-22-14 09.55.25' 25.629 0.000 6 0 20 SWAP dm-1 '03-22-14 09.55.30' 6.394 0.000 1 0 22 SWAP dm-1 '03-22-14 09.55.35' 38.419 0.000 9 0 98 SWAP dm-1 '03-22-14 09.55.40' 55.234 0.000 13 0 23 SWAP dm-1 '03-22-14 09.55.45' 43.253 0.000 10 0 35 SWAP dm-1 '03-22-14 09.55.50' 38.357 0.000 9 0 33 SWAP dm-1 - At 3-22-14 09.55 swap device looks good - low qlen - low ios ( about 15 for the complete monitor window ) - No writes to swap device only some reads PROBLEM SCENARIO : % ~/print_disk.sh grac41_CHMOS dm-1 pcpus dm-1 | egrep 'ior:|Device|03-22-14 10.03' Device_name: dm-1 ior: iow: ios: qlen: wait: type: Disk/Part '03-22-14 10.03.00' 1899.896 1593.513 873 1188 4885 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 4885 ms '03-22-14 10.03.05' 1292.842 1524.049 704 1268 2217 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 2217 ms '03-22-14 10.03.10' 1078.232 960.563 509 988 1827 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 1827 ms '03-22-14 10.03.15' 1495.491 1655.267 787 534 780 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 780 ms '03-22-14 10.03.20' 1174.470 1135.348 577 418 779 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 779 ms '03-22-14 10.03.25' 69.639 5658.866 1432 801 448 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 448 ms '03-22-14 10.03.30' 31478.585 475205.568 126671 1831 0 SWAP dm-1 '03-22-14 10.03.35' 1435.188 2016.783 862 2002 2494 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 2494 ms '03-22-14 10.03.40' 1288.316 1649.205 734 1111 1282 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 1282 ms '03-22-14 10.03.45' 523.714 364.601 222 701 3597 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 3597 ms '03-22-14 10.03.50' 836.427 1697.619 633 601 1001 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 1001 ms '03-22-14 10.03.55' 213.961 17409.152 4405 225 101 SWAP dm-1 : Disk spent too much time waiting for I/O (> 100 msecs) : 101 ms - Swap devices is very very busy - at 10.03.30 we have 126671 IOs and at 10.03.00 the service time was 4,8 seconds ! - At 10.03.30 475 Mbyte swap was written - this is really not an acceptable value - CHM reports for all snaps : Disk spent too much time waiting for I/O (> 100 msecs) Lets now investigate our Top CPU consumers WORKING SCENARIO: % ~/print_top.sh grac41_CHMOS topcpu: | egrep 'topcpu|03-22-14 09.55' -> Search Str 1 : topcpu: topcpu_processs CPU_usage topprivmem_process PRIVMEM topshmmem_process SHMMEM topfd_processs FD_usage topthread_process THERAD_NO 03-22-14 09.55.00 java(12693) 17.60 java(2566) 233508 ologgerd(5661) 64536 java(2566) 357 console-kit-dae(4484) 64 03-22-14 09.55.05 java(12693) 15.59 java(2566) 233508 oraclegrac41(12762) 65056 java(2566) 357 console-kit-dae(4484) 64 03-22-14 09.55.10 ocssd.bin(5283) 5.99 java(2566) 233508 oraclegrac41(12762) 66112 java(2566) 357 console-kit-dae(4484) 64 03-22-14 09.55.15 mp_stress(14181) 180.80 java(2566) 233508 oraclegrac41(12762) 66112 java(2566) 357 console-kit-dae(4484) 64 03-22-14 09.55.20 mp_stress(14181) 179.19 java(2566) 233520 oraclegrac41(12762) 66112 java(2566) 363 console-kit-dae(4484) 64 03-22-14 09.55.25 mp_stress(14181) 177.91 java(2566) 233508 oraclegrac41(12762) 66112 java(2566) 359 console-kit-dae(4484) 64 03-22-14 09.55.30 mp_stress(14181) 179.90 mp_stress(14181) 273036 oraclegrac41(12762) 66112 java(2566) 359 console-kit-dae(4484) 64 03-22-14 09.55.35 mp_stress(14181) 178.80 mp_stress(14181) 331544 oraclegrac41(12762) 66112 java(2566) 359 console-kit-dae(4484) 64 03-22-14 09.55.40 mp_stress(14181) 178.58 mp_stress(14181) 390128 oraclegrac41(12762) 66112 java(2566) 359 console-kit-dae(4484) 64 03-22-14 09.55.45 mp_stress(14181) 174.20 mp_stress(14181) 448892 oraclegrac41(12762) 66112 java(2566) 359 console-kit-dae(4484) 64 03-22-14 09.55.50 mp_stress(14181) 183.35 mp_stress(14181) 507468 oraclegrac41(12762) 66112 java(2566) 359 console-kit-dae(4484) 64 03-22-14 09.55.55 mp_stress(14181) 179.77 mp_stress(14181) 566240 oraclegrac41(12762) 66112 java(2566) 359 console-kit-dae(4484) 64 - since 09.55.15 program mp_stress is eating up our CPU ! - mp_stress remains our top CPU consumer for the remining 10 snapshots - Not a good situation ! PROBLEM SCENARIO: % ~/print_top.sh grac41_CHMOS topcpu: | egrep 'topcpu|03-22-14 10.03' -> Search Str 1 : topcpu: topcpu_processs CPU_usage topprivmem_process PRIVMEM topshmmem_process SHMMEM topfd_processs FD_usage topthread_process THERAD_NO 03-22-14 10.03.00 mp_stress(16867) 166.39 mp_stress(16867) 3072956 ologgerd(5661) 64536 java(17355) 376 console-kit-dae(4484) 64 03-22-14 10.03.05 asm_vktm_+ASM1(7540) 0.95 mp_stress(16867) 3069048 ologgerd(5661) 64536 java(17355) 376 console-kit-dae(4484) 64 03-22-14 10.03.10 ora_vktm_grac41(27027) 4.80 mp_stress(16867) 3064628 ologgerd(5661) 64536 java(2566) 373 console-kit-dae(4484) 64 03-22-14 10.03.15 ora_vktm_grac41(27027) 4.59 mp_stress(16867) 3060724 ologgerd(5661) 64536 java(2566) 373 console-kit-dae(4484) 64 03-22-14 10.03.20 asm_vktm_+ASM1(7540) 5.39 mp_stress(16867) 3055136 ologgerd(5661) 64536 java(2566) 373 console-kit-dae(4484) 64 03-22-14 10.03.25 asm_vktm_+ASM1(7540) 5.39 mp_stress(16867) 3055136 ologgerd(5661) 64536 java(2566) 373 console-kit-dae(4484) 64 03-22-14 10.03.30 asm_vktm_+ASM1(7540) 5.39 mp_stress(16867) 3055136 ologgerd(5661) 64536 java(2566) 373 console-kit-dae(4484) 64 03-22-14 10.03.35 mp_stress(16867) 157.25 mp_stress(16867) 3126540 ologgerd(5661) 64536 java(2566) 373 console-kit-dae(4484) 64 03-22-14 10.03.40 mp_stress(16867) 5.14 mp_stress(16867) 3139152 ologgerd(5661) 64536 java(2566) 373 console-kit-dae(4484) 64 03-22-14 10.03.45 asm_vktm_+ASM1(7540) 5.39 mp_stress(16867) 3134516 ologgerd(5661) 64536 java(17527) 378 console-kit-dae(4484) 64 03-22-14 10.03.50 ora_vktm_grac41(27027) 4.79 mp_stress(16867) 3132116 ologgerd(5661) 64536 java(2566) 377 console-kit-dae(4484) 64 03-22-14 10.03.55 ora_vktm_grac41(27027) 5.19 mp_stress(16867) 3128736 ologgerd(5661) 64536 java(2566) 377 console-kit-dae(4484) 64 - from 10.03.35 on mp_stress is only 2 times the top CPU consumer - mp_press holds now 3.12 GByte memory - this system is ready to freeze ! Process details for mp_stress program: WORKING SCENARIO: % ~/print_proc.sh grac41_CHMOS mp_stress | egrep 'name:|03-22-14 09.55' name: pid: #procfdlimit: cpuusage: privmem: shm: #fd: #threads: priority: nice: '03-22-14 09.55.10' 'mp_stress' 14181 1024 0.0 38740 556 3 5 -51 0 '03-22-14 09.55.15' 'mp_stress' 14181 1024 180.8 97248 556 3 5 -51 0 CPU usage is abnormally high (> 90): 180.8 '03-22-14 09.55.20' 'mp_stress' 14181 1024 179.19 156216 556 3 5 -51 0 CPU usage is abnormally high (> 90): 179.19 '03-22-14 09.55.25' 'mp_stress' 14181 1024 177.91 214264 556 3 5 -51 0 CPU usage is abnormally high (> 90): 177.91 '03-22-14 09.55.30' 'mp_stress' 14181 1024 179.90 273036 556 3 5 -51 0 CPU usage is abnormally high (> 90): 179.90 '03-22-14 09.55.35' 'mp_stress' 14181 1024 178.80 331544 556 3 5 -51 0 CPU usage is abnormally high (> 90): 178.80 '03-22-14 09.55.40' 'mp_stress' 14181 1024 178.58 390128 556 3 5 -51 0 CPU usage is abnormally high (> 90): 178.58 '03-22-14 09.55.45' 'mp_stress' 14181 1024 174.20 448892 556 3 5 -51 0 CPU usage is abnormally high (> 90): 174.20 '03-22-14 09.55.50' 'mp_stress' 14181 1024 183.35 507468 556 3 5 -51 0 CPU usage is abnormally high (> 90): 183.35 '03-22-14 09.55.55' 'mp_stress' 14181 1024 179.77 566240 556 3 5 -51 0 CPU usage is abnormally high (> 90): 179.77 - From 09.55.15 on mp_stress takes a lot of CPU ( 180% from max value of 200% for 2 CPU system) - Memory usage is increasing for 38Mbyte to 566 Mbyte at the end of our monitor window - mp_stress program is running on realtime priortiy of -51 which is quite high - 5 Threads ( == 4 Worker Threads ) will be lead to heavy CPU contentions as we have 2 RT threads per CPU PROBLEM SCENARIO: % ~/print_proc.sh grac41_CHMOS mp_stress | egrep 'name:|03-22-14 10.03' name: pid: #procfdlimit: cpuusage: privmem: shm: #fd: #threads: priority: nice: '03-22-14 10.03.00' 'mp_stress' 16867 1024 166.39 3072956 556 3 5 -51 0 CPU usage is abnormally high (> 90): 166.39 '03-22-14 10.03.05' 'mp_stress' 16867 1024 0.15 3069048 556 3 5 -51 0 '03-22-14 10.03.10' 'mp_stress' 16867 1024 0.0 3064628 556 3 5 -51 0 '03-22-14 10.03.15' 'mp_stress' 16867 1024 0.0 3060724 556 3 5 -51 0 '03-22-14 10.03.20' 'mp_stress' 16867 1024 0.0 3055136 556 3 5 -51 0 '03-22-14 10.03.25' 'mp_stress' 16867 1024 0.0 3055136 556 3 5 -51 0 '03-22-14 10.03.30' 'mp_stress' 16867 1024 0.0 3055136 556 3 5 -51 0 '03-22-14 10.03.35' 'mp_stress' 16867 1024 157.25 3126540 556 3 5 -51 0 CPU usage is abnormally high (> 90): 157.25 '03-22-14 10.03.40' 'mp_stress' 16867 1024 5.14 3139152 556 3 5 -51 0 '03-22-14 10.03.45' 'mp_stress' 16867 1024 0.19 3134516 556 3 5 -51 0 '03-22-14 10.03.50' 'mp_stress' 16867 1024 0.0 3132116 556 3 5 -51 0 '03-22-14 10.03.55' 'mp_stress' 16867 1024 0.0 3128736 556 3 5 -51 0 - at 10.03.50 memory usage increase to 3 Gbyte ! - now only 2 of 12 snapshots reports : CPU usage is abnormally high - mp_stress does not get very much CPU on all snapshots because system very likely has swapping problems when memory increased from 38 Mbyte to 3 Gbyte in less than 8 minutes ! Process details for LMS0 process As RAC alert.logs tells us LMS0 process has troubles IPC Send timeout detected. Sender: ospid 9116 [oracle@grac43.example.com (LMS0)] lets monitor LSM0 process too: WORKING SCENARIO: % ~/print_proc.sh grac41_CHMOS ora_lms0 | egrep 'name:|03-22-14 09.55' name: pid: #procfdlimit: cpuusage: privmem: shm: #fd: #threads: priority: nice: '03-22-14 09.55.00' 'ora_lms0_grac41' 27131 65536 13.80 9984 47660 19 1 -2 0 '03-22-14 09.55.05' 'ora_lms0_grac41' 27131 65536 12.59 9984 49768 19 1 -2 0 '03-22-14 09.55.10' 'ora_lms0_grac41' 27131 65536 4.59 9988 50504 19 1 -2 0 '03-22-14 09.55.15' 'ora_lms0_grac41' 27131 65536 0.0 9988 50504 19 1 -2 0 '03-22-14 09.55.20' 'ora_lms0_grac41' 27131 65536 0.0 9988 50504 19 1 -2 0 '03-22-14 09.55.25' 'ora_lms0_grac41' 27131 65536 0.0 9988 50504 19 1 -2 0 '03-22-14 09.55.30' 'ora_lms0_grac41' 27131 65536 0.0 9988 50504 19 1 -2 0 '03-22-14 09.55.35' 'ora_lms0_grac41' 27131 65536 0.0 9988 50504 19 1 -2 0 '03-22-14 09.55.40' 'ora_lms0_grac41' 27131 65536 0.0 9988 50504 19 1 -2 0 '03-22-14 09.55.45' 'ora_lms0_grac41' 27131 65536 0.0 9988 50504 19 1 -2 0 '03-22-14 09.55.50' 'ora_lms0_grac41' 27131 65536 0.0 9988 50504 19 1 -2 0 '03-22-14 09.55.55' 'ora_lms0_grac41' 27131 65536 0.0 9988 50504 19 1 -2 0 --> Only till 09.55.15 LMS0 process gets CPU PROBLEM SCENARIO: % ~/print_proc.sh grac41_CHMOS ora_lms0 | egrep 'name:|03-22-14 10.03' name: pid: #procfdlimit: cpuusage: privmem: shm: #fd: #threads: priority: nice: '03-22-14 10.03.00' 'ora_lms0_grac41' 27131 65536 0.25 3556 13040 19 1 -2 0 '03-22-14 10.03.05' 'ora_lms0_grac41' 27131 65536 0.0 3700 12820 19 1 -2 0 '03-22-14 10.03.10' 'ora_lms0_grac41' 27131 65536 0.0 3664 12720 19 1 -2 0 '03-22-14 10.03.15' 'ora_lms0_grac41' 27131 65536 0.0 3664 12652 19 1 -2 0 '03-22-14 10.03.20' 'ora_lms0_grac41' 27131 65536 0.0 3664 12640 19 1 -2 0 '03-22-14 10.03.25' 'ora_lms0_grac41' 27131 65536 0.0 3664 12640 19 1 -2 0 '03-22-14 10.03.30' 'ora_lms0_grac41' 27131 65536 0.0 3664 12640 19 1 -2 0 '03-22-14 10.03.35' 'ora_lms0_grac41' 27131 65536 0.0 3752 12120 19 1 -2 0 '03-22-14 10.03.40' 'ora_lms0_grac41' 27131 65536 0.0 3752 12112 19 1 -2 0 '03-22-14 10.03.45' 'ora_lms0_grac41' 27131 65536 0.0 3752 12104 19 1 -2 0 '03-22-14 10.03.50' 'ora_lms0_grac41' 27131 65536 0.0 3920 12072 19 1 -2 0 '03-22-14 10.03.55' 'ora_lms0_grac41' 27131 65536 0.0 3916 12000 19 1 -2 0 - ora_lms0_grac41 does not get any CPU for most of the snapshots - ora_lms0_grac41 is not scheduled for CPU ( cpuusage = 0.0 ) as ora_lms0_grac41 runs with a priority of -2 whereas - mp_stress runs on priority -51
Root cause
- CPU starvation and heavy paging/swapping was the root cause for IPC send timeout problem which leads to instance eviction of instamce 1
- Of course bad I/O distribution also contributes to the above instance eviction