Log File Sync wait event due to CPU starvation based on Linux RT processes sharing the same CPU

Table of Contents

Test system

8 CPUs OEL 6.4 Oracle 11.2.0.4 RAC

Compare the 2 sessions above running on the same hardware and same CPU load

Session 1: 
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
log file sync                         5,007 127.      23   93.9 Commit
DB CPU                                       6.6            5.3
gc current block busy                    34    1      29     .8 Cluster
db file sequential read                  27   .2       9     .2 User I/O
gc current block 2-way                  166   .2       1     .1 Cluster
db file scattered read                   19   .1       6     .1 User I/O
gc buffer busy release                   15   .1       6     .1 Cluster
gc current block 3-way                   69   .1       1     .0 Cluster
SQL*Net message to client            10,011    0       0     .0 Network
control file sequential read            112    0       0     .0 System I/O

Session2: 
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
log file sync                         5,008 41.1       8   81.2 Commit
DB CPU                                       6.2           12.2
reliable message                          1    1    1031    2.0 Other
gc current block busy                    52   .9      17    1.8 Cluster
gc buffer busy release                   41   .4      10     .8 Cluster
control file sequential read            112   .1       1     .3 System I/O
db file sequential read                  15   .1       8     .2 User I/O
db file scattered read                   11   .1       7     .1 User I/O
gc cr block busy                          4   .1      14     .1 Cluster
SQL*Net message to client            10,011    0       0     .1 Network

AWR reports tells us we are running the same number of COMMITs ( 5000 ) but session2 is about 3x faster than session1.
Note also the overall CPU load is the same for both tests – in Session1 we only run a CPU intensive process on the same CPU as our lgwr process whereas in Session2 the CPU intensive process is running on a different CPU.

Running both sq load process and lgwr on CPU 5

Compile and run sq.c 
#include <math.h>
 double sqrt();
 main ()
 {
 int i; double z;
 printf("PID: %d\n", getpid());
 for (;;)
 {
 for (i = 0; i < 1000000; i++)
 {
 z = sqrt (i);
 }
 poll (0, 0, 1);
 }
 }
$ cc   sq.c   -o sq -lm

Attach sq process to CPU 5 and increase priority

$  taskset -c 5 sq
 PID: 12657
# renice -15 12657

Now reschedule the LGWR to CPU 5 and monitor both process with top
 $ ps -elf | egrep 'NI|ora_lgwr' |  grep -v grep
 F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
 0 S oracle    6181     1  0  80   0 - 394456 semtim 18:31 ?       00:00:01 ora_lgwr_grac41

$ taskset -pc 5 6181
 pid 6181's current affinity list: 0-7
 pid 6181's new affinity list: 5

Run the JDBC test case
 $ java UCPDemo_si 1 5000 -noseq
 Started at:  Fri Nov 15 19:12:43 CET 2013 - Number of Threads: 1 - tx_cnt:5000 - mode: 2 Debug mode: false
 -> SQL: insert into rac_perftest values (?, ?, ?, sysdate)
 Connected at: 19:12:45 to instance: grac41 at host: grac41.example.com
 Released  connection to pool - Instance: grac41 - Records inserted: 5000
 Runtime : 130 [s]


Monitor both processes using top
# top -H -p 12657,6181 ( press f and j to add CPU affinity flag P )
 Cpu(s): 15.6%us,  6.3%sy,  0.0%ni, 77.2%id,  0.4%wa,  0.0%hi,  0.5%si,  0.0%st
 PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
 12657 oracle     5 -15  6544  308  244 R 80.0  0.0   8:47.77 5 sq
 6181  oracle    20   0 1540m  32m  28m D  7.6  0.8   0:10.13 5 oracle
 --> P = 5 means both processes are running on CPU 5
     lgwr only gets around 7 % of CPU
     There is  a lot of CPU available ( 77% ) but lgwr still runs into CPU starvation
     sq process takes about 80% CPUtime from CPU 5 and runs with a higher priority as lgwr process

Running both sq load process and lgwr on different CPUs

Reschedule lgwr process from CPU 5 to CPU 7 
$ taskset -pc 7 6181 
 pid 6181's current affinity list: 5 
 pid 6181's new affinity list: 7 

Run test case:
$ java UCPDemo_si 1 5000 -noseq
 Started at:  Fri Nov 15 19:23:17 CET 2013 - Number of Threads: 1 - tx_cnt:5000 - mode: 2 Debug mode: false
 -> SQL: insert into rac_perftest values (?, ?, ?, sysdate)
 Connected at: 19:23:18 to instance: grac41 at host: grac41.example.com
 Released  connection to pool - Instance: grac41 - Records inserted: 5000
 Runtime : 45 [s]

Releated Top Output
 Cpu(s): 19.0%us,  9.0%sy,  0.0%ni, 70.6%id,  0.0%wa,  0.0%hi,  1.5%si,  0.0%st
 PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
 12657 oracle     5 -15  6544  308  244 S 81.5  0.0  17:50.60 5 sq
 6181  oracle    20   0 1540m  32m  28m D 12.0  0.8   0:26.33 7 oracle
 --> sq process and lgwr runs on different CPUs - lgwr process now takes 12% CPU
     sq process still takes about 81 % of CPU 7 but lgwr process now running on CPU 5 get enough CPU time 
     Application runtime drops from 130 s to 45 s without changing anything

Summary:

Always check that lgwr process gets enough CPU time – total CPU shown in top output is not always the limiting factor
RT processes or lot of high priority process together with low number of CPUs can worthen the problem
First decrease priority for NON-Oracle RT processes ( resetting the priority of our sq load process fixes this problem : $ renice 0 12657 )
If this doesn’t help try to renice LGWR oracle shadow processes ( increase priority )

Reference

http://kevinclosson.wordpress.com/2007/07/21/manly-men-only-use-solid-state-disk-for-redo-logging-lgwr-io-is-simple-but-not-lgwr-processing/

One thought on “Log File Sync wait event due to CPU starvation based on Linux RT processes sharing the same CPU”

Intel Bail Tray News says:

April 9, 2014 at 9:21 pm

Attractive section of content. I just stumbled upon your blog
and in accession capital to assert that I acquire in fact enjoyed account your blog posts.
Any way I’ll be subscribing to your feeds and even I achievement you
access consistently fast.

Test system

Compare the 2 sessions above running on the same hardware and same CPU load

Running both sq load process and lgwr on CPU 5

Running both sq load process and lgwr on different CPUs

Reference

One thought on “Log File Sync wait event due to CPU starvation based on Linux RT processes sharing the same CPU”

Leave a Reply Cancel reply