Debug log file sync wait event with lfsdiag.sql/iostat

Table of Contents

ASH LFS BACKGROUND PROCESS WAITS DURING WORST MINUTE ( from lfsdiag.sql Output )

APPROACH: What is LGWR doing when 'log file sync' waits are happening? 
LMS info may be relevent for broadcaston commit and LNS data may be relevant for dataguard.
If more details are needed see the ASH DETAILS FOR WORST MINUTES section at the bottom of the report.

MINUTE       INST PROGRAM                           EVENT                   TOTAL_WAIT_TIME  WAITS AVG_TIME_WAITED
------------ ---- --------------------------------- -----------------------  --------------  ------ ------------ -
Dec02_1728      1 JDBC Thin Client                  log file sync                  2945.003      82      35.915
Dec02_1728      1 oracle@grac41.example.com (LGWR)  log file parallel write            .040       1        .040
Dec02_1728      1 oracle@grac41.example.com (LGWR)                                     .000      43        .000
Dec02_1728      1 oracle@grac41.example.com (LMS0)                                     .000       2        .000
Dec02_1729      1 JDBC Thin Client                   log file sync                  3819.957    105      36.381
Dec02_1729      1 oracle@grac41.example.com (LGWR)   log file parallel write           .108       2       .054
Dec02_1729      1 oracle@grac41.example.com (LGWR)                                     .000      54       .000
Dec02_1729      1 oracle@grac41.example.com (LMS0)                                     .000       2       .000
Dec02_1729      1 oracle@grac41.example.com (LMS1)                                     .000       3       .000
Dec02_1731      1 JDBC Thin Client                   log file sync                  995.457      19     52.392
Dec02_1731      1 oracle@grac41.example.com (LGWR)                                     .000      10       .000
Dec02_1731      1 oracle@grac41.example.com (LMS0)                                     .000       2       .000
Dec02_1732      1 JDBC Thin Client                   log file sync                 3429.242      97     35.353
Dec02_1732      1 oracle@grac41.example.com (LGWR)   log file parallel write         28.543             28.543
--> In minute 17:29 multiple JDBC clients wait 105 times with an average wait time of 36 ms for log file sync. 
    LGWR waits in  minute 17:32  about 28 ms for doing an I/O (log file parallel write )

Redo LOG I/O system doesn’t look healthy – Check your REDO LOG Write I/O performance
LGWR and JDBC clients are fighting for LOG File Writes

HISTOGRAM DATA FOR LFS AND OTHER RELATED WAITS ( from lfsdiag.sql Output )

APPROACH: Look at the wait distribution for log file sync waits by looking at "wait_time_milli". Look at the high wait times then
see if you can correlate those with other related wait events.

   INST_ID EVENT                                    WAIT_TIME_MILLI WAIT_COUNT
---------- ---------------------------------------- --------------- ----------
         1 log file sync                                          1        427
         1 log file sync                                          2      50841
         1 log file sync                                          4      69587
         1 log file sync                                          8      13165
         1 log file sync                                         16       4267
         1 log file sync                                         32       2811
         1 log file sync                                         64       1278
         1 log file sync                                        128        602
         1 log file sync                                        256        187
         1 log file sync                                        512         39
         1 log file sync                                       1024          6
         1 log file sync                                       2048          1

         1 log file parallel write                                1     112200
         1 log file parallel write                                2        130
         1 log file parallel write                                4         87
         1 log file parallel write                                8         75
         1 log file parallel write                               16         48
         1 log file parallel write                               32         15
         1 log file parallel write                               64         16
         1 log file parallel write                              128          6
         1 log file parallel write                              256          3

         1 gcs log flush sync                                     2       1204
         1 gcs log flush sync                                     4        182
         1 gcs log flush sync                                     8         56
         1 gcs log flush sync                                    16         30
         1 gcs log flush sync                                    32         25
         1 gcs log flush sync                                    64         15
         1 gcs log flush sync                                   128          8
         1 gcs log flush sync                                   256          3
-->  gcs log flush sync and log file sync heavily trigger LGWR for writing data to disk
     Generic write I/O stats don't look healthy - lot of writes over 8ms

Gcs Log Flush Sync times are not acceptable – most of the Waits are > 8 ms
Note: Only one session on a instance can write to the REDO logs – all the other sessions are waiting on LOG File SYNC event
Excessive COMMITs and dirty GCS buffers can stress your REDO LOG performance

AWR LFS AND OTHER RELATED WAITS FOR WORST LFS AWRs: ( from lfsdiag.sql Output )

APPROACH: These are the AWR snaps where the average 'log file sync' times were the highest. Look at related waits at those times.
   SNAP_ID INST NAME                                     TOTAL_WAITS   TOTAL_WAIT_TIME   AVG_TIME_WAITED
---------- ---- ---------------------------------------- ----------- ----------------- -----------------
       303    3 log file sync                                      7          2003.532           286.219
       303    1 log file sync                                     18          3072.108           170.673
       303    2 log file sync                                      9           271.443            30.160
       303    1 gcs log flush sync                              1956         18537.067             9.477
       303    3 gcs log flush sync                              2460         18970.100             7.711
       303    2 gcs log flush sync                                93           498.418             5.359
       303    1 LGWR wait for redo copy                            5             7.147             1.429
       303    3 log file parallel write                          278           193.133              .695
       303    2 gc current block 2-way                          3026          1789.197              .591
       303    1 gc cr grant 2-way                               2092          1192.904              .570
       303    1 gc current block 2-way                          1730           936.028              .541
       303    3 gc current block 2-way                          1949           998.319              .512
       303    3 gc cr grant 2-way                                695           305.589              .440
       303    2 gc cr grant 2-way                                 46            19.798              .430
       303    1 log file parallel write                          365            34.946              .096
       303    3 LGWR wait for redo copy                            1              .037              .037
       303    2 log file parallel write                          146             4.058              .028
--> Gcs Log Flush sync is top Wait Event
    This event triggers LGWR to write redo log ( log file parallel write )

Session details ( from lfsdiag.sql Output )

APPROACH: If you cannot determine the problem from the data above, you may need to look at the details of what each session
is doing during each 'bad' snap. Most likely you will want to note the times of the high log file sync waits, look at what
LGWR is doing at those times, and go from there...

SAMPLE_TIME               INST SESSION_ID PROGRAM                                       EVENT                          TIME_WAITED
------------------------- ---- ---------- --------------------------------------------- ------------------------------ -----------
P1                                       P2                                       P3
02-DEC-13 05.13.09.680 PM    1          5 JDBC Thin Client                              log file sync                        9.284
buffer#: 1456                            sync scn: 20387728                       : 0
02-DEC-13 05.13.09.680 PM    1         99 JDBC Thin Client                              log file sync                        6.774
buffer#: 1452                            sync scn: 20387725                       : 0
02-DEC-13 05.13.09.680 PM    1        193 JDBC Thin Client                              log file sync                        7.175
buffer#: 1452                            sync scn: 20387726                       : 0
02-DEC-13 05.13.10.680 PM    1          5 JDBC Thin Client                              log file sync                        4.070
buffer#: 1588                            sync scn: 20388120                       : 0
02-DEC-13 05.13.10.680 PM    1         96 oracle@grac41.example.com (LGWR)              log file parallel write               .905
files: 1                                 blocks: 1                                requests: 1
02-DEC-13 05.13.10.680 PM    1        193 JDBC Thin Client                              log file sync                        4.149
buffer#: 1589                            sync scn: 20388122        
..
02-DEC-13 05.13.12.780 PM    1          5 JDBC Thin Client                              log file sync                      282.466
buffer#: 2523                            sync scn: 20389585                       : 0
02-DEC-13 05.13.12.780 PM    1         99 JDBC Thin Client                              log file sync                      281.079
buffer#: 2523                            sync scn: 20389586                       : 0
02-DEC-13 05.13.12.780 PM    1        193 JDBC Thin Client                              log file sync                      288.317
buffer#: 2520                            sync scn: 20389583                       : 0
               : 0
--> Even LGWR is currently writing some redo records server one of our JDBC clients there are at 
       least 4 other clients waiting on LGWR.
    This wait time is accounted to log file sync wait event 
..
02-DEC-13 05.13.12.780 PM    1          5 JDBC Thin Client                              log file sync                      282.466
buffer#: 2523                            sync scn: 20389585                       : 0
02-DEC-13 05.13.12.780 PM    1         99 JDBC Thin Client                              log file sync                      281.079
buffer#: 2523                            sync scn: 20389586                       : 0
02-DEC-13 05.13.12.780 PM    1        193 JDBC Thin Client                              log file sync                      288.317
buffer#: 2520                            sync scn: 20389583                       : 0
--> Here we can see a very high wait times log file sync. Again either LGWR doesn't get CPU 
    or the the I/O bandwidth is to low

Check your iostat output during heavy load timeframes

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.68    0.00   29.55    0.72    0.00   62.04
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdc               1.20         0.00        14.34          0         72
sda              18.92         4.78       270.92         24       1360
sde               1.39        19.12        12.75         96         64
sdh               3.39         1.99         2.59         10         13
sdb               3.19        63.75        14.34        320         72
sdg               3.39         1.99         2.59         10         13
sdd               0.60         3.19         1.59         16          8
sdf               3.39         1.99         2.59         10         13
sdi             481.67         0.00       579.08          0       2907
sdj             513.55         0.00       620.12          0       3113
sdk             311.16         0.00       381.67          0       1916
dm-0             22.11         4.78       270.92         24       1360
-> Even sdi, sdkj and sdk shows an acceptable Redo log write performance all of these disks 
     are emulated via VirtualBox and use a single physical SSD disk allowing about 2000 tps. 
   ASM is spreading the IO across all available disks but number of physicial disk is the limiting factor.

Summary

Check that I/O path for redo log is not used for other components (only LGWR should write and read from that disks )
Moving Redo logs to SSD disks and provide a dedicated Redo log IO path for each instance
Check lgwr trace files for warnings like :Warning: log write elapsed time 628ms, size 1KB

Reference

Troubleshooting: “log file sync” Waits (Doc ID 1376916.1)
Click to add to Favorites Intermittent Long ‘log file sync’ Waits, LGWR Posting Long Write Times, I/O Portion of Wait Minimal (Doc ID 1278149.1)
Bug:10318123 – lgwr regularly stalls for 3 seconds at a time
Event 10928

ASH LFS BACKGROUND PROCESS WAITS DURING WORST MINUTE ( from lfsdiag.sql Output )

HISTOGRAM DATA FOR LFS AND OTHER RELATED WAITS ( from lfsdiag.sql Output )

AWR LFS AND OTHER RELATED WAITS FOR WORST LFS AWRs: ( from lfsdiag.sql Output )

Session details ( from lfsdiag.sql Output )

Check your iostat output during heavy load timeframes

Summary

Reference

Leave a Reply Cancel reply