Table of Contents
ASH LFS BACKGROUND PROCESS WAITS DURING WORST MINUTE ( from lfsdiag.sql Output )
APPROACH: What is LGWR doing when 'log file sync' waits are happening? LMS info may be relevent for broadcaston commit and LNS data may be relevant for dataguard. If more details are needed see the ASH DETAILS FOR WORST MINUTES section at the bottom of the report. MINUTE INST PROGRAM EVENT TOTAL_WAIT_TIME WAITS AVG_TIME_WAITED ------------ ---- --------------------------------- ----------------------- -------------- ------ ------------ - Dec02_1728 1 JDBC Thin Client log file sync 2945.003 82 35.915 Dec02_1728 1 oracle@grac41.example.com (LGWR) log file parallel write .040 1 .040 Dec02_1728 1 oracle@grac41.example.com (LGWR) .000 43 .000 Dec02_1728 1 oracle@grac41.example.com (LMS0) .000 2 .000 Dec02_1729 1 JDBC Thin Client log file sync 3819.957 105 36.381 Dec02_1729 1 oracle@grac41.example.com (LGWR) log file parallel write .108 2 .054 Dec02_1729 1 oracle@grac41.example.com (LGWR) .000 54 .000 Dec02_1729 1 oracle@grac41.example.com (LMS0) .000 2 .000 Dec02_1729 1 oracle@grac41.example.com (LMS1) .000 3 .000 Dec02_1731 1 JDBC Thin Client log file sync 995.457 19 52.392 Dec02_1731 1 oracle@grac41.example.com (LGWR) .000 10 .000 Dec02_1731 1 oracle@grac41.example.com (LMS0) .000 2 .000 Dec02_1732 1 JDBC Thin Client log file sync 3429.242 97 35.353 Dec02_1732 1 oracle@grac41.example.com (LGWR) log file parallel write 28.543 28.543 --> In minute 17:29 multiple JDBC clients wait 105 times with an average wait time of 36 ms for log file sync. LGWR waits in minute 17:32 about 28 ms for doing an I/O (log file parallel write )
- Redo LOG I/O system doesn’t look healthy – Check your REDO LOG Write I/O performance
- LGWR and JDBC clients are fighting for LOG File Writes
HISTOGRAM DATA FOR LFS AND OTHER RELATED WAITS ( from lfsdiag.sql Output )
APPROACH: Look at the wait distribution for log file sync waits by looking at "wait_time_milli". Look at the high wait times then see if you can correlate those with other related wait events. INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT ---------- ---------------------------------------- --------------- ---------- 1 log file sync 1 427 1 log file sync 2 50841 1 log file sync 4 69587 1 log file sync 8 13165 1 log file sync 16 4267 1 log file sync 32 2811 1 log file sync 64 1278 1 log file sync 128 602 1 log file sync 256 187 1 log file sync 512 39 1 log file sync 1024 6 1 log file sync 2048 1 1 log file parallel write 1 112200 1 log file parallel write 2 130 1 log file parallel write 4 87 1 log file parallel write 8 75 1 log file parallel write 16 48 1 log file parallel write 32 15 1 log file parallel write 64 16 1 log file parallel write 128 6 1 log file parallel write 256 3 1 gcs log flush sync 2 1204 1 gcs log flush sync 4 182 1 gcs log flush sync 8 56 1 gcs log flush sync 16 30 1 gcs log flush sync 32 25 1 gcs log flush sync 64 15 1 gcs log flush sync 128 8 1 gcs log flush sync 256 3 --> gcs log flush sync and log file sync heavily trigger LGWR for writing data to disk Generic write I/O stats don't look healthy - lot of writes over 8ms
- Gcs Log Flush Sync times are not acceptable – most of the Waits are > 8 ms
- Note: Only one session on a instance can write to the REDO logs – all the other sessions are waiting on LOG File SYNC event
- Excessive COMMITs and dirty GCS buffers can stress your REDO LOG performance
AWR LFS AND OTHER RELATED WAITS FOR WORST LFS AWRs: ( from lfsdiag.sql Output )
APPROACH: These are the AWR snaps where the average 'log file sync' times were the highest. Look at related waits at those times. SNAP_ID INST NAME TOTAL_WAITS TOTAL_WAIT_TIME AVG_TIME_WAITED ---------- ---- ---------------------------------------- ----------- ----------------- ----------------- 303 3 log file sync 7 2003.532 286.219 303 1 log file sync 18 3072.108 170.673 303 2 log file sync 9 271.443 30.160 303 1 gcs log flush sync 1956 18537.067 9.477 303 3 gcs log flush sync 2460 18970.100 7.711 303 2 gcs log flush sync 93 498.418 5.359 303 1 LGWR wait for redo copy 5 7.147 1.429 303 3 log file parallel write 278 193.133 .695 303 2 gc current block 2-way 3026 1789.197 .591 303 1 gc cr grant 2-way 2092 1192.904 .570 303 1 gc current block 2-way 1730 936.028 .541 303 3 gc current block 2-way 1949 998.319 .512 303 3 gc cr grant 2-way 695 305.589 .440 303 2 gc cr grant 2-way 46 19.798 .430 303 1 log file parallel write 365 34.946 .096 303 3 LGWR wait for redo copy 1 .037 .037 303 2 log file parallel write 146 4.058 .028 --> Gcs Log Flush sync is top Wait Event This event triggers LGWR to write redo log ( log file parallel write )
Session details ( from lfsdiag.sql Output )
APPROACH: If you cannot determine the problem from the data above, you may need to look at the details of what each session is doing during each 'bad' snap. Most likely you will want to note the times of the high log file sync waits, look at what LGWR is doing at those times, and go from there... SAMPLE_TIME INST SESSION_ID PROGRAM EVENT TIME_WAITED ------------------------- ---- ---------- --------------------------------------------- ------------------------------ ----------- P1 P2 P3 02-DEC-13 05.13.09.680 PM 1 5 JDBC Thin Client log file sync 9.284 buffer#: 1456 sync scn: 20387728 : 0 02-DEC-13 05.13.09.680 PM 1 99 JDBC Thin Client log file sync 6.774 buffer#: 1452 sync scn: 20387725 : 0 02-DEC-13 05.13.09.680 PM 1 193 JDBC Thin Client log file sync 7.175 buffer#: 1452 sync scn: 20387726 : 0 02-DEC-13 05.13.10.680 PM 1 5 JDBC Thin Client log file sync 4.070 buffer#: 1588 sync scn: 20388120 : 0 02-DEC-13 05.13.10.680 PM 1 96 oracle@grac41.example.com (LGWR) log file parallel write .905 files: 1 blocks: 1 requests: 1 02-DEC-13 05.13.10.680 PM 1 193 JDBC Thin Client log file sync 4.149 buffer#: 1589 sync scn: 20388122 .. 02-DEC-13 05.13.12.780 PM 1 5 JDBC Thin Client log file sync 282.466 buffer#: 2523 sync scn: 20389585 : 0 02-DEC-13 05.13.12.780 PM 1 99 JDBC Thin Client log file sync 281.079 buffer#: 2523 sync scn: 20389586 : 0 02-DEC-13 05.13.12.780 PM 1 193 JDBC Thin Client log file sync 288.317 buffer#: 2520 sync scn: 20389583 : 0 : 0 --> Even LGWR is currently writing some redo records server one of our JDBC clients there are at least 4 other clients waiting on LGWR. This wait time is accounted to log file sync wait event .. 02-DEC-13 05.13.12.780 PM 1 5 JDBC Thin Client log file sync 282.466 buffer#: 2523 sync scn: 20389585 : 0 02-DEC-13 05.13.12.780 PM 1 99 JDBC Thin Client log file sync 281.079 buffer#: 2523 sync scn: 20389586 : 0 02-DEC-13 05.13.12.780 PM 1 193 JDBC Thin Client log file sync 288.317 buffer#: 2520 sync scn: 20389583 : 0 --> Here we can see a very high wait times log file sync. Again either LGWR doesn't get CPU or the the I/O bandwidth is to low
Check your iostat output during heavy load timeframes
avg-cpu: %user %nice %system %iowait %steal %idle 7.68 0.00 29.55 0.72 0.00 62.04 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdc 1.20 0.00 14.34 0 72 sda 18.92 4.78 270.92 24 1360 sde 1.39 19.12 12.75 96 64 sdh 3.39 1.99 2.59 10 13 sdb 3.19 63.75 14.34 320 72 sdg 3.39 1.99 2.59 10 13 sdd 0.60 3.19 1.59 16 8 sdf 3.39 1.99 2.59 10 13 sdi 481.67 0.00 579.08 0 2907 sdj 513.55 0.00 620.12 0 3113 sdk 311.16 0.00 381.67 0 1916 dm-0 22.11 4.78 270.92 24 1360 -> Even sdi, sdkj and sdk shows an acceptable Redo log write performance all of these disks are emulated via VirtualBox and use a single physical SSD disk allowing about 2000 tps. ASM is spreading the IO across all available disks but number of physicial disk is the limiting factor.
Summary
- Check that I/O path for redo log is not used for other components (only LGWR should write and read from that disks )
- Moving Redo logs to SSD disks and provide a dedicated Redo log IO path for each instance
- Check lgwr trace files for warnings like :Warning: log write elapsed time 628ms, size 1KB
Reference
- Click to add to Favorites Intermittent Long ‘log file sync’ Waits, LGWR Posting Long Write Times, I/O Portion of Wait Minimal (Doc ID 1278149.1)
- Bug:10318123 – lgwr regularly stalls for 3 seconds at a time
- Event 10928