Table of Contents
ps command ( status codes, display thread count and RT priority )
Using ps to display Realtime priority and Number of Threads Some discrepancy in ps output caused by the fact that each system may use different values to represent the process priority and that the values have changed with the introduction of RT priorities. Scheduling priority depends on scheduling class. scheduling classes - SCHED_FIFO: A First-In, First-Out real-time process - SCHED_RR: A Round Robin real-time process - SCHED_NORMAL: A conventional, time-shared process Scheduling priorities - Real-time process (SCHED_FIFO/SCHED_RR) real-time priority , ranging from 1 (lowest priority) to 99 (higest priority). - Conventional process static priority(SCHED_NORMAL ), ranging from 100 (highest priority) to 139 (lowest priority). Note conventional, time-shared processes are only scheduled if there are no Realtime process to run Details of ps class flag TS SCHED_OTHER (SCHED_NORMAL) FF SCHED_FIFO RR SCHED_RR Sample: # ps -e -o pid,class,rtprio,pri,nlwp,cmd | egrep 'PID|mp_stress|ocssd.bin|ora_lms0_grac41' PID CLS RTPRIO PRI NLWP CMD 4582 RR 99 139 24 /u01/app/11204/grid/bin/ocssd.bin 7894 RR 10 50 5 ./mp_stress -t 4 -m 1 -p 10 -c 2000 13508 RR 1 41 1 ora_lms0_grac41 --> ocssd.bin runs at lowest priority ( PRI=139 ) ? Is that really true ? Of course not - all processes ocssd.bin, ora_lms0_grac41 and mp_stress belongs to Realtime Scheduling Class ( Round Robin real-time process ). This means ocssd.bin runs with the highest RT priority and will get CPU as needed. The ./mp_stress process runs as RT process too with PRIORITY 10. If ocssd.bin releases CPU ./mp_stress will be scheduled. If mp_stress does not release CPU ( due to system call waits, ..) ora_lms0_grac41 will never get scheduled and this can lead to an Instance Eviction as ora_lms0_grac41 only runs only with RR PRIORITY of 1. Note ./mp_stress runs with 5 threads. If we assume 4 worker threads this program can keep your system quite busy if you have a low number of CPUs and ./mp_stress doesn't release the CPU. Further details: The kernel stores the priority value in /proc/<pid>/stat (let's call it p->prio) and ps reads the value and displays it in various ways to the user: There are 8 undocumented values for the ps process priority that can be passed to -o option: prioirity p->prio intpri 60 + p->prio opri 60 + p->prio pri_foo p->prio - 20 pri_bar p->prio + 1 pri_baz p->prio + 100 pri 39 - p->priority pri_api -1 - p->priority They were introduced to fit the values in certain intervals and compatibility with POSIX and other systems. PROCESS STATE CODES Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a process: D uninterruptible sleep (usually IO) R running or runnable (on run queue) S interruptible sleep (waiting for an event to complete) T stopped, either by a job control signal or because it is being traced. W paging (not valid since the 2.6.xx kernel) X dead (should never be seen) Z defunct ("zombie") process, terminated but not reaped by its parent.
Reference
- http://honglus.blogspot.de/2011/03/understanding-cpu-scheduling-priority.html
- http://www.redhat.com/archives/rhelv5-list/2009-June/msg00107.html
Using and interpreting /proc/loadavg
The first three columns measure CPU and IO utilization of the last one, five, and 15 minute periods.
The fourth column shows the number of currently running processes and the total number of processes.
The last column displays the last process ID used.
# cat /proc/loadavg
2.87 4.00 4.01 1/712 25279
- during the last 60 seconds about 2.87 threads are concurrently running ( either Run queue or waiting on Disk I/0)
- during the last 5 minutes about 4.00 threads are concurrently running ( either Run queue or waiting on Disk I/0)
- during the last 15 minutes about 4.01 threads are concurrently running ( either Run queue or waiting on Disk I/0)
- only a single thread/porcess is scheduled on CPU ( only a single CPU active ? )
- 712 processes are active
- latest PID running was 25279
Check errors and droped frames with netstat monitor command
# netstat -Ieth2 1
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth2 1500 0 5433104 0 0 0 4730272 0 0 0 BMRU
eth2 1500 0 5433116 0 0 0 4730286 0 0 0 BMRU
eth2 1500 0 5433122 0 0 0 4730294 0 0 0 BMRU
Using dstat ( replacement for vmstat, iostat and ifstat )
Check netdwork bandwitdth for specific inferfaces together with CPU/Disk/paging activities # dstat -Neth2,eth3 ----total-cpu-usage---- -dsk/total- --net/eth2----net/eth3- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send: recv send| in out | int csw 5 5 59 31 0 0| 180k 2574k| 0 0 : 0 0 | 478B 2450B|2093 4957 8 5 85 3 0 0| 16k 98k|3420B 7794B:3244B 4252B| 0 0 |2471 5774 4 5 91 0 0 0| 48k 1536B|1501B 1086B: 965B 1276B| 0 0 |2023 4923 2 4 94 0 0 0| 32k 1536B|2418B 1371B:1904B 3225B| 0 0 |2060 5179 5 4 91 0 0 0| 16k 50k| 72k 34k: 114k 18k| 0 0 |2115 4695 6 7 86 0 0 1| 32k 1536B| 94k 36k: 12k 43k| 0 0 |2113 4961 8 5 85 3 0 0| 32k 146k|2640B 7284B:1614B 2472B| 0 0 |2374 5716 5 5 90 0 0 0| 96k 50k|3554B 3881B: 20k 5483B| 0 0 |2046 5064 4 5 92 0 0 0| 32k 1536B|2094B 1146B:1406B 1824B| 0 0 |1996 4965