I have a large system - 128GB, a couple of RAID0 filesystems (6TB and 2TB) with an SSD cache, 8 cores (16 with hyperthreading), running Ubuntu 12.04 64bit. When I try to write a large file I get very poor performance, and iotop shows processes waiting over 99% in iowait:
dd if=/dev/zero of=lezz bs=1024 count=$((1024*50))
51200+0 records in
51200+0 records out
52428800 bytes (52 MB) copied, 3.74852 s, 14.0 MB/s
From iotop:
Total DISK READ: 185.92 K/s | Total DISK WRITE: 84.06 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
24481 be/4 arris292 0.00 B/s 0.00 B/s 0.00 % 99.99 % dd if=/dev/zero of=lezz bs=1024 count=512000
22668 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [flush-252:0]
21532 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [kworker/1:2]
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init
2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]
3 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]
8196 be/4 arris292 0.00 B/s 0.00 B/s 0.00 % 0.00 % sshd: arris292@pts/22
5 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/u:0]
6 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]
7 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0]
On a very similar system (same memory, same model, similar filesystem) I get the expected performance and no processes waiting 99% of their time for IO....
dd if=/dev/zero of=lezz bs=1024 count=$((1024*50))
51200+0 records in
51200+0 records out
52428800 bytes (52 MB) copied, 0.111191 s, 472 MB/s
I've seen this before but I've never really been able to get to the bottom of the problem, and as the day goes on and more engineers start using this system to build, the overall performance will drop to a crawl.
So what could be causing the incredibly high IO wait times? How can I troubleshoot this further? Is it possibly an SSD or disk problem, and if so, what tools can I use to diagnose?