DTrace, MySQL, Ganglia, and Digging for Solutions
I stay generic when blogging, but I had an interesting problem today that I thought I’d share. I’ve got a Ganglia rollout thats constantly growing. Some time ago Gmetad start struggling to keep up and would report nodes down (oddly, if you refreshed the page over and over you’d see the number of nodes roll up and down, 8 down, 7, 6, 4, 2, 0, 8…). The temporary fix was to simply reduce the number of monitored clusters. But this irritated me because gmetad really is pretty light on CPU and memory usage… Ganglia is synonimous with “scale”, so why won’t it scale?
# iostat -xn sd3 1 extended device statisticsr/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device92.1 158.2 5460.8 2075.6 0.0 9.8 0.0 39.0 0 71 sd3 extended device statisticsr/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device207.0 244.8 12990.2 509.0 0.0 15.2 0.0 33.5 0 100 sd3 extended device statisticsr/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device64.0 330.8 4076.2 1083.7 0.0 19.2 0.0 48.7 0 100 sd3
I had noticed via iostat that disk IO was excessive for what the system was doing. Thats were I started. I could write some D script to explore, but Brendan Gregg’s DTrace Toolkit already contains useful scripts like iotop and iosnoop, so decided to save some time and use those. Here’s a snapshot from iotop:
2008 Apr 23 21:48:47, load: 0.32, disk_r: 34909 KB, disk_w: 20108 KBUID PID PPID CMD DEVICE MAJ MIN D BYTES0 3 0 fsflush sd3 31 192 W 744448100 24277 24180 mysqld sd3 31 196 W 3801088100 24277 24180 mysqld sd3 31 196 R 43581440 0 0 sched sd3 31 196 W 160450560 0 0 sched sd3 31 196 R 31389188
MySQL is doing a lot more IO than it ought to be. I know that Zabbix is using MySQL but I’ve got several other apps that may be using it as well, so I’ll keep that on my hit list. The “sched” IO is actually cache flushing, typically from the ZFS ARC (ZFS’s buffers). So I need to look at what MySQL is doing and then at file IO but this time at the POSIX layer so that I can see which files are being touched.
MySQL first… I like my my_qcache.d script because I can see queries, execution time, and whether or not its being served from the query cache (Qcache). Here is the script:
#!/usr/sbin/dtrace -s/** Query Cache Viewer; benr@joyent.com*/#pragma D option quietBEGIN{printf("Waiting for queries... hit ^c to quit.n");}pid$target:mysqld:*dispatch_command*:entry{self->query = copyinstr(arg2);self->start = timestamp;}pid$target:mysqld:*send_result_to_client*:entry{self->cache = "Yes";}pid$target:mysqld:*do_select*:entry{self->cache = "No";}pid$target:mysqld:*dispatch_command*:return{this->end = timestamp;this->elapsed = (this->end - self->start)/1000000;printf("From Query Cache?: %s in %d ms t| Query: %sn", self->cache, this->elapsed, self->query);self->start = 0;self->query = 0;}
Here is a sample of what I got…
$ ./my_qcache.d -p `pgrep -x mysqld`Waiting for queries... hit ^c to quit.From Query Cache?: Yes in 0 ms | Query: begin;From Query Cache?: No in 0 ms | Query: select httptestid,name,applicationid,nextcheck,status,delay,macros,agent from httptest where status=0 and nextcheck<=1208987526 and mod(httptestid,5)=1 and httptestid>=100000000000000*0 and httptestid<=(100000000000000*0+99999999999999)From Query Cache?: No in 0 ms | Query: select httptestid,name,applicationid,nextcheck,status,delay,macros,agent from httptest where status=0 and nextcheck<=1208987526 and mod(httptestid,5)=0 and httptestid>=100000000000000*0 and httptestid<=(100000000000000*0+99999999999999)From Query Cache?: No in 0 ms | Query: select httptestid,name,applicationid,nextcheck,status,delay,macros,agent from httptest where status=0 and nextcheck<=1208987526 and mod(httptestid,5)=3 and httptestid>=100000000000000*0 and httptestid<=(100000000000000*0+99999999999999)From Query Cache?: Yes in 0 ms | Query: select count(*),min(nextcheck) from httptest t where t.status=0 and mod(t.httptestid,5)=3 and t.httptestid>=100000000000000*0 and t.httptestid<=(100000000000000*0+99999999999999)From Query Cache?: Yes in 0 ms | Query: select count(*),min(nextcheck) from httptest t where t.status=0 and mod(t.httptestid,5)=1 and t.httptestid>=100000000000000*0 and t.httptestid<=(100000000000000*0+99999999999999)From Query Cache?: Yes in 0 ms | Query: update items set nextcheck=1208988127, lastclock=1208987527 where itemid=20837From Query Cache?: Yes in 10 ms | Query: commit;
After looking at queries fly by its clear that all this is happening by Zabbix. A lot of data is being written, thats gonna happen, but there is also a lot of disk read that calls for some tuning. I’ll set that aside for now and given that Zabbix is an auxilary monitoring solution I’ll just shut it down for now to concentrate on Ganglia. After shutting down Zabbix I check the MySQL query activity again:
root@ev2-admin tmp$ ./my_qcache.d -p `pgrep -x mysqld`Waiting for queries... hit ^c to quit.^C
Ok, MySQL is quiet. Now I can dig into Ganglia without auxiliary noise.
I looked back at iostat and things didn’t look drastically different. How can this be? For a second I though that perhaps that Ganglia might simply be running out of File Descriptors, perhaps its opening lots of files (RRDs in this case) simultaneously and thats my cap? So I’ll check the resource controls (rctl’s) with a sepecial eye on the FD limit:
# prctl -i process `pgrep gmetad`NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT...process.max-file-descriptorbasic 256 - deny 12528privileged 65.5K - deny -system 2.15G max deny -...
So the “basic” limit is still in place limiting the process to 256 (POSIX standard) limit. The “recipient” is the PID, not a counter. Now, I can increase that threshold or simply dump the “basic” limit to allow the full “privileged” limit, but do I need to? Using pfiles I’ll check the number of open file descriptors… I run it over a couple times to see how much the number is changing over time:
# pfiles `pgrep gmetad`12528: /opt/ganglia/sbin/gmetadCurrent rlimit: 256 file descriptors0: S_IFREG mode:0444 dev:278,4 ino:5705 uid:60001 gid:0 size:336O_RDONLY/proc/12528/psinfo1: S_IFSOCK mode:0666 dev:283,0 ino:8706 uid:0 gid:0 size:0O_RDWRSOCK_STREAMSO_REUSEADDR,SO_KEEPALIVE,SO_SNDBUF(49152),SO_RCVBUF(49152)sockname: AF_INET 0.0.0.0 port: 86512: S_IFSOCK mode:0666 dev:283,0 ino:57377 uid:0 gid:0 size:0O_RDWRSOCK_STREAMSO_REUSEADDR,SO_KEEPALIVE,SO_SNDBUF(49152),SO_RCVBUF(49152)sockname: AF_INET 0.0.0.0 port: 86523: S_IFDOOR mode:0444 dev:286,0 ino:35 uid:0 gid:0 size:0O_RDONLY|O_LARGEFILE FD_CLOEXEC door to nscd[24142]/var/run/name_service_door
Well, ok, so clearly the number of file descriptors isn’t the issue…. but just in case I wanted to rule out the issue completely. While I could use DTrace, this isn’t a production impacting application so I did a quick naughty look with truss:
# truss -p `pgrep gmetad`.../125: mkdir("/opt/ganglia/data/xxxxxxxxx", 0755) Err#17 EEXIST/125: mkdir("/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy", 0755) Err#17 EEXIST/125: xstat(2, "/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy/load_fifteen.rrd", 0xF6F3B874) = 0/125: open64("/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy/load_fifteen.rrd", O_RDWR) = 7/125: fstat64(7, 0xF6F3B200) = 0/125: fstat64(7, 0xF6F3B130) = 0/125: ioctl(7, TCGETA, 0xF6F3B1CC) Err#25 ENOTTY/125: read(7, " R R D? 0 0 0 3????".., 4096) = 4096/125: llseek(7, 0, SEEK_CUR) = 4096/125: lseek(7, 0, SEEK_END) = 12112/125: llseek(7, 0, SEEK_CUR) = 12112/125: llseek(7, 0, SEEK_CUR) = 12112/125: lseek(7, 1312, SEEK_SET) = 1312/125: fcntl(7, F_SETLK64, 0xF6F3B314) = 0/125: mmap64(0x00000000, 12112, PROT_READ|PROT_WRITE, MAP_SHARED, 7, 0) = 0xF6B30000/125: munmap(0xF6B30000, 12112) = 0/125: llseek(7, 0, SEEK_CUR) = 1312/125: lseek(7, 772, SEEK_SET) = 772/125: write(7, "9DCC0F H???? 1 . 5 5".., 540) = 540/125: close(7) = 0/125: mkdir("/opt/ganglia/data/xxxxxxxxx", 0755) Err#17 EEXIST/125: mkdir("/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy", 0755) Err#17 EEXIST/125: xstat(2, "/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy/proc_total.rrd", 0xF6F3B874) = 0/125: open64("/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy/proc_total.rrd", O_RDWR) = 7/125: fstat64(7, 0xF6F3B200) = 0/125: fstat64(7, 0xF6F3B130) = 0/125: ioctl(7, TCGETA, 0xF6F3B1CC) Err#25 ENOTTY/125: read(7, " R R D? 0 0 0 3????".., 4096) = 4096/125: llseek(7, 0, SEEK_CUR) = 4096/125: lseek(7, 0, SEEK_END) = 12112^C/125: ioctl(7, TCGETA, 0xF6F3B1CC) Err#25 ENOTTY...
So its doing a lot of IO and its happening sequentially. A file is opened, worked on, closed, next… so clearly this shows that file descriptor limits are not an issue here.
Truss is a quick and dirty tool, but I want a little more percision. Lets try something out of Brendan’s bag of tricks:
$ ./fsrw.dEvent Device RW Size Offset Pathsc-read . R 4096 0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.disk_total.rrd.Imbi4xfop_read . R 4096 0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.disk_total.rrd.Imbi4xsc-write . W 540 0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.disk_total.rrd.Imbi4xfop_write . W 540 0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.disk_total.rrd.Imbi4xsc-read . R 4096 0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.cpu_speed.rrd.Dmbi4xfop_read . R 4096 0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.cpu_speed.rrd.Dmbi4xsc-write . W 540 0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.cpu_speed.rrd.Dmbi4xfop_write . W 540 0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.cpu_speed.rrd.Dmbi4x
So fsrw.d from the DTrace Toolkit spits lines flying by… we’re doing a lot more IO that I realized and we’re doing it small chunks. Looking back at my iostat output from earlier I notice that the number of ops shown was low, but we’re doing a ton of IO… ZFS is clearly doing a good job of grouping and caching IO for me. The disk just can’t seem to keep up, and I’ve still got more than half of my Ganglia nodes unmonitored meaning that I need to at least double the IO load to solve my problem.
Lets look at that iostat output again:
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device...92.1 158.2 5460.8 2075.6 0.0 9.8 0.0 39.0 0 71 sd3207.0 244.8 12990.2 509.0 0.0 15.2 0.0 33.5 0 100 sd364.0 330.8 4076.2 1083.7 0.0 19.2 0.0 48.7 0 100 sd3
These are 1 second intervals. The throughput is low, the IO size (kw/s divided by w/s) isn’t terrible, but my asvc_t (average service time in ms) is higher than I’d expect on local 10K SAS drives, combined with the 100% busy. Combine those two facts, 100% busy and asvc_t thats significantly higher than average latency (~10ms) on the drive and I know that the drive is moving that disk head way more than it should, we’re dealing with a lot of scattered blocks on disk clearly. I could use DTrace to look at the LBA on IO’s but I don’t think I need to go that far. The problem is clear, the disk just can’t keep up with the flood of IO from Ganglia. Normally this is where I’d go looking for a new RAID controller or tuning any kernel parameter I can find… but we’re not talking about a lot of data here, these are RRD files, the total size of my Ganglia data/ directory is less than 50MB!
One way to solve this problem would be to get a really big RAID controller, but thats expensive. Besides, how does it really speed things up? Caches! Well, hell, I’ve already got a lot of free memory on this system and I’m not dealing with gigs of data… why not just store Ganglia data file in system memory? If I did that I’d provide more IO speed to Ganglia than it could exceed and free up my disks to do something more useful.
Enter tmpfs. Rather than code some nifty caching mechanism into gmetad, I could just mount a tmpfs segment as its data directory. Seems silly at first, but it would be fast and looking at Web 2.0 scalability apps like Varnish, Memcached, etc, they’ll store anything in memory… why not this? So the only trick is sync’ing the data in memory to disk periodically in case of a crash.
So I wrote up a startup wrapper for gmetad (which is invoked as an SMF method) that looks like this:
#!/usr/bin/bash## SMF Start method for Gmetad## -benrif ( mount | grep "/opt/ganglia/data on swap" >/dev/null )then echo "Tmpfs already mounted."else echo "Mounting tmpfs..." chown nobody /opt/ganglia/data mount -F tmpfs -o size=100m,noxattr swap /opt/ganglia/datafi## Now sync the data in if its empty:if [ -d /opt/ganglia/data/__SummaryInfo__/ ]then echo "Cache primed, ready to start." chown -R nobody /opt/ganglia/dataelse echo "Priming the cache..." /opt/csw/bin/rsync -at /opt/ganglia/data-disk/ /opt/ganglia/data/ chown -R nobody /opt/ganglia/datafi## Finally, start ganglia:/opt/ganglia/sbin/gmetad
Now I just add a cronjob to rsync the tmpfs data every 5 minutes:
# Sync gmetad ram-disk to physical-disk#5,10,15,20,25,30,35,40,45,50,55,0 * * * * /opt/csw/bin/rsync -at /opt/ganglia/data/ /opt/ganglia/data-disk/
After about 20 minutes of tinkering I get things right, stop and restart Ganglia, but this time with all the clusters enabled. To my delight the disk was getting a much needed vacation, the system performance was smoothed significantly because the IO system wasn’t busy all the time, Ganglia was working perfectly and faster than ever, and on top of it all I’ve increased the life of my disk because it’s not pounding the heads 24 hours a day.
After sharing my solution a bit I found that a lot of people are using tmpfs for RDD storage for the same reason I am. Wish they’d told me before. :)
We’ve seen here how powerful the observability capabilities of Solaris are, how DTrace helped us to understand both IO activity and even peek into a MySQL database, how iostat helped us understand what my hard drives were doing as well as how truss, rctls and the Solaris ptools (pfiles in this case) helped us investigate and rule out some hutches along the way. Furthermore, we’ve seen how we can exploit ‘tmpfs’ to trade a little memory for some big wins in terms of both longevity and performance, without having to write a line of code, just a couple of commands in a shell script combined with a cron job.
Solaris for the win. ;)
Post written by jason