Memory Capping on SmartOS
Marco Spadoni of Libero asks a couple of questions.
"I have a SmartVM that goes in overcap, and consequently I can see some MB that have been paged out. The question is why I do not find any trace of this in vmstat output?"
And:
"could you please, at your best convenience, also explain why, even if in the swap command man page it is stated that the "-l" option does not include physical memory (whilst the "-s" option does), it seems that each invocation returns the physical memory contained…"
A "SmartVM" is a virtualized OS instance running on SmartOS (i.e., a zone). The short answer to the first question is that you won't see anything regarding pages of an individual zone being paged out in vmstat(1M) output unless the entire machine is memory stressed. And even then, vmstat
will only show paging that is being done by the pageout()
daemon (see vm_pageout.c for the code and a long description of how paging works on SmartOS). I wrote a bit about how memory capping works on SmartOS here.
Marco includes the following output from kstat(1M) showing the zone in question:
# kstat memory_cap:6:03e65382-7f67-4791-a0e2-0c5fc5module: memory_cap instance: 6name: 03e65382-7f67-4791-a0e2-0c5fc5 class: zone_memory_cap anon_alloc_fail 0 anonpgin 90678 crtime 416.87107576 execpgin 97 fspgin 11757 n_pf_throttle 55308 n_pf_throttle_usec 7926500 nover 14 pagedout 2806816768 pgpgin 102532 physcap 2147483648 rss 2091122688 snaptime 8501413.44077987 swap 3413041152 swapcap 4294967296 zonename 03e65382-7f67-4791-a0e2-0c5fc5cdb8d6#
Note that the zone has gone over its memory cap 14 times, and has paged out a total of 2806816768 bytes. At some time later, it has gone over the cap 4 more times:
# kstat memory_cap:6:03e65382-7f67-4791-a0e2-0c5fc5kstat memory_cap:6:03e65382-7f67-4791-a0e2-0c5fc5module: memory_cap instance: 6name: 03e65382-7f67-4791-a0e2-0c5fc5 class: zone_memory_cap anon_alloc_fail 0 anonpgin 109303 crtime 416.87107576 execpgin 97 fspgin 11757 n_pf_throttle 310670 n_pf_throttle_usec 52254000 nover 18 pagedout 3218251776 pgpgin 121157 physcap 2147483648 rss 2116661248 snaptime 8506139.7862525 swap 3434885120 swapcap 4294967296 zonename 03e65382-7f67-4791-a0e2-0c5fc5cdb8d6#
The difference in pagedout
is ~392MB. Marco then shows the following vmstat(1M)
output:
# vmstat -p memory page executable anonymous filesystem swap free re mf fr de sr epi epo epf api apo apf fpi fpo fpf 103635632 5777488 417 3117 0 0 0 0 0 0 0 0 0 0 0 0# vmstat -S kthr memory page disk faults cpu r b w swap free si so pi po fr de sr rm s0 -- -- in sy cs us sy id 0 0 0 103635632 5777484 0 0 0 0 0 0 0 -1159 55 0 0 3544 13732 1904 3 1 96# vmstat kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr rm s0 -- -- in sy cs us sy id 0 0 0 103635632 5777484 417 3117 0 0 0 0 0 -1159 55 0 0 354413732 1904 3 1 96
Marco says he does not find anything about the zone that went over its memory cap in the vmstat(1M)
output. However, like most of the *stat
commands, the first line of output is an average since boot. To get multiple lines of output, you need to specify an interval in seconds, for instance:
# vmstat 2 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr lf rm s0 s1 in sy cs us sy id 0 0 0 2532416 332324 4 32 3 0 0 0 2 0 -139 0 537 319 174737 338 22 4 74 1 0 0 2406184 199776 4 25 10 0 0 0 0 97 3 1 0 328 176 440 99 1 0 1 0 0 2406104 199696 0 1 0 0 0 0 0 0 0 0 0 304 156 220 100 0 0...
This gives output every 2 seconds. The first line (after the header) is average since boot. The next lines are over the previous 2 seconds.
But even when specifying an interval, it is very likely that no evidence of the zone that is going over its memory cap would show up in this output. Unless the system as a whole is running short of free memory, vmstat
will not show paging activity. Memory capping for a zone is done by a thread in zoneadmd
. This thread uses the memcntl(2) system call to page out pages belonging to processes within the over-capped zone. Better to use zonememstat(1M)
to see per-zone memory usage.
As to Marco's second question regarding swap space. Let's take a look at the output of swap -l
in the global zone, then in a non-global zone. (This is not done on the same machine Marco was using, so sizes are different from his machine).
# swap -lhswapfile dev swaplo blocks free/dev/zvol/dsk/zones/swap 90,1 4K 2.0G 2.0G#
In the global zone, there is 2.0GB of swap space on 1 swap device. In a non-global zone:
# swap -lhswapfile dev swaplo blocks freeswap - 4K 512M 491M#
And the amount of memory in this zone is:
# prtconf | headprtconf: devinfo facility not availableSystem Configuration: Joyent i86pcMemory size: 512 MegabytesSystem Peripherals (Software Nodes):#
So, the amount of swap space in the non-global zone is equal to the size of the physical memory of that zone. But according to swap(1M) for the "-l
" option:
"The list does not include swap space in the form of physical memory because this space is not associated with a particular swap area."
The output in the non-global zone looks like the amount of swap space for the zone is equivalent to the memory cap on the zone. To understand what is going on, we'll look at the source code for the swap(1M)
command. This is in swap.c. For the "-l
" option, the list()
function is called (line 366 in swap.c). This function calls the swapctl(2) system call twice. The first time to get the number of swap devices/files, and the second time to get the sizes for those devices. The code for swapctl(2)
is at vm_swap.c. In that file, a comment for the swapctl(2)
call that retrieves the number of swap devices/files on the system says:
/* * When running in a zone we want to hide the details of the swap * devices: we report there only being one swap device named "swap" * having a size equal to the sum of the sizes of all real swap devices * on the system. */
So, in a non-global zone, the swap(1M)
command reports only 1 swap device, regardless of the number of swap devices/files configured on the system. In the vm_swap.c
file, starting at line 605, is the following (note that line numbers may change over time):
if (zp->zone_max_swap_ctl != UINT64_MAX) { rctl_qty_t cap, used; mutex_enter(&zp->zone_mem_lock); cap = zp->zone_max_swap_ctl; used = zp->zone_max_swap; mutex_exit(&zp->zone_mem_lock)"; st.ste_length = MIN(cap, st.ste_length); st.ste_pages = MIN(btop(cap), st.ste_pages); st.ste_free = MIN(st.ste_pages - btop(used), st.ste_free);}
This is part of the code that retrieves the size of the swap device. If zone_max_swap_ctl
is UINT64_MAX
, the size comes from the data structures that the kernel uses to manage swap space. If zone_max_swap_ctl
is not equal to UINT64_MAX
, the size of swap comes from the zone_max_swap_ctl
variable. In the case of a non-global zone with a memory cap (as in the case Marco is asking about), the zone_max_swap_ctl
variable is not equal to UINT64_MAX
(see the following output from DTrace. You didn't think I would go through a technical post without using DTrace, did you?).
# dtrace -n ';swapctl:entry/execname ==";swap" \ && stringof(((proc_t *)curpsinfo->pr_addr)->p_zone->zone_name) != "global"/ \ {printf("zone_max_swap_ctl = %d\n", \ ((proc_t *)curpsinfo->pr_addr)->p_zone->zone_max_swap_ctl);\ exit(0);}';dtrace: description ';swapctl:entry'; matched 1 probeCPU ID FUNCTION:NAME 0 20432 swapctl:entry zone_max_swap_ctl = 536870912#
The value of UINT64_MAX is much larger (18446744073709551615) than 536870912 (512MB), so the swap cap for the zone is reported.
# zonecfg -z 003f53ff-600b-44be-bae3-ca3f84aa5a8a info capped-memorycapped-memory: [physical: 512M] [swap: 512M] [locked: 512M]#
I wish to thank Marco for the questions. Maybe sometime soon I'll take a look at the output of swap -sh
.
Post written by Mr. Max Bruning