Walmart Node.js Memory Leak

A few weeks ago, Eran Hammer of Walmart labs came to the Node.js core team complaining of a memory leak he had been tracking down for months. By expending a lot of effort over those few months he had taken the memory leak from a couple hundred megabytes a day, down to a mere eight megabytes a day. That in and of itself is an impressive feat, however, the source and final portion of the leak was elusive.

So he spent more effort and found a test case that reproduced with only Node APIs, which placed the fault right at Node.js' doorstep. He found that the more HTTP client requests he did, the more memory his Node process would consume, but it was really slow. Here's the graph from his initial data:

The top line is barely trending up, but it's enough to manifest in frustrating ways for Walmart. The good news is that Node.js is getting mature, the bad news is that a more mature Node.js means bugs are harder to find. So with the test client and server in hand, I started on my 3 week journey to track down this memory leak. Lucky for you, you're getting the abbreviated version here.

$ node server.js &$ UMEM_DEBUG="default,audit=200" node --expose-gc --always-compact standalone.js

First things first, I started the server and put it in the background since we were only interested in the client. Then I ran Node with UMEM_DEBUG set to record various important information about the memory allocations. This added a certain amount of overhead to the application, but since we were just using a test script it was completely safe. Also, we were exposing the GC so we could run it from inside the script, at the same time telling the GC to stay as compact as possible.

Since it took a while to manifest, I had a little script to grab interesting information periodically from the client process. Every hour, it grabbed the output of pmap -x and a core file and stored those in Joyent Manta so we didn’t have to worry about storage space.

bash#!/bin/bashset -xeDEST=~~/public/walmart.graphsmmkdir -p $DESTfor i in $(seq 1 100); do  pmap -x $1 | mput $DEST/pmap.$i  gcore $1  mput -f core.$1 $DEST/core.$i  mput -f symboltable.json $DEST/  sleep 3600done

Then I let this run for 15 hours. To start with, let's compare the output between the first pmap and the 15th pmap.

diffdiff -u pmap.1 pmap.15--- pmap.1      2013-11-21 01:42:27.929994800 +0000+++ pmap.15     2013-11-21 01:42:09.622017684 +0000@@ -4,7 +4,7 @@ 08040000      32      32      32       - rw---    [ stack ] 08050000   14944   14944       -       - r-x--  node 08EF7000      56      56      16       - rwx--  node-08F05000   15204   15104   15104       - rwx--    [ heap ]+08F05000  181288  180784  180784       - rwx--    [ heap ] 80199000       4       4       4       - rw---    [ anon ] 8019A000       4       4       4       - rw---    [ anon ] 8019B000       8       8       8       - rw---    [ anon ]@@ -15,12 +15,13 @@ 801D9000     256     256     256       - rw---    [ anon ] 80219000     512     512     512       - rw---    [ anon ] 86300000     100     100     100       - rw---    [ anon ]-86B00000    1024    1024    1024       - rw---    [ anon ]+86700000    1024     636     636       - rw---    [ anon ]+8A500000      36      36      36       - rw---    [ anon ]+8A50A000     980     484     484       - rwx--    [ anon ] 8F140000      64      64      64       - rw---    [ anon ] 92A00000      36      36      36       - rw---    [ anon ] 92A0A000     164     164     164       - rwx--    [ anon ] 95B00000     228     228     228       - rw---    [ anon ]-9A300000    1024     664     664       - rw---    [ anon ] 9C000000    4096    4096    4096       - rw---    [ anon ] 9C400000    2048    2048    2048       - rw---    [ anon ] 9C600000    1024    1024    1024       - rw---    [ anon ]@@ -29,12 +30,11 @@ 9CC00000    2048    2048    2048       - rw---    [ anon ] 9CE00000    1024    1024    1024       - rw---    [ anon ] 9CF00000    1024    1024    1024       - rw---    [ anon ]-A0600000      36      36      36       - rw---    [ anon ]-A060A000     980     484     484       - rwx--    [ anon ] ABE00000      36      36      36       - rw---    [ anon ] ABE0A000     384     384     384       - rwx--    [ anon ] AC300000     100     100     100       - rw---    [ anon ]-B0B00000    1024    1016    1016       - rw---    [ anon ]+AC400000    1024    1024    1024       - rw---    [ anon ]+B6200000    1024     712     712       - rw---    [ anon ] B8700000     292     292     292       - rw---    [ anon ] BE700000    1024    1024    1024       - rw---    [ anon ] FE2F0000      64      64      64       - rwx--    [ anon ]@@ -101,4 +101,4 @@ FEFFB000       8       8       8       - rwx-- FEFFD000       4       4       4       - rwx-- -------- ------- ------- ------- --------total Kb   58728   57764   39156       -+total Kb  224812  223112  204504       -

The lines marked [ anon ] are mmap()'d regions of memory. Most of these are regions that V8 has allocated for your JavaScript objects. The lines marked [ heap ] are what you might consider the "traditional" heap, memory that is allocated with malloc() or new in C++ parlance. What you can see is that the traditional heap is growing rapidly over the 15 hours, but the V8 heap isn't growing, instead it's just shifting where its memory is stored.

When I saw that I was leaking memory in the heap, I was excited, anticipating this was a classic memory leak. A leak that would happen because Node allocated memory, but neglected to free it. In MDB there's a particularly helpful command ::findleaks that will show you the memory addresses and the stack traces for leaked memory, not unlike using valgrind, but without all the performance penalty.

> ::findleaksBYTES             LEAKED VMEM_SEG CALLER4096                   4 8019a000 MMAP32768                  1 801a1000 MMAP16384                  1 8019d000 MMAP8192                   1 8019b000 MMAP------------------------------------------------------------------------           Total       4 oversized leaks, 61440 bytes

Unfortunately for me, 61440 bytes is drastically less than the 160 megabytes actually leaking in the process as indicated by pmap. Another useful advantage of running with UMEM_DEBUG enabled, means we can tap into the cache information that libumem stores to let you know what sizes of memory are being allocated. You can access a lot of this information by simply using the ::umastat command.

> ::umastatcache                        buf     buf     buf     buf  memory     alloc allocname                        size  in use  in ptc   total  in use   succeed  fail------------------------- ------ ------- ------- ------- ------- --------- -----...umem_alloc_2304             2304       3       -       5     12K         9     0umem_alloc_2688             2688       0       -       7     20K         4     0umem_alloc_4096             4096   16987       -   16987    133M     16998     0umem_alloc_4544             4544      14       -     192    864K   8136132     0umem_alloc_8192             8192       3       -       3     36K         5     0...

So at hour 15 we can see that there's 133MB allocated in the 4K cache. That's a significant amount of the memory we're trying to track down. Another feature of UMEM_DEBUG is that it keeps stack traces for those allocations, so I extended the venerable stackvis which can be used to generate flamegraphs for performance, but instead I used it to generate flamegraphs for the stack traces most responsible for the memory allocations in the 4k cache. You can see the full flamegraph, but below I've attached the interesting bit.

The majority of this memory is being allocated by a method called HandleScope::Extend deep inside V8. Basically a HandleScope is used to keep track of the JavaScript objects that were allocated in this stack frame (and above) such that when this scope is deallocated it marks JavaScript objects as available for garbage collection. When we inspect this data we see we're collecting a lot of objects, but they're all the Undefined representative object.

So what we knew at this point was that we were generating a lot of JavaScript objects, they were getting stored by some underlying HandleScope but that was never being deallocated such that those objects could be collected. We aren't seeing the V8 heap grow, because these are all singleton objects of Undefined.

I had been doing all this work on our v0.10 branch, so Trevor Norris decided to look into it on the master branch of Node. Trevor was able to find a much smaller reproducible case, using that test case, I could reuse the same UMEM_DEBUG techniques, which revealed the predominate stack trace in the 4k cache:

> ::umem_cache !grep 409609138010 umem_alloc_4096           020f 80000000     4096    148350914c010 umem_alloc_40960          020f 80000000    40960        3> 09138010::walk bufctl | ::bufctl -v...         c008810          c009000   47e1be7da09f7a                1                          9138010                0                0       `umem_cache_alloc_debug+0x144       `umem_cache_alloc+0x19a       `umem_alloc+0xcd       `malloc+0x2a       `operator new+0x29       `operator new[]+0x1d                 v8::internal::HandleScope::Extend+0x127                 v8::Isolate::GetCurrentContext+0x78                 node::NeedImmediateCallbackGetter+0x1e                 0xa0a660a3                 0xa0a62068                 0x90d1186c                 0xa0a64c54                 0xa0a28679                 0xa0a18c2a                 _ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPS5_Pb+0xe4                 v8::internal::Execution::Call+0x77                 v8::Function::Call+0xc9                 node::MakeCallback+0xc5                 _ZN4nodeL14CheckImmediateEP10uv_check_si+0xc7                 uv__run_check+0x31                 uv_run+0x10e                 node::Start+0x118                 main+0x1b                 _start+0x83...

This stack had very little surface area to weed through, and when I looked at the node::CheckImmediate function, I noticed it was missing a HandleScope. By adding a HandleScope the memory leak went away. Magic.

At this point we knew that we were looking for something in v0.10 that called MakeCallback but that didn't first have a HandleScope on the stack. I then worked up this simple DTrace script.

awk#!/usr/sbin/dtrace -spid$1::*HandleScopeC1*:entry{        self->cone++;}pid$1::*HandleScopeD1*:entry/self->cone/{        self->cone--;}pid$1::*HandleScopeC2*:entry{        self->ctwo++;}pid$1::*HandleScopeD2*:entry/self->ctwo/{        self->ctwo--;}pid$1::*node*MakeCallback*Handle*Object*String*Value*:entry/self->cone < 1 || self->ctwo < 1/{        @[jstack(200, 8000), probefunc] = count();}tick-10s{        printa(@);}

Here we trace all the constructors and destructors for HandleScope as well as MakeCallback, if we see that MakeCallback has been called but we don't have any HandleScopes on the stack, grab the stack trace and keep track of how many times we see that stack trace. Finally, every 10 seconds print out the stack traces we've seen. The result was pretty damning.

bash./handlescope.d $(pgrep -f standalone)dtrace: script './handlescope.d' matched 12 probesCPU     ID                    FUNCTION:NAME 18   4362                        :tick-10s              node`_ZN4node12MakeCallbackEN2v86HandleINS0_6ObjectEEENS1_INS0_6StringEEEiPNS1_INS0_5ValueEEE              node`_ZN4node10HandleWrap7OnCloseEP11uv_handle_s+0x83              node`uv_run+0x164              node`_ZN4node5StartEiPPc+0x17b              node`main+0x1b              node`_start+0x83  _ZN4node12MakeCallbackEN2v86HandleINS0_6ObjectEEENS1_INS0_6StringEEEiPNS1_INS0_5ValueEEE             3587

In 10 seconds, there was only one stack trace, and we saw that trace 3587 times. HandleWrap::OnClose was the culprit, it was calling MakeCallback without first adding a HandleScope to the stack. HandleWrap is an internal class used in Node to wrap file descriptors. That means that for every file descriptor that was closing, Node was potentially leaking 4 bytes of memory. So for someone like Walmart with lots of connections and lots of disconnects this was adding up to quite a lot of memory.

There's only good news from here, the fix landed in the v0.10 branch, and is available in version v0.10.22. We've also made a change in the master branch that will allow us to detect this type of bug much sooner, even if the error is coming from a misbehaving binary add-on.

I want to thank my fellow engineers here at Joyent: Josh Clulow, Dave Pacheco, Robert Mustacchi, and Bryan Cantrill all of whom were instrumental in the process of finding this leak. Also, thanks to Trevor Norris for finding the reduced test case for master which helped us cross the finish line. This was a team effort, and demonstrates what an incredible team we have at our disposal. So it may have taken 3 weeks to hunt this 4 byte memory leak down, but in the end, it was incredibly satisfying to finally nail it down.

You can experience the saga yourself as Eran and I walk through the set up, identification, and resolution of this memory leak.

Post written by TJ Fontaine