TL;DR

Catastrophe! Your app is leaking memory. When it runs in production it crashes and starts raising Errno::ENOMEM exceptions. So you babysit it and restart it consistently so that your app keeps responding.

As hard as you try you don’t see any memory leaks. You use the available tools, but you can’t find the leak. Understanding your full stack, knowing your tools, and good ol’ debugging will help you find that memory leak.

Memory leaks are good?

Yes! Depending on your definition. A memory leak is any memory that is allocated, but never freed. This is the basis of anything global in your programs. 

In a Ruby program global variables are allocated but will never be freed. Same goes with constants, any constant you define will be allocated and never freed. Without these things we couldn’t be very productive Ruby programmers.

But there’s a bad kind

The bad kind of memory leak involves some memory being allocated and never freed, over and over again. For example, if a constant is appended each time a web request is made to a Rails app, that's a memory leak. Since that constant will never be freed and it’s memory consumption will only grow and grow.

Separating the good and the bad

Unfortunately, there’s no easy way to separate the good memory leaks from the bad ones. The computer can see that you’re allocating memory, but, as always, it doesn’t understand what you’re trying to do, so it doesn’t know which memory leaks are unintentional.

To make matters more muddy, the computer can’t differentiate betweeen a memory leak in Ruby-land and a memory leak in C-land. It’s all just memory.

If you’re using a C extension that’s leaking memory there are tools specific to the C language that can help you find memory leaks (Valgrind). If you have Ruby code that is leaking memory there are tools specific to the Ruby language that can help you (memprof). Unfortunately, if you have a memory leak in your app and have no idea where it’s coming from, selecting a tool can be really tough.

How bad can memory leaks get?

This begins the story of a rampant memory leak we experienced at Shopify at the beginning of this year. Here’s a graph showing the memory usage of one of our app servers during that time.


You can see that memory consumption continues to grow unhindered as time goes on! Those first two spikes which break the 16G mark show that memory consumption climbed above the limit of physical memory on the app server, so we had to rely on the swap. With that large spike the app actually crashed, raising Errno::ENOMEM errors for our users.

After that you can see many smaller spikes. We wrote a script to periodically reboot the app, which releases all of the memory it was using. This was obviously not a sustainable solution. Case in point: the last spike on the graph shows that we had an increase in traffic which resulted in memory usage growing beyond the limits of physical memory again.

So, while all this was going on we were searching high and low to find this memory leak.

Where to begin?

The golden rule is to make the leak reproducible. Like any bug, once you can reproduce it you can surely fix it. For us, that meant a couple of things:

  1. When testing, reproduce your production environment as closely as possible. Run your app in production mode on localhost, set up the same stack that you have on production. Ensure that you are running the same exact versions of the software that is running on production.

  2. Be aware of any issues happening on production. Are there any known issues with the production environment? Losing connections to the database? Firewall routing traffic properly? Be aware of any weird stuff that’s happening and how it may be affecting your problem.

Memprof

Now that we’ve laid out the basics at a high level, we’ll dive into a tool that can help you find memory leaks.

Memprof is a memory profiling tool built by ice799 and tmm1. Memprof does some crazy stuff like rewriting the current Ruby binary at runtime to hot patch features like object allocation tracking. Memprof can do stuff like tell you how many objects are currently alive in the Ruby VM, where they were allocated, what their internal state is, etc.

VM Dump

The first thing that we did when we knew there was a problem was to reach into the toolbox and try out memprof. This was my first experience with the tool. My only exposure to the tool had been a presentation by @tmm1 that detailed some heavy duty profiling by dumping every live object in the Ruby VM in JSON format and using MongoDB to perform analysis.

Without any other leads we decided to try this method. After hitting our staging server with some fake traffi we used memprof to dump the VM to a JSON file. An important note is that we did not reproduce the memory leak on our staging server, we just took a look at the dump file anyway.

Our dump of the VM came out at about 450MB of JSON. We loaded it into MongoDB and did some analysis. We were surprised by what we found. There were well over 2 million live objects in the VM, and it was very difficult to tell at a glance which should be there and which should not.

As mentioned earlier there are some objects that you want to ‘leak’, especially true when it comes to Rails. For instance, Rails uses ActiveSupport::Callbacks in many key places, such as ActiveRecord callbacks or ActionController filters. We had tons of Proc objects created by ActiveSupport::Callbacks in our VM, but these were all things that needed to stick around in order for Shopify to function properly.

This was too much information, with not enough context, for us to do anything meaningful with.

Memprof stats

More useful, in terms of context, is having a look at Memprof.stats and the middleware that ships with Memprof. Using these you can get an idea of what is being allocated during the course of a single web request, and ultimately how that changes over time. It’s all about noticing a pattern of live objects growing over time without stopping.

memprof.com

The other useful tool we used was memprof.com. It allows you to upload a JSON VM dump (via the memprof gem) and analyse it using a slick web interface that picks up on patterns in the data and shows relevant reports. It has since been taken offline and open sourced by tmm1 at https://github.com/tmm1/memprof.com.

Unable to reproduce our memory leak on development or staging we decided to run memprof on one of our production app servers. We were only able to put it in rotation for a few minutes because it increased response time by 1000% due to the modifications made by memprof. The memory leak that we were experiencing would typically take a few hours to show itself, so we weren’t sure if a few minutes of data would be enough to notice the pattern we were looking for.

We uploaded the JSON dump to memprof.com and started using the web UI to look for our problem. Different people on the team got involved and, as I mentioned earlier, this data can be confusing. After seeing the huge amount of Proc object from ActiveSupport::Callbacks some claimed that “ActiveSupport::Callbacks is obviously leaking objects on every request”. Unfortunately it wasn’t that simple and we weren’t able to find any patterns using memprof.com.

Good ol’ debuggin: Hunches & Teamwork

Unable to make progress using these approaches we were back to square one. I began testing locally again and, through spying on Activity Monitor, thought that I noticed a pattern emerging. So I double-checked that I had all the same software stack running that our production environment has, and then the pattern disappeared.

It was odd, but I had a hunch that it had something to do with a bad connection to memcached. I shared my hunch with @wisqnet and he started doing some testing of his own. We left our chat window open as we were testing and shared all of our findings.

This was immensely helpful so that we could both begin tracing patterns between each others results. Eventually we tracked down a pattern. If we consistently hit a URL we could see the memory usage climb and never stop. We eventually boiled it down to a single of code:

loop { Rails.cache.write(rand(10**10).to_s, rand(10**10).to_s) }

If we ran that code in a console and then shut down the memcached instance it was using, memory usage immediately spiked.

Now What?

Now that it was reproducible we were able to experiment with fixing it. We tracked the issue down to our memcached client library. We immediately switched libraries and the problem disappeared in production. We let the library author know about the issue and he had it fixed in hours. We switched back to our original library and all was well!


Finally

It turned out that the memory leak was happening in a C extension, so the Ruby tools would not have been able to find the problem.

Three pieces of advice to anyone looking for a memory leak:

  1. Make it reproducible!
  2. Trust your hunches, even if they don’t make sense.
  3. Work with somebody else. Bouncing your theories off of someone else is the most helpful thing you can do.