I was recently profiling a production Shopify application server using
perf and noticed a fair amount of time being spent in a particular function,
st_lookup, which is used by Ruby’s MRI implementation for hash table lookups:
Hash tables are used all over MRI, and not just for the
Hash object; global variables, instance variables, classes, and the garbage collector all use MRI’s internal hash table implementation,
st_table. Unfortunately, what this profile did not show were the callers of
st_lookup. Is this some application code that has gone wild? Is this an inefficiency in the VM?
perf is a great sampling profiler for Linux — it’s low overhead and can safely be used in production. However, up until a few years ago, in order to use the call-graph feature of
perf you had to recompile an application with
-fno-omit-frame-pointer to get usable stack traces. This gives the compiler one less register to work with, but I believe the trade-off is worth it in most cases. As of Linux 3.7,
perf now supports recording call graphs even when code is compiled without
-fno-omit-frame-pointer. This works by recording a snapshot of stack memory on each sample. When analyzing the profile later with
perf report, the stack data from each sample is combined with DWARF debugging information in order to build a call graph. This increases the amount of data included in the profile, but is a reasonable compromise when compared with having to recompile everything with
Now when running perf and collecting call graphs, I was able to see the callers of
From this profile, we can see that a large percentage of time is being spent in
rb_method_entry_get_without_cache. At first, I suspected this was being caused by global method cache invalidation or clearing. After using SystemTap on the
method__cache__clear probe and seeing a relatively low count, it was clear that this was not the case. At this point, I started digging into the MRI source, trying to understand what exactly happens when a method is called and how method caching actually works. Looking through the MRI source shows that effectively the only place that
rb_method_entry_get_without_cache is called is via
The idea here is that the global method cache will be checked first (via the
GLOBAL_METHOD_CACHE macro), and if the method is found, the method entry will be returned from the cache. If the method is not in the cache, a more expensive method lookup needs to be performed. This involves walking the class hierarchy to find the requested method, which can cause multiple invocations of
st_lookup to look up the method from a class method table.
GLOBAL_METHOD_CACHE macro performs a basic hash on the provided key to locate the entry in the global method cache. Looking at this code, I was surprised to see that the global method cache only had room for 2048 entries. Shopify is a large Rails application with millions of lines of Ruby (when you include gems), and defines hundreds of thousands of methods. A 2048-entry global method cache is nowhere near adequate for an application of our size.
Analyzing method cache hit rates
One of the first things I wanted to do was determine the existing method cache hit-to-miss ratio. One way of doing this would be to add tracing code to our MRI to log this data. This would require us to deploy a custom Ruby build to our app servers, which is not an ideal solution. There is a better way:
ftrace user probes.
ftrace is a lightweight tracing framework that is built into the Linux kernel. One of the best features of
ftrace is the ability to create dynamic user-mode event probes (uprobes). By adding a probe,
ftrace will fire an event whenever a specified function is called in a process. Better yet, you can even fire an event on the execution of a specific instruction. This was all I needed to collect method cache hit-to-miss numbers.
To configure a uprobe, you need to specify the path to the binary and the address of the instruction you want to trace. How do you figure out the address of the instruction? Using
The first field
objdump prints is the address of the function. Using this, I can now create a probe:
Now when I run Ruby, I’ll get an event every time
rb_method_entry is called:
What this tells me is that running Ruby with a simple “hello world” caused
rb_method_entry to be called 4441 times.
This is really cool! But what I originally wanted to figure out was the global method cache hit to miss ratio. By disassembling
rb_method_entry, we can figure out what code paths are executed in both hit and miss cases, which we can use to set the appropriate probes:
If you look at the C implementation of the function, you can see how it maps to the generated assembly. The relevant bits here are:
0x00000000001aa16a is the address that jumps to
rb_method_entry_get_without_cache. This is the path we take on a cache miss.
0x00000000001aa194 is the address of the return instruction that we’ll execute on a cache hit. Because cache misses perform an unconditional jump to
rb_method_entry_get_without_cache, we can infer that this instruction will only ever be executed on a hit.
With this information, we can set up our two probes, one for global method cache misses and one for hits:
And here are the results running the simple “hello world” script:
This is what we wanted to see! So for a simple hello world script, there were 727 method cache misses and 3714 hits, giving us a hit ratio of about 84%. Now that we have a method of calculating the hit rate, let’s see what it looks like in production!
Rather than have to execute these commands every time I wanted to take a sample, I wrote a small shell script to do it for me:
The script takes a binary of the
ruby you want to profile and the number of seconds to collect data. After the number of seconds has elapsed, it will display the number of cache hits and misses, along with the hit rate.
So, how does this look in production? With the default method cache size of 2048 entries, we get about a 90% hit rate. This is pretty good, but according to the results I saw in
perf, it could definitely be better.
The global method cache size is configured as a compile-time
GLOBAL_METHOD_CACHE_SIZE. I thought it would be useful to be able to configure the method cache size at runtime, like you can do with garbage collector parameters (
RUBY_GC_* environment variables). To that end, I added a new environment variable,
RUBY_GLOBAL_METHOD_CACHE_SIZE, that can be used to configure the global method cache size at process start time. This code has been committed upstream and is available in Ruby 2.2.
Now that I had a way to dynamically configure the method cache size, I could run some tests to see how Shopify performed using various cache sizes. I ran the
trace_rb_method_cache.sh script on one of our production servers for 60 seconds with a few different cache sizes and collected the results:
From these numbers, once we get around 64K it looks like the hit rates begin to level off.
Now, if we do a
perf before and after tuning, we can see some respectable results:
2K method cache
128K method cache
This change gives us a cycle savings of about 3%. Not bad for changing one configuration value!
Using the system-level profiling tools,
, I was able to find a performance issue that would have never been visible with Ruby-level tooling. These tools are available on any modern Linux distribution, and I encourage you to experiment with them on your own application to see what kind of benefits can be had!
ftrace user probes