"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan
Debugging is always challenging, and as programmers we can easily spend a good chunk of every day just trying to figure out what is going on with our code. Where exactly has a method been overwritten or defined in the first place? What does the inheritance chain look like for this object? Which methods are available to call from this context?
This article will take you through some under-utilized convenience methods in Ruby which will make answering these questions a little easier.
As we have for the past 3 years, Shopify released a Year in Review to highlight some of the exciting growth and change we’ve observed over the past year. Designers James and Veronica had ambitious ideas for this year’s review, including strong, bold typographic treatments and interactive data visualizations. We’ve gotten some great feedback on the final product, as well as some curious developers wondering how we pulled it off, so we’re going to review the development process for Year in Review and talk about some of the technologies we leveraged to make it all happen.
Black Friday and Cyber Monday are the biggest days of the year at Shopify with respect to every metric. As the Infrastructure team started preparing for the upcoming seasonal traffic in the late summer of 2014, we were confident that we could cope, and determined resiliency to be the top priority. A resilient system is one that functions with one or more components being unavailable or unacceptably slow. Applications quickly become intertwined with their external services if not carefully monitored, leading to minor dependencies becoming single points of failure.
For example, the only part of Shopify that relies on the session store is user sign-in - if the session store is unavailable, customers can still purchase products as guests. Any other behaviour would be an unfortunate coupling of components. This post is an overview of the tools and techniques we used to make Shopify more resilient in preparation for the holiday season.
I was recently profiling a production Shopify application server using
perf and noticed a fair amount of time being spent in a particular function,
st_lookup, which is used by Ruby’s MRI implementation for hash table lookups:
Hash tables are used all over MRI, and not just for the
Hash object; global variables, instance variables, classes, and the garbage collector all use MRI’s internal hash table implementation,
st_table. Unfortunately, what this profile did not show were the callers of
st_lookup. Is this some application code that has gone wild? Is this an inefficiency in the VM?
perf is a great sampling profiler for Linux — it’s low overhead and can safely be used in production. However, up until a few years ago, in order to use the call-graph feature of
perf you had to recompile an application with
-fno-omit-frame-pointer to get usable stack traces. This gives the compiler one less register to work with, but I believe the trade-off is worth it in most cases. As of Linux 3.7,
perf now supports recording call graphs even when code is compiled without
-fno-omit-frame-pointer. This works by recording a snapshot of stack memory on each sample. When analyzing the profile later with
perf report, the stack data from each sample is combined with DWARF debugging information in order to build a call graph. This increases the amount of data included in the profile, but is a reasonable compromise when compared with having to recompile everything with
Now when running perf and collecting call graphs, I was able to see the callers of
From this profile, we can see that a large percentage of time is being spent in
rb_method_entry_get_without_cache. At first, I suspected this was being caused by global method cache invalidation or clearing. After using SystemTap on the
method__cache__clear probe and seeing a relatively low count, it was clear that this was not the case. At this point, I started digging into the MRI source, trying to understand what exactly happens when a method is called and how method caching actually works. Looking through the MRI source shows that effectively the only place that
rb_method_entry_get_without_cache is called is via
The idea here is that the global method cache will be checked first (via the
GLOBAL_METHOD_CACHE macro), and if the method is found, the method entry will be returned from the cache. If the method is not in the cache, a more expensive method lookup needs to be performed. This involves walking the class hierarchy to find the requested method, which can cause multiple invocations of
st_lookup to look up the method from a class method table.
GLOBAL_METHOD_CACHE macro performs a basic hash lookup on the provided key to locate the entry in the global method cache. Looking at this code, I was surprised to see that the global method cache only had room for 2048 entries. Shopify is a large Rails application with millions of lines of Ruby (when you include gems), and defines hundreds of thousands of methods. A 2048-entry global method cache is nowhere near adequate for an application of our size.
Analyzing method cache hit rates
One of the first things I wanted to do was determine the existing method cache hit-to-miss ratio. One way of doing this would be to add tracing code to our MRI to log this data. This would require us to deploy a custom Ruby build to our app servers, which is not an ideal solution. There is a better way:
ftrace user probes.
ftrace is a lightweight tracing framework that is built into the Linux kernel. One of the best features of
ftrace is the ability to create dynamic user-mode event probes (uprobes). By adding a probe,
ftrace will fire an event whenever a specified function is called in a process. Better yet, you can even fire an event on the execution of a specific instruction. This was all I needed to collect method cache hit-to-miss numbers.
To configure a uprobe, you need to specify the path to the binary and the address of the instruction you want to trace. How do you figure out the address of the instruction? Using
The first field
objdump prints is the address of the function. Using this, I can now create a probe:
Now when I run Ruby, I’ll get an event every time
rb_method_entry is called:
What this tells me is that running Ruby with a simple “hello world” caused
rb_method_entry to be called 4441 times.
This is really cool! But what I originally wanted to figure out was the global method cache hit to miss ratio. By disassembling
rb_method_entry, we can figure out what code paths are executed in both hit and miss cases, which we can use to set the appropriate probes:
If you look at the C implementation of the function, you can see how it maps to the generated assembly. The relevant bits here are:
0x00000000001aa16ais the address that jumps to
rb_method_entry_get_without_cache. This is the path we take on a cache miss.
0x00000000001aa194is the address of the return instruction that we’ll execute on a cache hit. Because cache misses perform an unconditional jump to
rb_method_entry_get_without_cache, we can infer that this instruction will only ever be executed on a hit.
With this information, we can set up our two probes, one for global method cache misses and one for hits:
And here are the results running the simple “hello world” script:
This is what we wanted to see! So for a simple hello world script, there were 727 method cache misses and 3714 hits, giving us a hit ratio of about 84%. Now that we have a method of calculating the hit rate, let’s see what it looks like in production!
Rather than have to execute these commands every time I wanted to take a sample, I wrote a small shell script to do it for me:
The script takes a binary of the
ruby you want to profile and the number of seconds to collect data. After the number of seconds has elapsed, it will display the number of cache hits and misses, along with the hit rate.
So, how does this look in production? With the default method cache size of 2048 entries, we get about a 90% hit rate. This is pretty good, but according to the results I saw in
perf, it could definitely be better.
The global method cache size is configured as a compile-time
GLOBAL_METHOD_CACHE_SIZE. I thought it would be useful to be able to configure the method cache size at runtime, like you can do with garbage collector parameters (
RUBY_GC_* environment variables). To that end, I added a new environment variable,
RUBY_GLOBAL_METHOD_CACHE_SIZE, that can be used to configure the global method cache size at process start time. This code has been committed upstream and is available in Ruby 2.2.
Now that I had a way to dynamically configure the method cache size, I could run some tests to see how Shopify performed using various cache sizes. I ran the
trace_rb_method_cache.sh script on one of our production servers for 60 seconds with a few different cache sizes and collected the results:
|Cache Size||Hits||Misses||Hit Rate|
From these numbers, once we get around 64K it looks like the hit rates begin to level off.
Now, if we do a
perf before and after tuning, we can see some respectable results:
2K method cache
128K method cache
This change gives us a cycle savings of about 3%. Not bad for changing one configuration value!
Using the system-level profiling tools,
, I was able to find a performance issue that would have never been visible with Ruby-level tooling. These tools are available on any modern Linux distribution, and I encourage you to experiment with them on your own application to see what kind of benefits can be had!
This is the second in a series of blog posts describing our evolution of Shopify toward a Docker-powered, containerized data center. This instalment will focus on the creation of the container used in our production environment when you visit a Shopify storefront.
Read the first post in this series here.
Before we dive into the mechanics of building containers, let's discuss motivation. Containers have the potential to do for the datacenter what consoles did for gaming. In the early days of PC gaming, each game typically required video or sound driver massaging before you got to play. Gaming consoles however, offered a different experience:
- predictability: cartridges were self-contained fun: always ready-to-run, with no downloads or updates.
- fast: cartridges used read-only memory for lightning fast speeds.
- easy: cartridges were robust and largely child-proof - they were quite literally plug-and-play.
Predictable, fast, and easy are all good things at scale. Docker containers provide the building blocks to make our data centers easier to run and more adaptable by placing applications into self-contained, ready-to-run units much like cartridges did for console games.
This September, we quietly launched a new version of the Shopify admin. Unlike the launch of the previous major iteration of our admin, this version did not include a major overhaul of the visual design, and for the most part, would have gone largely unnoticed by the user.
Why would we rebuild our admin without providing any noticeable differences to our users? At Shopify, we strongly believe that any decision should be able to be questioned at any time. In late 2012, we started to question whether our framework was still working for us. This post will discuss the problems in the previous version of our admin, and how we decided that it was time to switch frameworks.
This is the first in a series of posts about adding containers to our server farm to make it easier to scale, manage, and keep pace with our business.
The key ingredients are:
- Docker: container technology for making applications portable and predictable
- CoreOS: provides a minimal operating system, systemd for orchestration, and Docker to run containers
Shopify is a large Ruby on Rails application that has undergone massive scaling in recent years. Our production servers are able to scale to over 8,000 requests per second by spreading the load across 1700 cores and 6 TB RAM.
In the early fall our infrastructure team was considering Kafka, a highly available message bus. We were looking to solve several infrastructure problems that had come up around that time.
We were looking for a reliable way to collect event data and send it to our data warehouse.
We were considering a more service-oriented architecture, and needed a standardized way of message passing between the components.
We were starting to evaluate containerization of Shopify, and were searching for a way to get logs out of containers.
We were intrigued by Kafka due to its highly available design. However, Kafka runs on the JVM, and its primary user, LinkedIn, runs a full JVM stack. Shopify is mainly Ruby on Rails and Go, so we had to figure out how to integrate Kafka into our infrastructure.
A recent phenomenon has taken the tech world by storm: Dogecoin. Though goofy and grammatically unique, the Dogecoin has proven to be an incredible force for good in the world through initiatives like The Dogecoin Foundation.
For Shopify Hackdays then, the development team at Shopify took it upon themselves to make a gentlepeople's wager against the Business Development and Talent Acquisition teams at Shopify that the Dev team could raise more money in Dogecoin than the so called hustlers could by starting a Shopify business. Nothing like a good old fashioned competition to raise some money for charity.
With all this said, Hackers vs Hustlers 2014 has started, and we could use your help getting all the doge possible in the hands of our charity Doge wallet! The hackers at Shopify have got every server we can find mining doge: the whole hadoop cluster, every beefy box with GPUs, a bunch of mac minis, and even the Raspberry Pis which power our office dashboards. We're mining lots, but maybe not enough to overtake the hustlers by the end. We'd like your help!
The trick is, the rules strictly prohibit donations of any sort, so we can't just ask for doge directly. We can however just so happen to leave these mining pool credentials lying around, and really it is definitely ok with us if anyone out there wanted out of the good of their own heart to contribute to our mining efforts.
Pool URL: stratum+tcp://pool.teamdoge.com:3333
Worker Username: DataEng.TechBlog
Worker Password: iRAKDHJksM77Mf
The charity we will donate all proceeds from both the hustlers' Shopify store and the hackers' mining efforts will be donated to the CompuCorps TECHYOUTH program, which provides children in low income families the opportunity to learn technology skills, and eventually get jobs in the technology field!
Doge Donations (which won't count for the competition, but will still go to CompuCorps) can be sent to this Dogecoin address: DM6xAdYmjMZd8eBNqZbse9cbGDRGb1ivfP.Much thanks, many wow, very generous.
I'm Chris Saunders, one of Shopify's developers. I like to keep journal entries about the problems I run into while working on the various codebases within the company.
Recently we ran into a issue with authentication in one of our applications and as a result I ended up learning a bit about Rack middleware. I feel that the experience was worth sharing with the world at large so here's is a rough transcription of my entry. Enjoy!
I'm looking at invalid form submissions for users who were trying to log in via their Shopify stores. The issue was actually at a middleware level, since we were passing invalid data off to OmniAuth which would then choke because it was dealing with invalid URIs.
The bug in particular was we were generating the shop URL based on the data that the user was submitting. Normally we'd be expecting something like mystore.myshopify.com or simply mystore, but of course forms can be confusing and people put stuff in there like http://mystore.myshopify.com or even worse my store. We'd build up a URL and end up passing something like https://http::/mystore.myshopify.com.myshopify.com and cause an exception to get raised.
Another caveat is that we aren't able to even sanitize the input before passing it off to OmniAuth, unless we were to add more code to the lambda that we pass into the setup initializer.
Adding more code to an initializer is definitely less than optimal, so we figured that we could implement this in a better way: adding a middleware to run before OmniAuth such that we could attempt to recover the bad form data, or simply kill the request before we get too deep.
We took a bit of time to learn about how Rack middlewares work, and looked to the OmniAuth code for inspiration since it provides a lot of pluggability and is what I'd call a good example of how to build out easily extendable code.
We decided that our middleware would be initialized with a series of routes to run a bunch of sanitization strategies on. Based on how OmniAuth works, I gleaned that the arguments after
config.use MyMiddleWare would be passed into the middleware during the initialization phase - perfect! We whiteboarded a solution that would work as follows:
Now that we had a goal we just had to implement it. We started off by building out the strategies since that was extremely easy to test. The interface we decided upon was the following:
We decided that the actions would be destructive, so instead of creating a new
Rack::Request at the end of our strategies call, we'd change values on the object directly. It simplifies things a little bit but we need to be aware that order of operations might set some of our keys to
nil and we'd have to anticipate that.
The simplest of sanitizers we'd need is one that cleans up our whitespace. Because we are building these for .myshopify.com domains we know the convention they follow: dashes are used as separators between words if the shop was created with spaces. For example, if I signed up with my super awesome store when creating a shop, that would be converted into my-super-awesome-store. So if a user accidentally put in my super awesome store we can totally recover that!
Now that we have a sanitization strategy written up, let's work on our actual middleware implementation.
According to the Rack spec, all we really need to do is ensure that we return the expected result: an array that consists of the following three things: A response code, a hash of headers and an iterable that represents the content body. An example of the most basic Rack response is:
Per the Rack spec, middlewares are always initialized where the first object is a Rack app, and whatever else afterwards. So let's get to the actual implementation:
That's pretty much it! We've written up a really simple middleware that takes care of cleaning up some bad user input that necessarily isn't a bad thing. People make mistakes and we should try as much as possible to react to this data in a way that isn't jarring to the users of our software.