StatsD at Shopify

StatsD at Shopify
Here at Shopify, we like data. One of the many tools in our data toolbox is StatsD. We've been using StatsD in production at Shopify for many months now, consistently sending multiple events to our StatsD instance on every request.

What is StatsD good for?

In my experience, there are two things that StatsD really excels at. First, getting a high level overview of some custom piece of data. We use NewRelic to tell us about the performance of our apps. NewRelic provides a great overview of our performance as a whole, even down to which of our controller actions are slowest, and though it has an API for custom instrumentation I've never used it. For custom metrics we're using StatsD.

We use lots of memcached, and one metric we track with StatsD is cache hits vs. cache misses on our frontend. On every request that hits a cacheable action we send an event to StatsD to record a hit or miss. 

Caching Baseline (Green: cache hits, Blue: cache misses)



Note: The graphs in this article were generated by Graphite, the real-time graphing system that StatsD runs on top of.

As an example of how this is useful, we recently added some data to a cache key that wasn't properly converted to a string, so that piece of the key was appearing to be unique far more often than it was. The net result was more cache misses than usual. Looking at our NewRelic data we could see that performance was affected, but it was difficult to see exactly where. The response time from our memcached servers was still good, the response time from the app was still good, but our number of cache misses had doubled, our number of cache hits had halved, and overall user-facing performance was down.

A problem



It wasn't until we looked at our StatsD graphs that we fully understood the problem. Looking at our caching trends over time we could clearly see that on a specific date something was introduced that was affecting caching negatively. With a specific date we were able to track down the git commit and fix the issue. Keeping an eye on our StatsD graphs we immediately saw the behaviour return to the normal trend.

Return to Baseline


The second thing that StatsD excels at is proving assumptions. When we're writing code we're constantly making assumptions. Assumptions about how our web app may be used, assumptions about how often an interaction will be performed, assumptions about how fast a particular operation may be, assumptions about how successful a particular operation may be. Using StatsD it becomes trivial to get real data about this stuff.

For instance, we push a lot of products to Google Product Search on behalf of our customers. There was a point where I was seeing an abnormally high number of failures returned from Google when we were posting these products via their API. My first assumption was that something was wrong at the protocol level and most of our API requests were failing. I could have done some digging around in the database to get an idea of how many failures we were getting, cross referenced with how many products we were trying to publish and how frequently, etc. But using our StatsD client (see below) I was able add a simple success/failure metric to give me a high level overview of the issue. Looking at the graph from StatsD I could see that my assumption was wrong, so I was able to eliminate that line of thinking.

statsd-instrument

We were excited about StatsD as soon as we read Etsy's announcement. We wrote our own client and began using it immediately. Today we're releasing that client. It's been in use in production since then and has been stalwartly collecting data for us. On an average request we're sending ~5 events to StatsD and we don't see a performance hit. We're actually using StatsD to record the raw number of requests we handle over time.

statsd-instrument provides some basic helpers for sending data to StatsD, but we don't typically use those directly. We definitely didn't want to litter our application with instrumentation details so we wrote metaprogramming methods that allow us to inject that instrumentation where it's needed. Using those methods we have managed to keep all of our instrumentation contained to one file in our config/initializers folder. Check out the README for the full API or pull down the statsd-instrument rubygem to use it.

A sample of our instrumentation shows how to use the library and the metaprogramming methods:

# Liquid
Liquid::Template.extend StatsD::Instrument
Liquid::Template.statsd_measure :parse, 'Liquid.Template.parse'
Liquid::Template.statsd_measure :render, 'Liquid.Template.render'

# Google Base
GoogleBase.extend StatsD::Instrument
GoogleBase.statsd_count_success :update_products!, 'GoogleBase.update_products'

# Webhooks
WebhookJob.extend StatsD::Instrument
WebhookJob.statsd_count_success :perform, 'Webhook.perform'

That being said, there are a few places where we do make use of the helpers directly (sans metaprogramming), still within the confines of our instrumentation initializer:

ShopAreaController.after_filter do
  StatsD.increment 'Storefront.requests', 1, 0.1

  return unless request.env['cacheable.cache']

  if request.env['cacheable.miss']
    StatsD.increment 'Storefront.cache.miss'
  elsif request.env['cacheable.store'] == 'client'
    StatsD.increment 'Storefront.cache.hit_client'
  elsif request.env['cacheable.store'] == 'server'
    StatsD.increment 'Storefront.cache.hit_server'
  end
end

Today we're recording metrics on everything from the time it takes to parse and render Liquid templates, how often our Webhooks are succeeding, performance of our search server, average response times from the many payment gateways we support, success/failure of user logins, and more.

As I mentioned, we have many tools in our data toolbox, and StatsD is a low-friction way to easily collect and inspect metrics. Check out statsd-instrument on github.

5 comments

  • Pete Forde
    Pete Forde
    July 29 2011, 12:08AM

    Thanks for the engaging tech breakdown, Jesse!

    Building BuzzData, we really wanted to get metrics visibility right from the beginning. In addition to a move from Google Analytics to GoSquared (which is real-time to the degree that I can watch a specific user click through our site) for page view tracking, I wanted to capture app-specific business metrics such as signup conversions and dataset uploads.

    We actively looked at StatsD (and haven’t dismissed it!) but opted to use a 3rd party service called MixPanel. Their API accepts arbitrary JSON strings, and throws them into a bucket. So far, it’s been an amazing tool to work with — but we’re probably watching 12 specific actions. Plus we’re running at beta scale and you’re running at Shopify scale… so it’s possible that we’ll be back in-house at some point.

    The primary reasons MixPanel is kicking ass right now include:

    - this was a piece of infrastructure that we didn’t have to build or manage internally, it’s just another Resque job

    - we wanted it to be consumable by our GeckoBoard

    - the graphing on MixPanel is a bit sexier, a bit less “longhaired sysadmin”

    - MixPanel does some pretty amazing funneling and segmenting that is pretty much the exact amount of Business Intelligence software I want in my life

    None of this is an argument against StatsD, and I suspect that we’re best to learn from your experience and start gathering stats on EVERYTHING sooner than later.

    There’s no reason we can’t do both — we’re using MixPanel for pretty specific, high level functions that serve the executive team well. What I like about your approach is that you’re really using it as a developer tool.

  • @Shopify Jesse Storimer
    Jesse Storimer
    July 29 2011, 10:46AM

    Thanks Pete. Your last comment about it being a developer tool is spot on. Our executive team has a lot of reporting tools at their fingertips, and StatsD could definitely be put to their use, but it’s purely a developer tool at the moment.

    MixPanel looks pretty great, and it speaks to a weakness of StatsD. For all the power and scalability, it’s very DIY. There is a web interface for building and viewing metrics, but it wasn’t built with usability in mind.

    It looks like BuzzData is currently a small team and still in beta. In my experience, StatsD has a hard time showing trends given small amounts of data. AFAIK all the graphing is done with points on the graph, so if you have sparse data it will not be able to join them up and draw lines for you. (If someone knows I’m wrong about this please let me know!).

    That being said, StatsD is pretty low-friction, and if someone is willing to record and look at the metrics it’s worth a try.

  • Joakim Kolsjö
    Joakim Kolsjö
    July 31 2011, 07:35PM

    You can join points that are a bit apart using keepLastValue, like this: /render?target=keepLastValue(ci.projects.specs.build_time)&…

    I could not find a list of these functions in the docs, so I extracted them into a gist: https://gist.github.com/d5f9ff6a88ccc4547888

    It might be better to report rare events directly to graphite instead of using statsd?

  • Cem Hurturk
    Cem Hurturk
    March 30 2012, 05:10PM

    There’s also Cockpito.com which is easier to install compared to Statsd/Graphite.

    http://cockpito.com/

    It’s less featured but uses the same UDP technique for metric monitoring.

    View your metrics on a beautiful visualization tool…

  • Fred van den Boch
    Fred van den Boch
    May 10 2012, 02:28PM

    There now is a Statsd connector for Librato Metrics (https://metrics.librato.com/ ;https://github.com/librato/statsd). Metrics is time series data management and visualization SaaS. Full disclosure: I work at Librato.

Leave a comment ...

Start your free 14-day trial of Shopify