Data

New Feature - Orders Export Includes Payment Reference

The Export Orders feature in your Shop's Administration page will now include the Payment Reference used to cross-reference…

The Export Orders feature in your Shop's Administration page will now include the Payment Reference used to cross-reference transactions to the gateway.

The Payment Method has been added to the end of each line of the Order export file:


This change will be put into production on December 19, 2012 at 12:00 EST. If you have any custom processing of the Orders CSV file that Shopify generates, please ensure that you have updated it to handle this new field.

RESTful thinking considered harmful - followup

My previous post RESTful thinking considered harmful caused quite a bit of discussion yesterday. Unfortunately, many people seem…

My previous post RESTful thinking considered harmful caused quite a bit of discussion yesterday. Unfortunately, many people seem to have missed the point I was trying to make. This is likely my own fault for focusing too much on the implementation, instead of the thinking process of developers that I was actually trying to discuss. For this reason, I would like to clarify some points.

  • My post was not intended as an arguments against REST. I don't claim to be a REST expert, and I don't really care about REST semantics.
  • I am also not claiming that it is impossible to get the design right using REST principles in Rails.

So what was the point I was trying to make?

  • Rails actively encourages the REST = CRUD design pattern, and all tutorials, screencasts, and documentation out there focuses on designing RESTful applications this way.
  • However, REST requires developers to realize that stuff like "publishing a blog post" is a resource, which is far from intuitive. This causes many new Rails developers to abuse the update action.
  • Abusing update makes your application lose valuable data. This is irrevocable damage.
  • Getting REST wrong may make your API less intuitive to use, but this can always be fixed in v2.
  • Getting a working application that properly supports your process should be your end goal, having it adhere to REST principles is just a means to get there.
  • All the focus on RESTful design and discussion about REST semantics makes new developers think this is actually more important and messes with them getting their priorities straight.

In the end, having a properly working application that doesn't lose data is more important than getting a proper RESTful API. Preferably, you want to have both, but you should always start with the former.

Improving the status quo

In the end, what I want to achieve is educating developers, not changing the way Rails implements REST. Rails conventions, generators, screencasts, and tutorials are all part of how we educate new Rails developers.

  • Rails should ship with a state machine implementation, and a generator to create a model based on it. Thinking "publishing a blog post" is a transaction in a state machine is a lot more intuitive.
  • Tutorials, screencasts, and documentation should focus on using it to design your application. This would lead to to better designed application with less bugs and security issues.
  • You can always wrap your state machine in a RESTful API if you wish. But this should always come as step 2.

Hopefully this clarifies a bit better what I was trying to bring across.

RESTful thinking considered harmful

It has been interesting and at times amusing to watch the last couple of intense debates in the…

It has been interesting and at times amusing to watch the last couple of intense debates in the Rails community. Of particular interest to me are the two topics that relate to RESTful design that ended up on the Rails blog itself: using the PATCH HTTP method for updates and protecting attribute mass-assignment in the controller vs. in the model.

REST and CRUD

These discussions are interesting because they are both about the update part of the CRUD model. PATCH deals with updates directly, and most problems with mass-assignment occur with updates, not with creation of resources.

In the Rails world, RESTful design and the CRUD interface are closely intertwined: the best illustration for this is that the resource generator generates a controller with all the CRUD actions in place (read is renamed to show, and delete is renamed to destroy). Also, there is the DHH RailsConf '06 keynote linking CRUD to RESTful design.

Why do we link those two concepts? Certainly not because this link was included in the original Roy Fielding dissertation on the RESTful paradigm. It is probably related to the fact that the CRUD actions match so nicely on the SQL statements in relational databases that most web applications are built on (SELECT, INSERT, UPDATE and DELETE) on the one hand, and on the HTTP methods that are used to access the web application on the other hand. So CRUD seems a logical link between the two.

But do the CRUD actions match nicely on the HTTP methods? DELETE is obvious, and the link between GET and read is also straightforward. Linking POST and create already takes a bit more imagination, but the link between PUT and update is not that clear at all. This is why PATCH was added to the HTTP spec and where the whole PUT/PATCH debate came from.

Updates are not created equal

In the relational world of the database, UPDATE is just an operator that is part of set theory. In the world of publishing hypermedia resources that is HTTP, PUT is just a way to replace a resource on a given URL; PATCH was added later to patch up an existing resource in an application-specific way.

But was it an update in the web application world? It turns out that it is not so clear cut. Most web application are built to support processes: it is an OLTP system. A clear example of an OLTP system supporting a process is an ecommerce application. In an OLTP system, there is two kinds of data: master data of the objects that play a role within the context of your application (e.g. customer and product) and process-describing data, the raison d'être of your application (e.g., an order in the ecommerce example).

For master data, the semantics of an update are clear: the customer has a new address, or a products description gets rewritten [1]. For process-related data it is not so clear cut: the process isn't so much updated, the state of the process is changed due to an event: a transaction. An example would be the customer paying the order.

In this case, a database UPDATE is used to make the data reflect the new reality due to this transaction. The usage of an UPDATE statement actually is an implementation detail, and you can easily do that otherwise. For instance, the event of paying for an order could just as well be stored as a new record INSERTed into the order_payments table. Even better would be to implement the process as a state machine, two concepts that are closely linked, and to store the transactions so you can later analyze the process.

Transactional design in a RESTful world

RESTful thinking for processes therefore causes more harm than it does good. The RESTful thinker may design both the payment of an order and the shipping of an order both as updates, using the HTTP PATCH method:

    PATCH /orders/42 # with { order: { paid: true  } }
    PATCH /orders/42 # with { order: { shipped: true } }

Isn't that a nice DRY design? Only one controller action is needed, just one code path to handle both cases!

But should your application in the first place be true to RESTful design principles, or true to the principles of the process it supports? I think the latter, so giving the different transactions different URIs is better:

    POST /orders/42/pay
    POST /orders/42/ship

This is not only clearer, it also allows you to authorize and validate those transactions separately. Both transactions affect the data differently, and potentially the person that is allowed to administer the payment of the order may not be the same as the person shipping it.

Some notes on implementation and security

When implementing a process, every possible transaction should have a corresponding method in the process model. This method can specify exactly what data is going to be updated, and can easily make sure that no other will be updated unintentionally.

In turn, the controller should call this method on the model. Using update_attributes from your controller directly should be avoided: it is too easy to forget appropriate protection for mass-assignment, especially if multiple transactions in the process update different fields of the model. This also sheds some light in the protecting from mass-assignment debate: protection is not so much part of the controller or the model, but should be part of the transaction.

Again, using a state machine to model the process makes following this following these principles almost a given, making your code more secure and bug free.

Improving Rails

Finally, can we improve Rails to reflect these ideas and make it more secure? Here are my proposals:

  • Do not generate an update action that relies on calling update_attributes when running the resource generator. This way it won't be there if it doesn't need to be reducing the possibility of a security problem.
  • Ship with a state machine implementation by default, and a generator for a state machine-backed process model. Be opinionated!

These changes would point Rails developers into the right direction when designing their application, resulting in better, more secure applications.


[1] You may even want to model changes to master data as transactions, to make your system fully auditable and to make it easy to return to a previous value, e.g. to roll back a malicious update to the ssh_key field in the users table.

A big thanks to Camilo Lopez, Jesse Storimer, John Duff and Aaron Olson for reading and commenting on drafts of this article.


Update: apparently many people missed to point I was trying to make. Please read the followup post in which I try to clarify my point.

Defining Churn Rate (no really, this actually requires an entire blog post)

If you go to three different analysts looking for a definition of "churn rate," they will all agree…

Old-style woodcut of a woman with a butter churn

If you go to three different analysts looking for a definition of "churn rate," they will all agree that it's an important metric and that the definition is self evident. Then they will go ahead and give you three different definitions. And as they share their definitions with each other they all have the same response: why is everyone else making this so complicated?

How can it be so confusing? All I want to know is how quickly my users are cancelling their service.

Unfortunately, churn rate is actually an extremely important metric. Why? To put it in modern startup terminology, churn rate is an ideal actionable (non-vanity) metric. As Eric Ries describes in his classic blog post, understanding your company metrics at a customer level is key to understanding the effects of your actions. It is your customer level metrics that will inform you if you are currently on a path to profitability, or if your current profits are sustainable. And at the heart of customer level metrics is churn rate: the measurement of the likelihood of your customer to become an ex-customer.

Let me take you through the recent stumbling process we went through at Shopify and share with you the definition we ended up with. (Or you can jump straight to the answer.)

The accountants' dream

As I say, a churn rate seems simple. Since it's just a rate all you need is a numerator and a denominator, right? So why not pick the obvious numerator and denominator.

\frac{(number\,of\,churns\,during\,period)}{(number\,of\,customers\,at\,beginning\,of\,period)}  

The problem here is that the [number of churns over period] value is affected by the entire period but the [number of customers at beginning of period] value is a snapshot from the beginning of the period. This might not have much impact if new customers only make up a small percentage of your user base but for a company that's growing this can lead to some major misinterpretations.

Consider you are calculating your churn rate for July and August. Let's say in July you started with 10,000 customers, lost 500 of them (5% of 10,000), gained 5,000 but lost 125 of those (you only lose 2.5% of the 5,000 because you gain those 5,000 over the course of the month). Now in August you start with 14,375 customers and lost a similar amount of 719 (5% of 14,375), gained another 5,000 and again lost 125 of them. So July's churn rate would be 6.25% and August's would be 5.87%.  That's a shift of 38 basis points for the exact same behaviour in both months (here the standard deviation is about 20~25 basis points depending on how you calculate it). You would have told yourself that your churn rate is improving when nothing has changed. Misreporting your performance is a bad problem to have.

And beyond that, what if July had been a dead month with only 100 new customers, 2 of which churned, and then August picked up again.  Then July would have a churn rate of 5.02% and August would have a churn rate of 6.3%. This is because this definition of churn rate is directly influenced by the number of new customers you acquire. But the whole point of a churn rate is to understand churn behaviour normalized for growth and size.

The accountants' adjusted dream

Since we had a classic problem in financial analysis in our previous rate, let us try the classic solution.



Great, this seems to solve the problem of the number of new customers affecting our measurement of churn: according to this definition in either of the scenarios above we always have a churn rate of 5.1% (that is where July was dead or not). Our fluctuations due to artifacting has disappeared.

So let's go on and consider the next natural question: what is the churn rate for the quarter? Well, if we say September is like August and gains another 5,000 customers, 125 of which it loses, and it lost 927 of your original customers, you end up with 22,480 at the end of the month. So if you use this definition you end up with a churn rate for the quarter of 15.5%?! That's a pretty radical shift from 5.1%.

But of course I should calm down. My quarter has three months in it so I just have to divide my 15.5% by three, right? And I have a reasonable number of 5.2%. However, let's consider that dead July again where I only gained 100 customers. I still had a July, August, September churn rate of 5.1% but now my quarterly churn is 4.6%?!

This is because my formula has a cooked in assumption that the churns are evenly spread out. If your data breaks this assumption you will get results that no longer make sense. Unfortunately you can't insist your customers follow your assumptions so that your equations are satisfied (I'm just imagining telling a customer "I'm sorry, we're going to have to churn you today because we haven't had enough churns to satisfy our linearly distribution assumption").

A good ratio metric needs to be able to expand and shrink in the length of period that it measures. You want to be able to see how your current month churn rate compares to the churn rate of the quarter or the YTD churn rate. You also should be able to change your view to be able to see a weekly churn rate that provides better resolution but still with comparable values. And even a daily churn rate. (Of course the more narrow your window the more volatility you should expect.)

Even going from 28 days in February to 31 days in March is enough of an increase in days to create results that will suggest churn is increasing even though customer behaviour hasn't changed.

(As an aside I once suggested that we stop reporting figures for actual months and instead chop up the year into 30 day periods and only report figures for these 30 day Shopify months. I didn't really get an answer as much as a I got a look that said did we really just hire this guy? I needed to remind myself that analysis needs to work within the world that we are in if it is to be useful; not an idealized version of the world. The globe isn't a sphere. The market isn't efficient. And months aren't all 30 days.)

We also tried:


which basically has the same problems but to different degrees. Instead we had to let this dream die and consider alternatives.

The predictive modelers' fancy

From a predictive modeling background my priority was how can I define a churn rate so that it might be useful for making predictions. The most straight forward way to do this is to find a churn rate r so that if you have the number of customers for today n then r*n is a prediction for the number of customers that will have churned sometime in the next 30 days (this is actually not a very good way to make this sort of prediction). To do this you might take the weighted average of rate of people who churn within 30 days for every day in your period. That is:


or



where the weights are



This seems to solve all of our past problems. If customer behaviour remains unchanged then this churn rate will remain consistent. And it can happily return a churn rate for a month, a week, or a day all in nice comparable numbers.

However, this number hindered by the fact that it is neither current or timely. By timely I mean that at the end of August the most recent churn rates you can report are for periods that end August 1st. You have to wait until the end of September before you can report August's churn rate. At worst you want to be able to report a month's rate only two or three days after the month ends. Maybe this is just a perception problem that can be solved by reporting July's churn rate for August; ie change the recognition date. This opens other problems and doesn't resolve the issue of not being current.

By current I mean that there are churn events that have occurred that can't be reflected in your most recent churn rate. It is possible for a surge in cancellations to not be captured in the churn rate until weeks later. This is a major problem. If your main measure for churn isn't able to notify you of a major change in churn behaviour shortly after the change then it is not performing one of its primary functions.

But really, these are all problems with the fact this metric makes it too hard to understand what is going on. That is when you say this is the churn rate for August 24th to the 31st one expects that this number is an aggregate of churn behaviour during that week. Instead, churn behaviour during that week and most of the month of August is reflected in that number. But only some of that churn behaviour from August. Which means it is very hard to understand how this number relates to when anything else is happening in your business. You get funky effects like churn rate dropping just before you put into practice a new retention strategy.

So while it is nice that this version of churn rate has some predictive utility it fails on so many other accounts that it no longer seems so fancy.

Where we ended up

Between these two ideas is where we ended up:


or


where the weights are



We've found that resolves all the issues we've had above: it's current, it's timely, it produces comparable results for different period lengths, and an increase or decrease in this churn rate reflects an actual change in churn behaviour for your measured period.

This isn't the best number to use to multiply your current customer count by to get an estimate of how many of your current customers will churn in the next 30 days. However, as demonstrated in a post over at Custora, this isn't a good practice anyhow.

What this metric is useful for is keeping track of changes in customer churn behaviour while giving a rough estimate of what percentage of your customers will leave in the next 30 days.

And finally, what makes this calculation actually something we can use is the fact that the components are all reconciliable numbers that we were already recording. While I am clearly a fan of rigor I believe that your final result should follow the 37signals advise of of "make it easy." All you have to know is how many customers you have each day and the number of cancellations for each day. Of course "customers" and "cancellations" are definitions that also need to be sorted out.

Epilogue

One thing that I did not mention but needs to be kept in mind: for virtually all businesses, new customers will have a higher churn rate than mature customers. But what this means is that some form of segmentation is necessary to have a useful churn rate. For example you may want to only report the churn rate for customers who have been around for at least 90 days. Or you may want separate churn rates for all sorts of demographics and tenure. The aforementioned post by Custora has a great discussion of this that goes into greater detail.

If you don't apply some form of segmentation in your reporting you will find that your churn rate increases whenever your ratio of new customers to mature customers increases; even though it may be that the churn rates of both new and mature customers are dropping.

No aggregate is ever going to perfectly communicate a particular customer behaviour: an aggregate is lossy compression after all. But what you want to avoid are aggregates that hide big news, tell you something has changed when everything is still the same, or leave you with the opposite impression of what is actually happening. And remember, a change in aggregate will almost never tell you the entire story. What it tells you is there is a story to be told and now you need to find it.

Most Memory Leaks are Good

TL;DR Catastrophe! Your app is leaking memory. When it runs in production it crashes and starts raising Errno::ENOMEM…

TL;DR

Catastrophe! Your app is leaking memory. When it runs in production it crashes and starts raising Errno::ENOMEM exceptions. So you babysit it and restart it consistently so that your app keeps responding.

As hard as you try you don’t see any memory leaks. You use the available tools, but you can’t find the leak. Understanding your full stack, knowing your tools, and good ol’ debugging will help you find that memory leak.

Memory leaks are good?

Yes! Depending on your definition. A memory leak is any memory that is allocated, but never freed. This is the basis of anything global in your programs. 

In a Ruby program global variables are allocated but will never be freed. Same goes with constants, any constant you define will be allocated and never freed. Without these things we couldn’t be very productive Ruby programmers.

But there’s a bad kind

The bad kind of memory leak involves some memory being allocated and never freed, over and over again. For example, if a constant is appended each time a web request is made to a Rails app, that's a memory leak. Since that constant will never be freed and it’s memory consumption will only grow and grow.

Separating the good and the bad

Unfortunately, there’s no easy way to separate the good memory leaks from the bad ones. The computer can see that you’re allocating memory, but, as always, it doesn’t understand what you’re trying to do, so it doesn’t know which memory leaks are unintentional.

To make matters more muddy, the computer can’t differentiate betweeen a memory leak in Ruby-land and a memory leak in C-land. It’s all just memory.

If you’re using a C extension that’s leaking memory there are tools specific to the C language that can help you find memory leaks (Valgrind). If you have Ruby code that is leaking memory there are tools specific to the Ruby language that can help you (memprof). Unfortunately, if you have a memory leak in your app and have no idea where it’s coming from, selecting a tool can be really tough.

How bad can memory leaks get?

This begins the story of a rampant memory leak we experienced at Shopify at the beginning of this year. Here’s a graph showing the memory usage of one of our app servers during that time.


You can see that memory consumption continues to grow unhindered as time goes on! Those first two spikes which break the 16G mark show that memory consumption climbed above the limit of physical memory on the app server, so we had to rely on the swap. With that large spike the app actually crashed, raising Errno::ENOMEM errors for our users.

After that you can see many smaller spikes. We wrote a script to periodically reboot the app, which releases all of the memory it was using. This was obviously not a sustainable solution. Case in point: the last spike on the graph shows that we had an increase in traffic which resulted in memory usage growing beyond the limits of physical memory again.

So, while all this was going on we were searching high and low to find this memory leak.

Where to begin?

The golden rule is to make the leak reproducible. Like any bug, once you can reproduce it you can surely fix it. For us, that meant a couple of things:

  1. When testing, reproduce your production environment as closely as possible. Run your app in production mode on localhost, set up the same stack that you have on production. Ensure that you are running the same exact versions of the software that is running on production.

  2. Be aware of any issues happening on production. Are there any known issues with the production environment? Losing connections to the database? Firewall routing traffic properly? Be aware of any weird stuff that’s happening and how it may be affecting your problem.

Memprof

Now that we’ve laid out the basics at a high level, we’ll dive into a tool that can help you find memory leaks.

Memprof is a memory profiling tool built by ice799 and tmm1. Memprof does some crazy stuff like rewriting the current Ruby binary at runtime to hot patch features like object allocation tracking. Memprof can do stuff like tell you how many objects are currently alive in the Ruby VM, where they were allocated, what their internal state is, etc.

VM Dump

The first thing that we did when we knew there was a problem was to reach into the toolbox and try out memprof. This was my first experience with the tool. My only exposure to the tool had been a presentation by @tmm1 that detailed some heavy duty profiling by dumping every live object in the Ruby VM in JSON format and using MongoDB to perform analysis.

Without any other leads we decided to try this method. After hitting our staging server with some fake traffi we used memprof to dump the VM to a JSON file. An important note is that we did not reproduce the memory leak on our staging server, we just took a look at the dump file anyway.

Our dump of the VM came out at about 450MB of JSON. We loaded it into MongoDB and did some analysis. We were surprised by what we found. There were well over 2 million live objects in the VM, and it was very difficult to tell at a glance which should be there and which should not.

As mentioned earlier there are some objects that you want to ‘leak’, especially true when it comes to Rails. For instance, Rails uses ActiveSupport::Callbacks in many key places, such as ActiveRecord callbacks or ActionController filters. We had tons of Proc objects created by ActiveSupport::Callbacks in our VM, but these were all things that needed to stick around in order for Shopify to function properly.

This was too much information, with not enough context, for us to do anything meaningful with.

Memprof stats

More useful, in terms of context, is having a look at Memprof.stats and the middleware that ships with Memprof. Using these you can get an idea of what is being allocated during the course of a single web request, and ultimately how that changes over time. It’s all about noticing a pattern of live objects growing over time without stopping.

memprof.com

The other useful tool we used was memprof.com. It allows you to upload a JSON VM dump (via the memprof gem) and analyse it using a slick web interface that picks up on patterns in the data and shows relevant reports. It has since been taken offline and open sourced by tmm1 at https://github.com/tmm1/memprof.com.

Unable to reproduce our memory leak on development or staging we decided to run memprof on one of our production app servers. We were only able to put it in rotation for a few minutes because it increased response time by 1000% due to the modifications made by memprof. The memory leak that we were experiencing would typically take a few hours to show itself, so we weren’t sure if a few minutes of data would be enough to notice the pattern we were looking for.

We uploaded the JSON dump to memprof.com and started using the web UI to look for our problem. Different people on the team got involved and, as I mentioned earlier, this data can be confusing. After seeing the huge amount of Proc object from ActiveSupport::Callbacks some claimed that “ActiveSupport::Callbacks is obviously leaking objects on every request”. Unfortunately it wasn’t that simple and we weren’t able to find any patterns using memprof.com.

Good ol’ debuggin: Hunches & Teamwork

Unable to make progress using these approaches we were back to square one. I began testing locally again and, through spying on Activity Monitor, thought that I noticed a pattern emerging. So I double-checked that I had all the same software stack running that our production environment has, and then the pattern disappeared.

It was odd, but I had a hunch that it had something to do with a bad connection to memcached. I shared my hunch with @wisqnet and he started doing some testing of his own. We left our chat window open as we were testing and shared all of our findings.

This was immensely helpful so that we could both begin tracing patterns between each others results. Eventually we tracked down a pattern. If we consistently hit a URL we could see the memory usage climb and never stop. We eventually boiled it down to a single of code:

loop { Rails.cache.write(rand(10**10).to_s, rand(10**10).to_s) }

If we ran that code in a console and then shut down the memcached instance it was using, memory usage immediately spiked.

Now What?

Now that it was reproducible we were able to experiment with fixing it. We tracked the issue down to our memcached client library. We immediately switched libraries and the problem disappeared in production. We let the library author know about the issue and he had it fixed in hours. We switched back to our original library and all was well!


Finally

It turned out that the memory leak was happening in a C extension, so the Ruby tools would not have been able to find the problem.

Three pieces of advice to anyone looking for a memory leak:

  1. Make it reproducible!
  2. Trust your hunches, even if they don’t make sense.
  3. Work with somebody else. Bouncing your theories off of someone else is the most helpful thing you can do.

Hack/Reduce Ottawa

Last Saturday Shopify attended Hack/Reduce Ottawa, the latest in a series of data hacking days run by the…

Last Saturday Shopify attended Hack/Reduce Ottawa, the latest in a series of data hacking days run by the fine folks at Hopper.  The interesting thing about k/R is that the focus is on data, not apps or APIs. The hosts provided some crazy processing resources (3.75TB of RAM and 375TB of storage) on Amazon's EC2 platform and various parties came armed with large data sets to hack on. Our data team came up with 1.1 million rows of anonymized US order data, and that was one of the smallest sets available(!).

 

There is a great write-up of the event over at the Hack/Reduce site. I'd like to thank Jean-Claude Batista, Taswar Bhati, Joel Sachs, Andrew Clunis, and Petro Verkhogliad for putting our data to work as well as the organizers and sponsors: Everyone from Hopper, David from InfoGlutton, the staff of Moca Loca (who provided an excellent lunch), and of course UQO for providing the event space.

Hack/Reduce is a killer event so if one shows up in your city, sign up! You won't regret it.

Start your free 14 day trial!Create your store now

Create an online store in minutesTry Shopify Free