Rally HealthSM began the new year with a bang. On January 1 we added more than 350 new clients to Rally, which meant we suddenly had 5 million new users eligible to register. Add that to our current users, and it's a bit like trying to expand your airplane in midair.
How did we handle it? By all accounts, it was smooth sailing — despite this massive influx of new users, our engineers report that we set up all these clients with no errors whatsoever, which surpassed even our most optimistic predictions.
How does a company that is barely six years old handle such a huge increase in traffic? Chris Brown, Rally’s vice president of engineering, discusses the challenges his team encountered and how they triumphed over them.
Chris Brown, Rally Health VP of Engineering:
The challenge we faced is that we had to add a large number of new users to the site. On a daily basis, RallySM is actually a pretty predictable platform. We know that on a given Wednesday, the traffic will be pretty much the same as the Wednesday before that. Once people establish the habit of using the system, they tend to stick with it.
Our user growth doesn't come on a continuous daily basis, but instead comes in spurts tied to key dates around the start of an employer's plan year. This has some advantages and disadvantages. The advantage is that our site is predictable and stable. The disadvantage is that we’ve never really had to be ready for massive bursts of traffic.
This year we had a large onboarding event. We brought on a couple million additional users into the system. And even though we knew about this in advance, we still had to make sure that our systems were ready to handle that increase in traffic.
So we followed a multi-step process. First, we worked with our sales and support teams to understand how many more users we should anticipate, and roughly what the profiles of those users would be. For example, we know that users who have incentives to use the platform are much more likely to use it.
Our data team tracks key details of site usage, like which users join Challenges or Missions, which ones use incentives, etc. And we used that data to get a sense of how many new people would be joining Challenges, how many would be joining Missions, and so on.
So we had a rough sense of the number of new users we had to take into the system. We wrote a program that pretends to be a user who does the things a user does when they register for the site. They choose a name, they choose a password, they go through the survey, and then a certain percentage of the time they join a mission, and a certain percentage of the time they join a challenge, and so on.
And then we took that little program, put it on an external platform called Flood.io, which is a tool that basically runs that program from servers across the United States, but runs it as many times per second, or per minute, or per hour, as we want. We were able to simulate 10,000 or 20,000 pretend users coming at our systems, and then observed how our systems reacted.
We saw things like, ‘Here’s how long it takes a page to render on our site. When we’re sending no traffic through, it takes X seconds, and when we send 10,000 at once, it takes Y seconds.’
We did those kinds of measurements and observed where the system was under stress. Then we gradually ramped up the number of users and said, ‘Ah, you know what, when we send this number of users through, our Health Survey starts to slow down, so let’s look at the Survey code as it’s running and see where it’s “hot.”’ And then we addressed that hot spot, rolled a new version of the software out, ran the test again, saw that it was OK now, and kind of pushed the bottleneck further down the pipe.
Honestly, we’re pretty proud of how we did this. The team kicked off in late October, and in less than three weeks we were able to increase our capacity in that test environment from being able to handle around 700 simultaneous users to around 9,000, so we improved by more than a factor of 10. And once we were confident in what we were doing, we ran similar tests in production and increased capacity there as well.
And when we actually gave access to the roughly 5.5 million new users on January 1, the process went almost exactly as our predictions told us they would. There were two major places where we found differences.
As I mentioned, the estimation of load and traffic was based on usage models, and those usage models turned out to be a little more conservative than real life, so we had fewer actual users at any one time than the models indicated. Not by much, but we definitely had less traffic than the models indicated. So as it turned out, we were able to handle more traffic than we had experienced thus far. What this basically means is that people were able to get through the registration process more quickly than we had estimated, which is a good thing, of course.
That said, we’re still in the midst of the busy part, and so we’re still likely to see higher loads in the next weeks, for example, than we did in the first weeks since January 1. So while it might have looked like the usage models were a little understated, we’re not at the point yet where we can say they definitely were understated.
And there was another surprise. As I mentioned, we were using a little script to validate and verify what the site was actually doing, and we would run that same program, you know, 10,000 times or whatever the specific number was, to see what the traffic would be like. And there were places in the system that the script didn’t explore or test. Specifically, it didn't test where users would go from Rally to outside servers.
And sure enough, when we were in production and actually started sending real traffic to outside servers, their systems couldn’t keep up with the sudden increase in demand, and as a result our systems backed up, so that’s one thing we didn’t anticipate.
So where the test had to interact for real with the outside world, instead of just staying within the test environment, that’s where we had scale issues. I think if there’s a lesson learned, it’s that we need to be more thoughtful about where those boundaries are, and maybe collaborate with our partners to make sure that they’re able to handle the test.
One thing that stands out for me is that in order to do this testing, we had a couple of team members who started shifting their schedules so they could start their day as our real site traffic was ramping down in the evenings or late at night, so that they could go in and sort of “experiment on the patient” while it was still alive, you know? We realized that to be effective, a lot of this work would have to happen overnight. We had to give people permission to do that, make sure we were covering for the night owls during the day, and then have explicit handoffs where we transitioned from night to day.
Of course, that’s not the kind of thing you want to make habitual, but we had people with expertise who cared enough to make this happen overnight, and most of them happened to have babies, so it’s not that big a deal to be up late at night anyway! That really helped us drive it through. And we were able to handle that in a way that was positive instead of negative.
And it’s going to sound sort of corny, but even though our team on this was kind of small, you know, it was 11 people, I think it’s the first big project where we had representation on the team from every Rally office. We had DC, San Francisco, Chicago, Denver, Minneapolis. We’ve never had that much intra-office collaboration on something like this. We had the guy in DC describing how things are designed, the guy in Chicago actually applying fixes, the guy in Minneapolis making sure it’s correct, the coordination happening out of San Francisco. And I know we do a lot of cross-office work here at Rally, but this was the most cross-office collaboration that I have seen.
It also shows why I love working at Rally. This company offers a great combination of “startup” mentality with stability. It’s an awesome feeling. We have the market momentum and capabilities of a big company, but we’re still hustling and staying hungry and are ready for anything. We’re scrappy, but stable. You know, I just can’t get enough of that.