The Knight Capital Disaster

Transcript

Mandi Walls(00:09): Welcome to Page It to the Limit, a podcast where we explore what it takes to run software and production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me at LNXCHK on Twitter.

All right. Hi, welcome back to Page to the Limit. It’s just me this week. I’m still looking for guests for this season, so if you’d like to be on the show, drop us a line. We’re community-team@pagerduty.com, and we’d love to hear your story on the show.

So this week, since we kind of have a lull in the guests, I wanted to cover what I think about as a disaster from history. So other industries have sort of seminal points that they go back and study. There’s still a lot of fascination around things like the Titanic, just in the general population, but also the Challenger disaster, things like that, that sort of create these pivotal moments in some industries for how they practice responding to incidents, how they practice emergency response and how they handle other components of their culture.

(01:24): So looking back through sort of great technical disasters of the past 20 years or so, one of the most notorious ones is the Knight Capital disaster. And you may have heard this one mentioned a few times. It’s kind of the boogeyman of technical incidents because it basically destroyed a company. So it does come up in discussions for worst practices or things we never want to do at our company or those kinds of things. So maybe you’ve heard parts of this and if you haven’t, I’ll give you the overview today and there’ll be some resources in the show notes if you’d like to learn more. This was sort of a, oh yeah, it’s sort of a benchmark kind of incident. So a lot of things went wrong and there’s a lot that we have learned from it in the subsequent years. And hopefully there are practices that you currently use that are reinforced by what folks learned from Knight Capital.

(02:23): So who was Knight Capital? Knight Capital was what was called a market maker. They were a firm that is basically ready to buy and sell stocks on a regular and continuous basis kind of all the time at whatever publicly quoted price was available. So they were doing high speed algorithmic trading. They covered more than 19,000 different US securities plus some options, plus some US securities in European markets. But for the most part, they were working under the New York Stock Exchange. They had a trading volume of $21 billion daily. That did not mean that they had that much money themselves. They were basically making these trades on behalf of other retail investment markets, so like e-trade or Ameritrade, some of these other ones that they would bid in for to handle their trades. So by 2011, when the story kind of starts, Knight Capital was worth 1.5 billion and employed just under 1,500 people.

(03:29): So not a huge company, but they were kind of in a good space. They were known for having very good algorithms and were very good at high speed trading, which foreshadow was maybe part of their downfall. So prior to 2011, there were what was called dark markets where the large brokerage houses could do private trades, sometimes at a little bit of a premium, sometimes a little bit of a discount, but to each other without that hitting the public market. So it wouldn’t necessarily hit the public market that some institutional investor had placed a large position on some particular security. So they existed among the different trading partners and the New York Stock Exchange wanted to launch one of these. They wanted to make one available to their partners. And so they put that forth at the end of 2011. So they proposed to create what they were calling the Retail Liquidity Rrogram.

(04:36): These are controversial because there’s no public information about the trades being made. It’s just like dark money kind of moving around. So they do feel kind of spooky. So they weren’t even sure they would get this thing approved, but it did get approved by the SEC. So in the US, that’s our federal oversight for financial institutions, the Securities and Exchange Commission. And the RLP was approved in June of 2012. Now, rather than give everybody time to implement the new systems that would be required, they decided to launch this thing on August 1st. So they ended up having like five or six weeks of implementing any new code that were required to access this new part of the market, which seems insane, right? Like that is not enough time to do this well. And we will find out that for Knight Capital, it was not enough time to do this well, but it was an excellent opportunity, they felt, but obviously the timeline was very rushed.

(05:38): So Knight obviously wanted to participate on launch day, so they needed their code to be ready in just a few weeks, ready on August 1st for it to be launched. Joining this RLP program meant they would have to make changes to what they had, what their system called SMARS. So it’s a Smart Market Access Routing System, and that was what would have to be updated. So SMARS was their algorithmic high speed order router. It could execute thousands of orders a second, right, lots of volume, and also compare prices among dozens of trading venues. So it was very, very fast, very efficient, but very, very, very high volume through these systems. So part of what SMARS would do was it would take large orders, so orders for many thousands of shares or whatever, and consider those a parent order, and it would parse them out into tranches and route them out to external venues for execution.

(06:40): And those were called the child orders. So every parent order could potentially have a number of different child orders. SMARS had to keep an eye on how many of those child orders were executed to make sure the parent order had been fulfilled and then marked the parent order as closed and completed. So the SMARS system lived on eight production servers and unfortunately, for some reason, there was still a lot of old code sitting around in SMARS. And you can kind of imagine working in one of these really, really complex, but also kind of monolithic systems, as I remember we’re working on in 2012. So there’s still a lot of large components with lots of things going on and SMARS was no different. It had this old component called Power Peg that was still included in the production code and Knight Capital hadn’t been using Power Peg. They had turned it off in 2003, but the code was never removed.

(07:42): So it was like nine years out of date, but still running, still available in the production code base. The other problem with Power Peg was a test system. It was meant to buy and sell things and then monitor other outputs and was never meant to be deployed to the production servers. So there’s some question as to why it was actually out on the SMARS servers in the first place. They probably should never have been there. There’s just a lot of unclean code going around in these particular deployments. So part of what SMARS was doing obviously was taking these parent orders, creating child orders, and there was a counter involved in keeping track of the child orders and making sure the parent order was fulfilled. So when Power Peg had been in play, that code had lived in one place, and then sometime after they had turned Power Peg off in 2003, they moved the counter.

(08:43): So that was moved in 2005. So the counter lived in a different place where the Power Peg code wouldn’t have been able to find it, but that was fine because they turned off Power Peg, they thought, right? So the counter was living in a different part of the SMARS architecture at that time. So the new code for the RLP servers for this launch in August were meant to replace some of the components in Power Peg. And they were doing this via a feature flag. So hopefully you’re familiar with feature flags, right? You have code that you deploy, but you don’t turn it on until a setting is sent out to the environment of the servers and then that code can be executed. So you can do sort of dark deployment. You can deploy things in advance when the code is finished and then turn the feature on at some launch date.

(09:36): So you get all your marketing and your PR all aligned, and that’s great. So RLP was placed behind a feature flag. So sometime in the end of July, the servers were deployed and RLP code was included in that deployment, but it was behind a feature flag, which is great, except that they used the same feature flag for RLP that had been used for Power Peg. That’s not really great practice when you’re talking about what your feature flags are meant to do, and you don’t necessarily know who’s going to be in charge of turning them on or off. We use a lot of feature flags and sometimes it’s support that turns them on or there’s other things that happen, the product managers turn them on for testing. You want them to have semantic meaning in that “Enable Power Peg” and “Enable RLP” would have been two obviously different flags with different meanings, and that is not what happened here.

(10:36): So the deployment happens, an engineer pushes the code out. There were eight servers in the SMARS pool. Unfortunately, due to some error or forgetting or missing an error or something happened, only seven of the SMARS servers got the new code. No one reviewed it. No one noticed. There was no monitoring. There was no alerts to say, “Hey, that server over there doesn’t have the right code version on it. " There’s no checks enabled for actually making sure that the production servers were running the thing that we were expecting them to run. So until the feature flag is flipped, no one has any reason to notice, right? That’s the whole point of feature flags. So we have one server that’s out of sync with the rest of the farm. Everybody has this code on it for Power Peg. We get to launch day, August 1st, 2012. Early in the morning or eight o’clock or so, an internal system starts reporting error messages.

(11:36): They’re coming from the SMARS servers with a description that is “Power Peg disabled”. That’s a cryptic message. If there’s folks receiving them that hadn’t been there nine years earlier, what’s Power Peg? Unfortunately, these messages were sent over email. What do we know about real-time alerts? You don’t send them over email. That’s why we encourage all of our users to use app pushes or SMS or something that’s going to get to them in real time. Email, super easy to ignore. You get a lot of them, you don’t get notified of all of them. So no one at Knight noticed the emails in advance of the market open. Market opens at 9:30. RLP orders started to flow to the SMARS servers. Seven of the servers responded correctly. Deals proceed as expected. On the eight server, however, Power Peg rises from the murky depths of the old code and causes a disaster, essentially.

(12:35): Because Power Peg no longer had access to the parent trade counter software, it never knew when to stop executing trades. It had no stop gaps or emergency breaks. It just kept trading. The trading volume from Knight was so high, it was noticed by the stock exchange analysts by 9:34 AM. It was noticed almost immediately by other folks watching the market. By 9:34, the stock exchange analysts were trying to get hold of people at Knight Capital to figure out what was going on, four minutes, right? They couldn’t get hold of the CEO. He was out on medical leave. They finally get hold of the CIO, so the CIO gathers all the staff together. They had no documented procedures for incident response. They don’t know what’s going on, who’s in charge, how to handle it, anything like that. And no emergency break and no kill switches on the trading systems.

(13:31): They couldn’t just turn them off or disconnect them from the trading network or block the firewall or anything like that. So they dug around in the systems for another 20 minutes before they determined that the problem was in the new code, which you just did a deployment, you just turned the thing on. Wouldn’t that be the first thing you suspect? But it took them 20 minutes to get to that part of the problem. And unfortunately, what they did, instead of turning off the feature flag, they reverted to the last version of the code across the SMARS servers. So now all eight servers were running the old version of the software with the Power Peg code in it, and that caused even more trades to spiral out of control. There were alarms and alerts and other things in the system, but they were set to things like price rather than volume.

(14:28): And that is kind of true. We’ll talk in a minute about other things that the rest of the market was watching as well. So yeah, inappropriate monitors and alerts, another kind of issue, right? They figured out the problem by like 9:58. They had 28 minutes of damage. It took them to 45 minutes to get everything cleaned back out. So how bad could it possibly be? It wasn’t even an hour. You had 28 minutes to actually figure out what was going on and then another couple to turn it off. Unfortunately, and this is why this became notorious is it turns out it was really, really bad. Knight Capital had executed over four million trades across 154 stocks that involved more than 397 million shares. Remember this high speed, high frequency trading. The total was more than $6 billion in trades. So crazy, right? Trying to figure out then what they’re going to do from this.

(15:33): So when a trade goes through, and if you’ve done any individual stock trading, you’ll know you put a buy order in or sell order, and it takes it a little time to get that email back that says that your order has been processed and closed. So for the market makers like Knight, they had three days from the time an order was settled to actually really settle up with the cash. So they could spend the rest of the day trying to unwind all the things that they had done. Okay, so we have that. Since this had been a mistake, well, couldn’t they just say, “Oh, takesies backsies, maybe can we get some of these unwound?” Unfortunately, there had been a flash crash in 2010, and the SEC had come back with a new set of rules around what kinds of trades could be canceled.

(16:20): And again, these were based on price and not volume. So the trades they could cancel were trades that had changed the price of a stock by more than 30%. So they had six stocks in this batch of 150 plus that had met that threshold, so they could unwind those, but everything else had to be covered. Fortunately, Goldman Sachs comes to the rescue. They step in and buy out all of the positions and Knight has to end up covering 440 million of the differences between the market price and the price that they had on these options that they had mistakenly traded into. So they ended up from a total price of $6 billion to costing the company 440 million. Now, I mentioned earlier, Knight was only worth 1.5 billion at the time. So this is really a third of their total market cap. So they were still not in great shape.

(17:18): A group of investors stepped in to try to save them, infused them with $400 million, but they still ended up selling to GetCo LLC later in the year, and that sale closed in 2013. So they ended up kind of disappearing completely after that. So you’ve kind of picked out that at least half a dozen different lessons from this one particular, we’ll call it a disaster, right? It is a disaster. So a couple of things, right? One of my favorite things ever is to delete old code, right? You don’t need it anymore and we don’t want to talk about it anymore and it’s just in there and it’s cruft and it’s junk and can’t we please delete this? It’s hard, right? If you’ve been in the market long enough to have worked on the devolution of a monolith into microservices, you know this is hard, right?

(18:07): You don’t know which parts of those code is load bearing, right? What’s in there that’s got some secret purpose that you’re not sure what it’s doing anymore and you don’t want to touch it. And that’s totally understandable, but at some point you do need to go back and really take a good architectural look at the code that you’re running. Remember, this happened in 2012 and they had been running this old code from 2003. It had been in the systems in the code base for nine years unused.

(18:45): It wasn’t commented out. It wasn’t deleted. It was still active and was still governed by a feature flag that wasn’t documented evidently. So that’s a hard one to justify that length of time. It’s hard too now to kind of imagine it because like a lot of our code is shorter lived in a lot of parts of the industry where, especially if you’ve moved to containerized solutions, it’s very easy to really take a hard look at microservices and cleaning out code. Most of the code base ends up being much smaller when you do that and taking a harsher look at that kind of stuff. You don’t necessarily have stuff that’s living on the same servers for nine years. So yeah, then version control, like it’s easy to say one way or the other, okay, we’re doing CI/CD, we’re just going to push things out, but how do you know?

(19:39): Hopefully you’re using tools that have mechanisms in them to verify for you that once you’ve deployed the code, it has gone everywhere that you expected. And if you are not using automated deployments, why aren’t you using automated deployments? You have some kind of documentation for monitoring and notifying if some server doesn’t get the right code so that you know exactly what is in the runtime versus what is on the disk because it is totally possible to push the code out, not restart the server, the server doesn’t restart successfully, and you’re still running old code that’s been in memory. So having a mechanism to be able to verify this service is running this particular version of this code and that is what I expect is a better practice than what we have here. Some people also suggest verifying or having some kind of second person review.

(20:31): I’m more on the field to automate deployments as much as possible, right? We want to reduce our chances for error by having the processes that are well understood and repeatable, just be done by the machines. We also don’t want to be reusing feature flags. Now there are definitely commercial tools available that are going to help you for this. And if you’re not using a commercial tool, hopefully you’ve got some process in your code scanners or in your CI/CD that say, “Oh, you’ve introduced a new feature flag. It does not appear in this other documentation file.” And honestly, you should fail the build, but you need to have some kind of mechanism for tracking all of your feature flags and making sure they’re not reused. And also make sure they’re semantic and they have meaning so that no one is mistakenly using the feature flag for feature X when they mean to enable feature Y.

(21:25): Lots of things to figure out there, but many tools available now that weren’t available in 2012 or were maybe in their infancy in 2012. We want you to have an incident response plan. Now, most of the folks at Knight Capital, from what I understand, were probably in the office that day, so you can pull everybody sort of into a conference room and everyone’s in the same place and able to freak out altogether. If you have mixed teams, you have remote teams, you have folks who work from home, you want to have an explicit plan for how you’re going to handle an incident. Are you going to use incident commanders? Do you need a scribe? Are you meeting in Slack? Are you meeting on a conference call? How are you handling all of this? And everyone should know this in advance. It’s like having a fire drill.

(22:17): You had a fire drill at school. Everybody lines up, you go down the hall, you go outside, you go across the parking lot, across the playground, and everybody lines up far away from the building. And maybe if you still work in an office building, you’ve had one of those for your office building. You need to know where to go when something goes wrong and how to get in touch with everybody else who is necessary or needed for that particular incident to have those plans in place. Also thinking about your high risk systems, what can you do to stop losses? You have something that is high volume, it’s very fast, it does a lot of things, it’s very clever, but what happens if something goes wrong? And having a red button, sort of stop button for expensive features or for things that you want to be able to turn off due to special events or any other kind of event that might happen in your marketplace is also a really good idea.

(23:16): These are kind of feature flags anyway to be able to turn these things off and needs to become part of your practice to be able to say, “Oh, when we reach this threshold, we should turn off this very expensive subprocess or this other component that we don’t need or some other kind of issue that rises up.” Really thinking about deconstructing your systems to the point where you are preserving the main components and the main feature set and then turning off things that maybe aren’t as necessary so that you have the ability to recover more gracefully from problems and issues and things that go bad. So there’s a lot in this one particular case. It’s really interesting from that perspective. And like I mentioned at the beginning, it’s kind of trotted out as a boogeyman because so many things did go wrong all at once. So many different aspects of the technical processes were not part of the practice, we’ll say, at Knight Capital.

(24:16): So a lot of things to learn, a lot of interesting stuff there. I will post in the show notes a bunch of different articles about it because there are a whole bunch of them. And if you’re really into investing or trading, especially in US markets, the rules that apply to this one and subsequently, there’s an actual SEC report for this. There’s a bunch of interesting things to read there. So I thought it was interesting. It’s definitely one of those that haunts the industry and it’s something that serves as the worst case scenario for the kinds of incidents that we see every day. We don’t expect to see $6 billion in losses from most of the incidents that our customers cover, but there are real costs to a lot of the incidents that do get reported through our platform, and we want folks to have as many tools as possible for dealing with those.

(25:08): So hopefully this was interesting. Like I mentioned at the beginning, if you’d like to be on the show, if you’d like to share with our audience, we’d love to hear from you. We are community-team@pagerduty.com. We put these out as often as we can and yeah, we’ll be back in a couple of weeks with a new episode. So in the meantime, I will wish you an uneventful day and hopefully your feature flags will keep you safe and we’ll talk to you later.

(25:41): That does it for another installment of Page it To the Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageitothelimit.com and you can reach us on Twitter at Page It 2 the Limit using the number two. Thank you so much for joining us. And remember, uneventful days are beautiful days.

The Knight Capital Disaster

Transcript

Show Notes

Additional Resources

Hosts

Mandi Walls (she/her)