Hacking DNA sequencing algorithms for fun, profit, and performance analysis

Transcript

Kat Gaines: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Kat Gaines, and you can find me on Twitter at Strawberry Field using the number one for the letter I. All right, folks, welcome back to Page It to the Limit. Thanks so much for joining us today. Again, I’m Kat Gaines, your host, and today we’re going to talk about a really fun topic, Hacking DNA Sequence Algorithms For Fun, Profit, and Latency Analysis. So it’s a little bit of a mouthful, but it’s the name of an internal conference talk that our guest today gave a few months ago and we thought it was really great, so we wanted to have him on the podcast. So I’m going to go ahead and welcome to the podcast Dylan Lingelbach. And Dylan, do you want to go ahead and just tell our listeners a little bit about yourself?

Dylan Lingelbach: Yeah, nice to meet everyone. As Kat said, my name’s Dylan. I’ve been in software development for a long time and primarily focused on product development. I did take a little detour into the financial world, which is where I worked on this topic. We worked on this problem, did that for a couple of years, but realized I really miss building products for customers instead of your internal tools for traders.

Kat Gaines: Yes, and just for folks’ reference, I mentioned this a moment ago, so this was a talk from an internal conference at PagerDuty. Dylan is a Dutonian, what we call a PagerDuty employee, so works with PagerDuty with me, not an external guest today. And we just thought this was a really interesting topic because really it revolves around taking a problem from one solution, relaying it to something that is completely unrelated, but just making it work. It’s really just creative problem-solving. We thought it was a really cool topic, and so that’s why we wanted to have Dylan on the podcast. So Dylan, just to get us started, can you level a set for us a little bit, just how this topic came up for you. Just set the stage, and for anyone who might not really understand the problem at hand here, just get into describing that for us a little bit.

Dylan Lingelbach: Yeah, absolutely. So in trading, latency is really, really important. And latency, both in how you send data out, but also how you receive data in. So we had a problem where one of our trading systems was consistently losing money. And based on what we saw, we thought it had to do with latency somewhere in the trading system, but we weren’t really sure where. And there was a long-standing myth or belief in the trading world that if you receive data over your private data feed that told you about the trades that you did, you would receive those faster than what you would receive over a public data feed that showed you all the trades. So being a computer scientist, this myth and belief didn’t make sense to me because you’re making two network calls. Those network calls should arrive at the exact same time. There’s no reason why an exchange would artificially limit one or the other, but that was a belief. So we were looking at a very large project to listen to the private data feed instead of the public data feed to make some trading decisions. Before we embarked on that project, I was like, “I really want to make sure that this is true. I don’t want to take a couple of months building something that could be really complex and really error-prone before really understanding if it’s true or not.” So we started looking at ways to analyze the two data feeds that were coming in and figure out if there was latency between one or the other. And it was actually, fairly complex because if we were talking about a handful of trades that were fairly unique, we’d be able to go and find those pretty easily in the public data feed. But the public data feed would have millions of trades a day, and we would have thousands of our own trades, and a lot of times the sequences were really difficult to line up. So would I bought one at 10, I bought two at 12, and there would be 10 or 15 different places where similar sequences happened in the public data feed. And it was really, really hard to figure out how to line them up. That’s the problem that we were trying to solve when we started to do this.

Kat Gaines: Yeah, I think that’s really interesting too because there’s clearly a bit at stake if the data isn’t showing up in a timely manner, if something doesn’t line up if it’s not accurate, I’m sure; throw things off a bit. So how did DNA sequencing come into the picture? This is obviously we’re describing something that doesn’t seem to have anything to do with DNA and I think I’m just really curious for us to talk about how that happened, where that came up as a possible solution to this problem, and how did it end up helping you solve the problem? What did that look like?

Dylan Lingelbach: Yeah, absolutely. So it really came up randomly. So before we even got to DNA sequencing, I did a lot of googling, did a lot of research. Had anyone solved anything similar to this in the past? Now, this was almost 10 years ago, so there was less information on the web then. But also the trading industry is fairly secretive. Where if people do solve a problem like this, they don’t go and write a blog post about it because every edge you can get over your competitors is something that you want to try to keep internal. So I looked at other ways of comparing two sets of data, Levenshtein distance between two strings, looked at that. But nothing was really matching it. So I started talking to other people in the company being like, “Hey, I’m trying to solve this problem. Do you have any ideas? What do you think about this? Have you solved a similar problem?” And just brainstorming with other people. In trading, they’re called financial engineers. In tech, they’re called data scientists. But was talking to one of them and they said, “That looks like DNA sequencing. That looks like a DNA sequencing problem.” I was like, “Huh, okay. Maybe. Let me go do some research.” So I started to research different DNA sequencing algorithms and found that there’s two types. So the first is a global alignment, and what that means is you’re taking two sets of DNA, and you think they’re about the same length, and you’re just trying to figure out how they best match up. Are there some base pairs that are deleted? Some are inserted, some flipped. In DNA sequencing you would have of all these different errors in your data and coding, and you’re trying to match it up given all of these errors. And then doing some more research, there was also something called local alignment where you would have a small strand of DNA, and you would try to figure out where it fit best in a much larger sequence of DNA. As I started looking at the diagrams of the algorithms and walking through the algorithms flow, this really looked like the problem because we have some of our trades that have a unique sequence, a unique fingerprint, and they’re in a much larger stream of other trades. That looked really promising. So started implementing that algorithm against our two data feeds and found the algorithm at the start wouldn’t really work for a couple of reasons. So one is DNA sequencing. There’s a set number of base pairs. Here, we could have lots of different trades that would have a different signature. So for example, trading one at 10 is different than trading two at 10 and trading one at 11 is different than trading one at 11. So many, many combinations. But looking at the algorithm, I found that the algorithm didn’t depend on the number of base pairs. It could work even without. So we’re like, “Okay, great, we’re just going to encode each different trade as a different base pair.” That was solved. And then the other one was there were some cases in DNA sequencing where there’d be an error introduced in the sequencing. So an A would become a G or something like that. And that… In the trade data feeds, that can’t really happen. The exchange guarantees that every trade that happens is represented the same way in both data feeds. So you couldn’t have a one at 10 get flipped in a data feed to two at 11. That just wouldn’t happen. So we had to adjust some scoring parameters of the algorithm to account for that. What I did is changed scores from negative one to be negative infinity, essentially penalizing that path as much as possible so they would never be an actual valid path. So did that, and tested it out, and it looked really good. So I started finding like, “Okay, yep, I can actually line this up. This looks really promising.” So then it expanded it out and ran it on more and more data.

Kat Gaines: That is really cool. As a sidebar, I feel like funnily enough, this is something that’s been on my mind a little bit lately. So, Dylan, I don’t know if you watch the Great British Baking Show at all?

Dylan Lingelbach: I do not, sorry.

Kat Gaines: Total sidebar. So, listeners, I promise this has nothing to do with the actual problem. So I’m a huge fan of that show, and on an episode recently there was someone who did their bake shaped like a strand of DNA. So the fact that we were recording this episode too, it’s just been like DNA has been present in the front of my mind, which it is not on a day-to-day, which is a really interesting thing. So I think that’s really interesting, just the way that that came up and the way that you took this thing, again, that really doesn’t have anything to do with the problem you were solving. It doesn’t have a lot to do with trading. The two don’t necessarily cross day to day. And the way it just came up from looking at different ways that problems are solved, as you said, and that someone just said, “Hey, this looks like that.” What’s the thing that you feel like you really want people to learn from this? Because I think that what I heard in your talk was that it was a really big learning moment for you and formative in how you approach problems. What do you want people to see in that and take away from that piece?

Dylan Lingelbach: Yeah, absolutely. And I should mention before I get into that, so what we found is there was actually significant latency in the public data feed. We went back to the exchange and said, “Hey, we’re seeing 45 milliseconds of latency.” And 45 milliseconds is not a long period of time, but in trading even 10 years ago, it was a lot. And the exchange said, “Oh, are you using our slow data feed? You probably want to use our fast one.” It was one of those where we’d even asked the exchange ahead of time, “Hey, should we be this latency?” They’re like, “Oh, no. No, no, you shouldn’t be seeing the latency. You shouldn’t see any latency difference at all,” except you do when you use their slow data feed, which was very poorly documented. So we flipped that over and then there was no latency. So the takeaway that I had is not to be afraid of experimentation, of trying things, of squinting your eyes a little bit and being like, “Well, I don’t get exactly how this is going to work or exactly how this fits, but if I squint my eyes, it looks like the same thing. So let me just play around, and try, and iterate, and see if I can tweak and hack things around so that I can actually use a different technique to solve my problem. That was really the biggest takeaway for me is you just got to try things sometimes, iterate and see if you can get them to work.

Kat Gaines: So I think this is really interesting and really, I want to know too, what happened at the end of this? What was the outcome of taking this creative approach and doing this work?

Dylan Lingelbach: That’s a great question. So we found that there was 45 milliseconds of latency by using this approach. And 45 milliseconds of latency even 10 years ago was a very long time in trading. So we went to the exchange and it was funny because before we even did this analysis, we went to the exchange and said, “Hey, should we be seeing a latency in your public data feed relative to your private data feed?” And their answer was, “Absolutely not. You should not.” So we took this to the exchange and said, “Hey, we’re seeing a very consistent 45 milliseconds of latency in your public data feed.” And they said, “Oh, are you using our slow data feed?”

Kat Gaines: Oh.

Dylan Lingelbach: Yeah, right. “You should be using our fast one.” We’re like, “Oh, yes, we would like to use your fast one. Thank you.”

Kat Gaines: Good to know!

Dylan Lingelbach: That was a interesting moment to learn that they actually had a fast data feed that wasn’t well documented. So we switched to that and saw no latency; essentially the same time as we reran the analysis. We ended up achieving what we were trying to do with much less engineering effort because of this analysis. So it saved us months of engineering time and got our trading system making money when it was losing money before.

Kat Gaines: And it sounds like that’s maybe a detail that might not have even come up in conversation with them without that data to prove that, “Hey, there is something wrong here.”

Dylan Lingelbach: Exactly. That’s exactly right. Yeah.

Kat Gaines: For you, it was a really cool moment to see just how you were applying this solution from a very different area that doesn’t have anything to do with trading and helping solve that problem and surface that information, again, that probably wouldn’t have come up. And I know that in your talk and when you were doing this internally, you were saying that that was a really important learning for you. What do you want people to learn from this and take away from this that you feel like was valuable for you in that area?

Dylan Lingelbach: Yeah, so I think the biggest takeaway for me is that sometimes when you’re trying to solve a problem, it’s useful to just try some things to iterate on some things. I think of it as squinting a little bit and seeing if a different solution looks the same way. I mean, the reason I started looking into DNA sequencing algorithms more deeply after it being mentioned in an offhand comment was visually when you looked at the algorithm how it flowed was very, very similar to what we were trying to do. So zooming out and saying, “These kind of look like the same thing,” even though when you look at the DNA, you’re like, “A trade? A base pair?” They’re not the same thing. They’re not the same thing at all. But when you zoom out and squint a little bit, it’s like, “Well, I’m trying to take this small piece and put it into this bigger thing. That’s exactly what this algorithm is doing. So that, and not being afraid to hack, and experiment, and take some dead ends. Where I had to iterate on the scoring a couple of times to get it dialed in to, “Oh, this is exactly what works the best.” I was doing things and I’d be like, “Oh, this is aligning. Or this is saying that there’s six minutes of latency because the best match is spread out over this long period of time.” That doesn’t make any sense at all, and I looked a little closer to it. I’m like, “Oh, I am scoring something that is spread out six minutes apart, the same thing that’s within a second of itself. I should probably score those differently.” So just that iteration and tweaking and retrying was another huge takeaway for me.

Kat Gaines: Yeah, I think that’s a great takeaway away because it’s really easy to get fixated on having a playbook to fix the problem or having the right answer the first try, and in reality, that’s just not always going to be the case. You’re going to run into problems that are the unknown, but that’s part of what makes these types of things fun to dive into. And you’re going to have to try and try again sometimes to get something right. I think too just the encouragement to look into other solutions that you wouldn’t have thought of and hit on that nugget that somebody mentioned in passing and dive into it deeper because we all have different backgrounds in different areas where we might know things about something that doesn’t necessarily directly apply, but could help with a problem. And being able to apply those types of things in the moment can lead to really cool discoveries like this one.

Dylan Lingelbach: Yeah, yeah, absolutely. Absolutely.

Kat Gaines: I think I’m curious too, there are a couple of questions that we ask every guest on this show. And I am really curious about your answer to this one actually, about one thing that you wish you would’ve maybe known sooner when approaching this problem.

Dylan Lingelbach: Yeah, I mean, well, obviously what I would love to know sooner is that the exchange had a fast data feed we should be using. But I think the other thing that I wish I had known sooner is to not get discouraged when the first attempt doesn’t work. So I know I implemented the algorithm. I ran it, and I’m not getting the results I want at all. And I think I went away from the problem for a day and worked on something else because I was like, “Ah, this isn’t right. I got to do some other research, find another solution.” And then I thought about it a little bit more. I’m like, “Well, let me try tweaking it a little bit.” I think just that belief that even if there’s not a belief that this will solve it, a willingness to fail or a willingness to take a wrong path, to just experiment and try and see if you can get a little further along, I think if I had had that sooner while doing this, I would’ve been able to complete the analysis sooner.

Kat Gaines: That makes sense. The next thing that I’m curious about, is there anything about this problem that you’re glad I didn’t ask you about?

Dylan Lingelbach: Yeah, so I mean, this was 10 years ago that I was doing this, so I’m glad you didn’t go too deep into the details. It was fun putting this presentation together because I remembered it at a high level and I had to go back and research the algorithm. I think we’ll be able to share the slide deck that I presented and people walk through that, but I didn’t remember any of that stuff off the top of my head. I had to go back and research that when I was putting it together.

Kat Gaines: So that’s very, very fair. Yes, folks. So as Dylan mentioned, we are going to drop some visual references in the resources, so please go check those out after you listen to the episode if you want to match the visuals to some of what Dylan was talking about, it’s really interesting stuff and really cool to see how the two things match together. I think that does it for our conversation, Dylan. Thanks so much for joining me on the podcast today. This is just a really fun conversation and topic, and I’m really glad that you could take some time to explain that to our listeners.

Dylan Lingelbach: Thank you. I really enjoy talking about this stuff. I really appreciate you having me on.

Kat Gaines: Awesome. Well, thank you so much, folks. Again, this is Kat Gaines wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor, PagerDuty, for making the podcast possible. Remember to subscribe in your favorite pod catcher if you like what you’ve heard. You can find our show notes at PageIttotheLimit.com, and you can reach us on Twitter at Page It 2 the Limit using the number two. Thank you so much for joining us and remember, uneventful days are beautiful days.

Hacking DNA Sequencing Algorithms for Fun, Profit, and Performance Analysis

Transcript

Show Notes

Additional Resources

Guests

Dylan Lingelbach (he/him)

Hosts

Kat Gaines (she/her/hers)