It's Always BGP: Networking and Other Disasters with Stuart Clark

Transcript

Mandi Walls: Welcome to Page it to the limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me @LNXCHK on Twitter. All right, welcome back folks. This week I have with me Stuart Clark. Stuart, tell us a little bit about yourself.

Stuart Clark: Hi, thank you for having me. So I’m Stuart Clark, I live in the United Kingdom, I’m a senior developer advocate. I’m also an author, I’ve published my first book this year under the Cisco Press banner for one of their certifications. I develop code and generally try to keep out of trouble as best as I can. But yeah, love being a developer advocate.

Mandi Walls: Awesome. So you spent a long time at Cisco working there with their program, right?

Stuart Clark: I did, yeah. So I did 10 years in total at Cisco. It was only meant to be 12 weeks.

Mandi Walls: Oh wow.

Stuart Clark: And ended up staying 10 years. So started as a contractor just as what they call a red badge, doing some stuff for IPv6. The contract rolled over a couple of times and then they asked me to move to full-time employment. And ended up doing a lot of stuff for network engineering, architecture, which then led me to do a lot of network automation. And then I became a developer advocate five years ago. And then more recently, I moved to AWS.

Mandi Walls: Well, that’s exciting. So you had a lot of experience in the networking space, and that’s why we want to talk to you. We haven’t had any networking folks on the show before, so this is new territory for us. And excited to learn everything that you’ve experienced about production networks and what happens and why is it always BGP that’s the problem?

Stuart Clark: Yeah, why is it always BGP? And when the network goes down, people always say, “Well, it must be DNS.” But a lot of the outages recently that we’ve seen have been BGP. BGP, I want to use the term, it’s like Marmite. So in the UK we have this spread and there’s a saying with Marmite, and you tend to put it on toast and it’s made from yeast. And there’s a saying around Marmite, and the reason I say Marmite is because the saying with Marmite is you either love it or you hate it. It is just one of those things.

And BGP is exactly like that. One small mistake will have catastrophic outcomes and issues. The recent one, which made the news about a year ago, exactly about a year ago, was the one which occurred for Facebook. And it was so severe that according to their outage and their CLCA and their outage report, is that it locked them out of meeting rooms, it locked them out of the data center where they needed to get in to send an engineer in with a consult cable to actually fix it because it was tied to a device which had something to do with their DNS. That was just a one simple line mistake that will cause that, and as we’ll probably discover more as we keep talking, how networks are incredibly fragile.

Mandi Walls: Yes.

Stuart Clark: Comes to mistakes like that. The cascading domino effect can have huge ramifications globally, completely globally.

Mandi Walls: Yeah, definitely that happens. And you sit back and you wonder, “How does the internet work at all some days.” It’s like weather patterns, it’s so crazy. The interconnectedness and how dependent everything is, like you say on a one line change has this snowball effect across so many layers of the systems. And then you look back and you’re like, “Oh, well we decided we’re going to distribute even more stuff and let’s go for it man.” So oh my God. I love following the BGP Twitter feed that Cisco has. And every now and then, some weird thing will pop up that somebody’s stolen a subnet or some other weird thing has happened and you’re like, “Okay, is that good, bad or indifferent? And is it the start of something horrible?”

Stuart Clark: Yeah. And then lot of these go back as far as one of the bigger ones happened in 2008 when a state owned telecommunications company managed to cut YouTube completely off the web. Completely off the web. And you can replay a lot of these things with tools like BGP Play. And you can actually see, like you said and weather map was a really good way of connecting it. You could automatically see by using BGP Play like a wipe out or something like that, the traffic just switching direction, changing like a storm as it all starts to converge or head into this particular subnet or this ASN, and then it just becomes unreachable. So yeah, it’s incredibly amazing. BGPmon, which is now owned by Cisco, was built by a gentleman called Andree Toonk, a Dutch guy. He was one of the original engineers that opened DNS. Andree’s a fantastic engineer. To be honest, I think he’s one of the leading experts in BGP in the world. When you ever see any of these BGP hijacks, he’s one of the people who’s commenting on it, and one of many of the experts in the field, certainly. And he created BGPmon, it was one of his many side projects, which was acquired by open DNS, which was then acquired by Cisco.

Mandi Walls: I’m going to make some notes. Well, I’ll put links to a bunch of this stuff in the show notes for folks. So if you’re interested in looking some of this up, some of it is absolutely fascinating stuff that’s gone on over the years. So you mentioned you worked in network automation and that feels like a potentially scary place to automate network stuff. When you feel like, okay, you’re maybe one line from a potential catastrophic failure. Yeah. How do folks go about that? How do you think about automating networks in safe ways?

Stuart Clark: Carefully, I think is the first. It’s kind of like that Spiderman saying, and I’m guilty of using this in many, many speaking opportunities that I’ve had is that, “With great power comes great responsibility.” Like we were saying about how one line can take down entire data centers or can cause big outages. As soon as you start doing that with automation, you are no longer just inputting this onto a single command line into one box and going kind of box by box. The going box by box method, if you are making a change and say you are doing it on 10 devices and they’re all in your data centers and you’re going east to west, just making the changes, you might start to get some alerts. If you’ve done something incorrect, you’re going box by box.

And this has certainly happened to me that I’m midway through making the changes, I’ve got 20 devices I need to add additional configuration to. And then by the time I get to kind of box five or six, I start getting alerts or somebody kind of messages you or calls you and says, “Hey, we’ve got reachability issues, this isn’t working.” Something happens like that. And you go, okay. And you start to roll back that change.

So you kind of go back to box five and back to box four and back to box as part of your change procedure. Obviously with automation, you are scripting this out. Now there’s a number of ways of doing this. Simple batch scripts, which we’ve done for decades now to be able to do this. And then using other things or other tools, programmatic, more tools generally a lot of stuff is run over SSH. And you might be doing changes with things like Ansible or Python, Python libraries such as netmiko, NAPALM, another good one. pyATS. That’s a powerful library.

You’re making all these changes, like I said, typically over SSH into the devices and you’re altering the state of the running configuration immediately. You are then potentially going to break all 20 devices at once because you’ll take that single change from your local machine and you will push it straight out to all of those devices. And if that change is missing something, could be missing a VLAN could be missing a subnet, it could be missing one line of code or one line of configuration. You’ve just pushed it out to 20 devices. If you haven’t got any out of band on those, you could cut yourself off from devices. And then all those devices become isolated. I can certainly attest to that. That’s happened to me a number of times when changing things like security policies, access lists, anything that will cut you off from the box or anything to do with the management IP.

So yeah, it’s one of those things I think where people kind of steered away from it. Certainly in my experience, one of the teams I was working on wanted to do things via automation because doing things via hand, going box by box and the constant firefighting was causing outages. And we know that the majority of outages are just caused by human error, which is 98%, especially in networks. We wanted to start using automation. But with that becomes this kind of really steep learning curve where you start experiencing outages because you’re pushing these at scale and a lot of devices, you are automatically pushing it straight into the running configuration, so the changes happen immediately. There’s no kind of save it and there’s no validation on the device. And sometimes with automation, you can be really caught up with things like white spaces or gaps or something like that, and only a part of the configuration is attached. And then the other parts aren’t.

So if you are putting in a string of numbers or something like that, and then you’ve got a comma or colon in the wrong place, half of that’s cut off and then all of a sudden 50% of your change has taken and the other 50% has gone. The box has kind of thrown up a question mark or what we call a carrot, the little pointy up thing.

Mandi Walls: Oh yeah.

Stuart Clark: And it basically says, “Yeah, I’ve got no idea what you’re talking about here.” And then you’re left with partially configured devices and something that isn’t working. So yeah, that’s the hardest part.

Mandi Walls: Yeah. And when you think about how many organizations out there really need to be running their own very large networks. We talked about Facebook earlier, obviously the big players have their own network configurations and they’re doing that stuff. Do mere mortals really need to be working on this stuff or should we leave this to our transit providers?

Stuart Clark: So it depends where you are within the organization. Certainly this is why when you are dealing with things like BGP, that service providers now put in guidelines and rules and security to stop some of the things that we’d spoken about. However, if you are in a big service provider or Facebook, one that runs essentially a big part of the internet, those mistakes can still happen. But this is when you are doing this now with automation, this is why some of the great tools are available, we’ll do a lot of validation for you before you push that configuration to that device.

This is why so many network teams probably, how long ago did I start doing this? Probably seven years ago, started really adopting a DevOps methodology to the way that we did things. And we learned a lot of this from DevOps teams within the organization I worked in. And the SRE teams that I worked with, started to look at how they did a lot of their workflows, a lot of their taught, a lot of deployments and to see what kind of things and lessons learned that we could use within networks to make the networking team, their lives easier. Our lives easier.

Mandi Walls: That’s super interesting. So when we talked about DevOps, bringing that into just machine operations. We were looking at things like code reviews and using git to store configs and those sorts of practices as well as testing and other stuff depending on your platform. Is that what you saw in networking as well?

Stuart Clark: Exactly. Yeah. So network devices were made for humanized human consumptions, that’s what the CLI was made for. Underneath every router CLI is essentially Linux is running in the background. A lot of vendors, including Cisco, have Linux running on the underneath side and there’s some kind of translation taking place. Take a router has over, I think he’s 8 million lines of code running on it in the background. And it’s kind of translated into the command line interface. When I started in network networking, which was probably now 15 years ago, when we save configurations, we used a tool called RANCID. And RANCID just went in and it was just essentially just TFTP, just the device-

Mandi Walls: Oh sure, yeah.

Stuart Clark: … Straight to the running configuration. And essentially what it did was it just issued a bunch of show commands and stored them in a text file and then you could have RANCID log into your devices and kind of screen scrape these commands however often you wanted to do them. That could be anything from every minute to every 15 minutes, depending on how frequently you have changes being made. And then these files would get historic and then what you do is jump into your RANCID box and then you’d do just a diff on the configuration files to see what changes were made and when they were made. And then you would just undo those changes. And that’s great. And it did work and it worked for many, many years. But then when teams I started working on we’re looking at automation, we started to adopt the DevOps thing.

We did exactly what you said. We started to look at storing configurations as machine readable code instead, because device to device and vendor to vendor, the configurations are different. They’re different across different vendor platforms and they’re different across different platforms, say firewalls and switches, and load balancers have a different sort of syntax, a different command line structures. You kind of have to be an expert or have knowledge of that command line to be able to operate it. But how cool is it to store something in machine readable code, which everybody understands, including non-network engineers, SRE or DevOps. They understand JSON, they understand YAML this is really cool. Or XML if you’re feeling really brave. And I’ve just have never been an XML reader. To me it’s like looking at Mandarin.

Mandi Walls: Really is, yeah.

Stuart Clark: I know what I want to look for, just a couple of minutes of looking at it and my mind just scrambles. I’ve got to have some strong coffee in me before I read XML. So we did exactly that. We started using machine readable files as well because, and then instead of storing these in RANCID on a server or having these on someone’s local machine, we had all of our configurations in machine readable code on GitHub. And GitHub then became our source of truth. So anytime changes were made, we did exactly what you said. The PR was made, again, a branch was made a PR was made, changes were made to that specific file and we kept everything in files in machine readable format, but we separated it almost by protocol or by layer.

Mandi Walls Oh, interesting. Okay.

Stuart Clark So we had blocks for security and then blocks for routing and then blocks for different devices as well. So this meant that you were only changing small portions of text files. You weren’t having to sift through lines and lines of YAML to find a specific area to add additional configuration. You could just go to the file that was for VLANs, for your switches and then you know, would just alter the information that you needed in this. And then when the automation ran, it kind of glued everything together as it ran. And as someone would eyeball that change, they would look at it and they would then merge it. And then all of those changes is then pushed from the main branch into the production. And then you’ve got this kind of whole pipeline of the change that’s been made and all of the metadata that comes with this to be able to identify who’s made the change, when the changes made, what change was made, what tests were performed, who did the peer review on this.

And you’ve kind of built this small sort of pipeline. And then like you said, with this being in machine readable things, it gives the ability for other teams to say, “But we want to put a PR against this, we want to add additional services into the load balancer. We want to put additional services into the firewall, these rules. Because all we’re having to do is just update JSON file, update a YAML file, simple.” So then that PR can go off. And then you’re releasing the network team from being that sort of 800 pound blocker, which has copious amounts of JIRA tickets to sift through for just the minor simple everyday changes, which should be just able to be done on the fly and doesn’t backlog a team for the best part of two weeks while someone gets around to making the change.

Mandi Walls: I have definitely been there waiting for firewall requests, waiting for load balancer requests and having that stuff more democratized as it becomes more, like you said, approachable for everybody versus you know, must have the magic wand to read the CLI and figure out which command you need for X, Y, and Z. That speeds up everybody’s workflow.

Stuart Clark: It does.

Mandi Walls: Without it, you reach the end of your scalability for the business that you’re trying to run. Everybody’s sitting around waiting for a handful of poor abused network people to get through all these tickets and everybody is stopped. Yeah.

Stuart Clark: I don’t know of any company at all who they expand their global footprint or they expand their services and then they hire another 10 engineers to be able to do this.

Mandi Walls: Yeah.

Stuart Clark: And all of a sudden it’s not unheard to one, to start managing a hundred devices with your small team of three or four and then in two years time you’re still three or four people and now you’ve got 300 devices. That’s not uncommon certainly within this industry.

Mandi Walls: Yeah. Definitely have been seeing plenty of teams where there’s a whole lot more work to be done and not a whole lot more people to be hired to help out on any of that stuff. So as you’ve been doing this for… So what’s the coolest part of this? Some of it seems so neat. I am on the outside looking in, have never really touched big networks. My first job out of college was at UUNet, but in DSL. So yeah, some of it seems so neat but also kind of scary.

Stuart Clark: It is kind of scary doing network changes and things and even when you start as a network engineer and there was a tweet that came on the day that from somebody that I follow and they said that they configured their first BGP peering and that really took me back to when I configured mine and I was sat there and even though I’d done this thousands of times in a lab environment, doing it on the real device in the real world, I remember my hands physically shaking while I was doing that. And then as soon as you’ve made that change, you keep issuing a flurry of show commands to make sure that you are sending the right prefixes and receiving the right prefixes and everything’s working. And there’s this kind of a slight delay in doing this because it’s like me giving you a book with 200 pages and then how long it takes for you to read that and how long it takes for me to read the book that you give me in return with 500 pages or something like that.

And how quickly your device can scan this and load it into the routing information table or the rit, how quickly it will actually load it in there, depending on how big or powerful your device is or what you are wanting to receive, et cetera. So there’re certain changes within the network world, which aside from BGP, adding VLANs to switch ports and things and spanning tree, which are the real nerving commands that when you are running some of those commands, there’s such great outages which have been attributed to these and almost kind of every engineer within that timeline has experienced a mistake or an outage from something like that. There comes that kind of, when you start to begin to automate these and build these up, you start with the low hanging fruit kind of changes and then you become a lot more braver as you become more sort of confident with your automation, confident with your validation and confident with your process.

It’s certainly not one of those things where you set out to do all of the changes in the first week. You kind of build it up gradually over a period of time. And sometimes this can go into years before getting even close to having a fully automated network. But there’s a great feeling which comes off the back of that to have automated systems. Self-healing networks as well, where you can deploy changes at any point in time and you can do it in your production network and your business and your users don’t see it. It becomes absolutely seamless and it is a great feeling to be able to do that. And I’ve done that sat in coffee shops, I’ve done it on trains, I’ve done it for my hotel rooms. Wherever I’ve got a wifi connection, it became part of that thing’s just like, yep, let it run.

It’s connecting, I can see it, I know. And you’ve got the confidence in there to know that it’s going to deploy. If it doesn’t deploy, you’ve got a great rollback plan as well, you are able to pull that back and seeing that, removing that kind of box by box, having to stay late every night to keep doing this out of hours, being able to run this successfully is a great thing to be able to do, especially for years of firefighting and 2:00 AM changes. Being able to make those changes kind of sat in your shorts or your pajamas with a cup of coffee. Or optional, if it’s weekends, no pants at all.

Mandi Walls: Hey, no judgment. Right. Yeah. So one of the recurring questions we have on the show is if there are say myths about things that you’d want to take a chance of busting if you have a favorite myth about networking or network gadgets or any of that stuff that you can share with us.

Stuart Clark: Yeah, let me think. Especially around automation is that within networking, you can’t automate everything. The other big one is for a lot of network engineers is certainly when I started advocating network automation, people thought that there was a lot of engineering or a lot of rumors within the industry that automation would see a reduction in headcount.

Mandi Walls: If only.

Stuart Clark: The engineers would become redundant or surplus to requirements. But that really, really wasn’t the case. Automation didn’t take over the jobs at all. If anything, it increased in and it made network engineering jobs and automation even more valuable within the industry. And those engineers, network engineers who had automation skills or some coding units on Python skills, Linux skills certainly did see a huge demand within industry for companies wanting to start with automation or whom had already started with automation. So I think that was one of the big kind of myths that we saw within network automation was what the potential impact would be on the industry and a lot of people saw it as a negative thing and it really, really wasn’t. It was positive.

As we know, if you are able to automate something, it kind of frees you up to do the more interesting things. Because time is money as they say, and businesses want to make the most out of their staff. And if they have two or three staff just doing simple changes constantly being on a Ferris wheel, if you can automate those, it frees their staff up to do such more cooler things. Start to work on better tooling for the network, better visibility, better orchestration, better monitoring or something along the board so it improves the business thing. I think for me, that’s the biggest myth that I’ve come across.

Mandi Walls: Yeah, I feel like we had that same thing on just machine operations side of the house. People saw it as a threat to the jobs they were currently doing. It’s like, you’re not going to have the same job, you’re going to have a different job, but you’re not going anywhere. You’ve never gotten through your backlog while you’ve been slogging through imaging machines by hand and some of this other stuff people love to do.

Stuart Clark: Yeah, exactly. And exactly the same with upgrading devices in the same room.

Mandi Walls: Yeah, absolutely.

Stuart Clark: Time consuming things, which should be automated.

Mandi Walls: Awesome. Well I’m glad to hear that our networking pals are benefiting from all of this stuff as well. Do you have any parting thoughts for folks out there? Do you have any advice for folks who want to get started in network engineering?

Stuart Clark: I think in network engineering has really changed in the last five years. Then that five years ago you used to see jobs saying network engineer required and it had this sort of typical network engineering skill set and protocols and platforms and under this sort of highly desirable, it wasn’t even under the desirable, it was under the highly desirable, the little small print at the bottom. Ansible would be at the bottom, or Bash. And now when you look at network engineering jobs, network engineer jobs, it’s almost a mandatory now for having those skills. That doesn’t mean that network engineers have to become coders.

The way that I described always described how much I use Python and how much I use Ansible or even GO was as long as I could do my network engineering job with those tools, that was going to be good enough to be able to do those deployments to understand those. As time went by, I became more interested in those languages to do other things, to perform other things, to build other tools, to do things in the cloud, to do other things for orchestration and telemetry and all of those other things. For me it was just a case of automation became a great tool to be able to do my job.

And I think about it like my father who was a carpenter, and if you think about how my father would’ve started with his tools when he started his trade in the sort of 60s, he didn’t have an electric drill or an electric hammer or an electric saw or anything like that, but as his career progressed, he got tools to help him do the job. Python and Ansible are those tools. You don’t have to understand how they work, underneath, it’s kind of like driving a car. You get in the car, turn the key, put the accelerator, you don’t have to understand the combustion engine. If at a later day you want to take apart the combustion engine and understand how it works, great. That’s a fun experience. Is it necessary? No.

Mandi Walls: That is fantastic advice.

Stuart Clark: Yeah. Good. And the other thing is, the other thing that people ask me, they say, “Should I use Ansible or should I use Python?” I’ll say, “Yes.” And they’ll say, “Yes? Which one?” It doesn’t matter which one, you just start doing it. Start doing it. If you find that you get halfway down the road with Ansible and you think this isn’t the tool for me, great, switch to Python. If you get halfway down the road with Python and you think, well, this is too hard for me. Maybe I ought to be using Ansible or maybe I ought to using Terraform, great switch to those. Just start doing it. Start doing it.

Mandi Walls: Awesome. Yeah, get those concepts in. Well fantastic. This has been so much fun.

Stuart Clark: Thank you. This is awesome.

Mandi Walls: It’s been great to talk to you. Thank you so much for joining us today. Where can people find you? Are you online?

Stuart Clark: I am. I’m online. I’m on Twitter as, @bigevilbeard for my two foot beard and that’s also my GitHub ID as well. They’re the two sort of main places you’ll find me. Or other than that, anywhere there’s coffee, I’m normally there too.

Mandi Walls: That sounds fantastic. Me too. All right, well thanks everybody for joining us this week. We’ll sign off here and we’ll wish you an uneventful day.

Stuart Clark: Thank you.

Mandi Walls: That does it for another installment of Page it to The limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit using the number two. Thank you so much for joining us and remember uneventful days, our beautiful days.

It's Always BGP: Networking and Other Disasters With Stuart Clark

Transcript

Show Notes

Additional Resources

Guests

Stuart Clark

Hosts

Mandi Walls (she/her)