Testing in Production: A Key Strategy for Resilience at Netflix
Netflix is a huge fan of Testing in Production, and as such, we've made it a core part of our strategy to improve vulnerability and availability. We achieve this through chaos engineering and have recently renamed our team to resilience engineering because while we do chaos engineering, it's still one means to an end to get us to that overall resilience story.
Our goal as a team is to improve vulnerability or improve availability by proactively finding vulnerabilities in services. We do this by experimenting on the production system. Our team has an active belief that there are certain classes of vulnerabilities and issues that you can only find with live production traffic. With this in mind, we're going to dive into how we approach Testing in Production.
Safety and Monitoring: The Foundation of Testing in Production
First and foremost, our main focuses with Testing in Production are safety and monitoring. You really can't have great Testing in Production unless you have these things in place, and Testing in Production can seem really scary. If it does seem scary in your company, you should listen to that voice and figure out why it seems scary. It might be because you don't have a good safety story, or it might be because you don't have a good observability story. We really focus on these two worlds within Netflix and within our tools.
Defining Chaos Engineering: Experimenting on Production to Find Vulnerabilities
To define chaos engineering in simple terms, it's the discipline of experimenting on production to find vulnerabilities in the system before they render it unusable for your customers. We do this at Netflix through a tool that we call ChAP, which stands for Chaos Automation Platform. ChAP can catch vulnerabilities and allows users to inject failures into services and prod that validate their assumptions about those services before they become full-blown outages.
How ChAP Works: A Hypothetical Example
Let's take a hypothetical set of microservice dependencies as an example. There's a proxy, which sends requests to service A, which fans out to service B, C, and D. And then there's also a persistence layer. Service D talks to Cassandra, and then service B talks to a cache. To see if service D is resilient to the failure of a cache, we go into the ChAP interface and select service D as a service that will observe the failures and the cache as a service that fails.
ChAP will actually go ahead and clone service B into two replicas, which we refer to as the control and experiment clusters. It works like AB testing or like a sticky canary. These are much smaller in size than service B. We only route a very small percentage of customers into these clusters because obviously we want to contain the blast radius. We calculate that percentage based on the current number of users currently streaming, currently using the service.
We instruct ChAP to tag these requests that match our criteria by adding information to the header of those requests. It creates two sets of tags: one set will have instructions to both fail and be routed to the canary, and the other will have instructions just to be routed to the control. When the RPC client in service A sees in the instructions that it needs to route a request, it actually sends them to the control or the experiment cluster.
Monitoring Chaos Experiments: Ensuring Safety
As we run these chaos experiments, we need to monitor them carefully because it has the potential to go very poorly. Initially, when Netflix started our chaos engineering story, we didn't have good gates in place. We would kind of run a failure experiment, cross our fingers, and then all sit in a war room watching the graphs and making sure that nothing actually went incorrectly.
Now, we have much more of a safety focus. We look at a lot of our key business metrics at Netflix. One of our key business metrics is what we call SPS or Stream Starts Per Second. If you think about what's the most important thing to the business of Netflix, it's that a customer can watch and stream their favorite content without any issues.
Our focus on safety and monitoring allows us to ensure that our chaos experiments are conducted in a controlled manner, ensuring that we're not compromising the stability or reliability of our services.
"WEBVTTKind: captionsLanguage: en(rock music)- Yeah, I'm superexcited to be here today.Netflix is a huge fan ofTesting in Production.We do it through chaos engineering,and we recently renamed ourteam to resilience engineeringbecause while we do chaos engineering toostill,chaos engineering is one means to an endto get you to thatoverall resilience story.So I'm gonna talk a littlebit about that today.Our goal as a team is to improve vulner...or improve availabilityby proactively findingvulnerabilities in services.And we do that by experimentingon the production system.Our team has an active beliefthat there are certainclass of vulnerabilitiesand issues that you can only findwith live production traffic.So I'm gonna talkto you a little bit abouthow we do that today.First and foremost, our main focuseswith Testing in Productionare safety and monitoring.You really can't havegreat Testing in Productionunless you have these thingsin place andTesting in Productioncan seem really scary.And if it does seem scary inyour company, you should listento that voiceand figure out why it seems scary.You know, it might bebecause you don't havea good safety story.It might bebecause you don't have agood observability story.So we really focus onthese two worlds withinNetflix and within our tools.So to define chaos engineering,just in a simple sentenceit's the discipline of experimentingon production to findvulnerabilities in the systembefore they render itunusable for your customers.So we do this at Netflix througha tool that we call ChAP,which stands for ChaosAutomation Platform.So ChAP can catch vulnerabilitiesand it allows users to inject failuresinto services and prod thatvalidate their assumptionsabout those servicesbefore they become full blown outages.So I'm gonna take you throughhow it works at a high level.This is a hypothe...a hypothetical set ofmicroservice dependencies.There's a proxy,it sends requests to service A which fansout to service B, C and D.And then there's also a persistence layer.So service D talks to cassandra,and then service B talks to a cache.So I went ahead and condensed thisbecause it's about toget busy in a second.So we wanna see ifservice D is resilient tothe failure of a cache.So the user goes into the ChAP interfaceand they select service Das a service that willobserve the failuresand the cache as a service that fails.So ChAP will actually go aheadand clone service B into two replicas.We refer to them as the controland the experiment clustersand it kind of works, you knowlike AB Testing or like a sticky canary.So these are much smallerin size in service Bwe only route a very verysmall percentage ofcustomers into these clustersbecause obviously we wannacontain the blast radius.We calculate that percentagebased on the current numberof users currently streaming,currently using the service.So it will then instructour failure injection testingto tag these requeststhat match our criteria.It does this by adding informationto the header of that request.So it creates two sets of tags.One set will have instructions toboth fail and be routed to the canary.And then the other willhave instructions justto be routed to the control.So when the RPC client in service Asees in the instructions thatit needs to route a requestit will actually send themto the control or the experiment cluster.And then once failure injection testingin the RPC layer of theexperiment cluster seesthat the request hasbeen tagged for failure.It will then return the failed response.As before the experiment cluster we'll seethat as a failed response from the cacheit will execute the codeto then handle a failure.And so we're doing thiswith the assumptionthat this is resilient to failure, right?But what we see sometimes is that,that's not always the case.For the point of view of service Ait looks like everything isactually behaving normally.So how do we monitor thiswhile these chaos experiments are running?Because it has thepotential to go very poorly.When Netflix started ourchaos engineering storywe didn't have good gates in place.You know, we would kind ofrun a failure experiment,cross our fingers, andthen all sit in a war roomwatching the graphs and making surethat nothing actually went incorrectly.Now we have much more of a safety focus.So we look at a lot of our keybusiness metrics at Netflix.One of our key businessmetrics is what we call SPSor Stream Starts Per Second.So if you think about whatis the most important thingto the business of Netflix,it's that a customer can watch\"Friends\" or \"The Office\"or whatever they wanna watchwhenever they wanna watch it.So what you seein these graphs hereare an actual experimentand it shows the SPS differencebetween the baseline...Between the experimentand control during a chaos experiment.So you can see here thatthese are deviating a lotfrom each other, whichshouldn't be the casebecause there's the same percentageof traffic routed to both clusters.So because of that, theexperiment will see...The experiment will useAutomated Canary Analysis and see\"Wow these deviated reallyfar from each other,I'm gonna short the experiment,I'm gonna stop failing these requestsfor the customer and they'llhave a normal experience.\"So from a customer perspective,it's more seen as, you knowa blipwhen something like this happens.So we have a bunch of otherprotections in place as well.We limit the amountof traffic that's impacted in each region.So we're not, you know,just only doing experimentsin U.S. West Two, we're doingthem all over the place andand not letting...Limiting the amountof experiments that canrun in a region at a time.We're only running during business hours.So we're not pagingengineers and waking them upif something goes wrong.If a test failsit can actually not beautomatically run againor picked up by anyone untilsomeone actually explicitlymanually resolves it and acknowledges,\"Hey, I know this failed,but it's, you knowI fixed whatever needed to be fixed.\"We also have the ability toapply custom fast propertiesto clusters,which is helpful ifyour service is shardedwhich a lot of services are at Netflix.Additionally, and I don'thave this as a bullet point,we also have the abilityto fail based on device.So if we're assuming that Appleor a certain type of televisionis having a bunch of issueswe can limit it to thatdevice specifically.And see, you know, ifthat issue is widespreadacross that device.ChAP has found a lot of vulnerabilities.Here are some examples.So this is one of my favorite ones.The user says \"We ran a ChAP experimentwhich verifies the servicesfallback path workswhich was crucial for our availability.And it successfully caught anissue in the fallback path.And the issue was resolvedbefore it resulted inavailability incident.\"This was a really interesting onebecause this fallback pathwasn't getting executed a lot.So the user didn't actuallyknow if it was working properly.And we were able to simulate.We were able to actually makeit fail and see if it wentto the fallback path and thefallback path worked properly.So in this case, the user kindof thought their service waswas non-critical or tier twoor whatever you label it as.But really it actuallywas a critical service.Here's another example.\"We ran an experiment toreproduce a Signup flowfallback issue that happenedwith certain deploys andintermittently at night.\"Something kinda weird washappening with their service.\"We were able to reproduce the issueby injecting 500 milliseconds of latency.By doing the experiment, wewere able to find the issuesin the log file that wasuploaded to the Big Data portal.This helped build context into why signup fallback experience isserved during certain pushes.\"So that fallback experimentexperience kept happening.But these users didn't know why.And they actually ran a ChAP experiment tosee when it was happening,to see why it was happening.So to set up ChAP experimentsthere's a lot of things theuser needs to go through.It needs to,they need to figureout what injection points they can use.So our teams had to decideif they wanted failure or latency.So these are all of our injection points.You can fail cassandra, Hystrix,which is our fallback layer, RPC ServiceRPC Client,S3, SQS or cache,or they can add latency,or you can add both,and you can actually come up with combosof different experiments.And so what would happen is wewould meet with service teamsand we'd sit in a room togetherand we'd try to come upwith a good experimentand it would take a really long time.And so when we were settingup the experiment tooyou have to decide your ACA configurationsor your Automatic Canary Configurations.So we had some canned ACAS to aid set-up.We had a ChAP SPS, one.We had one that looked at system metrics.We had one that looked atRPS successes and failures.We had one that lookedat whether our servicewas actually working properlyand injecting failures.And we learned that experimentcreation can be reallyreally time consuming.And it was,and so not a lot of experimentswere getting createdand it was hard to,for a human to actuallyhold all the thingsin their head that thatmade a good experiment.So we decided to automatesome of this from ChAP.So we were looking atdifferent things like,who was calling who,we were looking at timeout files,we were looking at retries,and we figured out that allof that information was ina lot of different places.So we decided to aggregate it.We zoomed into ChAP and we got cute.And we gave it a Monocle,and the Monocle providescrucial optics on services.So this is what Monocle looks like.So it has the abilityfor someone to look uptheir app and their cluster.And they can see all thisinformation in one place.They...So each row represents a dependency.And this dependency is whatfeeds into chaos experiments.So we were using this tocome up with experiments,but we didn't realize wasthis information was actuallyuseful to just have in one place as well.So that was an interesting side effect.And so users can comehere and actually seeif there are anti-patternsassociated with their service.Like, if they had adependency that was notsupposed to be critical,but didn't have a fallback.Like, obviously it was critical.Now people could seetimeout discrepancies,people could see retry discrepancies.So we use this informationto score a certain typeof experiments criticality,and fed that into an algorithmthat determined prioritization.So each row can ac...Each row represents a dependencyand they can actually expand the rows.So here's an interesting example,that blue line representssomeone's timeoutand the purple linerepresents how much time itwas actually takingmost of the time.So you can see it is very,very far away from the timeout,but a lot of this informationwasn't readily accessible.So what would happen ifwe did a chaos experimentjust under the timeout?You know? Is that gonna pass?It never executes that high.So it's an interesting question.So we're trying to provide this levelof detail to users beforethese chaos experiments get runto give them the opportunity to say,\"Wait, this doesn't look right.\"So I'm gonna play a little game.I know a lot of you don't have contexton the Netflix ecosystem,but there is a vulnerabilityin this service,and I wanna see if you can spot it.So take a second to look at it.So to give you some contextsample remote Hystrix command wrapsboth the sample rest clientand the sample rest client dot get.The Hystrix timeout isset to 500 milliseconds.So sample rest clientdot get has a timeoutof 200 with one retry.And this is fine because it's a totalof 400 millisecondswith exponential back off,which is within that Hystrix limit.The sample retry client has timeoutsof 100 and 600 with one retry.In this case,the retry might not have a chance tocomplete given the surroundingHystrix wrapper timeout.Which means that Hystrixabandons the requestbefore the RPC has a chance to return.So that's where the vulnerability lies.And so we're actually providingthis information to users,and what's interesting is a lotof this logic lies in different places.So they weren't able to havethis level of insight before.So those were okay.And then these...This is where the vulnerability lies.So why did this happen?So it's easy for a teamto just go in and be...And look at their conflictfile and just, you know\"Change this around\", right?But we wanna figure out why this happened.So we can change the timeout,but who's to say this won't happen again?So we also help with figuringout why these things happen.Engineers weren't making bad choices,it was just a lot ofthings to update at once.And so that's somethingto be learned as well.So we use Monoclefor automated...Automatic Experiment Creation as well.A user creates an experiment basedon in-factorial types of inputs.So we take all these things andwe're working to automate the creationof running these experiments.So that users don't have to.We're automatically creatingand prioritizing latency failure,and latency causing failure,RPC, and Hystrix experiments.ACA configs are added by defaultthe deviation configurations.So we have SPS, system metrics,request statistics,and experiments areautomatically run as well.Prioritizing experiments are also created.So I'll go through the algorithm for thatat a high level.We use the RPS stats range bucket.We use a number of retriesand the number of HystrixCommands associated with it.So these are all weighted appropriately.Something else we also takeinto account is the numberof commands without fallbacksand any curated impactsthat a customer adds to their dependency.So curated impacts is\"This has a known impact on login.This has a known impact on signup.This has a known impact on SPS.\"And we actually weigh thosenegatively and don't create...Don't run the experimentif, you know, the score is negative.Test cases are then ranked and runaccording to their criticality score,the higher the score,the sooner it's run,the more often it's run.Ironically enough, Monoclehas given us some feedbackthat allows us to run lessexperiments in production, right?It's ended up as a feedback loopbecause we've been runningso many experiments.We've seen patterns inbetween them where we can lookat certain configuration files nowand see certain anti-patternsand know that that's actuallygonna cause a failure.Whereas we didn't knowthat information before.It has led to new safety.So before,if an experiment failedit needed to be marked as resolved.Well, currently it needs to be markedas resolved before it can run again.But now we can add...explicitly add curatedimpacts to a dependency.So a user can go into theirMonocle and actually add,\"This has a known login impact.This has a known SPS impact.\"And we're working on a feedbackloop to where it fails,it will add a curated impact as well.The runner will not runexperiments with known impacts.So in summary ChAP's Monocle iscrucial optics in one place,automatically generated experiments,automatically prioritized experiments,and finding vulnerabilitiesbefore they become full blown outages.If I can leave you with one tangent,one side piece of advice,it's to remember why you'redoing chaos experimentsand why you're...Testing in Production it's to, you knowunderstand how customersare using your serviceand not lose sight of them.You want them to have thebest experience possible.So monitoring and safety areof utmost importance in these situationslike at Netflix, not beingable to stream a video.Oh, thank you.Appreciate it.(applause)\n"