API Troubleshooting in 2020

4 min readMar 30, 2020

The evolution of software from monolithic to distributed architectures — such as in microservices and Function-as-a-Service (“FaaS”) — is driven by the benefits of flexibility, scalability, faster deployment, etc. These advantages are somewhat offset by a tremendous amount of complexity. Debugging was once an annoying practice to find a needle in a haystack, troubleshooting in these evolving architectures has become a mission-critical process to find a needle in one of dozens of haystacks that can be all over the world. Developers spend, on average, half their programming time troubleshooting.

The execution of a highly distributed systems involves a substantial number of API interactions, which are often involve complex invocation chains. For example, as of 2017 Netflix estimated that it had around 700 individual microservices, handling two billion API requests every day. That is a lot of services to maintain!

“The average time used to locate and fix a fault increases with the number of microservices involved in the fault: 9.5 hours for one microservice, 20 hours for two microservices, 40 hours for three microservices, 48 hours for more than three microservices.”

The complexity and dynamism of distributed systems requires developers ot reason about the concurrent behaviors of different services and understand their interaction with the system as a whole. Developers may spend hours searching for a problem that does not exist in their services. But what happens when there is a bug that makes it into production?

The traditional flow for troubleshooting typically follows something like this:

There should probably be several other fire emojis in this

My experience with bugs in distributed systems — whether in dev, QA, staging, or production — generally includes some or all of the following events. QA spends lots of time documenting the bug in a ticket for the dev team. The developer spends a few hours trying to recreate the bug. The developer may say, at some point, says: “it works on my machine!” The developer spends hours digging through log files to try and find where the bug might be. The developer determines they probably aren’t logging the right thing. The developer starts to insert statements in key areas to try to find it. QA is feeling some pressure from the product manager, so they start pushing the dev team. The dev team starts questioning DevOps. The dev team starts pointing fingers at the 3rd party API, but doesn’t have any data to prove it is their problem. In extreme situations I have had to step in, put on my firefighter hat, and put remote debuggers in other environments to solve mission-critical problems.

I have experienced this problem as the new dev, the experienced dev, the team lead, the dev manager, and CTO. Regardless of my role, this process has been painful and expensive. Countless hours are spent trying to identify, locate, and fix bugs in software, particularly when there are many services involved.

“The main problem with debugging and finding the root cause in a distributed system is being able to recreate the state of the system when the error occurred so that you can obtain a holistic view.”

So we set out to create a solution.

Our organization wasted so much time troubleshooting, that we decided to cut to the root of it and be done with logging and tracing, trial and error. What we built is a network layer solution — requiring zero changes to source code — that allows you to securely route API enpoints directly to the developer IDE from any environment. A developer can create conditional filters that route API transactions and debug them in real time. We call it Vortex.

Vortex is not meant to replace your existing development lifecycle, but it can speed it up by putting the problem right in your IDE. No more reproducing the error. No more documenting how it happened. One click, and the problematic transaction is in front of you.

The developer can then step through the code in their IDE, identify the problem, make a change locally, and push the change through to see the result. Once you have fixed the problem, you can then go through your normal deployment process with the changes.

Let us know what you think!

API Troubleshooting in 2020

So we set out to create a solution.

Written by Jonathan Wesley

No responses yet