This is the third and final part in a multi-part series on using OpenTelemetry and distributed tracing to level up your system observability.
In part 2, we looked at Velocity’s case study - to allow our developers to debug and understand complex workflows as quickly as possible.
In this final part, we’ll learn what we had to do to achieve our use case and discuss the benefits and challenges of this approach.
So, we’ve shown you how great OpenTelemetry is… but there are always ways to improve it even more, and that’s exactly what we wanted to do so that it would be as easy as possible for us to use traces.
So what we did was take everything that bothered us and optimized and standardized until it was a non-issue.
Usually, when writing logs, we need to craft a message that gives a description of the event in English. And many times, after working on a complex feature, being satisfied with our solution, opening a pull request (or merge request), and waiting for code review, the review comes back with ten comments related to grammar/language: “Your English isn’t right; use the present progressive.”
It’s super unproductive and we wanted to do better. So what we did is on the level of runtime when we produce a span, instead of defining its name as a string, we extract the name of the function and the complete path to it from the call stack to achieve the following format: “repository/package/object/method”. Since most of the functions in our code create a span, we get a kind of “always-on” profiling.
In OpenTelemetry, you can attach errors to spans. But we also wanted to be able to get a stack trace, too. So we attached stack traces to every error attached to a span. It’s pretty simple, but it can be very helpful to understand what happened in the stack trace and move on. We don’t always have to dig into traces to understand the problem.
We love linting. It’s a great way to automate many “I think this - you think that” cases instead of having to add all sorts of annoying notes to the code review. So we decided to take anything we thought was important enough and add a custom linting for it.
To do this, we use a tool called Semgrep - Semantic “grep”. What it allows us to do is use any language we choose to define a pattern - in our case, “CreateSpan”. If we see CreateSpan in a function and it brings back an error, we want to wrap it and report to OpenTelemtry that this error happened under this span so that when we look at the traces we’ll understand there was an error.
The static analysis rule that we see in the code snippet looks for patterns in the code of CreateSpan that do not record the error before returning an error.
This way, we don’t have to look around for things or worry about somebody forgetting to record an error – the developer receives an error before she does a commit and everybody’s happy.
Often, during a post-mortem, we’ll say, “If only we’d had a log in the right place, we would have known why an event happened faster.” So we decided that by default we do gather spans – even if we end up with huge traces, we’d rather go the exclusion route. That means we automatically check and add spans around almost every call. If we feel it’s too much, we can always pull back a little.
At the end of the day, if I’m a developer dealing with an on-call issue, it doesn’t matter what we need to check, we have one defined place that we can go to understand where my problems are. We don’t have to think about whether we should look at metrics, logs, traces, or some other system. Traces answer all of our questions, which is very helpful when time is of the essence.
The moment that we introduced conventions in the way that we work, everything became much easier. It’s even simpler to write traces than logs, due to the structured manner in which we create them. Static analysis tools such as Semgrep help us avoid mundane usage errors via automation. Finally, traces just provide more context than logs, making it easier to troubleshoot issues with them.
The combination of OpenTelemtry with all the adjustments we made makes it very easy to take any code base and introduce traces to it. All we have to do is enter it into the library. The infrastructure already exists and we can start working very quickly.
Sometimes the product team comes to us asking for analytics on a feature to understand if it’s working. Instead of introducing a new tool, assimilating a new analytics system, and reporting events and all of that stuff we don’t have time for, we realized that because we report all this information, we already have everything for proper analytics. With how we’ve decided to work, we can know pretty much everything about our product usage after the fact.
Here’s an example. We looked at the usage rates of a feature of ours, producing the data as a graph to be able to visualize things and understand what’s happening in our system even on the product level.
Based on everything I've written so far, you might think this method is perfect. But it is important to note that it doesn’t work for everybody. We’ve seen companies with very high workloads of thousands and even hundreds of thousands of requests per second. For them, it’s just not practical to gather so much data.
That said, I do want to mention some things you can do with traces that will allow you to work with them even with high workloads and still receive some of their benefits.
In order to deal with large data quantities, OpenTelemetry has built-in support for sampling. Instead of collecting every trace, we can define the percentage of requests or processes that will be aggregated in your traces data store. The rest we lose. This is great for understanding what’s going on with the average process and what’s happening with slow requests, which you’ll still get in the chosen percentage.
But, of course, if there’s a very serious error, you might not be able to see it. To make up for this, you can use error monitoring tools like Sentry and BugSnag so you can report every bug and see when errors happen.
Because they’re just statistics that can be added and aggregated, metrics are lightweight. You can decide that if you want to know what’s happening with a particular feature or part of the system, you’ll report metrics where it’s necessary and receive the relevant information instead of using traces. It’s not perfect, but it’s definitely a solution that I’ve seen work well at other companies.
Finally, you can simply decide that every trace will contain much less information. You won’t get the same beautiful profiling and understand everything, but you can still receive a trace for every request and understand more or less what’s happening.
To summarize the benefits, by applying the strategies we’ve outlined in this blog series, we are:
Using OpenTelemetry and distributed tracing the way we do isn’t magic, and it isn’t going to solve all of everybody’s problems. But it sure does a lot for us - and I hope you can find a way to benefit from it, too.
Python class called ProcessVideo
Python class called ProcessVideo
Thank you! Your submission has been received!