This is the first post in a multi-part series on using OpenTelemetry and distributed tracing to level up your system observability.
As developers, there’s no denying that the work we do and the systems we work with are complicated. It can be difficult to understand the flow of all of our microservices, event-driven architecture, and distributed systems. Bugs and unpredictable behavior make such systems even more difficult to comprehend and troubleshoot. At the end of the day, the monitoring systems and tools available to understand what’s going on just aren’t good enough. That’s the problem we faced at Velocity, and this is the story of how we solved it.
Before we dive into the details, let’s level-set about a few key terms.
The word Telemetry is derived from the Greek words “tele” and “metry” - remote measurement. With modern cloud systems, it’s possible to report different types of observability data to remote systems that will ultimately be seen in selected monitoring systems.
Telemetry is made up of three main pillars:
Logs are constructed objects that contain different information pieces. Usually, these are:
Metrics are numbers that you can very conveniently represent on graphs. As you can see in our example here, the average number of readers when you’re reading this story is approximately 800 - not bad ;)
This allows us to easily know what’s happening in our systems at a higher level, add things, and see processes from a bird’s eye view, which can be quite helpful for understanding where our problems are located.
Now on to the really fun stuff - traces. A trace represents a complete request in the system, a job, or any complete process that has a beginning and end.
You can essentially think of traces as a tree of Spans, which are smaller, logical processes that we as developers decide how to scope. The control is in our hands to represent our systems as we imagine. That is what makes them so powerful. Spans have a beginning and end, which makes it easier for us to understand, well, when something begins and then ends.
Let’s take a look at an example.
Above, we can see two types of visual representations for traces. On the left, we see the trace in tree form, on the right the trace is presented as a timeline. Each individual segment (circle or rectangular block) is a span. Let’s take a look at the simple flow of a login:
With the help of traces, it’s possible to clearly understand what led to what and which processes happened in parallel by using both a timeline and a hierarchy.
OpenTelemetry is a collection of tools, APIs, SDKs, and protocols that allow us to:
What’s great about OpenTelemetry is that it is open source and vendor-neutral. That means that users can write code one way, one time in an SDK that OpenTelemetry offers, and then switch the monitoring system to any other one they choose with only a configuration change.
As you can see, OpenTelemetry supports nearly every programming language that you can imagine. Since OpenTelemetry is relatively new, the maturity level varies from one programming language to another. At Velocity, we use Golang and it works great.
Similarly, the OpenTelemetry backends, typically the observability products and solutions, include a wide variety of commercial and open-source vendors - Prometheus, Jaeger, Datadog, New Relic, etc. This allows us to use OpenTelemetry knowing that no matter which technology you choose, it’s supported.
So, how does using OpenTelemetry work?
Developers can send the metrics to Prometheus and the traces to Jaeger and it all works. If they change their mind and decide to replace the monitoring system, everything keeps working – that’s the beauty of being vendor-neutral. This way, they know it will continue to serve them in the future no matter which technology they choose.
Stay tuned for Part 2, where we’ll showcase how we decided to use OpenTelemetry tracing in Velocity, and discuss the benefits and challenges of this approach.
Python class called ProcessVideo
Python class called ProcessVideo
Thank you! Your submission has been received!