Blogs

Goodbye Logs, Hello Traces: Part 2 - See it in the wild

October 6, 2022

This is part two in a multi-part series on using OpenTelemetry and distributed tracing to level up your system observability.

Goodbye Logs, Hello Traces: Part 2 - See it in the wild

In part 1, we covered the basics of telemetry and introduced OpenTelemetry, a collection of tools, APIs, and SDKs that can be used to instrument, generate, collect, and export telemetry data (metrics, logs, and traces).

To understand how we leveraged OpenTelemetry (or OTEL for short) in Velocity, and what we can take away from it, let’s look at Velocity’s case study.

Case study: Velocity

At Velocity, our goal was to allow our developers to debug and understand complex workflows as quickly as possible while reducing the time it takes to identify (MTTD) and recover (MTTR) from a malfunction.

To make sure we always have the needed relevant information, we decided to fully embrace tracing and to take the approach of covering most parts of our system with traces. This means that every request, including client, server, API, or other backend systems – nearly every central function – is covered with an OpenTelemetry Span. That’s right; we’re practically running system profiling for every request, all the time.

So what does that look like?

Well, Velocity’s product automatically spins up production-like environments for easier development and testing on k8s. Our customers' systems are already complicated enough and our system is actually the one spinning them up, so you can probably imagine that our flow is also pretty complex. What often happens is that we aren’t familiar with our clients’ flows, making it really difficult to understand what happened and where. We need the very best tools, and that’s where distributed traces come in.

The “distributed” in “distributed trace” means that multiple processes, systems, and technologies can be involved in the same trace. A trace can start with a client, continue to an API endpoint, and end in our main worker – all in the same trace. Tracing allows us to visualize that same request as it moves through different services. Each service has its own set of spans and all the information we need to understand exactly what happened.

If we were to look at our system and what’s happening in it in logs, we’d see a long list of discrete event objects. Maybe we’ll filter them later, but as you can see, you can’t understand the hierarchy of requests. If things are happening at the same time or causing one another, you can’t understand it just from looking at logs. For us at Velocity, this wasn’t good enough.

Then we found traces and all the amazing things they can do for us. For example…

Hierarchical display with start and end markers

Traces allow us to look at our information hierarchically and know when something started and ended. As you can see here, the colorful display clearly shows us what is happening in our command line, the API server, and the worker. There’s a beginning and end to every span allowing us to understand the exact scope, context, and chain of events for each error.

When using tracing, errors are saved as a part of the span data. This means we can always know where they came from. If we want to understand how it happened, we just go to the tree, see which parameters were sent to every span, and easily glean the origin of the problem.

Latency distribution graph

A latency distribution graph allows us to easily see where our slowest requests or processes are, debug them, and understand why they are so slow. Since every span is recursive and includes a beginning and end time, we can easily understand exactly where our bottleneck is.

Network dependency graph

Without defining anything specific, the monitoring system knows how to parse traces and display them as a graph that easily demonstrates which component is communicating with which other components. This is incredibly useful when we’re dealing with large systems and can’t figure out where an event, request, or bug is located.

External dependencies

Some external components are outside of our codebase. We cannot choose how to instrument them with traces and where/what information to record. However, OTEL’s wide automatic instrumentation support means that we can usually see these components as well. This can be very helpful in allowing us to understand what’s going on because errors are often located in external services that aren’t under our control. Using this, we can understand what’s happening in our system much faster.

Stay tuned for the next part where we will be covering how to level it all up and discuss the advantages and disadvantages of the approach we took.

‍

Join the discussion!

Have any questions or comments about this post? Maybe you have a similar project or an extension to this one that you'd like to showcase? Join the Velocity Discord server to ask away, or just stop by to talk K8s development with the community.

‍

Python class called ProcessVideo