Share:

Goodbye Logs, Hello Traces: Part 1 - A proper introduction

As developers, there’s no denying that the work we do and the systems we work with are complicated. It can be difficult to understand the flow of all of our microservices, event-driven architecture, and distributed systems. Bugs and unpredictable behavior make such systems even more difficult to comprehend and troubleshoot. At the end of the day, the monitoring systems and tools available to understand what’s going on just aren’t good enough. That’s the problem we faced at Velocity, and this is the story of how we solved it.

Basic terms in the observability world

Before we dive into the details, let’s level-set about a few key terms.

Telemetry

The word Telemetry is derived from the Greek words “tele” and “metry” - remote measurement. With modern cloud systems, it’s possible to report different types of observability data to remote systems that will ultimately be seen in selected monitoring systems.

Telemetry is made up of three main pillars:

  • Logging - Discrete events
  • Metrics - Essentially numbers or values that developers can do all sorts of aggregations and calculations with, display on graphs, etc.
  • Tracing - Traces are used to look at a flow in the system as a whole and not as a discrete event or a number

Logs

Logs are constructed objects that contain different information pieces. Usually, these are:

  • The time when an event happened
  • The service from which the event was sent
  • A text description of the event
  • All sorts of other data such as our account, our tenant ID, which client we’re talking about, and so on
Sample code

Metrics

Metrics are numbers that you can very conveniently represent on graphs. As you can see in our example here, the average number of readers when you’re reading this story is approximately 800 - not bad ;)

Graph of readers

This allows us to easily know what’s happening in our systems at a higher level, add things, and see processes from a bird’s eye view, which can be quite helpful for understanding where our problems are located.

Traces

Now on to the really fun stuff - traces. A trace represents a complete request in the system, a job, or any complete process that has a beginning and end.

You can essentially think of traces as a tree of Spans, which are smaller, logical processes that we as developers decide how to scope. The control is in our hands to represent our systems as we imagine. That is what makes them so powerful. Spans have a beginning and end, which makes it easier for us to understand, well, when something begins and then ends.

Let’s take a look at an example.

Trace and span illustration

Above, we can see two types of visual representations for traces. On the left, we see the trace in tree form, on the right the trace is presented as a timeline. Each individual segment (circle or rectangular block) is a span. Let’s take a look at the simple flow of a login:

  1. A person presses the login button on a browser
  2. A command or request is sent to our API server in the path /login
  3. The server refers to the DB and does a query asking if the user’s credentials are correct
  4. A “send event” process reports the fact that there is a login in the system, which can be used for analytics
  5. The browser does a redirect to our home page or dashboard

With the help of traces, it’s possible to clearly understand what led to what and which processes happened in parallel by using both a timeline and a hierarchy.

OpenTelemetry - and how it can solve all of our problems

OpenTelemetry is a collection of tools, APIs, SDKs, and protocols that allow us to:

  • Automatically create all sorts of metrics, traces, and logs - every type of telemetry data
  • Define them
  • Collect them
  • Export them to our monitoring systems

What’s great about OpenTelemetry is that it is open source and vendor-neutral. That means that users can write code one way, one time in an SDK that OpenTelemetry offers, and then switch the monitoring system to any other one they choose with only a configuration change.

Logos of supported technologies

As you can see, OpenTelemetry supports nearly every programming language that you can imagine. Since OpenTelemetry is relatively new, the maturity level varies from one programming language to another. At Velocity, we use Golang and it works great.

Similarly, the OpenTelemetry backends, typically the observability products and solutions, include a wide variety of commercial and open-source vendors - Prometheus, Jaeger, Datadog, New Relic, etc. This allows us to use OpenTelemetry knowing that no matter which technology you choose, it’s supported.

OpenTelemetry architecture

So, how does using OpenTelemetry work?

OpenTelemetry architecture flow chart
  • A developer imports the OpenTelemetry SDK in their application code and uses it.
  • The SDK reports telemetry to an OpenTelemetry Collector agent, which usually sits on the same machine as the application for low latency. The collector agent can also report metrics from the host like CPU, memory, etc.
  • The Collector agent sends metrics to a centralized OpenTelemetry collector service that usually sits in a central location – this is where the magic happens.
  • The Collector service does a batching of all the requests that the developer reported and converts them to the correct format regardless of which system they use, or to multiple monitoring systems if desired.

Developers can send the metrics to Prometheus and the traces to Jaeger and it all works. If they change their mind and decide to replace the monitoring system, everything keeps working – that’s the beauty of being vendor-neutral. This way, they know it will continue to serve them in the future no matter which technology they choose.

Stay tuned for Part 2, where we’ll showcase how we decided to use OpenTelemetry tracing in Velocity, and discuss the benefits and challenges of this approach.