Distributed Tracing for Web Applications

Aug. 23, 2020

I’ve had observability on my mind and I recently got the opportunity to shoehorn in some tracing information to our events our work. The idea is pretty straightforward, and luckily there’s even a recommended standard nowadays for how to pass along the data.

Basic Components

Trace ID

In a trace’s most basic form, you can generate some sort of identifier for a request, and you pass that id along to all the activity that followed. Boom, now you have something you can join along to link all that activity. This is especially handy when searching logs for information for “random” errors.

Span ID

ENHANCE

Let’s say you want a bit more granularity. We could also break up the work within the request, which we can call spans. You could do it however you like, such as by application.

If you wanted to get real fancy, whenever you’re recording your trace data, you could also record the span ID that triggered the current span you’re in; aka the parent span ID.

TraceContext

The W3 spec basically attempts to standardize how we’re passing trace information from one application to another via two headers. Since tracestate is for letting vendors basically put whatever they want, let’s stick to looking at traceparent.

It’s pretty much the fields we mentioned above of trace id, and the parent span id aka parent id. The other fields are fairly straight forward; version to indicate which version of the spec we’re on, flags for trace configs we want to pass along (only one right now is to sample or not).

In the Browser

Splitting up traces by request makes sense for backend applications, but didn’t quite jive for me on the browser side. After a bit of searching, I found this article extremely helpful for getting a lay of the land for tracing on the client-side. Basically, Emily also suggested creating traces per page or session.

I thought this made sense for most browser application, but was still unsure if there’s a “standard” if your application is a mix of single page applications (SPAs) and server-side rendered (SSR) pages. I suppose we could have our backend fake it with expected parent span id and somehow send this information back to the caller and hope they use it.

Turns out, after doing some digging in the trace-context GitHub repo, there’s a spec for a traceresponse header! And what do you know, as part of it they have a trace-id and proposed-parent-id. Seems like there’s still a bit of debate here, however, as normally you can’t access the headers on a page request from the browser. So that’s a bit of a downer.

Using the same idea, though, there’s a couple workarounds like passing up the information in cookies or shoving in DOM elements. In my particular case, we already were jerryrigging initial state into our page, so I just went with that.

function generateHexString(len) {
  let arr = new Uint8Array(len / 2)
  window.crypto.getRandomValues(arr)
  return Array.from(arr, (n) => (n < 16 ? '0' : '') + n.toString(16)).join('');
}

function loadTraceData() {
  // via cookies
  const cookies = document.cookie.split(';').map((s) => s.trim());
  const rootTraceId = cookies
    .find(row => row.startsWith('trace-id'))
    .split('=')[1];
  const rootSpanId = cookies
    .find(row => row.startsWith('proposed-parent-id'))
    .split('=')[1];

  // via meta tags
  const metas = Array.from(document.getElementsByTagName('meta'));
  const rootTraceId = metas
    .find(row => row.getAttribute('name') === 'trace-id')
    .getAttribute('content');
  const rootSpanId = metas
    .find(row => row.getAttribute('name') === 'proposed-parent-id')
    .getAttribute('content');

  // via global vars
  const rootTraceId = window.TRACE_ID;
  const rootSpanId = window.PROPOSED_PARENT_ID;

  return {
    rootTraceId: rootTraceId || generateHexString(32),
    rootSpanId: rootSpanId || generateHexString(16)
  };
}

I think then we can proceed as normal, splitting up the span IDs however we like. My implemented solution was a little messier than this, but we did finish quickly and it seems good enough for now. When I get time™ I’d rather move to using a third-party to handle the handoffs anyways.