Drilling down on Rust Performance Bottlenecks with tokio-tracing and texray

10 minute read

Published: April 28, 2025

When a Rust program feels sluggish, adding instrumentation can shine a light on where the time is going. In this post, we’ll walk through a guided journey of using Tokio’s tracing framework and the tracing-texray tool to drill into performance issues. We assume you’re familiar with the basics of tokio-tracing (if not, see the Tokio tracing introduction for spans and events fundamentals). Our journey will start with a simple sequential task, then ramp up to parallel execution and illustrate how to maintain insight at each step.

Setting the Stage: A Sequential File Reader

Imagine we have a Rust program that reads a list of files one by one. Perhaps it’s processing log files or loading data at startup. For this example, we’ll use a mock scenario of reading a few files from disk sequentially. Here’s a simplified version of the code without any instrumentation:

use std::fs;

fn main() {
    let files = vec!["small.txt", "medium.txt", "large.txt", "huge.txt"];
    for file in files {
        // Read the entire file
        let _data = fs::read_to_string(file)
            .expect("Failed to read file");
    }
}

This code will successfully read the files, but if the overall operation is slow, we have no visibility into which file or step is the culprit. We could measure the total execution time, but that wouldn’t tell us which file took most of the time. We need more granular insight.

Instrumenting with Spans for Insight

To get visibility, we’ll instrument the code using tracing spans. A span represents a period of time in the program (it has a start and end) and can carry contextual information. By placing spans around regions of interest (like our file-read loop and each file read), we can collect timing data. We’ll use Tokio’s tracing crate to create spans and tracing-texray as a subscriber to visualize them.

First, we set up tracing-texray as our global tracing subscriber. tracing-texray is a tracing layer that outputs a plaintext timeline of spans and events. It stays dormant until we mark a span to examine, then prints the span’s entire subtree when that span completes. In our code, we call tracing_texray::init() to initialize it, and then wrap our code in an examine span. Let’s add an outer span for the whole file-reading loop:

use tracing_texray::examine;
use tracing::{info_span, Span};

fn main() {
    tracing_texray::init(); // Initialize `texray` as the global subscriber
    let files = vec!["small.txt", "medium.txt", "large.txt", "huge.txt"];

    // Examine the root span for the entire operation
    examine(info_span!("process_files")).in_scope(|| {
        for file in &files {
            // Create a child span with file name, and enter its scope
            let span = info_span!("read_file", file = %file);
            span.in_scope(|| {
                let contents = std::fs::read_to_string(file)
                    .expect("Failed to read file");
                // (Use contents if needed to avoid compiler optimizations)
                println!("Read {} bytes from {}", contents.len(), file);
            });
        }
    });
}

When running this program, texray prints a timeline to stderr when the "process_files" span completes. The output timeline for Example 1 (sequential read) might look like this:

=== Example 1: Sequential File Reading ===
read_files                     728μs ├─────────────────────────────────────────────────┤
  read_file{file: small.txt}    39μs    ├─┤
  read_file{file: medium.txt}   27μs       ├┤
  read_file{file: large.txt}    58μs         ├──┤
  read_file{file: huge.txt}    535μs              ├───────────────────────────────────┤

Here, the top-level span process_files took 133ms in total. Each read_file{file: ...} line is a child span showing how long it took to read that specific file. Because the files were read sequentially, their timeline bars appear one after another (no overlaps). For instance, small.txt took 14ms, then medium.txt took 22ms, and so on. The ASCII bars give a visual sense of timing: each bar’s length corresponds to its duration, and since these reads happened back-to-back, the bars are placed end-to-end.

At this point, we’ve used tracing-texray to pinpoint which file(s) are slow. In a real scenario, this insight might prompt us to investigate why huge.txt is slow (perhaps it’s larger or on a slower disk). It might also prompt us to consider optimizations, such as reading files in parallel to reduce overall time. Let’s try that next.

Parallelizing with Rayon for Speed

Knowing that one file dominates the timeline, a natural optimization is to read the files in parallel. We can use the Rayon crate for data parallelism. Rayon makes it easy to convert sequential iterators into parallel iterators, safely distributing work across threads (rayon - Rust). For our file reader, switching to Rayon is straightforward: we replace the sequential loop with Rayon’s par_iter() over the file list.

First, ensure we add Rayon to our dependencies. Then our code becomes:

use rayon::prelude::*;
use tracing_texray::examine;
use tracing::info_span;

fn main() {
    tracing_texray::init();
    let files = vec!["small.txt", "medium.txt", "large.txt", "huge.txt"];

    // Examine the root span for parallel processing
    examine(info_span!("process_files")).in_scope(|| {
        // Parallel iteration over files
        files.par_iter().for_each(|file| {
            let span = info_span!("read_file", file = %file);
            span.in_scope(|| {
                let contents = std::fs::read_to_string(file)
                    .expect("Failed to read file");
                println!("Read {} bytes from {}", contents.len(), file);
            });
        });
    });
}

With these small changes, our program will read files in parallel. If we run this, we should see a significant speed-up in total execution time.

However, when we check the tracing-texray output now, we encounter a problem: the timeline is not showing the inner read_file spans at all under the read_files span. In fact, you might only see the outer read_files span with no children, or the output might be missing details of the file reads. What happened to our nice breakdown?

=== Example 2: Rayon Parallel File Reading ===
read_files  619μs ├────────────────────────────────────────────────────────────────────┤

Where did the child spans go? We only see the top-level span process_files (taking about 61ms total) and no read_file spans underneath it. This happens because the spans created inside Rayon threads are not being recorded as children of the examined span. The tracing context (which carries the parent span) isn’t automatically propagated to new threads spawned by Rayon’s thread pool. As a result, texray doesn’t realize those "read_file" spans are part of process_files, and so it omits them from the timeline. We end up with a single flat span indicating the total time for processing all files in parallel, but without any breakdown per file.

In other words, the parallel file reads did happen (the total time dropped to ~619μs, roughly the duration of the slowest file), but texray couldn’t trace the inner spans across threads. This can make debugging parallel operations tricky, since we lose visibility into the concurrent tasks.

The Challenge: Tracing Spans Across Threads

The missing spans are a result of how tracing contexts propagate (or rather, don’t propagate) across threads. By default, a span entered on one thread isn’t automatically carried into work happening on other threads. In our parallel code, we start the "read_files" span on the main thread, but the file reading happens on Rayon’s thread pool threads. Those threads create new "read_file" spans, yet they don’t inherit the parent span context. In other words, the child spans are disconnected from the parent.

This behavior is a known aspect of tracing. The core tracing library doesn’t implicitly track spans across thread or task boundaries (tasks threads). If we want a span’s context to continue on another thread, we have to do it manually. If we want tracing to embed spans properly, the recommendation is to pass the span handle to the new thread or task and enter it there. In our case, that would mean capturing the "read_files" span and inside the par_iter closure. This would explicitly re-establish the parent context on each worker thread.

Doing this by hand is possible, but it’s a bit tricky and adds extra complexity to our nicely abstracted Rayon loop. We’d rather not mess with manually managing span guards in every thread. So, how can we regain our tracing insight without abandoning Rayon or significantly complicating our code?

A Workaround: Using `maybe-rayon` for Consistent Traces

One way to regain that visibility is to run the tasks on a single thread (so that span context is preserved), while still keeping the option to use multiple threads when we need raw speed. This is where p3-maybe-rayon comes in. The p3-maybe-rayon crate (a modern alternative to the now-abandoned maybe-rayon crate) is a feature-gated wrapper around Rayon. It can act as Rayon (parallel) or as a single-threaded facade depending on a compile-time feature flag, allowing us to easily toggle parallelism.

We replace Rayon’s import with p3_maybe_rayon::prelude::* and use the same par_iter() code:

use p3_maybe_rayon::prelude::*;
use tracing_texray::examine;
use tracing::info_span;

fn main() {
    tracing_texray::init();
    let files = vec!["small.txt", "medium.txt", "large.txt", "huge.txt"];

    // Examine the root span (using maybe-rayon for optional parallelism)
    examine(info_span!("process_files")).in_scope(|| {
        files.par_iter().for_each(|file| {
            let span = info_span!("read_file", file = %file);
            span.in_scope(|| {
                let contents = std::fs::read_to_string(file)
                    .expect("Failed to read file");
                println!("Read {} bytes from {}", contents.len(), file);
            });
        });
    });
}

If we run this with the parallel feature turned off (by not passing the "parallel" feature to p3-maybe-rayon, so that par_iter() actually runs sequentially on the current thread), texray will be able to capture all the file-reading spans just like in the purely sequential Example 1. The Example 3 output shows the nested spans properly:

=== Example 3: Maybe-rayon for Debugging ===
read_files                     846μs ├─────────────────────────────────────────────────┤
  read_file{file: small.txt}    42μs  ├─┤
  read_file{file: medium.txt}   26μs     ├┤
  read_file{file: large.txt}    35μs       ├┤
  read_file{file: huge.txt}    698μs          ├────────────────────────────────────────┤

Wrapping Up

All the code for this post is available at https://github.com/huitseeker/texray-demo.

Using tracing-texray we visualized the execution of reading multiple files under different modes. In sequential mode, texray clearly showed each file read span and how long it took relative to the others. With naive Rayon-based parallelism, the timeline lost detail due to span context not carrying over to new threads. By using the p3-maybe-rayon crate (essentially running the tasks serially for tracing purposes), we regained the detailed breakdown.

Note: If we enable parallelism in p3-maybe-rayon (making it use Rayon internally), we would again face the same issue as Example 2 – the spans would execute on worker threads without the parent context, and texray’s output would look empty like before. The crate doesn’t magically propagate tracing contexts for us; it simply gives the flexibility to switch between parallel and serial execution. In a real application, to trace multi-threaded spans properly, we might need to propagate the span context manually or use tracing facilities that support cross-thread span relationships.

texray proved useful for quickly understanding the performance characteristics of our code. We saw which file was the slowest and how the total time was distributed among tasks. When writing parallel code, tools like texray can help ensure that we don’t sacrifice observability for speed. And with crates like p3-maybe-rayon, we have the option to turn off parallelism when debugging, giving us the best of both worlds: clarity during development and concurrency in production.

Share on

Twitter Facebook LinkedIn

François Garillot

Drilling down on Rust Performance Bottlenecks with tokio-tracing and texray

Setting the Stage: A Sequential File Reader

Instrumenting with Spans for Insight

Parallelizing with Rayon for Speed

The Challenge: Tracing Spans Across Threads

A Workaround: Using `maybe-rayon` for Consistent Traces

Wrapping Up

Share on

You May Also Enjoy

Byzantine-Consistent Broadcast- A Promising Yet Challenging Frontier in Digital Asset Transfers

PSA: maven central and sonatype are slow

A June 2016 roundup of distributed Deep Learning projects on Apache Spark

May 2016 time series storage roundup

François Garillot

Setting the Stage: A Sequential File Reader

Instrumenting with Spans for Insight

Parallelizing with Rayon for Speed

The Challenge: Tracing Spans Across Threads

A Workaround: Using maybe-rayon for Consistent Traces

Wrapping Up

Share on

You May Also Enjoy

Byzantine-Consistent Broadcast- A Promising Yet Challenging Frontier in Digital Asset Transfers

PSA: maven central and sonatype are slow

A June 2016 roundup of distributed Deep Learning projects on Apache Spark

May 2016 time series storage roundup

A Workaround: Using `maybe-rayon` for Consistent Traces