Establishing a Baseline: AI Framework Profiling Methodology

Motivation

As we begin a systematic series of performance investigations into AI training and inference frameworks, it's worth establishing a consistent methodology. The goal is reproducibility: every investigation should use the same measurement approach so results are directly comparable across frameworks, models, and hardware.

Profiling Stack

Our standard toolchain for NVIDIA GPU workloads:

Vendor Tools

Nsight Systems — Timeline-based system profiler. Primary tool for understanding end-to-end execution flow, CPU-GPU interactions, kernel launch patterns, and communication overhead.
Nsight Compute — Kernel-level profiler. Used for deep-dive analysis of individual CUDA kernels: occupancy, memory throughput, instruction mix, warp efficiency.
PyTorch Profiler — Framework-level integration. Captures operator-level timing, memory allocation patterns, and generates Chrome traces for visualization.

Custom Instrumentation

Beyond vendor tools, we develop lightweight instrumentation for metrics that matter in production:

# Example: Simple kernel timing wrapper
import torch
from torch.cuda import Event

def timed_forward(model, inputs, warmup=5, repeats=20):
    """Measure forward pass with proper CUDA synchronization."""
    start = Event(enable_timing=True)
    end = Event(enable_timing=True)

    # Warmup
    for _ in range(warmup):
        _ = model(inputs)
    torch.cuda.synchronize()

    # Timed runs
    times = []
    for _ in range(repeats):
        start.record()
        _ = model(inputs)
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    return {
        "mean_ms": sum(times) / len(times),
        "min_ms": min(times),
        "max_ms": max(times),
    }

Measurement Principles

Always warm up — First N iterations are discarded to avoid cold-start effects (JIT compilation, memory allocation, caching).
Synchronize properly — torch.cuda.synchronize() before and after measurement regions. Asynchronous kernel launches make naive timing meaningless.
Report distributions, not averages — Median, P95, P99 alongside mean. Variance often reveals more than central tendency.
Control the environment — Pin GPU clocks, disable frequency scaling, document driver and framework versions.
Measure what matters — Wall-clock time, GPU utilization, memory bandwidth utilization, and achieved FLOPS relative to theoretical peak (roofline position).

Coming Next

With the methodology in place, we went to work — and the series drifted away from training loops and more toward the systems questions of inference, where this measurement discipline earns its keep. The investigations that followed:

Batching Is the Parallelism — the first hands-on run: continuous batching turns an idle T4 into a 27× throughput server.
vLLM vs SGLang: An Honest Bake-Off — apples-to-apples across serving engines, warts and corrections included.
Scale-Out Meets the Silicon Ceiling — Kubernetes autoscaling on a T4, and why scale-out isn't scale-up.
From MLPerf to Agents per Megawatt — what we actually benchmark now.
Below the Framework — the debugging ladder from a CPU segfault to a GPU out-of-bounds.
What the KV Cache Actually Is — the one structure that governs serving.
One Line, an Order of Magnitude — where the speedups actually live: data movement, not compute.

The training-and-scaling thread this methodology was first aimed at — NeMo, Megatron, and the distributed-systems view of a neural network — runs in parallel, starting with Your Neural Network Is a Distributed System.