Motivation

As we begin a systematic series of performance investigations into AI training and inference frameworks, it's worth establishing a consistent methodology. The goal is reproducibility: every investigation should use the same measurement approach so results are directly comparable across frameworks, models, and hardware.

Profiling Stack

Our standard toolchain for NVIDIA GPU workloads:

Vendor Tools

  • Nsight Systems — Timeline-based system profiler. Primary tool for understanding end-to-end execution flow, CPU-GPU interactions, kernel launch patterns, and communication overhead.
  • Nsight Compute — Kernel-level profiler. Used for deep-dive analysis of individual CUDA kernels: occupancy, memory throughput, instruction mix, warp efficiency.
  • PyTorch Profiler — Framework-level integration. Captures operator-level timing, memory allocation patterns, and generates Chrome traces for visualization.

Custom Instrumentation

Beyond vendor tools, we develop lightweight instrumentation for metrics that matter in production:

# Example: Simple kernel timing wrapper
import torch
from torch.cuda import Event

def timed_forward(model, inputs, warmup=5, repeats=20):
    """Measure forward pass with proper CUDA synchronization."""
    start = Event(enable_timing=True)
    end = Event(enable_timing=True)

    # Warmup
    for _ in range(warmup):
        _ = model(inputs)
    torch.cuda.synchronize()

    # Timed runs
    times = []
    for _ in range(repeats):
        start.record()
        _ = model(inputs)
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    return {
        "mean_ms": sum(times) / len(times),
        "min_ms": min(times),
        "max_ms": max(times),
    }

Measurement Principles

  1. Always warm up — First N iterations are discarded to avoid cold-start effects (JIT compilation, memory allocation, caching).
  2. Synchronize properlytorch.cuda.synchronize() before and after measurement regions. Asynchronous kernel launches make naive timing meaningless.
  3. Report distributions, not averages — Median, P95, P99 alongside mean. Variance often reveals more than central tendency.
  4. Control the environment — Pin GPU clocks, disable frequency scaling, document driver and framework versions.
  5. Measure what matters — Wall-clock time, GPU utilization, memory bandwidth utilization, and achieved FLOPS relative to theoretical peak (roofline position).

Coming Next

With this methodology established, our first investigation will profile a standard PyTorch training loop on a cloud GPU instance, establishing baseline performance numbers before moving to NeMo and Megatron-Core.