Motivation
As we begin a systematic series of performance investigations into AI training and inference frameworks, it's worth establishing a consistent methodology. The goal is reproducibility: every investigation should use the same measurement approach so results are directly comparable across frameworks, models, and hardware.
Profiling Stack
Our standard toolchain for NVIDIA GPU workloads:
Vendor Tools
- Nsight Systems — Timeline-based system profiler. Primary tool for understanding end-to-end execution flow, CPU-GPU interactions, kernel launch patterns, and communication overhead.
- Nsight Compute — Kernel-level profiler. Used for deep-dive analysis of individual CUDA kernels: occupancy, memory throughput, instruction mix, warp efficiency.
- PyTorch Profiler — Framework-level integration. Captures operator-level timing, memory allocation patterns, and generates Chrome traces for visualization.
Custom Instrumentation
Beyond vendor tools, we develop lightweight instrumentation for metrics that matter in production:
# Example: Simple kernel timing wrapper
import torch
from torch.cuda import Event
def timed_forward(model, inputs, warmup=5, repeats=20):
"""Measure forward pass with proper CUDA synchronization."""
start = Event(enable_timing=True)
end = Event(enable_timing=True)
# Warmup
for _ in range(warmup):
_ = model(inputs)
torch.cuda.synchronize()
# Timed runs
times = []
for _ in range(repeats):
start.record()
_ = model(inputs)
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
return {
"mean_ms": sum(times) / len(times),
"min_ms": min(times),
"max_ms": max(times),
}
Measurement Principles
- Always warm up — First N iterations are discarded to avoid cold-start effects (JIT compilation, memory allocation, caching).
- Synchronize properly —
torch.cuda.synchronize()before and after measurement regions. Asynchronous kernel launches make naive timing meaningless. - Report distributions, not averages — Median, P95, P99 alongside mean. Variance often reveals more than central tendency.
- Control the environment — Pin GPU clocks, disable frequency scaling, document driver and framework versions.
- Measure what matters — Wall-clock time, GPU utilization, memory bandwidth utilization, and achieved FLOPS relative to theoretical peak (roofline position).
Coming Next
With this methodology established, our first investigation will profile a standard PyTorch training loop on a cloud GPU instance, establishing baseline performance numbers before moving to NeMo and Megatron-Core.