Benchmark Methodology

This page documents the methodology behind the performance benchmarks reported in our 2026 Outlook blog post.


Hardware

SpecValue
GPUNVIDIA RTX PRO 6000 Blackwell Workstation Edition
Memory96GB GDDR7
Bandwidth1792 GB/s
TDP600W
CUDA13.1.0
Driverr580.44.80
TensorRT10.14

Workload

ParameterValue
ModelFLUX.1-dev
ResolutionMegapixel (1024×1024)
ArchitectureAdaLN-modulated DiT
Transformer Blocks57 (38 double + 19 single)

FLUX.1-dev Benchmark Results

Iterations per second on RTX PRO 6000 Blackwell for megapixel inference:

MethodPerformanceNotes
FLUX Denoise SOL~1000 it/sTheoretical ceiling (conjectured)
s4 TensorRT Codegen (myelin4)100-300 it/sOur implementation
idoru (fleek v1)48 it/sPrevious implementation
nunchaku v28 it/s
TRT ModelOpt7-8 it/s
torchao Float84 it/s
torch.compile2.5 it/s
torch eager1.3 it/sBaseline

Performance Gains

ComparisonSpeedup
vs torch eager77-230×
vs idoru2-6×
vs nunchaku12-37×

Range reflects end-to-end integration overhead: AdaLN memory ordering, VAE↔DiT precision boundaries, CFG batching.


s4 Op-Gen Kernel Performance

NVFP4 fusion patterns on RTX PRO 6000 Blackwell — sustained TFLOPS:

PatternTFLOPSLatency
s4 GEMM myelin tactic1,850SOL
Deep Linear Chain1,1770.47ms
Fused MLP1,0810.51ms
Residual Pattern1,0620.52ms
Transformer Block9760.85ms
Single Linear8210.08ms
MHA Pattern7990.34ms

Summary Stats

MetricValue
Peak Sustained1.85 PFLOP
Compression Ratio7.1×
Quantization Throughput25 GB/s

All patterns hit 800+ TFLOPS sustained.


Kernel Fusion Analysis

Per-block latency measured via TensorRT Nsys profiling using NVFP4 bidirectional noncausal attention. Native Blackwell FP4 tensor core path confirmed:

ParameterValue
Kernelcutlass3x_sm120_bstensorop_s16864gemm_block_scaled
Formatue4m3xe2m1 (NVFP4 E2M1 + E4M3 block scales)
Tile128×128×128, pingpong scheduling
Measured0.449ms mean @ seq_len=16384

Why Official Tools Miss This Path

TensorRT ModelOpt (~8 it/s)

  • W4A8 weight-only quant → attention hits unfused Myelin block, graph breaks at precision boundaries
  • Kernel: sm80_xmma_gemm_f16f16 (Ampere fallback, graph fragmentation at every boundary)

Nunchaku (~8 it/s)

  • W4A4 + FP16 low-rank branch → SVD outlier absorption adds ~16% memory traffic
  • Memory overhead from branch fusion (~57% naive, ~16% optimized)

VAE Decode

  • BF16 everywhere—carrying 8× more bits than measured latent entropy requires
  • "The only thing that's difficult is to forget precisely."

s4 myelin4 (~200 it/s)

  • Block-scaled NVFP4 with forced tactic selection
  • Bidirectional noncausal attention emitted as single fused op → Myelin selects sm120 tactics
  • Precision matched to measured latent entropy—epilogue fusion at gauge boundaries
  • Kernel: cutlass3x_sm120_bstensorop (Native Blackwell FP4, whole-graph fusion intact)

The moat isn't quantization math—it's knowing where to force tactics, what compositions preserve fusion, and matching precision to measured entropy at every boundary.


Full Generation Estimate

Kernel Benchmark (Isolated)

ComponentCalculationResult
DiT forward57 × 0.449ms25.6ms
4-step CFG (8 passes)8 × 25.6ms205ms

End-to-End Projected (with Integration Overhead)

EstimateTimeEquivalent
Conservative~300-600ms100-200 it/s
Optimistic~200-300ms200-300 it/s

What Drives The Range

Negative Factors

  • AdaLN memory ordering — Modulation tensors computed per-step create serialization points
  • VAE↔DiT boundary — Precision mismatch at FP4 DiT → FP16 VAE decode
  • CFG batching — Doubles effective batch mid-graph, memory pressure

Positive Factors

  • Coherent precision — Single FP4 precision across entire DiT eliminates cast overhead
  • Graph-level fusion — Myelin sees full transformer block, not isolated ops

Caveats

  1. High variance (100-300 it/s range) from thermal/power modulation at 600W sustained and end-to-end integration overhead

  2. Synthetic weights — Kernel timing from synthetic weight-shared model; full 6GB unique weights may see additional memory bandwidth cost

  3. Pattern requirements — NVFP4 bidirectional noncausal attention pattern required for Myelin to select optimal tactics