Benchmark Methodology
This page documents the methodology behind the performance benchmarks reported in our 2026 Outlook blog post.
Hardware
| Spec | Value |
|---|---|
| GPU | NVIDIA RTX PRO 6000 Blackwell Workstation Edition |
| Memory | 96GB GDDR7 |
| Bandwidth | 1792 GB/s |
| TDP | 600W |
| CUDA | 13.1.0 |
| Driver | r580.44.80 |
| TensorRT | 10.14 |
Workload
| Parameter | Value |
|---|---|
| Model | FLUX.1-dev |
| Resolution | Megapixel (1024×1024) |
| Architecture | AdaLN-modulated DiT |
| Transformer Blocks | 57 (38 double + 19 single) |
FLUX.1-dev Benchmark Results
Iterations per second on RTX PRO 6000 Blackwell for megapixel inference:
| Method | Performance | Notes |
|---|---|---|
| FLUX Denoise SOL | ~1000 it/s | Theoretical ceiling (conjectured) |
| s4 TensorRT Codegen (myelin4) | 100-300 it/s | Our implementation |
| idoru (fleek v1) | 48 it/s | Previous implementation |
| nunchaku v2 | 8 it/s | — |
| TRT ModelOpt | 7-8 it/s | — |
| torchao Float8 | 4 it/s | — |
| torch.compile | 2.5 it/s | — |
| torch eager | 1.3 it/s | Baseline |
Performance Gains
| Comparison | Speedup |
|---|---|
| vs torch eager | 77-230× |
| vs idoru | 2-6× |
| vs nunchaku | 12-37× |
Range reflects end-to-end integration overhead: AdaLN memory ordering, VAE↔DiT precision boundaries, CFG batching.
s4 Op-Gen Kernel Performance
NVFP4 fusion patterns on RTX PRO 6000 Blackwell — sustained TFLOPS:
| Pattern | TFLOPS | Latency |
|---|---|---|
| s4 GEMM myelin tactic | 1,850 | SOL |
| Deep Linear Chain | 1,177 | 0.47ms |
| Fused MLP | 1,081 | 0.51ms |
| Residual Pattern | 1,062 | 0.52ms |
| Transformer Block | 976 | 0.85ms |
| Single Linear | 821 | 0.08ms |
| MHA Pattern | 799 | 0.34ms |
Summary Stats
| Metric | Value |
|---|---|
| Peak Sustained | 1.85 PFLOP |
| Compression Ratio | 7.1× |
| Quantization Throughput | 25 GB/s |
All patterns hit 800+ TFLOPS sustained.
Kernel Fusion Analysis
Per-block latency measured via TensorRT Nsys profiling using NVFP4 bidirectional noncausal attention. Native Blackwell FP4 tensor core path confirmed:
| Parameter | Value |
|---|---|
| Kernel | cutlass3x_sm120_bstensorop_s16864gemm_block_scaled |
| Format | ue4m3xe2m1 (NVFP4 E2M1 + E4M3 block scales) |
| Tile | 128×128×128, pingpong scheduling |
| Measured | 0.449ms mean @ seq_len=16384 |
Why Official Tools Miss This Path
TensorRT ModelOpt (~8 it/s)
- W4A8 weight-only quant → attention hits unfused Myelin block, graph breaks at precision boundaries
- Kernel:
sm80_xmma_gemm_f16f16(Ampere fallback, graph fragmentation at every boundary)
Nunchaku (~8 it/s)
- W4A4 + FP16 low-rank branch → SVD outlier absorption adds ~16% memory traffic
- Memory overhead from branch fusion (~57% naive, ~16% optimized)
VAE Decode
- BF16 everywhere—carrying 8× more bits than measured latent entropy requires
- "The only thing that's difficult is to forget precisely."
s4 myelin4 (~200 it/s)
- Block-scaled NVFP4 with forced tactic selection
- Bidirectional noncausal attention emitted as single fused op → Myelin selects sm120 tactics
- Precision matched to measured latent entropy—epilogue fusion at gauge boundaries
- Kernel:
cutlass3x_sm120_bstensorop(Native Blackwell FP4, whole-graph fusion intact)
The moat isn't quantization math—it's knowing where to force tactics, what compositions preserve fusion, and matching precision to measured entropy at every boundary.
Full Generation Estimate
Kernel Benchmark (Isolated)
| Component | Calculation | Result |
|---|---|---|
| DiT forward | 57 × 0.449ms | 25.6ms |
| 4-step CFG (8 passes) | 8 × 25.6ms | 205ms |
End-to-End Projected (with Integration Overhead)
| Estimate | Time | Equivalent |
|---|---|---|
| Conservative | ~300-600ms | 100-200 it/s |
| Optimistic | ~200-300ms | 200-300 it/s |
What Drives The Range
Negative Factors
- AdaLN memory ordering — Modulation tensors computed per-step create serialization points
- VAE↔DiT boundary — Precision mismatch at FP4 DiT → FP16 VAE decode
- CFG batching — Doubles effective batch mid-graph, memory pressure
Positive Factors
- Coherent precision — Single FP4 precision across entire DiT eliminates cast overhead
- Graph-level fusion — Myelin sees full transformer block, not isolated ops
Caveats
-
High variance (100-300 it/s range) from thermal/power modulation at 600W sustained and end-to-end integration overhead
-
Synthetic weights — Kernel timing from synthetic weight-shared model; full 6GB unique weights may see additional memory bandwidth cost
-
Pattern requirements — NVFP4 bidirectional noncausal attention pattern required for Myelin to select optimal tactics
Related Reading
- 2026 Outlook Blog Post — Full context and charts
- Research Overview — Technical foundations
- Weyl .plan — Deep-dive technical articles