Fleek 2026 Outlook: All In On Efficient Inference

2025 was a rollercoaster year for Fleek. We set out to transition from decentralized infrastructure to AI, but the journey took longer than expected with many turns, especially with the FLK token launching in the middle of it. From the outside it admittedly probably looked very confusing, but from the inside things were starting to click.

Now as we start 2026, the dust has settled, and we've finally found our lane. And the path forward has never been clearer or more exciting.

2025 was the flame. 2026 is the phoenix.

Fleek's 2025 AI Transition

Quick background: Fleek spent years building decentralized web infrastructure. IPFS hosting, ENS domains, Fleek Network. Real products, real users, but never true product-market fit. And we weren't alone—very few decentralized infra projects have found PMF. So at the start of 2025, we made the call to pivot to AI.

That pivot started with hosting Eliza agents in January 2025. Through that experience we learned how developers were actually using the framework, which led us to explore AI social use cases in the second half of the year: characters, twins, generative AI, etc. But every exploration pointed to the same bottleneck: We needed fast, high quality, and low cost inference in order to compete.

The initial exploration and results stemming from that need were showcased in our previous blog Introducing Weyl AI (how the Fleek & Weyl brands will be used going forward is explained at the end of this blog). So I will pick up the story from there.

The Open Lane We Found

1) DeepSeek proved algorithmic innovation beats brute-force scaling.
2) NVIDIA's TensorRT-LLM proved you can build a world-class LLM optimization toolchain.
3) Recent attention kernel research proved you can hit breakthrough performance for narrow operations.

We took inspiration from those three, and initially tried to see if we could apply it generally to diffusion inference.

So we dug. Deep into nvfuser commits, TensorRT internals, half-documented code paths. And we found something: hardware fast paths on Blackwell that exist in the silicon but aren't exposed through standard tooling. The performance numbers Jensen touts on stage—the ones nobody in the industry believed were achievable in practice.

Turns out they are achievable. You just need to know which IterDomain compositions preserve whole-model fusion, where to force tactic selection, and how to match precision to measured entropy at every gauge boundary.

FLUX.1-dev Megapixel Inference*

RTX PRO 6000 Blackwell · iterations per second

FLUX Denoise SOL

~1000 it/s

optimization headroom

s4 TensorRT Codegen
(myelin4)

100-300

idoru (fleek v1)

nunchaku v2

TRT ModelOpt

7-8

torchao Float8

torch.compile

2.5

torch eager

1.3

77-230× vs eager2-6× vs idoru12-37× vs nunchaku

Range reflects end-to-end integration overhead: AdaLN memory ordering, VAE↔DiT precision boundaries, CFG batching

Projected end-to-end performance from kernel fusion benchmarks. Conservative range accounts for full pipeline integration. See benchmark methodology .

The chart shows the full inference stack on FLUX.1-dev, from PyTorch eager to our myelin4 runtime. The official tools (TensorRT ModelOpt, nunchaku) plateau around 8 it/s. They hit the same ceiling from different angles: ModelOpt's W4A8 quantization chokes on unfused attention, while nunchaku's W4A4 loses gains to its low-rank outlier branch. Our approach bypasses both bottlenecks by emitting ONNX patterns that trigger native Blackwell FP4 tensor core tactics.

The key insight: NVIDIA shipped Blackwell with incredible FP4 tensor cores, but the official tools don't generate the right graph patterns to trigger them. We figured out the exact ONNX quantization structure that makes TensorRT's Myelin compiler select the cutlass3x_sm120_bstensorop kernels instead of falling back to sm80 FP16 compute.

Why The Gap Exists

Same hardware, different optimization depths

Official Tools (~8 it/s)

TRT

W4A8 weight-only quant → attention hits unfused Myelin block, graph breaks at precision boundaries

NCK

W4A4 + FP16 low-rank branch → SVD outlier absorption adds ~16% memory traffic

VAE

BF16 decode—8× more bits than the latent entropy justifies

Kernel: sm80_xmma_gemm_f16f16

Ampere fallback, graph fragmentation at every boundary

s4 TensorRT Codegen (~200 it/s)

FP4

Block-scaled NVFP4 with forced tactic selection—we know which IterDomain compositions preserve whole-model fusion

ATN

Bidirectional noncausal attention emitted as single fused op → Myelin selects sm120 tactics

VAE

Precision matched to measured latent entropy—epilogue fusion at gauge boundaries

Kernel: cutlass3x_sm120_bstensorop

Native Blackwell FP4, whole-graph fusion intact

The moat isn't quantization math—it's knowing where to force tactics, what compositions preserve fusion, and matching precision to measured entropy at every boundary.

Whole-model fusion requires compiler archaeology: IterDomain algebra, tactic forcing, and searching only the polyhedral space that preserves fusion

s4 Op-Gen Kernel Performance

NVFP4 fusion patterns · RTX PRO 6000 Blackwell · TFLOPS sustained

s4 GEMM myelin

1,850 TFLOPS

SOL

Deep Linear

1,177 TFLOPS

0.47ms

Fused MLP

1,081 TFLOPS

0.51ms

Residual

1,062 TFLOPS

0.52ms

Transformer

976 TFLOPS

0.85ms

Single Linear

821 TFLOPS

0.08ms

MHA Pattern

799 TFLOPS

0.34ms

1.85 PFLOP

peak sustained

7.1×

compression ratio

25 GB/s

quantization throughput

Op-level kernel performance from s4 myelin4.py benchmark suite — all patterns hit 800+ TFLOPS sustained

Why This Matters

Efficient inference is THE game for the next decade, and one of the biggest commercial opportunities of our lifetimes. Every chatbot response, every AI generated image, video, voice, every robot, drone, and self-driving car — they all run on inference.

The inference market is projected to hit $255B annually by 2030, and every company deploying and utilizing AI is already under pressure to cut inference costs. NVIDIA's $20B acquisition of Groq signals that the one-size-fits-all era of inference is ending. A new era of efficient, customized inference is beginning, and Fleek is positioned to compete and win on the software side.

Several billion-dollar inference platforms already exist (Together, Fireworks, Fal, Baseten), but their inference optimization approaches are all pretty similar and standard. Our approach is completely different. We don't guess at precision based on convention. We measure actual information content and optimize each architecture accordingly. The physics says this should work, and our benchmarks already confirm it does. General-purpose optimization is the endgame, and we've now proven it's possible.

What We're Actually Building

We are mainly focused on building two things:

A general purpose optimization tool chain (think TensorRT-LLM for any PyTorch model)
An inference platform to deploy and run the optimized models

Here's the upcoming roadmap of what we will be releasing:

Jan (est.)

Initial Live s4-Optimized Models

FLUX.1-dev, Wan 2.2, and flagship diffusion models on s4 TensorRT codegen. 100-300 it/s megapixel generation.

Q1 2026

General Purpose Torch Compiler Stack

Targeting Blackwell/Rubin and MLIR linalg. Paste a HuggingFace link, get an s4-optimized deployment with whole-model fusion intact.

Q2 2026

Edge & Embedded on Jetson Thor

Our compiler stack thrives in constrained environments. Same precision-matched fusion, 10-50W power envelope.

Q3 2026

Safety-Critical Verticals via Proof Certification

Formal verification of precision boundaries and fusion correctness for automotive, aerospace, and medical deployment.

We're also open-sourcing research and code along the way—starting today with some of our core infrastructure for the NVIDIA ecosystem.

Sticking with the Fleek Brand

We considered starting fresh. But there's something better about a comeback.

So Fleek is still the brand, but we are switching to the fleek.sh domain to better reflect the new efficient inference and AI developer direction. The new homepage is already live.

Weyl becomes our internal research lab brand—think DeepMind to Google. R&D, papers, open-source contributions. The engine behind the product.

The FLK token remains central to the project, now tied directly to efficient inference and platform usage, which is a way bigger and more exciting opportunity. Updated tokenomics here.

Long Live Fleek

This is the start of (yet another) new chapter for Fleek, but this time is different. We've got a real technical cofounder/CTO in Ben. We've got real differentiated AI tech that actually works. And we're early movers positioned exactly where the puck is going in one of the biggest commercial market opportunities of our generation: efficient inference.

All the past Fleek failures, mistakes, pivots, etc, as painful as they were, were all worth it because they led us to this opportunity, and this exact position we are in right now. 2026 is the year Fleek becomes a global leader in efficient inference, and a startup comeback story for the ages. I don't expect anybody to believe it until they see it, but I promise you will start to see it in the coming weeks and months with what we put out.

Now back to the garage.

— Harrison

fleek.sh·@fleek