2025 was a rollercoaster year for Fleek. We set out to transition from decentralized infrastructure to AI, but the journey took longer than expected with many turns, especially with the FLK token launching in the middle of it. From the outside it admittedly probably looked very confusing, but from the inside things were starting to click.
Now as we start 2026, the dust has settled, and we've finally found our lane. And the path forward has never been clearer or more exciting.
2025 was the flame. 2026 is the phoenix.
Fleek's 2025 AI Transition
Quick background: Fleek spent years building decentralized web infrastructure. IPFS hosting, ENS domains, Fleek Network. Real products, real users, but never true product-market fit. And we weren't alone—very few decentralized infra projects have found PMF. So at the start of 2025, we made the call to pivot to AI.
That pivot started with hosting Eliza agents in January 2025. Through that experience we learned how developers were actually using the framework, which led us to explore AI social use cases in the second half of the year: characters, twins, generative AI, etc. But every exploration pointed to the same bottleneck: We needed fast, high quality, and low cost inference in order to compete.
The initial exploration and results stemming from that need were showcased in our previous blog Introducing Weyl AI (how the Fleek & Weyl brands will be used going forward is explained at the end of this blog). So I will pick up the story from there.
The Open Lane We Found
1) DeepSeek proved algorithmic innovation beats brute-force scaling.
2) NVIDIA's TensorRT-LLM proved you can build a world-class LLM optimization toolchain.
3) Recent attention kernel research proved you can hit breakthrough performance for narrow operations.
We took inspiration from those three, and initially tried to see if we could apply it generally to diffusion inference.
So we dug. Deep into nvfuser commits, TensorRT internals, half-documented code paths. And we found something: hardware fast paths on Blackwell that exist in the silicon but aren't exposed through standard tooling. The performance numbers Jensen touts on stage—the ones nobody in the industry believed were achievable in practice.
Turns out they are achievable. You just need to know which IterDomain compositions preserve whole-model fusion, where to force tactic selection, and how to match precision to measured entropy at every gauge boundary.
FLUX.1-dev Megapixel Inference*
RTX PRO 6000 Blackwell · iterations per second
(myelin4)
Range reflects end-to-end integration overhead: AdaLN memory ordering, VAE↔DiT precision boundaries, CFG batching
The chart shows the full inference stack on FLUX.1-dev, from PyTorch eager to our myelin4 runtime. The official tools (TensorRT ModelOpt, nunchaku) plateau around 8 it/s. They hit the same ceiling from different angles: ModelOpt's W4A8 quantization chokes on unfused attention, while nunchaku's W4A4 loses gains to its low-rank outlier branch. Our approach bypasses both bottlenecks by emitting ONNX patterns that trigger native Blackwell FP4 tensor core tactics.
The key insight: NVIDIA shipped Blackwell with incredible FP4 tensor cores, but the official tools don't generate the right graph patterns to trigger them. We figured out the exact ONNX quantization structure that makes TensorRT's Myelin compiler select the cutlass3x_sm120_bstensorop kernels instead of falling back to sm80 FP16 compute.
Why The Gap Exists
Same hardware, different optimization depths
W4A8 weight-only quant → attention hits unfused Myelin block, graph breaks at precision boundaries
W4A4 + FP16 low-rank branch → SVD outlier absorption adds ~16% memory traffic
BF16 decode—8× more bits than the latent entropy justifies
Kernel: sm80_xmma_gemm_f16f16
Ampere fallback, graph fragmentation at every boundary
Block-scaled NVFP4 with forced tactic selection—we know which IterDomain compositions preserve whole-model fusion
Bidirectional noncausal attention emitted as single fused op → Myelin selects sm120 tactics
Precision matched to measured latent entropy—epilogue fusion at gauge boundaries
Kernel: cutlass3x_sm120_bstensorop
Native Blackwell FP4, whole-graph fusion intact
The moat isn't quantization math—it's knowing where to force tactics, what compositions preserve fusion, and matching precision to measured entropy at every boundary.
s4 Op-Gen Kernel Performance
NVFP4 fusion patterns · RTX PRO 6000 Blackwell · TFLOPS sustained
peak sustained
compression ratio
quantization throughput
Why This Matters
Efficient inference is THE game for the next decade, and one of the biggest commercial opportunities of our lifetimes. Every chatbot response, every AI generated image, video, voice, every robot, drone, and self-driving car — they all run on inference.
The inference market is projected to hit $255B annually by 2030, and every company deploying and utilizing AI is already under pressure to cut inference costs. NVIDIA's $20B acquisition of Groq signals that the one-size-fits-all era of inference is ending. A new era of efficient, customized inference is beginning, and Fleek is positioned to compete and win on the software side.
Several billion-dollar inference platforms already exist (Together, Fireworks, Fal, Baseten), but their inference optimization approaches are all pretty similar and standard. Our approach is completely different. We don't guess at precision based on convention. We measure actual information content and optimize each architecture accordingly. The physics says this should work, and our benchmarks already confirm it does. General-purpose optimization is the endgame, and we've now proven it's possible.
What We're Actually Building
We are mainly focused on building two things:
- A general purpose optimization tool chain (think TensorRT-LLM for any PyTorch model)
- An inference platform to deploy and run the optimized models
Here's the upcoming roadmap of what we will be releasing:
Initial Live s4-Optimized Models
FLUX.1-dev, Wan 2.2, and flagship diffusion models on s4 TensorRT codegen. 100-300 it/s megapixel generation.
General Purpose Torch Compiler Stack
Targeting Blackwell/Rubin and MLIR linalg. Paste a HuggingFace link, get an s4-optimized deployment with whole-model fusion intact.
Edge & Embedded on Jetson Thor
Our compiler stack thrives in constrained environments. Same precision-matched fusion, 10-50W power envelope.
Safety-Critical Verticals via Proof Certification
Formal verification of precision boundaries and fusion correctness for automotive, aerospace, and medical deployment.
We're also open-sourcing research and code along the way—starting today with some of our core infrastructure for the NVIDIA ecosystem.
Sticking with the Fleek Brand
We considered starting fresh. But there's something better about a comeback.
So Fleek is still the brand, but we are switching to the fleek.sh domain to better reflect the new efficient inference and AI developer direction. The new homepage is already live.
Weyl becomes our internal research lab brand—think DeepMind to Google. R&D, papers, open-source contributions. The engine behind the product.
The FLK token remains central to the project, now tied directly to efficient inference and platform usage, which is a way bigger and more exciting opportunity. Updated tokenomics here.
Long Live Fleek
This is the start of (yet another) new chapter for Fleek, but this time is different. We've got a real technical cofounder/CTO in Ben. We've got real differentiated AI tech that actually works. And we're early movers positioned exactly where the puck is going in one of the biggest commercial market opportunities of our generation: efficient inference.
All the past Fleek failures, mistakes, pivots, etc, as painful as they were, were all worth it because they led us to this opportunity, and this exact position we are in right now. 2026 is the year Fleek becomes a global leader in efficient inference, and a startup comeback story for the ages. I don't expect anybody to believe it until they see it, but I promise you will start to see it in the coming weeks and months with what we put out.
Now back to the garage.
— Harrison