It’s just really impractical to use a licensed programming language in 2025.
Possibly rose-tinted glasses on my part, but I’m optimistic for 2026. Chris Lattner has a pretty strong track record of getting these things right.
I haven’t tried it in a long time, but as it’s a Python superset, I tried to drop it into my jupyter notebook docker container and you had to agree to license terms and register your email and install a modular package that contained a bunch of extra things.
If you want to get widespread adoption for a python superset, you would probably want to get it included in the official jupyter docker images as people who do this sort of programming like to use a jupyter repl, but they just made it so difficult.
I’m no open source zealot and I’m happy to pay for software, but I think the underlying language needs to be a lot more open to be practical.
Btw, Mojo's development is a masterclass in language development and community building, it's been fun watching Chris go back to fix technical debts in existing features rather than proceeding with adding new features.
Also to MLIR while Lattner was at Google:
> MLIR was born—a modular, extensible compiler infrastructure designed to bring order to the chaos. It brought forth a foundation that could scale across hardware platforms, software frameworks, and the rapidly evolving needs of machine learning. It aimed to unify these systems, and provide a technology platform that could harmonize compute from many different hardware makers.
But unification is hard. What started as a technical project quickly turned into a battleground: open-source governance, corporate rivalries, and competing visions all collided. What could have been a straightforward engineering win became something much more complicated.
https://www.modular.com/blog/democratizing-ai-compute-part-8...
I can't say for sure because I couldn't find the CUDA kernel but I kind of doubt this is true. You can hit memory bandwidth on Hopper without using TMA at all, which is mostly designed for accelerating asynchronous copies and reducing memory pressure. If all you are doing is a transpose you don't need any of this to go fast (though it might simplify your indexing code…?)
Jay Shah’s later articles contain examples that involve epilogue fusion. IMHO, understanding how to write an efficient transpose helps with following the more involved ones.
Great write up! I learned a lot!
Isn't it better to simply combine the transposition with whatever next operation one wishes to do with the matrix?
You have global memory and shared memory, the global is slower.
You read in rows in the global memory (faster than reading columns)
You write in columns in the shared memory (slower than in rows, but the shared memory is fast, this is the transpose operation)
You read in rows in the shared memory (very fast)
You write in rows in the global memory (faster than writing in columns)
The idea behind that tiling is to hide the slow part in a memory that is faster.
>(2771.35/2775.49 - 1) * 100 = -.14916285052369131300
Flagged.
"This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) which is still impressive
He has a bit of a track record already.
Now he's making a for profit company and there's already MAX and MAX Enterprise stuff to not trust that the open source part would be competitive with already great inferencing frameworks for example.
Are you talking about your libc equivalent or MAX?
Mojo standard library is already open source. Mojo at the moment does not need a runtime (but if it ever needs one it'd get open sourced). My point was, Mojo as a whole, as a programming language & a reference implementation, will definitely get open sourced.
MAX itself is a bigger beast to work with, and I am out of my depth to talk about it. I think it'll get open sourced as well, just the timeline might be different (shorter or longer, IDK).
Also, the improvement is 0.14%, not 14% making the editorialized linkbait particularly egregious.
transpose_naive - Basic implementation with TMA transfers
transpose_swizzle - Adds swizzling optimization for better memory access patterns
transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling
Performance comparison with CUDA: The Mojo implementations achieve bandwidths of:
transpose_naive: 1056.08 GB/s (32.0025% of max)
transpose_swizzle: 1437.55 GB/s (43.5622% of max)
transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)
via the GitHub - simveit/efficient_transpose_mojo
Comparing to the CUDA implementations mentioned in the article:
Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s
Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s
Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s
So there is highly efficient matrix transpose in Mojo
All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).
The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.
Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.
> "From the moment I understood the weakness of my flesh, it disgusted me. I craved the strength and certainty of steel."
14% all the time vs 35% some of the time
edit: Closing numbers are far less impressive than those buried in the middle of the post. Confusing; bye everyone