People know that one of my minor obsessions has been alternative ways to get use out of the ray tracing unit.
I think I've found my first genuine candidate!
Take a look at (2026)
https://www.ssslab.cn/assets/papers/2026-lian-UniSTC.pdf Uni-STC: Unified Sparse Tensor Core.
This is a really nice idea. The problem to be solved is providing an accelerator for sparse linear algebra (eg dense matrix times sparse vector, sparse matrix times sparse matrix, etc). The big ideas are
1. whatever compression scheme is used to compress the sparse arrays, that is pre-processed before the accelerator to generate a (temporary) bitmap indicating zero vs non-zero elements. This bitmap may be directly loaded from memory (it's one reasonable way to indicate a sparse vector/matrix) or it may be constructed on demand by some auxiliary HW in parallel with loading in the sparse vector/matrices.
2. we have essentially three async HW loops each connected by queues.
The outermost loop considers items at a 16*16*16 granularity using an outer product bitmap to figure out which of the 16*16 calculated spots might be non-zero.
Any time the resultant 16*16 tile is non-zero it's passed on to the next stage which considers the 16*16 tile at a 4*4*4 granularity, figuring out which particular 4-element dot products will be non-zero.
Descriptors for these non-zero 4-element dot products are passed to the final stage which then performs the dot product.
The details are in the paper. This is all rather nice! And, except for extremely sparse matrices, works better than other known solutions.
However what caught my eye was that there are many similarities between what's being done here and what's being done in the ray tracing unit. Both basically walk a data structure performing simple auxiliary calculations and waiting on memory so that they can occasionally feed some genuine computational work to the primary GPU datapath.
It feels like it shouldn't be too difficult to augment the existing ray tracing unit with these ideas, and thereby have a mechanism that can substantially boost the performance of sparse linear algebra (which is, of course, probably going to be helpful to AI).
And once the idea has been validated in the GPU something similar could probably be added to ANE.
And even maybe to AMX? (The fit to AMX is not so great, though it is feasible, especially maybe if you shard your large computation over multiple cores so that each of these sparse pre-processors is working simultaneously. I want to see it attached to AMX for the selfish reason that science other than just AI wants to use sparse linear algebra, and AMX is the only place where we have FP64.)