Nvidia

1. Why a Second Article When I started SASS King I didn’t know really the purpose behind it, except knowing better the compiler behaviour. Because the more I went deep into CUDA, the more I saw myself drawn by the click between software and hardware, that tiny frontier few people actually care about. As a GPU programmer, when you write a kernel you still think at a level of abstraction that delegates a lot of superpower to the CUDA machinery. And nowadays NVIDIA tends to keep abstracting up the stack to give developers the ease to focus on high level challenges that are already numerous. As we will all agree that the principal use cases are LLM architectures, we can start to assess that the main challenges a developer faces are not really at the register level. The further forward we go, we can even start to conclude that the main bottlenecks are rarely at compute level and more at memory level, how we transfer data between H2D, D2D or D2H. A recent documented case makes this concrete. Anam, a company building real-time AI avatars, found that their Cara-3 inference pipeline was not GPU-bound at all. The GPU would run a burst of kernels, then sit idle, then run again. The bottleneck was Python runtime overhead stalling CPU dispatch and leaving the GPU waiting. Fixing the dispatch path, not the kernels, gave them a 2.5x latency improvement on the same hardware. Considering this, trying to achieve a full understanding of deep machinery seems like being on the wrong side of the risk/reward ratio. ...