<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Florian Mattana</title>
    <link>https://florianmattana.com/</link>
    <description>Recent content on Florian Mattana</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://florianmattana.com/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>I Wrote an MXFP4 Quantization Kernel and Ranked #1 on Tensara</title>
      <link>https://florianmattana.com/posts/mxfp4_article/</link>
      <pubDate>Sun, 05 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://florianmattana.com/posts/mxfp4_article/</guid>
      <description>&lt;h2 id=&#34;why-i-did-this&#34;&gt;Why I Did This&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;m building an FP4 fused attention kernel for consumer Blackwell GPUs (SM120). That means I spend my days thinking about how to squeeze 32-bit numbers into 4 bits without losing too much information.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://tensara.org&#34;&gt;Tensara&lt;/a&gt; is a platform where you submit GPU kernels and compete on real hardware. They had an MXFP4 quantization problem with almost no submissions. I figured: I already know this format inside out on SM120, how hard can it be to write a standalone quantization kernel?&lt;/p&gt;</description>
    </item>
    <item>
      <title>Building an FP4 Fused Attention Kernel on Consumer Blackwell (SM120) </title>
      <link>https://florianmattana.com/posts/fp4-fused-attention-kernel-sm120/</link>
      <pubDate>Tue, 17 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://florianmattana.com/posts/fp4-fused-attention-kernel-sm120/</guid>
      <description>A deep dive into writing a fused FP4 attention kernel for the RTX 5070 Ti, using inline PTX and warp-level MMA instructions.</description>
    </item>
    <item>
      <title>From Silicon to Thread Identity: How CUDA Threads Know Who They Are</title>
      <link>https://florianmattana.com/posts/from-silicon-to-thread-identity/</link>
      <pubDate>Fri, 13 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://florianmattana.com/posts/from-silicon-to-thread-identity/</guid>
      <description>How threadIdx, blockIdx and blockDim are hardware registers, not software functions, and what that means for performance.</description>
    </item>
    <item>
      <title>Exploring PTX: A Close Look at Tile Optimization in CUDA</title>
      <link>https://florianmattana.com/posts/exploring-ptx-tile-optimization/</link>
      <pubDate>Thu, 15 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://florianmattana.com/posts/exploring-ptx-tile-optimization/</guid>
      <description>Comparing naive and tiled matrix transpose kernels at the PTX level to understand shared memory, coalescing, and bank conflicts.</description>
    </item>
    <item>
      <title></title>
      <link>https://florianmattana.com/about/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://florianmattana.com/about/</guid>
      <description>&lt;p&gt;I&amp;rsquo;m Florian Mattana, GPU kernel engineer based in France.&lt;/p&gt;
&lt;p&gt;I write CUDA kernels at the PTX level for LLM inference: fused attention, quantized GEMM, online softmax. You can read about the technical work on the &lt;a href=&#34;https://florianmattana.com/posts/&#34;&gt;blog&lt;/a&gt; and on &lt;a href=&#34;https://github.com/florianmattana&#34;&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I got into GPU computing through a weird path. Started in finance (Sorbonne master&amp;rsquo;s), then built a crypto mining rig around 2015 and got hooked on understanding why memory bandwidth matters more than clock speed. That led to production CUDA work at Geopost, Airbus, and Melexis, and eventually to writing inference kernels full time.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
