FlashKDA Delta Attention Skill Skill by ara.so — Daily 2026 Skills collection. FlashKDA provides high-performance CUDA kernels for Kimi Delta Attention (KDA) built on CUTLASS. It targets SM90+ GPUs (H100/H20 class) and integrates as a drop-in backend for 's operation. Requirements - GPU: SM90+ (H100, H20, or newer) - CUDA 12.9+ - PyTorch 2.4+ - Python 3.8+ Installation Install the FLA integration (optional but recommended): Core Kernel API The primary low-level kernel call: Tensor shapes and dtypes: | Parameter | Dtype | Shape | Notes | |---------------|--------------|--------------------|---…