Tutorial: Adding a New Kernel to FlashInfer This tutorial walks through adding a simple element-wise scale operation to FlashInfer. We'll implement to demonstrate the complete workflow. Goal Add a new operation that scales each element of a tensor by a scalar factor: - Input: tensor and scalar - Output: (element-wise) - Support multiple dtypes (FP16, BF16, FP32) Step 1: Define CUDA Kernel in Create : Key points: - Framework-agnostic (no Torch headers) - Uses raw pointers - Template-based for dtype flexibility - Only includes what's needed (cuda runtime, cuda fp16, cuda bf16) Step 2: Create La…