dflash-mlx-speculative-decoding

dflash-mlx Speculative Decoding Skill by ara.so — Daily 2026 Skills collection. DFlash implements lossless speculative decoding for MLX on Apple Silicon. A small draft model ( 1B params) generates 16 tokens in parallel using block diffusion; the target model verifies all 16 in a single forward pass. Tokens are only emitted after target verification — output is lossless (every token is the target model's greedy argmax). Typical speedups : 1.7x–4.1x over baseline depending on model size and context length. Acceptance rates hover around 87–90% for Qwen3.5 models. Installation Requires Python 3.1…