stepfun-tts — Skillopedia

StepFun stepaudio-2.5-tts Generate Chinese / Japanese speech with (released 2026-04, verified 2026-04-23). Contextual TTS — emotion and prosody go through natural-language description, not fixed labels. Companion: for transcription with (the sibling model), use the skill — they share an API key but live on different endpoints with different body shapes. Why this skill exists — StepAudio 2.5 has two non-obvious pitfalls that cost hours if you don't know them: 1. rejects (the step-tts-2 way). Emotion/prosody now goes through (natural-language description, ≤200 chars) and inline parentheses insi…