multimodal-llm — Skillopedia

Multimodal LLM Patterns Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling v3, Sora 2, Veo 3.1 std/lite/fast tiers, Runway Gen-4.5 via ). Canonical model IDs (pinned against ): | Provider | Model IDs | |----------|-----------| | Anthropic | (latest), , , , | | OpenAI | (current flagship) | | Google | (flagship), (cost) | | Veo | / / | | Kling | (model name field in Kling API) | | Runway | (product label: Gen-4.5) | Quick…