Multimodal Models Pre-trained models for vision, audio, and cross-modal tasks. --- Model Overview | Model | Modality | Task | |-------|----------|------| | CLIP | Image + Text | Zero-shot classification, similarity | | Whisper | Audio → Text | Transcription, translation | | Stable Diffusion | Text → Image | Image generation, editing | --- CLIP (Vision-Language) Zero-shot image classification without training on specific labels. CLIP Use Cases | Task | How | |------|-----| | Zero-shot classification | Compare image to text label embeddings | | Image search | Find images matching text query | |…