Multimodal RAG Embedding Model Matrix | Model | Modalities | Dim | Notes | |---|---|---|---| | CLIP ViT-L/14 (OpenAI) | image + text | 768 | Classic baseline, weak on OCR | | SigLIP 2 (Google) | image + text | 768/1152 | Stronger zero-shot than CLIP | | ImageBind (Meta) | image, text, audio, video, depth, IMU | 1024 | Only choice for 6-modality joint space | | VoyageAI | image + text (interleaved) | 1024 | SOTA for PDF pages with mixed layout | | Cohere | image + text | 1536 | Competitive, enterprise SLAs | | BGE-M3 (BAAI) | text (dense + sparse + colbert) | 1024 | Text-only but multi-vector;…