Multimodal Embedding Models vs Text Embedding Models: A Benchmark Study

There are two major approaches to building multimodal search systems. The first approach uses multimodal embedding models such as CLIP or DINOv2, which directly encode images into vector representations. The second approach uses text embedding models on text descriptions of the original data—for example, using an LLM to generate image descriptions, then embedding those descriptions.

In production environments, the text embedding approach is often the natural choice because modern AI pipelines already rely heavily on LLMs. Working with text data makes it easier to integrate multimodal content into existing infrastructure. Choosing the right embedding approach affects cost, development effort, maintenance overhead, and retrieval performance.

However, text embedding models are often considered just a fallback—an alternative when multimodal models aren't available. The assumption is that multimodal embedding models are inherently superior since they're specifically trained to handle visual data. But is this really the case? Can text embedding methods perform just as well, or even better?

To answer these questions, I compared multimodal embedding models against text embedding models using standardized benchmarks, focusing on image and text modalities.

Models Tested

I selected models that are cost-effective and practical for production deployments:

ModelTypeDescriptionTested On
CLIP (ViT-B/32)MultimodalOpenAI's contrastive image-text modelMS COCO, DreamSim
DINOv2 (Base)Image-onlyMeta's self-supervised vision modelDreamSim
GPT-5-nano + text-embedding-3-smallTextLLM-generated descriptions + text embeddingsMS COCO, DreamSim

For the text embedding approach, I used GPT-5-nano to generate detailed text descriptions of images, then embedded those descriptions using OpenAI's text-embedding-3-small model.

Why these models?

  • Cost-effective: All models are practical for production use
  • Representative: CLIP and DINOv2 are among the most popular open-source vision models; text-embedding-3-small is widely used for text retrieval

Benchmark 1: Cross-Modal Retrieval (MS COCO)

About MS COCO

MS COCO is a large-scale dataset containing images paired with human-written captions. It serves as the de facto standard benchmark for image-text retrieval tasks. Each image has five human-annotated captions describing its contents.

Sample from MS COCO
Sample COCO image

Human-Written Captions:

• This wire metal rack holds several pairs of shoes and sandals

• A dog sleeping on a shoe rack in the shoes

• Various slides and other footwear rest in a metal basket outdoors

• A small dog is curled up on top of the shoes

• A shoe rack with some shoes and a dog sleeping on them

The benchmark measures how well an embedding model can match images with their correct captions (and vice versa). Higher Recall@K indicates better performance.

Results

MS COCO results
ModelR@1R@5R@10
Text Embedding46.3569.8978.64
CLIP (ViT-B/32)40.3165.5075.21

Key Finding: The text embedding model outperformed CLIP on cross-modal retrieval. This result suggests that LLM-generated descriptions effectively capture semantic content for matching with human-written captions.

Benchmark 2: Image-to-Image Retrieval (DreamSim NIGHTS)

About DreamSim NIGHTS

The NIGHTS dataset from DreamSim evaluates perceptual image similarity. It consists of human-curated triplets: one reference image and two candidate images. Human annotators selected which candidate is more visually similar to the reference.

More established benchmarks for image-to-image retrieval exist, such as Revisited Oxford and Paris (ROxford/RParis). However, these datasets measure instance-level similarity—finding the exact same landmark or product across different photos. For example, in ROxford, a photo of a statue and a photo of a museum building can be labeled as the "same landmark" simply because they belong to the same location, even though they look nothing alike visually. I chose DreamSim NIGHTS because it evaluates perceptual similarity, which better reflects how humans intuitively judge visual resemblance and is more representative of real-world retrieval scenarios.

Sample Triplet from NIGHTS
Reference
Reference
Candidate A
Candidate A
Candidate B ✓
Candidate B

Humans judged Candidate B as more visually similar to the reference.

The benchmark measures how accurately embedding models predict which candidate image humans would choose as more similar to the reference.

Results

DreamSim results
Model2AFC Accuracy
DINOv2 (Base)84.72%
CLIP (ViT-B/32)82.17%
Text Embedding73.21%

Key Finding: Image-native models (CLIP and DINOv2) significantly outperformed the text embedding approach on visual similarity tasks. DINOv2 achieved the highest accuracy, demonstrating that models with dedicated image encoders excel at capturing fine-grained visual details.

Qualitative Examples

To help illustrate the benchmark, here are some example triplets where the two models disagreed. Note that all human judgments in DreamSim NIGHTS are based purely on visual similarity—not conceptual or semantic similarity.

Cases Where CLIP Matched Human Judgment but Text Embeddings Did Not

Out of 2,120 triplets, there were 408 cases where CLIP correctly matched human judgment but text embeddings failed.

Reference
Mexican Aster
Option A
Option B
CLIP Text
Reference
Miniature Poodle
Option A
Option B
CLIP Text
Reference
Starfish
Option A
Option B
CLIP Text

Cases Where Text Embeddings Matched Human Judgment but CLIP Did Not

There were 216 cases where text embeddings correctly matched human judgment but CLIP failed.

Reference
Art Gallery
Option A
Option B
CLIP Text
Reference
Hamburger
Option A
Option B
CLIP Text
Reference
Tractor
Option A
Option B
CLIP Text

Summary

Summary comparison
TaskBest ModelWinner
Cross-Modal Retrieval (Text↔Image)Text Embedding (R@1: 46.35%)Text Embedding
Image-to-Image RetrievalDINOv2 (84.72% accuracy)Multimodal
  1. For cross-modal retrieval (text↔image): Text embedding models performed better than CLIP. LLM-generated descriptions effectively capture semantic content for matching with human-written text.
  2. For image-to-image retrieval: Multimodal models (CLIP, DINOv2) showed a clear advantage. Models with dedicated image encoders are better at capturing fine-grained visual details that are difficult to express in text.
  3. Text embeddings performed surprisingly well even on pure visual similarity benchmarks (73% accuracy), despite being an indirect approach that relies on text as an intermediate representation.

Discussion

At first glance, the results suggest that CLIP and DINOv2 have better capability at capturing visual similarity compared to text embeddings. This seems intuitive: models with dedicated image encoders can directly process visual information, while text embeddings rely on an intermediate text representation. But does the ~10 percentage point gap in accuracy (82-85% vs 73%) reflect a fundamental limitation of text-based methods, or something else?

To explore this, consider how CLIP actually works. CLIP was only trained on image-text pairs—there was no dataset where the model could directly learn image-to-image similarity. This means CLIP learned image-image similarity indirectly through natural language, a fact that is well-documented in the literature. This has an important implication: CLIP can only learn visual similarities that are either expressible in natural language or statistically correlated with captioned concepts across its training data.

Given this, in theory, a text-based embedding method should be able to perform just as well as CLIP—if the text descriptions are sufficiently detailed and accurate. However, in this benchmark, the text embedding approach (GPT-5-nano + text-embedding-3-small) did not outperform CLIP on image-to-image retrieval. The most likely explanation is information loss during the two-step process: first when converting images to text descriptions, and again when converting those descriptions to embeddings. Each step discards information that the next step cannot recover. In contrast, CLIP performs image-to-embedding conversion in a single step, preserving more visual information in the final representation.

DINOv2, on the other hand, was trained using pure self-supervision on images alone—no text involved at all. This means DINOv2 learns visual similarity directly from pixel patterns, not indirectly via language. Yet despite CLIP's inherent limitation of learning through text, it performed nearly as well as DINOv2 on the image-to-image benchmark (82.17% vs 84.72%). This suggests that most visual similarities humans perceive are indeed capturable through language, which may explain why CLIP's indirect approach achieves comparable results to DINOv2's direct visual learning.

Practical Recommendations

For most use cases, text embedding models are sufficient. In production, text embeddings are nearly unavoidable since LLMs are already part of most AI pipelines. Adding multimodal embedding models requires:

  • Managing multiple embedding spaces
  • Self-hosting models (CLIP, DINOv2)
  • Additional complexity at query time

Consider multimodal models if:

  • Your application requires high-precision image-to-image similarity search
  • You are willing to accept the additional infrastructure complexity

Appendix: Detailed Benchmark Results

MS COCO (val2014-5k)

CLIP (ViT-B/32)

Image→Caption: R@1: 50.08%, R@5: 75.01%, R@10: 83.55%

Caption→Image: R@1: 30.55%, R@5: 55.99%, R@10: 66.87%

Average: R@1: 40.31%, R@5: 65.50%, R@10: 75.21%

Text Embedding (GPT-5-nano + text-embedding-3-small)

Description→Caption: R@1: 54.64%, R@5: 77.39%, R@10: 84.95%

Caption→Description: R@1: 38.07%, R@5: 62.40%, R@10: 72.34%

Average: R@1: 46.35%, R@5: 69.89%, R@10: 78.64%

DreamSim NIGHTS (test split, 2,120 triplets)

Model2AFC Accuracy
DINOv2 (Base)84.72%
CLIP (ViT-B/32)82.17%
Text Embedding73.21%