Introduction

I will be honest with you, I kind of have a thing for vision encoders, and I really wanted to start a series about them. So, this blog post is meant to be the start of the series. And how else should I start a Vision Encoders series rather than with a method that's not made solely for vision encoding. Or at least not entirely.

When we train a neural network using labeled examples, those examples can come from a finite number of categories. This is a big limiting factor, because we are never able to capture all the categories that exist in the real world.

This limiting factor was a clear statement of why ML models, especially computer vision ones, will never be able to fully understand the real world. I said that Computer Vision ones were suffering the most due to this issue because, in parallel, in the Language world, the text-to-text training method was achieving state-of-the-art results, while also avoiding the need for specialized output heads or dataset-specific customization.

Taking inspiration from NLP, the authors of this paper decided to experiment with a method that would learn what an image means, using natural language rather than a set of classes.

CLIP was not the first method to use natural language supervision for image understanding. There were several prior works, but their best results were incomparable with the supervised models results (11.5% vs. 88.4% accuracy on ImageNet). What was missing from these early works was data scale.

Method

We already established that training a model with a finite number of categories will not work in the real world, and language supervision will do better. But why? Let's take a simple example:

The Kangaroo Problem

An autonomous car is trained to avoid hitting animals. Because CLIP was not a thing when it was trained, the Machine Learning researchers did their best to include in the examples it sees, as many animals as possible. They showed cats, dogs, and even polar bears to the network, but somehow, they couldn't get pictures of kangaroos for their training dataset. First test in Australia, the network sees a kangaroo on the road, it assigns it a random label, and it doesn't hit the brake at all.

This happens because finite supervision makes the neural network learn blueprints of each class it sees at training time. At test time, if an object doesn't have a blueprint (like a kangaroo when trained only on cats and dogs), it won't have any meaningful reference, thus forcing the predictions to be essentially random.

Instead of learning fixed and rigid blueprints, using language supervision, the model learns meaningful associations between images and words. Going back to the autonomous vehicle example, at training time, the model will see images of cats, dogs, and polar bears, learning to associate the pixels of each cat image, with words in the label ("this is a photo of a cat").

At test time, the car knows it should hit the brake when an animal is in front of it. Because CLIP was trained on a massive and diverse dataset of image–text pairs scraped from the internet, it likely encountered the word "kangaroo" in many contexts. Over time, it learns that "kangaroo" often appears near words like "animal" or "wildlife," and learns to associate such text with similar-looking visuals. So, when it sees a kangaroo, even if it's the first time, it can still make the connection—and hit the brakes.

How was CLIP trained?

Now that we have a basic understanding of CLIP's inner workings, let's dive into details. The authors collected 400 MILLION data pairs (text, and images) and trained from scratch an image encoder and a text encoder, aiming to bring visual cues close to their language counterparts, while pushing away pairs that don't match.

Explained in intuitive terms, a picture of a kangaroo will be pushed closer to phrases like "An image with a kangaroo", "Wildlife" and variations such as "Animals in Australia", while pushing away mismatches. In order to "push" images and text together in the same space, they used the following algorithm:

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]

# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2
Figure 3. Code block presented in the original paper.

The first couple of lines describe the two trained encoders. For image encoding, they experimented with different neural networks, including five different types of ResNets, which where in trend at that time, and three different types of Visual Transformers( ViT). For language they used a plain 63M parameters Transformer. Jumping to the 5th line, the outputs of the two encoders had to be projected to a unified space, a space where both the visual and text features are represented, using W_i and W_t. CLIP also learns a temperature parameter, noted as t in the code above, which controls how confident the model becomes when comparing similarities. Instead of manually tuning it like a hyperparameter, it's learned through backpropagation to sharpen or soften the softmax outputs.

Then the actual algorithm starts, with feature extraction done on both sets of data. The resulted features are firstly projected to the unified space, then they are normalized. The normalization is part of a bigger process, where we're trying to measure how similar the sets of features are, using cosine similarity.

Cosine similarity

Cosine similarity calculates how similar the directions of two vectors are, and because we represent our image and text features as vectors, it's the perfect way to measure how alike they are. For example, if we have A = [3,4] and B = [0,1] and we draw them, we can tell that both vectors are pointing upward. But how similar are them (or their direction)? Well, it turns out that vector A is exactly 0.8 similar to vector B. How do I know that?

First, since we only care about how similar their direction is, we want to eliminate the length of the vectors from the computation, normalizing them, resulting in vectors with length of 1. This will get us A = [0.6,0.8] and B = [0,1].

The dot product formula:
\[A \cdot B = \sum_{i=1}^{n} A_i B_i = A_1 B_1 + A_2 B_2 + ... + A_n B_n\]
x y A [3,4] B [0,1] A projected on B
Vector A and B with the projection of A onto B, showing their directional similarity.

A dot product follows the normalization. The dot product between two vectors represents a chain of sums and products meant to extract whether one vector has a component pointing in the same direction as the other, and by how much. In our example, after the normalization, 0.6 of A points right and 0.8 points upwards, while B points only upward. So exactly 0.8 of A has the same direction as B.

Computing score

We've extracted the features, and we've seen how each image relates to each text in our batch, resulting in a matrix, where on the principal diagonal we have the similarity between matching pairs.

Next, we need to remember our goal. We don't want to just predict the best match, but we also want to push apart text and images that don't match. To do that, we're applying a clever trick, pushing the best similarity score to be higher, thus making the model more confident. By using this technique, the model learns to assign even smaller scores to pairs that don't match, creating a unified space where it is clear which images and phrases relate to each other.

This can be seen in the last lines of the above code, where the authors of the paper decide to scale the similarity scores using a learned temperature parameter, thus making the model more confident. Then, they compute a symmetric loss, on both text and images, pushing the model to learn clear relations between pixels and language.

Conclusion

CLIP wasn't the first to try language supervision for vision tasks, but it was the first to really make it work. The secret? Scale. Not just in terms of model size, but in the sheer diversity and volume of data: 400 million image–text pairs. That kind of scale made it possible to train a model that generalizes far beyond the limited world of ImageNet-style classification.

At the time of its release, CLIP matched or outperformed state-of-the-art supervised models on over 30 different benchmarks, without being fine-tuned on any of them.

Instead of memorizing class blueprints, it learned to understand the meaning of images through language. That's a pretty big deal, and a clear sign that rigid labels weren't cutting it anymore.

This shift towards language-driven learning sparked a new wave of methods looking to refine, simplify, or scale up this idea. One of those methods? SigLip. It's a bit of a twist on CLIP's recipe, and we'll dive into that next. Stay tuned.