Introduction
In the last blog, we talked about vision encoders and how some of them work, but why do we even use vision encoders and what’s their goal? I mean, understanding the difference between cats and dogs, calculating the depth in images, or differentiating between anger and sadness on human faces are pretty cool, but is this all we want?
Well, as Lucas Beyer (SigLip's author) says in this video, the target is way bigger than that. We hope to combine all these little tasks that we’ve been working on, in a General Visual Representation model, a model that can be taught to see anything, without any extra training. But until we’re able to combine tasks, we need to at least have one task with decent performance on data from any domain.
Let’s take classification for example: we have models that can classify between flower species, animal breeds or furniture types with accuracy that exceeds human capabilities. But if you take a model specialized on flowers and test it on dog breeds, it will definitely predict that a german shepherd is a beautiful Bearded iris. In the best case, fine tuning the last layer would be needed, but for good accuracy, full re-training is mandatory.
After a ton of research has been done in different directions, Google released BiT in 2019, which had really good performance on multiple classes, coming from different domains, and proved one main thing that seems common today: Training a very big network, on a huge labeled dataset, for a very long time, results in really good visual generalization. The only issue this model had, is that no one, besides google, owns a 300 Million labeled images dataset, so something else was needed.
CLIP Recap
CLIP, and its twin brother from google ALIGN, are trying to solve this issue using natural language as labels, while being trained on a huge dataset, scrapped from the internet. So instead of having human labelers look at each image, CLIP training images are collected from the internet, and the labels are the alt texts of those images.
With this novel annotation idea, CLIP can be used on any task without any fine tuning, you just need to spell out the classes. For example, if the model is presented a picture with a german shepherd and a bearded iris, it will be able to correctly link the dog image with the text “a german shepherd”, because at training time, it saw both flowers and dogs, and it is able to differentiate between them.
The intuition behind CLIP is pretty simple. In each batch, the model classifies each image using text as labels, and then classifies each text using images as labels, training everything with a temperature altered SoftMax loss.

Dropping the complicated terminology, let’s say that the model sees 3 images, as in Figure1: a dog, a flower, and a chair, while the texts are: “dog”, “iris” and “old chair”. CLIP takes each image at a time, for example the one with a dog, and all the texts as labels. At inference, it gives the highest score to “dog”, while “iris” and “old chair” have small scores.


Then, it takes each text at a time, and it uses all the images as labels. Taking the text “iris”, it predicts that the image with a flower is the right class, while the dog and the chair are wrong (negative) classes. This can be seen in Figure 2, from both perspectives.
The fact that it looks at text and images separately is both a blessing and a curse. It helps building a clear space where images and texts are represented together, but at the same time it’s very heavy computational wise:
- SoftMax computes percentages, so the scores must add up to 100. Because each image is taken one at a time, a sum on all texts elements is computed for each image, so for each image, the scores computes to 100. More than that, CLIP looks both at text and images, thus doubling the computations.
- The previous point is not an issue on small batches. But in order to learn a general vision representation, a huge batch size is needed, as the more varied images the model sees at once, the better it can tell what really makes a dog a dog instead of “anything furry”. For CLIP, each row contains only one positive (true) pair, while the rest are negatives, so more negative examples mean a sharper and clearer differentiation between negative and positive pairs. CLIP was trained on a batch size of 32 thousand images, while its open-source variant, OpenClip was trained on a batch size of 86 thousand images.
Now imagine a matrix like the one from Figure1, but with 32k*32k elements. Because the softmax has to see all the rows and all the columns for its global sums, the similarity matrix would have to be stored on only one device, and that’s imposing a hard limit on scaling CLIP models, limiting their improvements for visual representation.
Why is SigLip better?
Last chapter we’ve seen three main pitfalls of CLIP’s approach:
- The loss is done in both directions, doubling the computations.
- There are a lot of global sums, because for each computed score, softmax has to add up all the other elements of the row/column.
- Because softmax has to see the entire similarity matrix, the matrix has to be stored on only one device, thus limiting any multi-device scaling.
SigLip authors understood that they can solve all these limitations by modifying one simple thing - replace softmax with sigmoid:
- The sigmoid function sees each pair at a time, instead of linking each image with all the texts and the other way around, so the loss is computed in only one way.
- There are no global sums, because sigmoid assigns an individual score to each pair instead of a percentage score.
- If there are no global sums, each device can compute its own loss, so scaling the system in a multi-device fashion is now possible.
How does it work?
Softmax Function
Sigmoid Function
Each beginner course for Machine Learning starts with 2 lessons: Binary Classification, and then multi-class classification. If you do one of those courses, you learn that for binary classification, Sigmoid is used as an activation function, predicting a score for each item. The scores don’t have to sum up to 1, they just have to show which element is positive and which one is negative. Then, when courses extend this exercise to multiple classes, you usually learn that softmax is better to be used, because the scores are normalized over the classes, and it’s easier to interpret.
SigLip vs CLIP follows the same exact intuition. CLIP classifies each image in a multi-class fashion, using all the texts as labels. SigLip takes each image-text pair possible and computes one score, that doesn’t depend in any way on the other pairs.
So now that each pair is independent and the similarity matrix can be split on multiple devices, there is only one limitation that has to be taken care of, in order to successfully compute the loss: each image has to have a score with each possible text.

SigLip’s authors designed a very clever and simple algorithm to achieve this. They split the matrix in equal parts, giving each device a set of images and texts. Each device computes scores for the pairs it has, then it sends its text embeddings to the device left of it. We can see in the image that the second device sends its text embeddings to the first device, and so on.
Results
All that theory is nice, but what did the numbers actually say when the SigLip crew let the model loose in the wild?
- ImageNet zero shot. With the exact same ViT B/16 backbone and the same WebLI training set, SigLip lands around 73% top 1. The CLIP replica built on the very same recipe sits in the low 71s. Two percentage points may sound small, but in ImageNet land that’s the difference between a solid baseliner and a leaderboard contender.
- Smaller batch, bigger gap. If the batch size gets even smaller, to 16k pairs, the delta widens to roughly +3 pp. Softmax quickly loses steam when it can’t feast on tens of thousands of negatives at once, while sigmoid keeps improving because every pair is judged on its own.
- Retrieval tasks. On COCO image to text retrieval SigLip nudges recall at 1 from ~74% to 77–78%. Similar single digit but consistent gains show up on Flickr30k and on cross lingual XM 3600.
- Compute & memory. The loss uses one forward/backward pass, no global normalization, and half the activations. In practice the authors doubled the batch on the same number of TPUv4 chips and still used ≈1.8× less cross device traffic.
Why does the tiny swap help? Two main reasons:
- Gradient scale is batch size independent. In softmax/InfoNCE the positive logit is divided by the sum over the whole row, so its gradient shrinks as the row length grows. Sigmoid treats every pair separately—no hidden division by N.
- No double counting. CLIP computes an image to text loss and a text to image loss, effectively using each positive pair twice. That sounds good until you realise you’re also paying twice in memory and communication. SigLip’s one directional view keeps the useful signal and drops the dead weight.
Could there be deeper, fancier theoretical explanations? Probably. Do we have airtight proofs? Not yet. What we do have is a pile of experiments that all point in the same direction: when the batch is anything you could realistically train on a few dozen GPUs/TPUs, the pair wise sigmoid loss just ends up with cleaner gradients and a happier optimizer.
Conclusion
So, is sigmoid “cooler” than softmax? In the context of large scale vision–language pre training, the answer is a pretty confident yes—at least up to the batch sizes us mortals can run.
SigLip shows that you can:
- Swap one line of code (softmax → sigmoid + binary labels)
- Train with half the memory and communication
- Pull out +2–3 pp zero shot accuracy and similar bumps in retrieval
- Scale across devices without shoving a 32k × 32k similarity matrix onto a single host
The catch is simple and honest: if you already inhabit a world where 100k plus batches are routine, the raw accuracy edge fades. But for everyone else, SigLip offers a cheaper, cleaner road to the same destination. And that destination gets closer and closer with each new vision encoder.