A quick guide to Contrastive Learning

Contrastive learning, much like the secret sauce behind our brain’s ability to discern patterns, is revolutionizing the field of machine learning. But what exactly is contrastive learning, and why is it causing such a stir? In this post, I’ll guide you through the fundamentals of contrastive learning. Let’s get started! 🎉

Table of Contents

  1. Understanding Contrastive Learning
  2. Awesome Contrastive Learning Papers

Understanding Contrastive Learning

The Concept of Contrastive Learning

Imagine you’re a detective trying to solve a puzzle. You’re given two sets of clues: one set that helps you identify the culprit based on evidence like fingerprints and eyewitness accounts, and another set that throws you off track with misleading information such as planted alibis and false leads. Contrastive learning operates on a similar principle, but instead of suspects and clues, it deals with data points and similarities.

In essence, contrastive learning involves presenting a model with pairs of data points and teaching it to differentiate between similar pairs (positive samples) and dissimilar pairs (negative samples). By optimizing the model to maximize the similarity between positive pairs and minimize the similarity between negative pairs, the model learns to capture meaningful representations of the data.

Key Components of Contrastive Learning

Poitive generation

In contrastive learning, generating positive pairs is crucial for training the model effectively. Positive pairs consist of data points that are similar or belong to the same class. Creating positive pairs involves various strategies depending on the dataset and the task at hand.

One common method is to use data augmentation techniques to create variations of the same data point. For example, in computer vision tasks, positive pairs can be generated by applying random transformations such as rotation, cropping, flipping, or color jittering to an image. These augmented versions of the original image serve as positive examples, which may still be considered to represent the same object. Another approach is to leverage domain-specific knowledge to identify similarities between data points. For instance, in natural language processing tasks, positive pairs can be created by considering synonyms or paraphrases of the same sentence or phrase. Deep learning models, such as NERF, can also be used for augmentations (create different views of the object); however, this would greatly add to heavy computational costs.

Negative Sampling

Another crucial aspect of contrastive learning is negative sampling, which involves selecting dissimilar pairs of data points to serve as negative examples during training. Negative sampling plays a pivotal role in guiding the model to focus on relevant features and discard irrelevant ones.

The most basic approach to negative sampling is to randomly select data points from the dataset that are are simply not the anchor point. This is particularly effective in datasets with diverse classes, reducing the likelihood of sampling instances from the same class as the anchor. Another method for negative sampling is to employ hard negative mining or generation, where challenging examples that are close to positive samples are identified or generated based on given rules and used as negative examples. By focusing on challenging examples, the model is forced to learn more robust representations and improve its performance on difficult tasks.

Contrastive Loss

The contrastive loss writes as below:

\[\mathcal{L}_{N} = \mathbb{E}_{\mathbf{X}} \left[ -\log \frac{e^{f_{\text{sim}}(\mathbf{x}_{i}, \mathbf{x}_{j})/\tau}}{\sum_{k=1}^{N} \mathbb{I}_{[k \neq i,j]} e^{f_{\text{sim}}(\mathbf{x}_{i}, \mathbf{x}_{k})/\tau}} \right]\]


  • \(\mathcal{L}_{N}\) is the overall contrastive loss.
  • \(N\) is the number of samples.
  • \(\mathbb{E}_{\mathbf{X}}\) is the expectation operator over the batch of data \(\mathbf{X}\).
  • \(\mathbf{x}_{i}\) is an anchor sample.
  • \(\mathbf{x}_{j}\) is a positive sample (similar to the anchor).
  • \(\mathbf{x}_{k}\) is a negative sample (dissimilar to the anchor).
  • \(\tau\) is the temperature parameter to scale the logits.
  • \(f_{\text{sim}}(\cdot, \cdot)\) is a similarity metric (e.g., Euclidean distance or cosine similarity).

Awesome Contrastive Learning Papers

Here’s a curated list of papers on contrastive learning:

SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
Authors: Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton
Venue: ECCV 2020

MoCo: Momentum Contrast for Unsupervised Visual Representation Learning
Authors: Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick
Venue: CVPR 2020

SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
Authors: Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin
Venue: NeurIPS 2020

MoCo v2: Improved Baselines with Momentum Contrastive Learning
Authors: Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He
Venue: arXiv preprint

BYOL: Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning
Authors: Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko
Venue: NeurIPS 2020

Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Authors: Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, Stéphane Deny
Venue: arXiv preprint

CLIP: Contrastive Language-Image Pre-training
Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
Venue: NeurIPS 2021

SimCSE: Simple Contrastive Learning of Sentence Embeddings
Authors: Tianyu Gao, Xingcheng Yao, Danqi Chen
Venue: arXiv preprint