Yuxuan Shu Vasileios Lampos
preprint 2024
Code Paper
Abstract
We present Unsupervised hard Negative Augmentation (UNA), a method that generates synthetic negative instances based on the term frequency-inverse document frequency (TF-IDF) retrieval model. UNA uses TF-IDF scores to ascertain the perceived importance of terms in a sentence and then produces negative samples by replacing terms with respect to that. Our experiments demonstrate that models trained with UNA improve the overall performance in semantic textual similarity tasks. Additional performance gains are obtained when combining UNA with the paraphrasing augmentation. Further results show that our method is compatible with different backbone models. Ablation studies also support the choice of having a TF-IDF-driven control on negative augmentation.
Highlights
We therefore propose Unsupervised hard Negative Augmentation (UNA), an augmentation strategy for generating negative samples in Self-supervised contrastive learning. The proposed method is driven by TF-IDF to generate hard negative pairs. Words with more substance have a greater probability of being swapped, and more common words do not.
Datasets
We pretrained on an English Wikipedia corpus (containing 1 million sentences) proposed by SimCSE and evaluated on seven Semantic Textual Similarity tasks, which are STS 2012-2016, STS Benchmark, and SICK Relatedness. All data sets can be found in the official code base of SimCSE.Citation
@article{shu2024unsupervised, title={Unsupervised hard Negative Augmentation for contrastive learning}, author={Yuxuan Shu and Vasileios Lampos}, year={2024}, journal={arXiv preprint arXiv:2401.02594} }