Alibaba Group’s Institute for Intelligent Computing has unveiled the gte-v1.5 series, a significant upgrade to their gte embeddings, now supporting context lengths up to 8192. These models leverage the transformer++ encoder backbone, integrating BERT, RoPE, and GLU technologies to enhance performance. The gte-v1.5 series sets a new benchmark in the MTEB benchmark for its model size category and shows competitive performance in the LoCo long-context retrieval tests. A standout model, the gte-Qwen1.5-7B-instruct, excels in multi-lingual embedding, securing top positions in both MTEB and C-MTEB competitions.
Developed with a focus on text embeddings, the series underwent a rigorous training regimen, including masked language modeling and weak-supervised contrastive pre-training, to support extended context lengths. The training strategy was multi-staged, gradually increasing context lengths to optimize the model’s performance across various benchmarks.
In summary, the gte-v1.5 series represents a significant step forward in the development of text embeddings, offering enhanced performance for processing long-context information and setting new standards in multi-lingual embedding capabilities.
Read more…