You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: FlagEmbedding/BGE_M3/README.md
+23-27Lines changed: 23 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,8 +5,26 @@ In this project, we introduce BGE-M3, which is distinguished for its versatility
5
5
- Multi-Linguality: It can support more than 100 working languages.
6
6
- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
7
7
8
+
For more details, please refer to our paper: [BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation](https://arxiv.org/pdf/2402.03216.pdf)
9
+
10
+
11
+
**Some suggestions for retrieval pipeline in RAG**
12
+
13
+
We recommend to use following pipeline: hybrid retrieval + re-ranking.
14
+
- Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities.
15
+
A classic example: using both embedding retrieval and the BM25 algorithm.
16
+
Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval.
17
+
This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings.
18
+
To use hybrid retrieval, you can refer to [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
19
+
) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
20
+
21
+
- As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.
22
+
Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker), [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker)) after retrieval can further filter the selected text.
23
+
8
24
9
25
## News:
26
+
- 2024/3/20: **Thanks Milvus team!** Now you can use hybrid retrieval of bge-m3 in Milvus: [pymilvus/examples
- 2024/3/8: **Thanks for the [experimental results](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) from @[Yannael](https://huggingface.co/Yannael). In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.**
11
29
- 2024/3/2: Release unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data)
12
30
- 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
@@ -43,47 +61,25 @@ In this project, we introduce BGE-M3, which is distinguished for its versatility
43
61
- Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
44
62
- Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
45
63
46
-
**2. Comparison with BGE-v1.5 and other monolingual models**
47
-
48
-
BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages.
49
-
However, we still recommend trying BGE-M3 because of its versatility (support for multiple languages and long texts).
50
-
Moreover, it can simultaneously generate multiple representations, and using them together can enhance accuracy and generalization,
51
-
unlike most existing models that can only perform dense retrieval.
52
-
53
-
In the open-source community, there are many excellent models (e.g., jina-embedding, colbert, e5, etc),
54
-
and users can choose a model that suits their specific needs based on practical considerations,
55
-
such as whether to require multilingual or cross-language support, and whether to process long texts.
56
64
57
-
**3. How to use BGE-M3 in other projects?**
65
+
**2. How to use BGE-M3 in other projects?**
58
66
59
67
For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE.
60
68
The only difference is that the BGE-M3 model no longer requires adding instructions to the queries.
61
-
For sparse retrieval methods, most open-source libraries currently do not support direct utilization of the BGE-M3 model.
62
-
Contributions from the community are welcome.
63
69
70
+
For hybrid retrieval, you can use [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
71
+
) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
64
72
65
-
In our experiments, we use [Pyserini](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse) and Faiss to do hybrid retrieval.
66
-
**Now you can ou can try the hybrid mode of BGE-M3 in [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
67
-
). Thanks @jobergum.**
68
73
69
-
70
-
**4. How to fine-tune bge-M3 model?**
74
+
**3. How to fine-tune bge-M3 model?**
71
75
72
76
You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
73
77
to fine-tune the dense embedding.
74
78
75
-
If you want to fine-tune all embedding function of m3, you can refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune)
79
+
If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune)
76
80
77
81
78
82
79
-
**5. Some suggestions for retrieval pipeline in RAG**
80
-
We recommend to use following pipeline: hybrid retrieval + re-ranking.
81
-
- Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities.
82
-
A classic example: using both embedding retrieval and the BM25 algorithm.
83
-
Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval.
84
-
This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings.
85
-
- As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.
86
-
Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker), [cohere-reranker](https://txt.cohere.com/rerank/)) after retrieval can further filter the selected text.
0 commit comments