You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. make sure `gemma_config.py` and `gemma_model.py` from [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight/tree/main) in your local path.
In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
4
4
- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
@@ -7,7 +7,6 @@ In this project, we introduce BGE-M3, which is distinguished for its versatility
7
7
8
8
For more details, please refer to our paper: [BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation](https://arxiv.org/pdf/2402.03216.pdf)
9
9
10
-
11
10
**Some suggestions for retrieval pipeline in RAG**
12
11
13
12
We recommend to use following pipeline: hybrid retrieval + re-ranking.
@@ -19,23 +18,24 @@ To use hybrid retrieval, you can refer to [Vespa](https://github.com/vespa-engin
19
18
) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
20
19
21
20
- As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.
22
-
Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker), [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker)) after retrieval can further filter the selected text.
21
+
Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/reranker#2-normal-reranker), [bge-reranker-v2](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/reranker#3-llm-based-reranker)) after retrieval can further filter the selected text.
23
22
24
23
25
24
## News:
26
25
27
26
- 2024/7/1: **We update the MIRACL evaluation results of BGE-M3**. To reproduce the new results, you can refer to: [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr). We have also updated our [paper](https://arxiv.org/pdf/2402.03216) on arXiv.
27
+
28
28
<details>
29
29
<summary> Details </summary>
30
-
30
+
31
31
> The previous test results were lower because we mistakenly removed the passages that have the same id as the query from the search results. After correcting this mistake, the overall performance of BGE-M3 on MIRACL is higher than the previous results, but the experimental conclusion remains unchanged. The other results are not affected by this mistake. To reproduce the previous lower results, you need to add the `--remove-query` parameter when using `pyserini.search.faiss` or `pyserini.search.lucene` to search the passages.
32
-
32
+
33
33
</details>
34
34
- 2024/3/20: **Thanks Milvus team!** Now you can use hybrid retrieval of bge-m3 in Milvus: [pymilvus/examples
- 2024/3/8: **Thanks for the [experimental results](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) from @[Yannael](https://huggingface.co/Yannael). In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.**
37
-
- 2024/3/2: Release unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data)
38
-
- 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
37
+
- 2024/3/2: Release unified fine-tuning [example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#2-bge-m3) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data)
38
+
- 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB/MLDR).
39
39
- 2024/2/1: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
40
40
41
41
@@ -81,10 +81,10 @@ For hybrid retrieval, you can use [Vespa](https://github.com/vespa-engine/pyvesp
81
81
82
82
**3. How to fine-tune bge-M3 model?**
83
83
84
-
You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
84
+
You can follow the common in this [example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#1-standard-model)
85
85
to fine-tune the dense embedding.
86
86
87
-
If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune)
87
+
If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the [unified_fine-tuning example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#2-bge-m3)
Copy file name to clipboardExpand all lines: research/LM_Cocktail/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -237,7 +237,7 @@ Merge 10 models fine-tuned on other tasks based on five examples for new tasks:
237
237
- Examples Data for dataset from FLAN: [./llm_examples.json]()
238
238
- MMLU dataset: https://huggingface.co/datasets/cais/mmlu (use the example in dev set to do in-context learning)
239
239
240
-
You can use these models and our code to produce a new model and evaluate its performance using the [llm-embedder script](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/llm_embedder/docs/evaluation.md) as following:
240
+
You can use these models and our code to produce a new model and evaluate its performance using the [llm-embedder script](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/research/llm_embedder/docs/evaluation.md) as following:
Copy file name to clipboardExpand all lines: research/baai_general_embedding/README.md
+10-9Lines changed: 10 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,11 +13,12 @@ Therefore, make sure to use the correct method to obtain sentence vectors. You c
13
13
14
14
**1. How to fine-tune bge embedding model?**
15
15
16
-
Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
16
+
Following this [example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder) to prepare data and fine-tune your model.
17
17
Some suggestions:
18
-
- Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives), which can improve the retrieval performance.
19
-
- In general, larger hyper-parameter `per_device_train_batch_size` brings better performance. You can expand it by enabling `--fp16`, `--deepspeed df_config.json` (df_config.json can refer to [ds_config.json](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/ds_config.json), `--gradient_checkpointing`, etc.
20
-
- If you want to maintain the performance on other tasks when fine-tuning on your data, you can use [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail) to merge the fine-tuned model and the original bge model. Besides, if you want to fine-tune on multiple tasks, you also can approximate the multi-task learning via model merging as [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).
18
+
19
+
- Mine hard negatives following this [example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#hard-negatives), which can improve the retrieval performance.
20
+
- In general, larger hyper-parameter `per_device_train_batch_size` brings better performance. You can expand it by enabling `--fp16`, `--deepspeed df_config.json` (df_config.json can refer to [ds_config.json](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/examples/finetune/ds_stage0.json), `--gradient_checkpointing`, etc.
21
+
- If you want to maintain the performance on other tasks when fine-tuning on your data, you can use [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail) to merge the fine-tuned model and the original bge model. Besides, if you want to fine-tune on multiple tasks, you also can approximate the multi-task learning via model merging as [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail).
21
22
- If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
22
23
- If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
23
24
@@ -57,7 +58,7 @@ please select an appropriate similarity threshold based on the similarity distri
57
58
For the `bge-*-v1.5`, we improve its retrieval ability when not using instruction.
58
59
No instruction only has a slight degradation in retrieval performance compared with using instruction.
59
60
So you can generate embedding without instruction in all cases for convenience.
60
-
61
+
61
62
For a retrieval task that uses short queries to find long related documents,
62
63
it is recommended to add instructions for these short queries.
63
64
**The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task.**
`baai-general-embedding` models achieve **state-of-the-art performance on both MTEB and C-MTEB leaderboard!**
195
-
For more details and evaluation tools see our [scripts](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md)
196
+
For more details and evaluation tools see our [scripts](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB)
196
197
197
-
If you want to evaluate the model(or your model) on **your data**, you can refer to this [tool](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#6-evaluate-model).
198
+
If you want to evaluate the model(or your model) on **your data**, you can refer to this [tool](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#8-custom-dataset).
198
199
199
200
200
201
-**MTEB**:
@@ -224,7 +225,7 @@ If you want to evaluate the model(or your model) on **your data**, you can refer
224
225
-**C-MTEB**:
225
226
We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
226
227
Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
0 commit comments