Skip to content

Commit 7ac70bc

Browse files
committed
update readme
1 parent b81fc65 commit 7ac70bc

20 files changed

Lines changed: 708 additions & 26 deletions

File tree

examples/inference/reranker/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -389,6 +389,36 @@ with torch.no_grad():
389389
print(scores)
390390
```
391391

392+
## Load model in local
393+
394+
### Load llm-based layerwise reranker in local
395+
396+
If you download reranker-v2-minicpm-layerwise, you can load it with the following method:
397+
398+
1. make sure `configuration_minicpm_reranker.py` and `modeling_minicpm_reranker.py` in `/path/bge-reranker-v2-minicpm-layerwise`.
399+
2. modify the following part of `config.json`:
400+
401+
```
402+
"auto_map": {
403+
"AutoConfig": "configuration_minicpm_reranker.LayerWiseMiniCPMConfig",
404+
"AutoModel": "modeling_minicpm_reranker.LayerWiseMiniCPMModel",
405+
"AutoModelForCausalLM": "modeling_minicpm_reranker.LayerWiseMiniCPMForCausalLM"
406+
},
407+
```
408+
409+
### Load llm-based lightweight reranker in local
410+
411+
1. make sure `gemma_config.py` and `gemma_model.py` from [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight/tree/main) in your local path.
412+
2. modify the following part of config.json:
413+
414+
```
415+
"auto_map": {
416+
"AutoConfig": "gemma_config.CostWiseGemmaConfig",
417+
"AutoModel": "gemma_model.CostWiseGemmaModel",
418+
"AutoModelForCausalLM": "gemma_model.CostWiseGemmaForCausalLM"
419+
},
420+
```
421+
392422
## Citation
393423

394424
If you find this repository useful, please consider giving a star :star: and citation

research/BGE_M3/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# BGE-M3 ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
1+
# BGE-M3 ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/BGE_M3))
22

33
In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
44
- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
@@ -7,7 +7,6 @@ In this project, we introduce BGE-M3, which is distinguished for its versatility
77

88
For more details, please refer to our paper: [BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation](https://arxiv.org/pdf/2402.03216.pdf)
99

10-
1110
**Some suggestions for retrieval pipeline in RAG**
1211

1312
We recommend to use following pipeline: hybrid retrieval + re-ranking.
@@ -19,23 +18,24 @@ To use hybrid retrieval, you can refer to [Vespa](https://github.com/vespa-engin
1918
) and [Milvus](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
2019

2120
- As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.
22-
Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker), [bge-reranker-v2](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker)) after retrieval can further filter the selected text.
21+
Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/reranker#2-normal-reranker), [bge-reranker-v2](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/reranker#3-llm-based-reranker)) after retrieval can further filter the selected text.
2322

2423

2524
## News:
2625

2726
- 2024/7/1: **We update the MIRACL evaluation results of BGE-M3**. To reproduce the new results, you can refer to: [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr). We have also updated our [paper](https://arxiv.org/pdf/2402.03216) on arXiv.
27+
2828
<details>
2929
<summary> Details </summary>
30-
30+
3131
> The previous test results were lower because we mistakenly removed the passages that have the same id as the query from the search results. After correcting this mistake, the overall performance of BGE-M3 on MIRACL is higher than the previous results, but the experimental conclusion remains unchanged. The other results are not affected by this mistake. To reproduce the previous lower results, you need to add the `--remove-query` parameter when using `pyserini.search.faiss` or `pyserini.search.lucene` to search the passages.
32-
32+
3333
</details>
3434
- 2024/3/20: **Thanks Milvus team!** Now you can use hybrid retrieval of bge-m3 in Milvus: [pymilvus/examples
3535
/hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
3636
- 2024/3/8: **Thanks for the [experimental results](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) from @[Yannael](https://huggingface.co/Yannael). In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.**
37-
- 2024/3/2: Release unified fine-tuning [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data)
38-
- 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
37+
- 2024/3/2: Release unified fine-tuning [example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#2-bge-m3) and [data](https://huggingface.co/datasets/Shitao/bge-m3-data)
38+
- 2024/2/6: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB/MLDR).
3939
- 2024/2/1: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
4040

4141

@@ -81,10 +81,10 @@ For hybrid retrieval, you can use [Vespa](https://github.com/vespa-engine/pyvesp
8181

8282
**3. How to fine-tune bge-M3 model?**
8383

84-
You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
84+
You can follow the common in this [example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#1-standard-model)
8585
to fine-tune the dense embedding.
8686

87-
If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the [unified_fine-tuning example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune)
87+
If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the [unified_fine-tuning example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#2-bge-m3)
8888

8989

9090

research/C_MTEB/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ pip install -U C_MTEB
3030
Or clone this repo and install as editable
3131
```
3232
git clone https://github.com/FlagOpen/FlagEmbedding.git
33-
cd FlagEmbedding/C_MTEB
33+
cd FlagEmbedding/research/C_MTEB
3434
pip install -e .
3535
```
3636

@@ -40,7 +40,7 @@ pip install -e .
4040
```bash
4141
python eval_cross_encoder.py --model_name_or_path BAAI/bge-reranker-base
4242
```
43-
43+
4444
### Evaluate embedding model
4545
* **With our scripts**
4646

@@ -54,7 +54,7 @@ python eval_MTEB.py --model_name_or_path BAAI/bge-large-en
5454
```
5555

5656
* **With sentence-transformers**
57-
57+
5858
You can use C-MTEB easily in the same way as [MTEB](https://github.com/embeddings-benchmark/mteb).
5959

6060
Note that the original sentence-transformers model doesn't support instruction.

research/LM_Cocktail/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -237,7 +237,7 @@ Merge 10 models fine-tuned on other tasks based on five examples for new tasks:
237237
- Examples Data for dataset from FLAN: [./llm_examples.json]()
238238
- MMLU dataset: https://huggingface.co/datasets/cais/mmlu (use the example in dev set to do in-context learning)
239239

240-
You can use these models and our code to produce a new model and evaluate its performance using the [llm-embedder script](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/llm_embedder/docs/evaluation.md) as following:
240+
You can use these models and our code to produce a new model and evaluate its performance using the [llm-embedder script](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/research/llm_embedder/docs/evaluation.md) as following:
241241
```
242242
# for 30 tasks from FLAN
243243
torchrun --nproc_per_node 8 -m evaluation.eval_icl \

research/baai_general_embedding/README.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,12 @@ Therefore, make sure to use the correct method to obtain sentence vectors. You c
1313

1414
**1. How to fine-tune bge embedding model?**
1515

16-
Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
16+
Following this [example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder) to prepare data and fine-tune your model.
1717
Some suggestions:
18-
- Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives), which can improve the retrieval performance.
19-
- In general, larger hyper-parameter `per_device_train_batch_size` brings better performance. You can expand it by enabling `--fp16`, `--deepspeed df_config.json` (df_config.json can refer to [ds_config.json](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/ds_config.json), `--gradient_checkpointing`, etc.
20-
- If you want to maintain the performance on other tasks when fine-tuning on your data, you can use [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail) to merge the fine-tuned model and the original bge model. Besides, if you want to fine-tune on multiple tasks, you also can approximate the multi-task learning via model merging as [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).
18+
19+
- Mine hard negatives following this [example](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#hard-negatives), which can improve the retrieval performance.
20+
- In general, larger hyper-parameter `per_device_train_batch_size` brings better performance. You can expand it by enabling `--fp16`, `--deepspeed df_config.json` (df_config.json can refer to [ds_config.json](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/examples/finetune/ds_stage0.json), `--gradient_checkpointing`, etc.
21+
- If you want to maintain the performance on other tasks when fine-tuning on your data, you can use [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail) to merge the fine-tuned model and the original bge model. Besides, if you want to fine-tune on multiple tasks, you also can approximate the multi-task learning via model merging as [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail).
2122
- If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
2223
- If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
2324

@@ -57,7 +58,7 @@ please select an appropriate similarity threshold based on the similarity distri
5758
For the `bge-*-v1.5`, we improve its retrieval ability when not using instruction.
5859
No instruction only has a slight degradation in retrieval performance compared with using instruction.
5960
So you can generate embedding without instruction in all cases for convenience.
60-
61+
6162
For a retrieval task that uses short queries to find long related documents,
6263
it is recommended to add instructions for these short queries.
6364
**The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task.**
@@ -80,7 +81,7 @@ or:
8081
```
8182
pip install -U FlagEmbedding
8283
```
83-
84+
8485

8586
```python
8687
from FlagEmbedding import FlagModel
@@ -192,9 +193,9 @@ print("Sentence embeddings:", sentence_embeddings)
192193
## Evaluation
193194

194195
`baai-general-embedding` models achieve **state-of-the-art performance on both MTEB and C-MTEB leaderboard!**
195-
For more details and evaluation tools see our [scripts](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md)
196+
For more details and evaluation tools see our [scripts](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB)
196197

197-
If you want to evaluate the model(or your model) on **your data**, you can refer to this [tool](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#6-evaluate-model).
198+
If you want to evaluate the model(or your model) on **your data**, you can refer to this [tool](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#8-custom-dataset).
198199

199200

200201
- **MTEB**:
@@ -224,7 +225,7 @@ If you want to evaluate the model(or your model) on **your data**, you can refer
224225
- **C-MTEB**:
225226
We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
226227
Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
227-
228+
228229
| Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
229230
|:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
230231
| [**BAAI/bge-large-zh-v1.5**](https://huggingface.co/BAAI/bge-large-zh-v1.5) | 1024 | **64.53** | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 |

research/llm_dense_retriever/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,8 +188,10 @@ run.py \
188188
--dataloader_drop_last True \
189189
--normlized True \
190190
--temperature 0.02 \
191-
--query_max_len 512 \
191+
--query_max_len 2048 \
192192
--passage_max_len 512 \
193+
--example_query_max_len 256 \
194+
--example_passage_max_len 256 \
193195
--train_group_size 8 \
194196
--logging_steps 1 \
195197
--save_steps 250 \

research/llm_reranker/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ You can fine-tune the reranker with the following code:
254254

255255
**For normal reranker** (bge-reranker-base / bge-reranker-large / bge-reranker-v2-m3 )
256256

257-
Refer to: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker
257+
Refer to: [reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/reranker#1-standard-model)
258258

259259
**For llm-based reranker** (bge-reranker-v2-gemma)
260260

0 commit comments

Comments
 (0)