Skip to content

Commit 97bc10c

Browse files
committed
Merge branch 'new-flagembedding-v1' of https://github.com/hanhainebula/FlagEmbedding into new-flagembedding-v1
2 parents b8e1cb0 + ef51aeb commit 97bc10c

8 files changed

Lines changed: 52 additions & 30 deletions

File tree

FlagEmbedding/abc/inference/AbsEmbedder.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -167,8 +167,6 @@ def encode(
167167
instruction_format: Optional[str] = None,
168168
**kwargs: Any
169169
):
170-
if instruction is None: instruction = self.instruction
171-
if instruction_format is None: instruction_format = self.instruction_format
172170
if batch_size is None: batch_size = self.batch_size
173171
if max_length is None: max_length = self.passage_max_length
174172
if convert_to_numpy is None: convert_to_numpy = self.convert_to_numpy

FlagEmbedding/inference/reranker/model_mapping.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,5 +54,22 @@ class RerankerConfig:
5454
"bge-reranker-v2.5-gemma2-lightweight",
5555
RerankerConfig(LightWeightFlagLLMReranker)
5656
),
57-
# TODO: Add more models, such as Jina, e5, etc.
57+
# others
58+
(
59+
"jinaai/jina-reranker-v2-base-multilingual",
60+
RerankerConfig(FlagReranker)
61+
),
62+
(
63+
"Alibaba-NLP/gte-multilingual-reranker-base",
64+
RerankerConfig(FlagReranker)
65+
),
66+
(
67+
"maidalun1020/bce-reranker-base_v1",
68+
RerankerConfig(FlagReranker)
69+
),
70+
(
71+
"jinaai/jina-reranker-v1-turbo-en",
72+
RerankerConfig(FlagReranker)
73+
),
74+
# TODO: Add more models.
5875
])

README.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
<h1 align="center">FlagEmbedding</h1>
1+
![bge_logo](./imgs/bge_logo.jpg)
2+
3+
<h1 align="center">⚡️BGE: One-Stop Retrieval Toolkit For Search and RAG</h1>
24
<p align="center">
35
<a href="https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d">
46
<img alt="Build" src="https://img.shields.io/badge/BGE_series-🤗-yellow">
@@ -12,7 +14,7 @@
1214
<a href="https://huggingface.co/C-MTEB">
1315
<img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
1416
</a>
15-
<a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding">
17+
<a href="https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding">
1618
<img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.1-red">
1719
</a>
1820
</p>
@@ -30,25 +32,26 @@
3032
<p>
3133
</h4>
3234

35+
[English](README.md) | [中文](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/README_zh.md)
3336

3437

35-
[English](README.md) | [中文](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/README_zh.md)
3638

37-
FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently:
39+
BGE (BAAI General Embedding) focuses on retrieval-augmented LLMs, consisting of the following projects currently:
40+
41+
![projects](./imgs/projects.png)
3842

3943
- **Inference**: [Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/embedder), [Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/reranker)
4044
- **Finetune**: [Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder), [Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/reranker)
41-
- **Evaluation**: [MTEB](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#1-mteb), [BEIR](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#2-beir), [MSMARCO](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#3-msmarco), [MIRACL](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#4-miracl), [MLDR](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#5-mldr), [MKQA](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#6-mkqa), [AIR-Bench](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#7-air-bench), [Custom Dataset](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#8-custom-dataset)
42-
- **[Dataset](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/dataset)**: [MLDR](https://huggingface.co/datasets/Shitao/MLDR), [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data), [public-data](https://huggingface.co/datasets/cfli/bge-e5data), [full-data](https://huggingface.co/datasets/cfli/bge-full-data), [reranker-data](Shitao/bge-reranker-data)
45+
- **[Evaluation](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation)**
46+
- **[Dataset](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/dataset)**
4347
- **[Tutorials](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/Tutorials)**
44-
- **research**:
45-
- **Long-Context LLM**: [Activation Beacon](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/activation_beacon), [LongLLM QLoRA](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/longllm_qlora)
46-
- **Fine-tuning of LM** : [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail)
47-
- **Embedding Model**: [Visualized-BGE](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/visual_bge), [BGE-M3](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/BGE_M3), [LLM Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_embedder), [BGE Embedding](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding)
48-
- **Reranker Model**: [llm rerankers](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_reranker), [BGE Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/reranker)
49-
- **Benchmark**: [C-MTEB](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB), [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench), [MLVU](https://github.com/JUNJIE99/MLVU)
48+
- **[research](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research)**
5049

5150
## News
51+
52+
- 29/10/2024: :earth_asia: We created WeChat group for BGE. Scan the [QR code](./imgs/BGE_WeChat_Group.png) to join the group chat! To get the first hand message about our updates and new release, or having any questions or ideas, join us now!
53+
- <img src="./imgs/BGE_WeChat_Group.png" alt="bge_wechat_group" class="center" width="200">
54+
5255
- 22/10/2024: :fire: We release another interesting model: [OmniGen](https://github.com/VectorSpaceLab/OmniGen), which is a unified image generation model supporting various tasks. OmniGen can accomplish complex image generation tasks without the need for additional plugins like ControlNet, IP-Adapter, or auxiliary models such as pose detection and face detection.
5356
- 9/10/2024: Introducing **MemoRAG**, a step forward towards RAG 2.0 on top of memory-inspired knowledge discovery (repo: https://github.com/qhjqhj00/MemoRAG, paper: https://arxiv.org/pdf/2409.05591v1) :fire:
5457
- 9/2/2024: Start to maintain the [tutorials](./Tutorials/). The contents within will be actively updated and eariched, stay tuned! :books:

README_zh.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
<h1 align="center">FlagEmbedding</h1>
1+
![bge_logo](./imgs/bge_logo.jpg)
2+
3+
<h1 align="center">⚡️BGE: One-Stop Retrieval Toolkit For Search and RAG</h1>
24
<p align="center">
35
<a href="https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d">
46
<img alt="Build" src="https://img.shields.io/badge/BGE_series-🤗-yellow">
@@ -12,11 +14,12 @@
1214
<a href="https://huggingface.co/C-MTEB">
1315
<img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
1416
</a>
15-
<a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding">
17+
<a href="https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding">
1618
<img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.1-red">
1719
</a>
1820
</p>
1921

22+
2023
<h4 align="center">
2124
<p>
2225
<a href=#更新>更新</a> |
@@ -30,25 +33,24 @@
3033
<a href="#license">License</a>
3134
<p>
3235
</h4>
33-
3436
[English](README.md) | [中文](README_zh.md)
3537

38+
BGE (BAAI General Embedding) 专注于检索增强llm领域,目前包括以下项目:
3639

37-
FlagEmbedding专注于检索增强llm领域,目前包括以下项目:
40+
![projects](./imgs/projects.png)
3841

3942
- **推理**: [Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/embedder), [Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/reranker)
4043
- **微调**: [Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder), [Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/reranker)
41-
- **评估**: [MTEB](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#1-mteb), [BEIR](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#2-beir), [MSMARCO](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#3-msmarco), [MIRACL](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#4-miracl), [MLDR](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#5-mldr), [MKQA](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#6-mkqa), [AIR-Bench](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#7-air-bench), [Custom Dataset](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation#8-custom-dataset)
42-
- **[数据集](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/dataset)**: [MLDR](https://huggingface.co/datasets/Shitao/MLDR), [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data), [public-data](https://huggingface.co/datasets/cfli/bge-e5data), [full-data](https://huggingface.co/datasets/cfli/bge-full-data), [reranker-data](Shitao/bge-reranker-data)
44+
- **[评估](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/evaluation)**
45+
- **[数据集](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/dataset)**
4346
- **[教程](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/Tutorials)**
44-
- **研究**:
45-
- **Long-Context LLM**: [Activation Beacon](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/activation_beacon), [LongLLM QLoRA](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/Long_LLM/longllm_qlora)
46-
- **Fine-tuning of LM** : [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail)
47-
- **Embedding Model**: [Visualized-BGE](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/visual_bge), [BGE-M3](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/BGE_M3), [LLM Embedder](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_embedder), [BGE Embedding](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding)
48-
- **Reranker Model**: [llm rerankers](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/llm_reranker), [BGE Reranker](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/reranker)
49-
- **Benchmark**: [C-MTEB](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/C_MTEB), [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench), [MLVU](https://github.com/JUNJIE99/MLVU)
47+
- **[研究](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research)**
5048

5149
## 更新
50+
51+
- 29/10/2024: :earth_asia: 我们建立了[BGE技术交流群](./BGE_WeChat_Group.png),欢迎扫码入群!
52+
- <img src="./imgs/BGE_WeChat_Group.png" alt="bge_wechat_group" class="center" width="200">
53+
5254
- 9/2/2024: 开始维护更新[教程](./Tutorials/),教程文件夹中的内容会在未来不断丰富,欢迎持续关注! :books:
5355
- 7/26/2024:发布[bge-en-icl](https://huggingface.co/BAAI/bge-en-icl)。这是一个结合了上下文学习能力的文本检索模型,通过提供与任务相关的查询-回答示例,可以编码语义更丰富的查询,进一步增强嵌入的语义表征能力。 :fire:
5456
- 7/26/2024: 发布[bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2)。这是一个基于gemma-2-9b的多语言文本向量模型,同时支持多种语言和多样的下游任务,在多语言检索数据集 MIRACL, MTEB-fr, MTEB-pl 上取得了迄今最好的实验结果。 :fire:

examples/finetune/embedder/README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,9 +75,11 @@ cd FlagEmbedding/scripts
7575
python add_reranker_score.py \
7676
--input_file toy_finetune_data_minedHN.jsonl \
7777
--output_file toy_finetune_data_score.jsonl \
78-
--range_for_sampling 2-200 \
79-
--negative_number 15 \
80-
--use_gpu_for_searching
78+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
79+
--devices cuda:0 cuda:1 \
80+
--cache_dir ./cache/model \
81+
--reranker_query_max_length 512 \
82+
--reranker_max_length 1024
8183
```
8284

8385
- **`input_file`**: path to save JSON data with mined hard negatives for finetuning

imgs/BGE_WeChat_Group.png

60.9 KB
Loading

imgs/bge_logo.jpg

3.64 MB
Loading

imgs/projects.png

113 KB
Loading

0 commit comments

Comments
 (0)