Skip to content

Commit 71cbc1f

Browse files
authored
Merge branch 'FlagOpen:master' into master
2 parents 1e4d428 + ad08b9a commit 71cbc1f

173 files changed

Lines changed: 48864 additions & 12371 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

C_MTEB/MKQA/dense_retrieval/step1-search_results.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -136,10 +136,6 @@ def save_result(search_results, result_save_path: str, qids: list, max_hits: int
136136
max_passage_hits=1000)
137137
with output_writer:
138138
for topic, hits in search_results:
139-
# For some test collections, a query is doc from the corpus (e.g., arguana in BEIR).
140-
# Remove the query from the results.
141-
hits = [hit for hit in hits if hit.docid != topic]
142-
143139
output_writer.write(topic, hits)
144140

145141

C_MTEB/MLDR/dense_retrieval/step1-search_results.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -120,10 +120,6 @@ def save_result(search_results, result_save_path: str, qids: list, max_hits: int
120120
max_passage_hits=1000)
121121
with output_writer:
122122
for topic, hits in search_results:
123-
# For some test collections, a query is doc from the corpus (e.g., arguana in BEIR).
124-
# Remove the query from the results.
125-
hits = [hit for hit in hits if hit.docid != topic]
126-
127123
output_writer.write(topic, hits)
128124

129125

FlagEmbedding/.DS_Store

0 Bytes
Binary file not shown.

FlagEmbedding/BGE_M3/README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,14 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
2323

2424

2525
## News:
26+
27+
- 2024/7/1: **We update the MIRACL evaluation results of BGE-M3**. To reproduce the new results, you can refer to: [bge-m3_miracl_2cr](https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr). We have also updated our [paper](https://arxiv.org/pdf/2402.03216) on arXiv.
28+
<details>
29+
<summary> Details </summary>
30+
31+
> The previous test results were lower because we mistakenly removed the passages that have the same id as the query from the search results. After correcting this mistake, the overall performance of BGE-M3 on MIRACL is higher than the previous results, but the experimental conclusion remains unchanged. The other results are not affected by this mistake. To reproduce the previous lower results, you need to add the `--remove-query` parameter when using `pyserini.search.faiss` or `pyserini.search.lucene` to search the passages.
32+
33+
</details>
2634
- 2024/3/20: **Thanks Milvus team!** Now you can use hybrid retrieval of bge-m3 in Milvus: [pymilvus/examples
2735
/hello_hybrid_sparse_dense.py](https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py).
2836
- 2024/3/8: **Thanks for the [experimental results](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) from @[Yannael](https://huggingface.co/Yannael). In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.**
@@ -205,14 +213,15 @@ print(model.compute_score(sentence_pairs,
205213

206214
We provide the evaluation script for [MKQA](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MKQA) and [MLDR](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR)
207215

216+
208217
### Benchmarks from the open-source community
209218
![avatar](./imgs/others.webp)
210219
The BGE-M3 model emerged as the top performer on this benchmark (OAI is short for OpenAI).
211220
For more details, please refer to the [article](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05) and [Github Repo](https://github.com/Yannael/multilingual-embeddings)
212221

213222

214223
### Our results
215-
- Multilingual (Miracl dataset)
224+
- Multilingual (MIRACL dataset)
216225

217226
![avatar](./imgs/miracl.jpg)
218227

FlagEmbedding/BGE_M3/imgs/bm25.jpg

61.4 KB
Loading
126 KB
Loading

FlagEmbedding/baai_general_embedding/README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,11 @@ print("Sentence embeddings:", sentence_embeddings)
192192
## Evaluation
193193

194194
`baai-general-embedding` models achieve **state-of-the-art performance on both MTEB and C-MTEB leaderboard!**
195-
For more details and evaluation tools see our [scripts](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md)
195+
For more details and evaluation tools see our [scripts](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md)
196+
197+
If you want to evaluate the model(or your model) on **your data**, you can refer to this [tool](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#6-evaluate-model).
198+
199+
196200
- **MTEB**:
197201

198202
| Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) |

FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py

Lines changed: 40 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,15 @@ class Args:
2626
default=False,
2727
metadata={'help': 'Add query-side instruction?'}
2828
)
29+
30+
corpus_data: str = field(
31+
default="namespace-Pt/msmarco",
32+
metadata={'help': 'candidate passages'}
33+
)
34+
query_data: str = field(
35+
default="namespace-Pt/msmarco-corpus",
36+
metadata={'help': 'queries and their positive passages for evaluation'}
37+
)
2938

3039
max_query_length: int = field(
3140
default=32,
@@ -143,7 +152,10 @@ def search(model: FlagModel, queries: datasets, faiss_index: faiss.Index, k:int
143152
return all_scores, all_indices
144153

145154

146-
def evaluate(preds, labels, cutoffs=[1,10,100]):
155+
def evaluate(preds,
156+
preds_scores,
157+
labels,
158+
cutoffs=[1, 10, 100]):
147159
"""
148160
Evaluate MRR and Recall at cutoffs.
149161
"""
@@ -177,15 +189,37 @@ def evaluate(preds, labels, cutoffs=[1,10,100]):
177189
recall = recalls[i]
178190
metrics[f"Recall@{cutoff}"] = recall
179191

180-
return metrics
192+
# AUC
193+
pred_hard_encodings = []
194+
for pred, label in zip(preds, labels):
195+
pred_hard_encoding = np.isin(pred, label).astype(int).tolist()
196+
pred_hard_encodings.append(pred_hard_encoding)
197+
198+
from sklearn.metrics import roc_auc_score, roc_curve, ndcg_score
199+
pred_hard_encodings1d = np.asarray(pred_hard_encodings).flatten()
200+
preds_scores1d = preds_scores.flatten()
201+
auc = roc_auc_score(pred_hard_encodings1d, preds_scores1d)
202+
203+
metrics['AUC@100'] = auc
181204

205+
# nDCG
206+
for k, cutoff in enumerate(cutoffs):
207+
nDCG = ndcg_score(pred_hard_encodings, preds_scores, k=cutoff)
208+
metrics[f"nDCG@{cutoff}"] = nDCG
209+
210+
return metrics
182211

183212
def main():
184213
parser = HfArgumentParser([Args])
185214
args: Args = parser.parse_args_into_dataclasses()[0]
186-
187-
eval_data = datasets.load_dataset("namespace-Pt/msmarco", split="dev")
188-
corpus = datasets.load_dataset("namespace-Pt/msmarco-corpus", split="train")
215+
216+
if args.query_data == 'namespace-Pt/msmarco-corpus':
217+
assert args.corpus_data == 'namespace-Pt/msmarco'
218+
eval_data = datasets.load_dataset("namespace-Pt/msmarco", split="dev")
219+
corpus = datasets.load_dataset("namespace-Pt/msmarco-corpus", split="train")
220+
else:
221+
eval_data = datasets.load_dataset('json', data_files=args.query_data, split='train')
222+
corpus = datasets.load_dataset('json', data_files=args.corpus_data, split='train')
189223

190224
model = FlagModel(
191225
args.encoder,
@@ -223,9 +257,7 @@ def main():
223257
for sample in eval_data:
224258
ground_truths.append(sample["positive"])
225259

226-
from FlagEmbedding.llm_embedder.src.utils import save_json
227-
228-
metrics = evaluate(retrieval_results, ground_truths)
260+
metrics = evaluate(retrieval_results, scores, ground_truths)
229261

230262
print(metrics)
231263

FlagEmbedding/flag_reranker.py

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
from torch.utils.data import DataLoader
77
from tqdm import tqdm, trange
88
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, is_torch_npu_available
9-
9+
from peft import PeftModel
1010
import warnings
1111
from torch.utils.data import Dataset
1212
import os
@@ -218,15 +218,14 @@ def compute_score(self, sentence_pairs: Union[List[Tuple[str, str]], Tuple[str,
218218
if normalize:
219219
all_scores = [sigmoid(score) for score in all_scores]
220220

221-
if len(all_scores) == 1:
222-
return all_scores[0]
223221
return all_scores
224222

225223

226224
class FlagLLMReranker:
227225
def __init__(
228226
self,
229227
model_name_or_path: str = None,
228+
peft_path: str = None,
230229
use_fp16: bool = False,
231230
use_bf16: bool = False,
232231
cache_dir: str = None,
@@ -240,6 +239,9 @@ def __init__(
240239
cache_dir=cache_dir,
241240
trust_remote_code=True,
242241
torch_dtype=torch.bfloat16 if use_bf16 else torch.float32)
242+
if peft_path:
243+
self.model = PeftModel.from_pretrained(self.model,peft_path)
244+
self.model = self.model.merge_and_unload()
243245
self.model_name_or_path = model_name_or_path
244246
self.cache_dir = cache_dir
245247

@@ -270,7 +272,7 @@ def __init__(
270272
@torch.no_grad()
271273
def compute_score(self, sentence_pairs: Union[List[Tuple[str, str]], Tuple[str, str]], batch_size: int = 16,
272274
max_length: int = 512, prompt: str = None, normalize: bool = False,
273-
use_dataloader: bool = True, num_workers: int = None) -> List[float]:
275+
use_dataloader: bool = False, num_workers: int = None) -> List[float]:
274276
assert isinstance(sentence_pairs, list)
275277
if isinstance(sentence_pairs[0], str):
276278
sentence_pairs = [sentence_pairs]
@@ -365,8 +367,8 @@ def compute_score(self, sentence_pairs: Union[List[Tuple[str, str]], Tuple[str,
365367
if normalize:
366368
all_scores = [sigmoid(score) for score in all_scores]
367369

368-
if len(all_scores) == 1:
369-
return all_scores[0]
370+
# if len(all_scores) == 1:
371+
# return all_scores[0]
370372

371373
return all_scores
372374

@@ -392,6 +394,7 @@ class LayerWiseFlagLLMReranker:
392394
def __init__(
393395
self,
394396
model_name_or_path: str = None,
397+
peft_path: str = None,
395398
use_fp16: bool = False,
396399
use_bf16: bool = False,
397400
cache_dir: str = None,
@@ -410,7 +413,9 @@ def __init__(
410413
trust_remote_code=True,
411414
local_files_only=True,
412415
torch_dtype=torch.bfloat16 if use_bf16 else torch.float32)
413-
416+
if peft_path:
417+
self.model = PeftModel.from_pretrained(self.model,peft_path)
418+
self.model = self.model.merge_and_unload()
414419
self.model_name_or_path = model_name_or_path
415420
self.cache_dir = cache_dir
416421

@@ -444,7 +449,7 @@ def __init__(
444449
@torch.no_grad()
445450
def compute_score(self, sentence_pairs: Union[List[Tuple[str, str]], Tuple[str, str]], batch_size: int = 16,
446451
max_length: int = 512, cutoff_layers: List[int] = None, prompt: str = None,
447-
normalize: bool = False, use_dataloader: bool = True,
452+
normalize: bool = False, use_dataloader: bool = False,
448453
num_workers: int = None) -> Union[float, List[float], List[List[float]]]:
449454
assert isinstance(sentence_pairs, list)
450455
if isinstance(sentence_pairs[0], str):
@@ -556,10 +561,10 @@ def compute_score(self, sentence_pairs: Union[List[Tuple[str, str]], Tuple[str,
556561
if normalize:
557562
all_scores[i] = [sigmoid(score) for score in all_scores[i]]
558563

559-
if len(all_scores) == 1:
560-
if len(all_scores[0]) == 1:
561-
return all_scores[0][0]
562-
return all_scores[0]
564+
# if len(all_scores) == 1:
565+
# if len(all_scores[0]) == 1:
566+
# return all_scores[0][0]
567+
# return all_scores[0]
563568

564569
return all_scores
565570

FlagEmbedding/llm_reranker/README.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,11 @@ See [toy_finetune_data.jsonl](https://github.com/FlagOpen/FlagEmbedding/tree/mas
251251

252252
You can fine-tune the reranker with the following code:
253253

254-
**For llm-based reranker**
254+
**For normal reranker** (bge-reranker-base / bge-reranker-large / bge-reranker-v2-m3 )
255+
256+
Refer to: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker
257+
258+
**For llm-based reranker** (bge-reranker-v2-gemma)
255259

256260
```shell
257261
torchrun --nproc_per_node {number of gpus} \
@@ -282,7 +286,7 @@ torchrun --nproc_per_node {number of gpus} \
282286
--target_modules q_proj k_proj v_proj o_proj
283287
```
284288

285-
**For llm-based layerwise reranker**
289+
**For llm-based layerwise reranker** (bge-reranker-v2-minicpm-layerwise)
286290

287291
```shell
288292
torchrun --nproc_per_node {number of gpus} \
@@ -360,21 +364,21 @@ merge_layerwise_finetuned_llm('BAAI/bge-reranker-v2-minicpm-layerwise', 'lora_ll
360364

361365
- BEIR.
362366

363-
rereank the top 100 results from bge-en-v1.5 large.
367+
rerank the top 100 results from bge-en-v1.5 large.
364368

365369
![image-20240319140555921](./evaluation/BEIR-bge-en-v1.5.png)
366370

367-
rereank the top 100 results from e5 mistral 7b instruct.
371+
rerank the top 100 results from e5 mistral 7b instruct.
368372

369373
![image-20240317172949713](./evaluation/BEIR-e5-mistral.png)
370374

371375
- CMTEB-retrieval.
372-
It rereank the top 100 results from bge-zh-v1.5 large.
376+
It rerank the top 100 results from bge-zh-v1.5 large.
373377

374378
![image-20240317173026235](./evaluation/CMTEB-retrieval-bge-zh-v1.5.png)
375379

376380
- miracl (multi-language).
377-
It rereank the top 100 results from bge-m3.
381+
It rerank the top 100 results from bge-m3.
378382

379383
![image-20240317173117639](./evaluation/miracl-bge-m3.png)
380384

0 commit comments

Comments
 (0)