You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/evaluation/README.md
+213-6Lines changed: 213 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -101,29 +101,236 @@ First, we will introduce the commonly used variables, followed by an introductio
101
101
102
102
In the evaluation of MTEB, we primarily utilize the official [MTEB](https://github.com/embeddings-benchmark/mteb) code, which supports only the assessment of embedders. Additionally, it restricts the output format of evaluation results to JSON. The following new variables have been introduced:
103
103
104
-
### 2. BEIR
104
+
-**`languages`**: Languages to evaluate. Default: eng
105
+
-**`tasks`**: Tasks to evaluate. Default: None
106
+
-**`task_types`**: The task types to evaluate. Default: None
107
+
-**`use_special_instructions`**: Whether to use specific instructions in `prompts.py` for evaluation. Default: False
108
+
-**`use_special_examples`**: Whether to use specific examples in `examples.py` for evaluation. Default: False
BEIR supports evaluations on datasets including `arguana`, `climate-fever`, `cqadupstack`, `dbpedia-entity`, `fever`, `fiqa`, `hotpotqa`, `msmarco`, `nfcorpus`, `nq`, `quora`, `scidocs`, `scifact`, `trec-covid`, `webis-touche2020`, with `msmarco` as the dev set and all others as test sets. The following new variables have been introduced:
127
+
128
+
-**`use_special_instructions`**: Whether to use specific instructions in `prompts.py` for evaluation. Default: False
129
+
130
+
Here is an example for evaluation:
131
+
132
+
```shell
133
+
python -m FlagEmbedding.evaluation.beir \
134
+
--eval_name beir \
135
+
--dataset_dir ./beir/data \
136
+
--dataset_names fiqa arguana cqadupstack \
137
+
--splits test dev \
138
+
--corpus_embd_save_dir ./beir/corpus_embd \
139
+
--output_dir ./beir/search_results \
140
+
--search_top_k 1000 \
141
+
--rerank_top_k 100 \
142
+
--cache_path ./cache/data \
143
+
--overwrite False \
144
+
--k_values 10 100 \
145
+
--eval_output_method markdown \
146
+
--eval_output_path ./beir/beir_eval_results.md \
147
+
--eval_metrics ndcg_at_10 recall_at_100 \
148
+
--embedder_name_or_path BAAI/bge-m3 \
149
+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
150
+
--devices cuda:0 cuda:1 \
151
+
--cache_dir ./cache/model \
152
+
--reranker_query_max_length 512 \
153
+
--reranker_max_length 1024
154
+
```
107
155
108
156
### 3. MSMARCO
109
157
110
-
158
+
MSMARCO supports evaluations on both `passage` and `document`, providing evaluation splits for `dev`, `dl19`, and `dl20` respectively.
MIRACL supports evaluations in multiple languages. We utilize different languages as dataset names, including `ar`, `bn`, `en`, `es`, `fa`, `fi`, `fr`, `hi`, `id`, `ja`, `ko`, `ru`, `sw`, `te`, `th`, `zh`, `de`, `yo`. For the languages `de` and `yo`, the supported splits are `dev`, while for the rest, the supported splits are `train` and `dev`.
MLDR supports evaluations in multiple languages. We have dataset names in various languages, including `ar`, `de`, `en`, `es`, `fr`, `hi`, `it`, `ja`, `ko`, `pt`, `ru`, `th`, `zh`. The available splits are `train`, `dev`, and `test`.
219
+
220
+
Here is an example for evaluation:
221
+
222
+
```shell
223
+
python -m FlagEmbedding.evaluation.mldr \
224
+
--eval_name mldr \
225
+
--dataset_dir ./mldr/data \
226
+
--dataset_names hi \
227
+
--splits test \
228
+
--corpus_embd_save_dir ./mldr/corpus_embd \
229
+
--output_dir ./mldr/search_results \
230
+
--search_top_k 1000 \
231
+
--rerank_top_k 100 \
232
+
--cache_path ./cache/data \
233
+
--overwrite False \
234
+
--k_values 10 100 \
235
+
--eval_output_method markdown \
236
+
--eval_output_path ./mldr/mldr_eval_results.md \
237
+
--eval_metrics ndcg_at_10 \
238
+
--embedder_name_or_path BAAI/bge-m3 \
239
+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
240
+
--devices cuda:0 cuda:1 \
241
+
--cache_dir ./cache/model \
242
+
--reranker_query_max_length 512 \
243
+
--reranker_max_length 1024
244
+
```
119
245
120
246
### 6. MKQA
121
247
248
+
MKQA supports multi-language evaluation, using different languages as dataset names, including `en`, `ar`, `fi`, `ja`, `ko`, `ru`, `es`, `sv`, `he`, `th`, `da`, `de`, `fr`, `it`, `nl`, `pl`, `pt`, `hu`, `vi`, `ms`, `km`, `no`, `tr`, `zh_cn`, `zh_hk`, `zh_tw`. The supported split is `test`.
249
+
250
+
Here is an example for evaluation:
251
+
252
+
```shell
253
+
python -m FlagEmbedding.evaluation.mkqa \
254
+
--eval_name mkqa \
255
+
--dataset_dir ./mkqa/data \
256
+
--dataset_names en zh_cn \
257
+
--splits test \
258
+
--corpus_embd_save_dir ./mkqa/corpus_embd \
259
+
--output_dir ./mkqa/search_results \
260
+
--search_top_k 1000 \
261
+
--rerank_top_k 100 \
262
+
--cache_path ./cache/data \
263
+
--overwrite False \
264
+
--k_values 20 \
265
+
--eval_output_method markdown \
266
+
--eval_output_path ./mkqa/mkqa_eval_results.md \
267
+
--eval_metrics qa_recall_at_20 \
268
+
--embedder_name_or_path BAAI/bge-m3 \
269
+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
270
+
--devices cuda:0 cuda:1 \
271
+
--cache_dir ./cache/model \
272
+
--reranker_query_max_length 512 \
273
+
--reranker_max_length 1024
274
+
```
275
+
276
+
### 7. AIR-Bench
277
+
278
+
The AIR-Bench is mainly based on the official [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench/tree/main) framework, and it necessitates the use of official evaluation metrics. Below are some important variables:
279
+
280
+
-**`benchmark_version`**: Benchmark version.
281
+
-**`task_types`**: Task types.
282
+
-**`domains`**: Domains to evaluate.
283
+
-**`languages`**: Languages to evaluate.
284
+
285
+
Here is an example for evaluation:
286
+
287
+
```shell
288
+
python -m FlagEmbedding.evaluation.air_bench \
289
+
--benchmark_version AIR-Bench_24.05 \
290
+
--task_types qa long-doc \
291
+
--domains arxiv \
292
+
--languages en \
293
+
--splits dev test \
294
+
--output_dir ./air_bench/search_results \
295
+
--search_top_k 1000 \
296
+
--rerank_top_k 100 \
297
+
--cache_dir ./cache/data \
298
+
--overwrite False \
299
+
--embedder_name_or_path BAAI/bge-m3 \
300
+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
301
+
--devices cuda:0 cuda:1 \
302
+
--model_cache_dir ./cache/model \
303
+
--reranker_query_max_length 512 \
304
+
--reranker_max_length 1024
305
+
```
122
306
307
+
### 8. Custom Dataset
123
308
124
-
### 7. AIR+Bench
309
+
You can refer to MLDR custom dataset, just need to rewrite `DataLoader`, rewriting the loading method for the required dataset.
125
310
311
+
The example data for `corpus.jsonl`:
126
312
313
+
```json
314
+
{"id": "77628", "title": "Recover deleted cache", "text": "Is it possible to recover cache photos? The files were deleted by Clean Master to save space. I have no idea where to start. The photos are precious and are irreplaceable."}
315
+
{"id": "806", "title": "How do I undelete or recover deleted files on Android?", "text": "> **Possible Duplicate:** > How can I recover a deleted file on Android? Is there a way to recover deleted files on Android phones without using standard USB storage recovery tools?"}
316
+
{"id": "74923", "title": "Recovering deleted pictures", "text": "I recently deleted all of my pictures by mistake from my samsung galaxy s4. I went into my files and documents and deleted not realising it would delete all my pics! Is there a way for me to recover them? My phone is not rooted. I have not taken any pictures since but have received pictures through whatsapp?"}
317
+
{"id": "50864", "title": "How to recover deleted files on Android phone", "text": "I was a using an autocall recorder app on my HTC Wildfire. I saved a call on my phones SD card and in my Dropbox. However, I accidently deleted the saved call and it was removed from my dropbox file. I now need this call and I tried some data recovery software. I scanned both my phone and pc. The software found the deleted call and recovered it, but the file which has .AMR extension does not work. The size of the file is only 143kb. 1. What is the likelihood this file is corrupted/stiil intact? Can I check that? 2. Which software can I use to salvage/replay the AMR file?"}
318
+
{"id": "81285", "title": "How to recover deleted photo album saved on internal memory - Note 3", "text": "I have a Samsung Note 3 and I accidentally deleted an entire photo album from my phones gallery. I didn't enable my device to sync with Gmail. I didn't manually backup any of the data. The images were saved on my phone, not on the SD card. Is there any way for me to recover this deleted photo album? I Google'd and came across SDrescan but that won't work since the images were not initially saved on my SD card."}
319
+
```
127
320
128
-
### 8. Custom Dataset
321
+
The example data for `test_queries.jsonl`:
322
+
323
+
```json
324
+
{"id": "79085", "text": "HTC One Mini data recovery after root"}
0 commit comments