Skip to content

Commit a525e44

Browse files
committed
update evaluation readme
1 parent ef30a53 commit a525e44

1 file changed

Lines changed: 213 additions & 6 deletions

File tree

examples/evaluation/README.md

Lines changed: 213 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -101,29 +101,236 @@ First, we will introduce the commonly used variables, followed by an introductio
101101

102102
In the evaluation of MTEB, we primarily utilize the official [MTEB](https://github.com/embeddings-benchmark/mteb) code, which supports only the assessment of embedders. Additionally, it restricts the output format of evaluation results to JSON. The following new variables have been introduced:
103103

104-
### 2. BEIR
104+
- **`languages`**: Languages to evaluate. Default: eng
105+
- **`tasks`**: Tasks to evaluate. Default: None
106+
- **`task_types`**: The task types to evaluate. Default: None
107+
- **`use_special_instructions`**: Whether to use specific instructions in `prompts.py` for evaluation. Default: False
108+
- **`use_special_examples`**: Whether to use specific examples in `examples.py` for evaluation. Default: False
109+
110+
Here is an example for evaluation:
111+
112+
```shell
113+
python -m FlagEmbedding.evaluation.mteb \
114+
--eval_name mteb \
115+
--output_dir ./data/mteb/search_results \
116+
--languages eng \
117+
--tasks NFCorpus BiorxivClusteringS2S SciDocsRR \
118+
--eval_output_path ./mteb/mteb_eval_results.json \
119+
--embedder_name_or_path BAAI/bge-m3 \
120+
--devices cuda:7 \
121+
--cache_dir ./cache/model
122+
```
105123

124+
### 2. BEIR
106125

126+
BEIR supports evaluations on datasets including `arguana`, `climate-fever`, `cqadupstack`, `dbpedia-entity`, `fever`, `fiqa`, `hotpotqa`, `msmarco`, `nfcorpus`, `nq`, `quora`, `scidocs`, `scifact`, `trec-covid`, `webis-touche2020`, with `msmarco` as the dev set and all others as test sets. The following new variables have been introduced:
127+
128+
- **`use_special_instructions`**: Whether to use specific instructions in `prompts.py` for evaluation. Default: False
129+
130+
Here is an example for evaluation:
131+
132+
```shell
133+
python -m FlagEmbedding.evaluation.beir \
134+
--eval_name beir \
135+
--dataset_dir ./beir/data \
136+
--dataset_names fiqa arguana cqadupstack \
137+
--splits test dev \
138+
--corpus_embd_save_dir ./beir/corpus_embd \
139+
--output_dir ./beir/search_results \
140+
--search_top_k 1000 \
141+
--rerank_top_k 100 \
142+
--cache_path ./cache/data \
143+
--overwrite False \
144+
--k_values 10 100 \
145+
--eval_output_method markdown \
146+
--eval_output_path ./beir/beir_eval_results.md \
147+
--eval_metrics ndcg_at_10 recall_at_100 \
148+
--embedder_name_or_path BAAI/bge-m3 \
149+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
150+
--devices cuda:0 cuda:1 \
151+
--cache_dir ./cache/model \
152+
--reranker_query_max_length 512 \
153+
--reranker_max_length 1024
154+
```
107155

108156
### 3. MSMARCO
109157

110-
158+
MSMARCO supports evaluations on both `passage` and `document`, providing evaluation splits for `dev`, `dl19`, and `dl20` respectively.
159+
160+
Here is an example for evaluation:
161+
162+
```shell
163+
python -m FlagEmbedding.evaluation.msmarco \
164+
--eval_name msmarco \
165+
--dataset_dir ./msmarco/data \
166+
--dataset_names passage \
167+
--splits dev dl19 dl20 \
168+
--corpus_embd_save_dir ./msmarco/corpus_embd \
169+
--output_dir ./msmarco/search_results \
170+
--search_top_k 1000 \
171+
--rerank_top_k 100 \
172+
--cache_path ./cache/data \
173+
--overwrite True \
174+
--k_values 10 100 \
175+
--eval_output_method markdown \
176+
--eval_output_path ./msmarco/msmarco_eval_results.md \
177+
--eval_metrics ndcg_at_10 recall_at_100 \
178+
--embedder_name_or_path BAAI/bge-m3 \
179+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
180+
--devices cuda:0 cuda:1 \
181+
--cache_dir ./cache/model \
182+
--reranker_query_max_length 512 \
183+
--reranker_max_length 1024
184+
```
111185

112186
### 4. MIRACL
113187

114-
188+
MIRACL supports evaluations in multiple languages. We utilize different languages as dataset names, including `ar`, `bn`, `en`, `es`, `fa`, `fi`, `fr`, `hi`, `id`, `ja`, `ko`, `ru`, `sw`, `te`, `th`, `zh`, `de`, `yo`. For the languages `de` and `yo`, the supported splits are `dev`, while for the rest, the supported splits are `train` and `dev`.
189+
190+
Here is an example for evaluation:
191+
192+
```shell
193+
python -m FlagEmbedding.evaluation.miracl \
194+
--eval_name miracl \
195+
--dataset_dir ./miracl/data \
196+
--dataset_names bn hi sw te th yo \
197+
--splits dev \
198+
--corpus_embd_save_dir ./miracl/corpus_embd \
199+
--output_dir ./miracl/search_results \
200+
--search_top_k 1000 \
201+
--rerank_top_k 100 \
202+
--cache_path ./cache/data \
203+
--overwrite False \
204+
--k_values 10 100 \
205+
--eval_output_method markdown \
206+
--eval_output_path ./miracl/miracl_eval_results.md \
207+
--eval_metrics ndcg_at_10 recall_at_100 \
208+
--embedder_name_or_path BAAI/bge-m3 \
209+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
210+
--devices cuda:0 cuda:1 \
211+
--cache_dir ./cache/model \
212+
--reranker_query_max_length 512 \
213+
--reranker_max_length 1024
214+
```
115215

116216
### 5. MLDR
117217

118-
218+
MLDR supports evaluations in multiple languages. We have dataset names in various languages, including `ar`, `de`, `en`, `es`, `fr`, `hi`, `it`, `ja`, `ko`, `pt`, `ru`, `th`, `zh`. The available splits are `train`, `dev`, and `test`.
219+
220+
Here is an example for evaluation:
221+
222+
```shell
223+
python -m FlagEmbedding.evaluation.mldr \
224+
--eval_name mldr \
225+
--dataset_dir ./mldr/data \
226+
--dataset_names hi \
227+
--splits test \
228+
--corpus_embd_save_dir ./mldr/corpus_embd \
229+
--output_dir ./mldr/search_results \
230+
--search_top_k 1000 \
231+
--rerank_top_k 100 \
232+
--cache_path ./cache/data \
233+
--overwrite False \
234+
--k_values 10 100 \
235+
--eval_output_method markdown \
236+
--eval_output_path ./mldr/mldr_eval_results.md \
237+
--eval_metrics ndcg_at_10 \
238+
--embedder_name_or_path BAAI/bge-m3 \
239+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
240+
--devices cuda:0 cuda:1 \
241+
--cache_dir ./cache/model \
242+
--reranker_query_max_length 512 \
243+
--reranker_max_length 1024
244+
```
119245

120246
### 6. MKQA
121247

248+
MKQA supports multi-language evaluation, using different languages as dataset names, including `en`, `ar`, `fi`, `ja`, `ko`, `ru`, `es`, `sv`, `he`, `th`, `da`, `de`, `fr`, `it`, `nl`, `pl`, `pt`, `hu`, `vi`, `ms`, `km`, `no`, `tr`, `zh_cn`, `zh_hk`, `zh_tw`. The supported split is `test`.
249+
250+
Here is an example for evaluation:
251+
252+
```shell
253+
python -m FlagEmbedding.evaluation.mkqa \
254+
--eval_name mkqa \
255+
--dataset_dir ./mkqa/data \
256+
--dataset_names en zh_cn \
257+
--splits test \
258+
--corpus_embd_save_dir ./mkqa/corpus_embd \
259+
--output_dir ./mkqa/search_results \
260+
--search_top_k 1000 \
261+
--rerank_top_k 100 \
262+
--cache_path ./cache/data \
263+
--overwrite False \
264+
--k_values 20 \
265+
--eval_output_method markdown \
266+
--eval_output_path ./mkqa/mkqa_eval_results.md \
267+
--eval_metrics qa_recall_at_20 \
268+
--embedder_name_or_path BAAI/bge-m3 \
269+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
270+
--devices cuda:0 cuda:1 \
271+
--cache_dir ./cache/model \
272+
--reranker_query_max_length 512 \
273+
--reranker_max_length 1024
274+
```
275+
276+
### 7. AIR-Bench
277+
278+
The AIR-Bench is mainly based on the official [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench/tree/main) framework, and it necessitates the use of official evaluation metrics. Below are some important variables:
279+
280+
- **`benchmark_version`**: Benchmark version.
281+
- **`task_types`**: Task types.
282+
- **`domains`**: Domains to evaluate.
283+
- **`languages`**: Languages to evaluate.
284+
285+
Here is an example for evaluation:
286+
287+
```shell
288+
python -m FlagEmbedding.evaluation.air_bench \
289+
--benchmark_version AIR-Bench_24.05 \
290+
--task_types qa long-doc \
291+
--domains arxiv \
292+
--languages en \
293+
--splits dev test \
294+
--output_dir ./air_bench/search_results \
295+
--search_top_k 1000 \
296+
--rerank_top_k 100 \
297+
--cache_dir ./cache/data \
298+
--overwrite False \
299+
--embedder_name_or_path BAAI/bge-m3 \
300+
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
301+
--devices cuda:0 cuda:1 \
302+
--model_cache_dir ./cache/model \
303+
--reranker_query_max_length 512 \
304+
--reranker_max_length 1024
305+
```
122306

307+
### 8. Custom Dataset
123308

124-
### 7. AIR+Bench
309+
You can refer to MLDR custom dataset, just need to rewrite `DataLoader`, rewriting the loading method for the required dataset.
125310

311+
The example data for `corpus.jsonl`:
126312

313+
```json
314+
{"id": "77628", "title": "Recover deleted cache", "text": "Is it possible to recover cache photos? The files were deleted by Clean Master to save space. I have no idea where to start. The photos are precious and are irreplaceable."}
315+
{"id": "806", "title": "How do I undelete or recover deleted files on Android?", "text": "> **Possible Duplicate:** > How can I recover a deleted file on Android? Is there a way to recover deleted files on Android phones without using standard USB storage recovery tools?"}
316+
{"id": "74923", "title": "Recovering deleted pictures", "text": "I recently deleted all of my pictures by mistake from my samsung galaxy s4. I went into my files and documents and deleted not realising it would delete all my pics! Is there a way for me to recover them? My phone is not rooted. I have not taken any pictures since but have received pictures through whatsapp?"}
317+
{"id": "50864", "title": "How to recover deleted files on Android phone", "text": "I was a using an autocall recorder app on my HTC Wildfire. I saved a call on my phones SD card and in my Dropbox. However, I accidently deleted the saved call and it was removed from my dropbox file. I now need this call and I tried some data recovery software. I scanned both my phone and pc. The software found the deleted call and recovered it, but the file which has .AMR extension does not work. The size of the file is only 143kb. 1. What is the likelihood this file is corrupted/stiil intact? Can I check that? 2. Which software can I use to salvage/replay the AMR file?"}
318+
{"id": "81285", "title": "How to recover deleted photo album saved on internal memory - Note 3", "text": "I have a Samsung Note 3 and I accidentally deleted an entire photo album from my phones gallery. I didn't enable my device to sync with Gmail. I didn't manually backup any of the data. The images were saved on my phone, not on the SD card. Is there any way for me to recover this deleted photo album? I Google'd and came across SDrescan but that won't work since the images were not initially saved on my SD card."}
319+
```
127320

128-
### 8. Custom Dataset
321+
The example data for `test_queries.jsonl`:
322+
323+
```json
324+
{"id": "79085", "text": "HTC One Mini data recovery after root"}
325+
```
326+
327+
The example data for `test_qrels.jsonl`:
328+
329+
```json
330+
{"qid": "79085", "docid": "77628", "relevance": 1}
331+
{"qid": "79085", "docid": "806", "relevance": 1}
332+
{"qid": "79085", "docid": "74923", "relevance": 1}
333+
{"qid": "79085", "docid": "50864", "relevance": 1}
334+
{"qid": "79085", "docid": "81285", "relevance": 1}
335+
```
129336

0 commit comments

Comments
 (0)