update evaluation readme

545999961 · 545999961 · commit a525e440bbf3 · 2024-10-29T11:15:16.000+08:00
diff --git a/examples/evaluation/README.md b/examples/evaluation/README.md
@@ -101,29 +101,236 @@ First, we will introduce the commonly used variables, followed by an introductio
 
 In the evaluation of MTEB, we primarily utilize the official [MTEB](https://github.com/embeddings-benchmark/mteb) code, which supports only the assessment of embedders. Additionally, it restricts the output format of evaluation results to JSON. The following new variables have been introduced:
 
-### 2. BEIR
+- **`languages`**: Languages to evaluate. Default: eng
+- **`tasks`**: Tasks to evaluate. Default: None
+- **`task_types`**: The task types to evaluate. Default: None
+- **`use_special_instructions`**: Whether to use specific instructions in `prompts.py` for evaluation. Default: False
+- **`use_special_examples`**: Whether to use specific examples in `examples.py` for evaluation. Default: False
+
+Here is an example for evaluation:
+
+```shell
+python -m FlagEmbedding.evaluation.mteb \
+	--eval_name mteb \
+    --output_dir ./data/mteb/search_results \
+    --languages eng \
+    --tasks NFCorpus BiorxivClusteringS2S SciDocsRR \
+    --eval_output_path ./mteb/mteb_eval_results.json \
+    --embedder_name_or_path BAAI/bge-m3 \
+    --devices cuda:7 \
+    --cache_dir ./cache/model
+```
 
+### 2. BEIR
 
+BEIR supports evaluations on datasets including `arguana`, `climate-fever`, `cqadupstack`, `dbpedia-entity`, `fever`, `fiqa`, `hotpotqa`, `msmarco`, `nfcorpus`, `nq`, `quora`, `scidocs`, `scifact`, `trec-covid`, `webis-touche2020`, with `msmarco` as the dev set and all others as test sets. The following new variables have been introduced:
+
+- **`use_special_instructions`**: Whether to use specific instructions in `prompts.py` for evaluation. Default: False
+
+Here is an example for evaluation:
+
+```shell
+python -m FlagEmbedding.evaluation.beir \
+	--eval_name beir \
+    --dataset_dir ./beir/data \
+    --dataset_names fiqa arguana cqadupstack \
+    --splits test dev \
+    --corpus_embd_save_dir ./beir/corpus_embd \
+    --output_dir ./beir/search_results \
+    --search_top_k 1000 \
+    --rerank_top_k 100 \
+    --cache_path ./cache/data \
+    --overwrite False \
+    --k_values 10 100 \
+    --eval_output_method markdown \
+    --eval_output_path ./beir/beir_eval_results.md \
+    --eval_metrics ndcg_at_10 recall_at_100 \
+    --embedder_name_or_path BAAI/bge-m3 \
+    --reranker_name_or_path BAAI/bge-reranker-v2-m3 \
+    --devices cuda:0 cuda:1 \
+    --cache_dir ./cache/model \
+    --reranker_query_max_length 512 \
+    --reranker_max_length 1024 
+```
 
 ### 3. MSMARCO
 
-
+MSMARCO supports evaluations on both `passage` and `document`, providing evaluation splits for `dev`, `dl19`, and `dl20` respectively.
+
+Here is an example for evaluation:
+
+```shell
+python -m FlagEmbedding.evaluation.msmarco \
+	--eval_name msmarco \
+    --dataset_dir ./msmarco/data \
+    --dataset_names passage \
+    --splits dev dl19 dl20 \
+    --corpus_embd_save_dir ./msmarco/corpus_embd \
+    --output_dir ./msmarco/search_results \
+    --search_top_k 1000 \
+    --rerank_top_k 100 \
+    --cache_path ./cache/data \
+    --overwrite True \
+    --k_values 10 100 \
+    --eval_output_method markdown \
+    --eval_output_path ./msmarco/msmarco_eval_results.md \
+    --eval_metrics ndcg_at_10 recall_at_100 \
+    --embedder_name_or_path BAAI/bge-m3 \
+    --reranker_name_or_path BAAI/bge-reranker-v2-m3 \
+    --devices cuda:0 cuda:1 \
+    --cache_dir ./cache/model \
+    --reranker_query_max_length 512 \
+    --reranker_max_length 1024 
+```
 
 ### 4. MIRACL
 
-
+MIRACL supports evaluations in multiple languages. We utilize different languages as dataset names, including `ar`, `bn`, `en`, `es`, `fa`, `fi`, `fr`, `hi`, `id`, `ja`, `ko`, `ru`, `sw`, `te`, `th`, `zh`, `de`, `yo`. For the languages `de` and `yo`, the supported splits are `dev`, while for the rest, the supported splits are `train` and `dev`.
+
+Here is an example for evaluation:
+
+```shell
+python -m FlagEmbedding.evaluation.miracl \
+	--eval_name miracl \
+    --dataset_dir ./miracl/data \
+    --dataset_names bn hi sw te th yo \
+    --splits dev \
+    --corpus_embd_save_dir ./miracl/corpus_embd \
+    --output_dir ./miracl/search_results \
+    --search_top_k 1000 \
+    --rerank_top_k 100 \
+    --cache_path ./cache/data \
+    --overwrite False \
+    --k_values 10 100 \
+    --eval_output_method markdown \
+    --eval_output_path ./miracl/miracl_eval_results.md \
+    --eval_metrics ndcg_at_10 recall_at_100 \
+    --embedder_name_or_path BAAI/bge-m3 \
+    --reranker_name_or_path BAAI/bge-reranker-v2-m3 \
+    --devices cuda:0 cuda:1 \
+    --cache_dir ./cache/model \
+    --reranker_query_max_length 512 \
+    --reranker_max_length 1024 
+```
 
 ### 5. MLDR
 
-
+MLDR supports evaluations in multiple languages. We have dataset names in various languages, including `ar`, `de`, `en`, `es`, `fr`, `hi`, `it`, `ja`, `ko`, `pt`, `ru`, `th`, `zh`. The available splits are `train`, `dev`, and `test`.
+
+Here is an example for evaluation:
+
+```shell
+python -m FlagEmbedding.evaluation.mldr \
+	--eval_name mldr \
+    --dataset_dir ./mldr/data \
+    --dataset_names hi \
+    --splits test \
+    --corpus_embd_save_dir ./mldr/corpus_embd \
+    --output_dir ./mldr/search_results \
+    --search_top_k 1000 \
+    --rerank_top_k 100 \
+    --cache_path ./cache/data \
+    --overwrite False \
+    --k_values 10 100 \
+    --eval_output_method markdown \
+    --eval_output_path ./mldr/mldr_eval_results.md \
+    --eval_metrics ndcg_at_10 \
+    --embedder_name_or_path BAAI/bge-m3 \
+    --reranker_name_or_path BAAI/bge-reranker-v2-m3 \
+    --devices cuda:0 cuda:1 \
+    --cache_dir ./cache/model \
+    --reranker_query_max_length 512 \
+    --reranker_max_length 1024 
+```
 
 ### 6. MKQA
 
+MKQA supports multi-language evaluation, using different languages as dataset names, including `en`, `ar`, `fi`, `ja`, `ko`, `ru`, `es`, `sv`, `he`, `th`, `da`, `de`, `fr`, `it`, `nl`, `pl`, `pt`, `hu`, `vi`, `ms`, `km`, `no`, `tr`, `zh_cn`, `zh_hk`, `zh_tw`. The supported split is `test`.
+
+Here is an example for evaluation:
+
+```shell
+python -m FlagEmbedding.evaluation.mkqa \
+	--eval_name mkqa \
+    --dataset_dir ./mkqa/data \
+    --dataset_names en zh_cn \
+    --splits test \
+    --corpus_embd_save_dir ./mkqa/corpus_embd \
+    --output_dir ./mkqa/search_results \
+    --search_top_k 1000 \
+    --rerank_top_k 100 \
+    --cache_path ./cache/data \
+    --overwrite False \
+    --k_values 20 \
+    --eval_output_method markdown \
+    --eval_output_path ./mkqa/mkqa_eval_results.md \
+    --eval_metrics qa_recall_at_20 \
+    --embedder_name_or_path BAAI/bge-m3 \
+    --reranker_name_or_path BAAI/bge-reranker-v2-m3 \
+    --devices cuda:0 cuda:1 \
+    --cache_dir ./cache/model \
+    --reranker_query_max_length 512 \
+    --reranker_max_length 1024 
+```
+
+### 7. AIR-Bench
+
+The AIR-Bench is mainly based on the official [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench/tree/main) framework, and it necessitates the use of official evaluation metrics. Below are some important variables:
+
+- **`benchmark_version`**: Benchmark version.
+- **`task_types`**: Task types.
+- **`domains`**: Domains to evaluate.
+- **`languages`**: Languages to evaluate.
+
+Here is an example for evaluation:
+
+```shell
+python -m FlagEmbedding.evaluation.air_bench \
+	--benchmark_version AIR-Bench_24.05 \
+    --task_types qa long-doc \
+    --domains arxiv \
+    --languages en \
+    --splits dev test \
+    --output_dir ./air_bench/search_results \
+    --search_top_k 1000 \
+    --rerank_top_k 100 \
+    --cache_dir ./cache/data \
+    --overwrite False \
+    --embedder_name_or_path BAAI/bge-m3 \
+    --reranker_name_or_path BAAI/bge-reranker-v2-m3 \
+    --devices cuda:0 cuda:1 \
+    --model_cache_dir ./cache/model \
+    --reranker_query_max_length 512 \
+    --reranker_max_length 1024 
+```
 
+### 8. Custom Dataset
 
-### 7. AIR+Bench
+You can refer to MLDR custom dataset, just need to rewrite `DataLoader`, rewriting the loading method for the required dataset.
 
+The example data for `corpus.jsonl`:
 
+```json
+{"id": "77628", "title": "Recover deleted cache", "text": "Is it possible to recover cache photos? The files were deleted by Clean Master to save space. I have no idea where to start. The photos are precious and are irreplaceable."}
+{"id": "806", "title": "How do I undelete or recover deleted files on Android?", "text": "> **Possible Duplicate:**   >  How can I recover a deleted file on Android? Is there a way to recover deleted files on Android phones without using standard USB storage recovery tools?"}
+{"id": "74923", "title": "Recovering deleted pictures", "text": "I recently deleted all of my pictures by mistake from my samsung galaxy s4.   I went into my files and documents and deleted not realising it would delete all my pics! Is there a way for me to recover them? My phone is not rooted. I have not taken any pictures since but have received pictures through whatsapp?"}
+{"id": "50864", "title": "How to recover deleted files on Android phone", "text": "I was a using an autocall recorder app on my HTC Wildfire. I saved a call on my phones SD card and in my Dropbox. However, I accidently deleted the saved call and it was removed from my dropbox file. I now need this call and I tried some data recovery software. I scanned both my phone and pc. The software found the deleted call and recovered it, but the file which has .AMR extension does not work. The size of the file is only 143kb.   1. What is the likelihood this file is corrupted/stiil intact? Can I check that?   2. Which software can I use to salvage/replay the AMR file?"}
+{"id": "81285", "title": "How to recover deleted photo album saved on internal memory - Note 3", "text": "I have a Samsung Note 3 and I accidentally deleted an entire photo album from my phones gallery. I didn't enable my device to sync with Gmail. I didn't manually backup any of the data. The images were saved on my phone, not on the SD card. Is there any way for me to recover this deleted photo album? I Google'd and came across SDrescan but that won't work since the images were not initially saved on my SD card."}
+```
 
-### 8. Custom Dataset
+The example data for `test_queries.jsonl`:
+
+```json
+{"id": "79085", "text": "HTC One Mini data recovery after root"}
+```
+
+The example data for `test_qrels.jsonl`:
+
+```json
+{"qid": "79085", "docid": "77628", "relevance": 1}
+{"qid": "79085", "docid": "806", "relevance": 1}
+{"qid": "79085", "docid": "74923", "relevance": 1}
+{"qid": "79085", "docid": "50864", "relevance": 1}
+{"qid": "79085", "docid": "81285", "relevance": 1}
+```