Skip to content

Commit fb105af

Browse files
committed
update evaluation readme
1 parent f3cc40a commit fb105af

1 file changed

Lines changed: 4 additions & 4 deletions

File tree

examples/evaluation/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
After finetuning, the model needs to be evaluated. To facilitate this, we have provided scripts for assessing it on various datasets, including **MTEB**, **BEIR**, **MSMARCO**, **MIRACL**, **MLDR**, **MKQA**, and **AIR-Bench**. You can find the specific bash scripts in the respective folders. This document provides an overview of these evaluations.
44

5-
First, we will introduce the commonly used variables, followed by an introduction to the variables for each dataset.
5+
First, we will introduce the commonly used parameters, followed by an introduction to the parameters for each dataset.
66

77
## Introduction
88

@@ -99,7 +99,7 @@ First, we will introduce the commonly used variables, followed by an introductio
9999

100100
### 1. MTEB
101101

102-
In the evaluation of MTEB, we primarily utilize the official [MTEB](https://github.com/embeddings-benchmark/mteb) code, which supports only the assessment of embedders. Additionally, it restricts the output format of evaluation results to JSON. The following new variables have been introduced:
102+
In the evaluation of MTEB, we primarily utilize the official [MTEB](https://github.com/embeddings-benchmark/mteb) code, which supports only the assessment of embedders. Additionally, it restricts the output format of evaluation results to JSON. The following new parameters have been introduced:
103103

104104
- **`languages`**: Languages to evaluate. Default: eng
105105
- **`tasks`**: Tasks to evaluate. Default: None
@@ -123,7 +123,7 @@ python -m FlagEmbedding.evaluation.mteb \
123123

124124
### 2. BEIR
125125

126-
[BEIR](https://github.com/beir-cellar/beir/) supports evaluations on datasets including `arguana`, `climate-fever`, `cqadupstack`, `dbpedia-entity`, `fever`, `fiqa`, `hotpotqa`, `msmarco`, `nfcorpus`, `nq`, `quora`, `scidocs`, `scifact`, `trec-covid`, `webis-touche2020`, with `msmarco` as the dev set and all others as test sets. The following new variables have been introduced:
126+
[BEIR](https://github.com/beir-cellar/beir/) supports evaluations on datasets including `arguana`, `climate-fever`, `cqadupstack`, `dbpedia-entity`, `fever`, `fiqa`, `hotpotqa`, `msmarco`, `nfcorpus`, `nq`, `quora`, `scidocs`, `scifact`, `trec-covid`, `webis-touche2020`, with `msmarco` as the dev set and all others as test sets. The following new parameters have been introduced:
127127

128128
- **`use_special_instructions`**: Whether to use specific instructions in `prompts.py` for evaluation. Default: False
129129

@@ -275,7 +275,7 @@ python -m FlagEmbedding.evaluation.mkqa \
275275

276276
### 7. AIR-Bench
277277

278-
The AIR-Bench is mainly based on the official [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench/tree/main) framework, and it necessitates the use of official evaluation metrics. Below are some important variables:
278+
The AIR-Bench is mainly based on the official [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench/tree/main) framework, and it necessitates the use of official evaluation metrics. Below are some important parameters:
279279

280280
- **`benchmark_version`**: Benchmark version.
281281
- **`task_types`**: Task types.

0 commit comments

Comments
 (0)