Skip to content

Commit ef30a53

Browse files
committed
update mteb eval
1 parent ab7462f commit ef30a53

4 files changed

Lines changed: 133 additions & 13 deletions

File tree

FlagEmbedding/evaluation/mteb/arguments.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ class MTEBEvalArgs(AbsEvalArgs):
1313
default=None, metadata={"help": "Tasks to evaluate. Default: None"}
1414
)
1515
task_types: List[str] = field(
16-
default=None, metadata={"help": "The tasks to evaluate. Default: None"}
16+
default=None, metadata={"help": "The task types to evaluate. Default: None"}
1717
)
1818
use_special_instructions: bool = field(
1919
default=False, metadata={"help": "Whether to use specific instructions in `prompts.py` for evaluation. Default: False"}

FlagEmbedding/evaluation/mteb/examples.py

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

FlagEmbedding/evaluation/mteb/runner.py

Lines changed: 2 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
from .arguments import MTEBEvalArgs
1111
from .searcher import MTEBEvalDenseRetriever, MTEBEvalReranker
1212
from .prompts import get_task_def_by_task_name_and_type
13+
from .examples import examples_dict
1314

1415
logger = logging.getLogger(__name__)
1516

@@ -133,18 +134,7 @@ def run(self):
133134

134135
if self.eval_args.use_special_examples:
135136
try:
136-
eg_file_path = f'./examples/{task_name}.csv'
137-
eg_pairs = []
138-
df = pd.read_csv(eg_file_path)
139-
for i in range(len(df)):
140-
task_def = self.retriever.get_instruction()
141-
eg_pairs.append(
142-
{
143-
'instruct': task_def,
144-
'query': df[df.keys()[0]][i],
145-
'response': df[df.keys()[1]][i]
146-
}
147-
)
137+
eg_pairs = examples_dict[task_name]
148138
self.retriever.set_examples(eg_pairs)
149139
except:
150140
logger.logger.info(f"No examples found for {task_name}")

examples/evaluation/README.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# Evaluation
2+
3+
After finetuning, the model needs to be evaluated. To facilitate this, we have provided scripts for assessing it on various datasets, including **MTEB**, **BEIR**, **MSMARCO**, **MIRACL**, **MLDR**, **MKQA**, and **AIR-Bench**. You can find the specific bash scripts in the respective folders. This document provides an overview of these evaluations.
4+
5+
First, we will introduce the commonly used variables, followed by an introduction to the variables for each dataset.
6+
7+
## Introduction
8+
9+
### 1. EvalArgs
10+
11+
**Parameters for evaluation setup:**
12+
13+
- **`eval_name`**: Name of the evaluation task (e.g., msmarco, beir, miracl).
14+
15+
- **`dataset_dir`**: Path to the dataset directory. This can be:
16+
1. A local path to perform evaluation on your dataset (must exist). It should contain:
17+
- `corpus.jsonl`
18+
- `<split>_queries.jsonl`
19+
- `<split>_qrels.jsonl`
20+
2. Path to store datasets downloaded via API. Provide `None` to use the cache directory.
21+
22+
- **`force_redownload`**: Set to `true` to force redownload of the dataset.
23+
24+
- **`dataset_names`**: List of dataset names to evaluate or `None` to evaluate all available datasets.
25+
26+
- **`splits`**: Dataset splits to evaluate. Default is `test`.
27+
28+
- **`corpus_embd_save_dir`**: Directory to save corpus embeddings. If `None`, embeddings will not be saved.
29+
30+
- **`output_dir`**: Directory to save evaluation results.
31+
32+
- **`search_top_k`**: Top-K results for initial retrieval.
33+
34+
- **`rerank_top_k`**: Top-K results for reranking.
35+
36+
- **`cache_path`**: Cache directory for datasets.
37+
38+
- **`token`**: Token used for accessing the model.
39+
40+
- **`overwrite`**: Set to `true` to overwrite existing evaluation results.
41+
42+
- **`ignore_identical_ids`**: Set to `true` to ignore identical IDs in search results.
43+
44+
- **`k_values`**: List of K values for evaluation (e.g., [1, 3, 5, 10, 100, 1000]).
45+
46+
- **`eval_output_method`**: Format for outputting evaluation results (options: 'json', 'markdown'). Default is `markdown`.
47+
48+
- **`eval_output_path`**: Path to save the evaluation output.
49+
50+
- **`eval_metrics`**: Metrics used for evaluation (e.g., ['ndcg_at_10', 'recall_at_10']).
51+
52+
### 2. ModelArgs
53+
54+
**Parameters for Model Configuration:**
55+
56+
- **`embedder_name_or_path`**: The name or path to the embedder.
57+
58+
- **`embedder_model_class`**: Class of the model used for embedding (options include 'auto', 'encoder-only-base', etc.). Default is `auto`.
59+
60+
- **`normalize_embeddings`**: Set to `true` to normalize embeddings.
61+
62+
- **`use_fp16`**: Use FP16 precision for inference.
63+
64+
- **`devices`**: List of devices used for inference.
65+
66+
- **`query_instruction_for_retrieval`**, **`query_instruction_format_for_retrieval`**: Instructions and format for query during retrieval.
67+
68+
- **`examples_for_task`**, **`examples_instruction_format`**: Example tasks and their instructions format.
69+
70+
- **`trust_remote_code`**: Set to `true` to trust remote code execution.
71+
72+
- **`reranker_name_or_path`**: Name or path to the reranker.
73+
74+
- **`reranker_model_class`**: Reranker model class (options include 'auto', 'decoder-only-base', etc.). Default is `auto`.
75+
76+
- **`reranker_peft_path`**: Path for portable encoder fine-tuning of the reranker.
77+
78+
- **`use_bf16`**: Use BF16 precision for inference.
79+
80+
- **`query_instruction_for_rerank`**, **`query_instruction_format_for_rerank`**: Instructions and format for query during reranking.
81+
82+
- **`passage_instruction_for_rerank`**, **`passage_instruction_format_for_rerank`**: Instructions and format for processing passages during reranking.
83+
84+
- **`cache_dir`**: Cache directory for models.
85+
86+
- **`embedder_batch_size`**, **`reranker_batch_size`**: Batch sizes for embedding and reranking.
87+
88+
- **`embedder_query_max_length`**, **`embedder_passage_max_length`**: Maximum length for embedding queries and passages.
89+
90+
- **`reranker_query_max_length`**, **`reranker_max_length`**: Maximum lengths for reranking queries and reranking in general.
91+
92+
- **`normalize`**: Normalize the reranking scores.
93+
94+
- **`prompt`**: Prompt for the reranker.
95+
96+
- **`cutoff_layers`**, **`compress_ratio`**, **`compress_layers`**: Parameters for configuring the output and compression of layerwise or lightweight rerankers.
97+
98+
## Usage
99+
100+
### 1. MTEB
101+
102+
In the evaluation of MTEB, we primarily utilize the official [MTEB](https://github.com/embeddings-benchmark/mteb) code, which supports only the assessment of embedders. Additionally, it restricts the output format of evaluation results to JSON. The following new variables have been introduced:
103+
104+
### 2. BEIR
105+
106+
107+
108+
### 3. MSMARCO
109+
110+
111+
112+
### 4. MIRACL
113+
114+
115+
116+
### 5. MLDR
117+
118+
119+
120+
### 6. MKQA
121+
122+
123+
124+
### 7. AIR+Bench
125+
126+
127+
128+
### 8. Custom Dataset
129+

0 commit comments

Comments
 (0)