Skip to content

Commit 770f5af

Browse files
committed
add reranker score
1 parent aeb25da commit 770f5af

3 files changed

Lines changed: 104 additions & 46 deletions

File tree

examples/finetune/embedder/README.md

Lines changed: 26 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -55,12 +55,12 @@ python hn_mine.py \
5555
--use_gpu_for_searching
5656
```
5757

58-
- `input_file`: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
59-
- `output_file`: path to save JSON data with mined hard negatives for finetuning
60-
- `negative_number`: the number of sampled negatives
61-
- `range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
62-
- `candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
63-
- `use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
58+
- **`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
59+
- **`output_file`**: path to save JSON data with mined hard negatives for finetuning
60+
- **`negative_number`**: the number of sampled negatives
61+
- **`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
62+
- **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
63+
- **`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
6464

6565
### Teacher Scores
6666

@@ -82,10 +82,26 @@ python add_reranker_score.py \
8282

8383
- `input_file`: path to save JSON data with mined hard negatives for finetuning
8484
- `output_file`: path to save JSON data with scores for finetuning
85-
- `negative_number`: the number of sampled negatives
86-
- `range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
87-
- `candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
88-
- `use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
85+
- **`use_fp16`**: Whether to use fp16 for inference. Default: True
86+
- **`devices`**: Devices to use for inference. Default: None, multiple values allowed
87+
- **`trust_remote_code`**: Trust remote code. Default: False
88+
- **`reranker_name_or_path`**: The reranker name or path. Default: None
89+
- **`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
90+
- **`reranker_peft_path`**: The reranker peft path. Default: None
91+
- **`use_bf16`**: Whether to use bf16 for inference. Default: False
92+
- **`query_instruction_for_rerank`**: Instruction for query. Default: None
93+
- **`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
94+
- **`passage_instruction_for_rerank`**: Instruction for passage. Default: None
95+
- **`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
96+
- **`cache_dir`**: Cache directory for models. Default: None
97+
- **`reranker_batch_size`**: Batch size for inference. Default: 3000
98+
- **`reranker_query_max_length`**: Max length for reranking queries. Default: None
99+
- **`reranker_max_length`**: Max length for reranking. Default: 512
100+
- **`normalize`**: Whether to normalize the reranking scores. Default: False
101+
- **`prompt`**: The prompt for the reranker. Default: None
102+
- **`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
103+
- **`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
104+
- **`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed
89105

90106
## 3. Train
91107

examples/finetune/reranker/README.md

Lines changed: 66 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -36,31 +36,72 @@ Train data should be a json file, where each line is a dict like this:
3636

3737
See [example_data](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder/example_data) for more detailed files.
3838

39-
### Hard Negatives
40-
41-
Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:
42-
43-
```shell
44-
git clone https://github.com/FlagOpen/FlagEmbedding.git
45-
cd FlagEmbedding/scripts
46-
```
47-
48-
```shell
49-
python hn_mine.py \
50-
--model_name_or_path BAAI/bge-base-en-v1.5 \
51-
--input_file toy_finetune_data.jsonl \
52-
--output_file toy_finetune_data_minedHN.jsonl \
53-
--range_for_sampling 2-200 \
54-
--negative_number 15 \
55-
--use_gpu_for_searching
56-
```
57-
58-
- `input_file`: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
59-
- `output_file`: path to save JSON data with mined hard negatives for finetuning
60-
- `negative_number`: the number of sampled negatives
61-
- `range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
62-
- `candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
63-
- `use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
39+
- Hard Negatives
40+
41+
Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:
42+
43+
```shell
44+
git clone https://github.com/FlagOpen/FlagEmbedding.git
45+
cd FlagEmbedding/scripts
46+
```
47+
48+
```shell
49+
python hn_mine.py \
50+
--model_name_or_path BAAI/bge-base-en-v1.5 \
51+
--input_file toy_finetune_data.jsonl \
52+
--output_file toy_finetune_data_minedHN.jsonl \
53+
--range_for_sampling 2-200 \
54+
--negative_number 15 \
55+
--use_gpu_for_searching
56+
```
57+
58+
- **`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
59+
- **`output_file`**: path to save JSON data with mined hard negatives for finetuning
60+
- **`negative_number`**: the number of sampled negatives
61+
- **`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
62+
- **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
63+
- **`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
64+
65+
### Teacher Scores
66+
67+
Teacher scores can be used for model distillation. You can obtain the scores using the following command:
68+
69+
```shell
70+
git clone https://github.com/FlagOpen/FlagEmbedding.git
71+
cd FlagEmbedding/scripts
72+
```
73+
74+
```shell
75+
python add_reranker_score.py \
76+
--input_file toy_finetune_data_minedHN.jsonl \
77+
--output_file toy_finetune_data_score.jsonl \
78+
--range_for_sampling 2-200 \
79+
--negative_number 15 \
80+
--use_gpu_for_searching
81+
```
82+
83+
- `input_file`: path to save JSON data with mined hard negatives for finetuning
84+
- `output_file`: path to save JSON data with scores for finetuning
85+
- **`use_fp16`**: Whether to use fp16 for inference. Default: True
86+
- **`devices`**: Devices to use for inference. Default: None, multiple values allowed
87+
- **`trust_remote_code`**: Trust remote code. Default: False
88+
- **`reranker_name_or_path`**: The reranker name or path. Default: None
89+
- **`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
90+
- **`reranker_peft_path`**: The reranker peft path. Default: None
91+
- **`use_bf16`**: Whether to use bf16 for inference. Default: False
92+
- **`query_instruction_for_rerank`**: Instruction for query. Default: None
93+
- **`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
94+
- **`passage_instruction_for_rerank`**: Instruction for passage. Default: None
95+
- **`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
96+
- **`cache_dir`**: Cache directory for models. Default: None
97+
- **`reranker_batch_size`**: Batch size for inference. Default: 3000
98+
- **`reranker_query_max_length`**: Max length for reranking queries. Default: None
99+
- **`reranker_max_length`**: Max length for reranking. Default: 512
100+
- **`normalize`**: Whether to normalize the reranking scores. Default: False
101+
- **`prompt`**: The prompt for the reranker. Default: None
102+
- **`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
103+
- **`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
104+
- **`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed
64105

65106
## 3. Train
66107

scripts/add_reranker_score.py

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -78,16 +78,7 @@ class ModelArgs:
7878
default=None, metadata={"help": "The compress layers of lightweight reranker.", "nargs": "+"}
7979
)
8080

81-
82-
if __name__ == '__main__':
83-
parser = HfArgumentParser((
84-
ScoreArgs,
85-
ModelArgs
86-
))
87-
score_args, model_args = parser.parse_args_into_dataclasses()
88-
eval_args: ScoreArgs
89-
model_args: ModelArgs
90-
81+
def main(score_args, model_args):
9182
reranker = FlagAutoReranker.from_finetuned(
9283
model_name_or_path=model_args.reranker_name_or_path,
9384
model_class=model_args.reranker_model_class,
@@ -138,4 +129,14 @@ class ModelArgs:
138129
for d in data:
139130
f.write(json.dumps(d) + '\n')
140131

141-
del reranker
132+
133+
if __name__ == '__main__':
134+
parser = HfArgumentParser((
135+
ScoreArgs,
136+
ModelArgs
137+
))
138+
score_args, model_args = parser.parse_args_into_dataclasses()
139+
score_args: ScoreArgs
140+
model_args: ModelArgs
141+
main(score_args, model_args)
142+

0 commit comments

Comments
 (0)