add reranker score

545999961 · 545999961 · commit 770f5af76945 · 2024-10-29T20:36:30.000+08:00
diff --git a/examples/finetune/embedder/README.md b/examples/finetune/embedder/README.md
@@ -55,12 +55,12 @@ python hn_mine.py \
 --use_gpu_for_searching 
 ```
 
-- `input_file`: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
-- `output_file`: path to save JSON data with mined hard negatives for finetuning
-- `negative_number`: the number of sampled negatives
-- `range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
-- `candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
-- `use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
+- **`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
+- **`output_file`**: path to save JSON data with mined hard negatives for finetuning
+- **`negative_number`**: the number of sampled negatives
+- **`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
+- **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
+- **`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
 
 ### Teacher Scores
 
@@ -82,10 +82,26 @@ python add_reranker_score.py \
 
 - `input_file`: path to save JSON data with mined hard negatives for finetuning
 - `output_file`: path to save JSON data with scores for finetuning
-- `negative_number`: the number of sampled negatives
-- `range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
-- `candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
-- `use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
+- **`use_fp16`**: Whether to use fp16 for inference. Default: True
+- **`devices`**: Devices to use for inference. Default: None, multiple values allowed
+- **`trust_remote_code`**: Trust remote code. Default: False
+- **`reranker_name_or_path`**: The reranker name or path. Default: None
+- **`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
+- **`reranker_peft_path`**: The reranker peft path. Default: None
+- **`use_bf16`**: Whether to use bf16 for inference. Default: False
+- **`query_instruction_for_rerank`**: Instruction for query. Default: None
+- **`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
+- **`passage_instruction_for_rerank`**: Instruction for passage. Default: None
+- **`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
+- **`cache_dir`**: Cache directory for models. Default: None
+- **`reranker_batch_size`**: Batch size for inference. Default: 3000
+- **`reranker_query_max_length`**: Max length for reranking queries. Default: None
+- **`reranker_max_length`**: Max length for reranking. Default: 512
+- **`normalize`**: Whether to normalize the reranking scores. Default: False
+- **`prompt`**: The prompt for the reranker. Default: None
+- **`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
+- **`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
+- **`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed
 
 ## 3. Train
 
diff --git a/examples/finetune/reranker/README.md b/examples/finetune/reranker/README.md
@@ -36,31 +36,72 @@ Train data should be a json file, where each line is a dict like this:
 
 See [example_data](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder/example_data) for more detailed files.
 
-### Hard Negatives
-
-Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:
-
-```shell
-git clone https://github.com/FlagOpen/FlagEmbedding.git
-cd FlagEmbedding/scripts
-```
-
-```shell
-python hn_mine.py \
---model_name_or_path BAAI/bge-base-en-v1.5 \
---input_file toy_finetune_data.jsonl \
---output_file toy_finetune_data_minedHN.jsonl \
---range_for_sampling 2-200 \
---negative_number 15 \
---use_gpu_for_searching 
-```
-
-- `input_file`: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
-- `output_file`: path to save JSON data with mined hard negatives for finetuning
-- `negative_number`: the number of sampled negatives
-- `range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
-- `candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
-- `use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
+- Hard Negatives
+
+  Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:
+
+  ```shell
+  git clone https://github.com/FlagOpen/FlagEmbedding.git
+  cd FlagEmbedding/scripts
+  ```
+
+  ```shell
+  python hn_mine.py \
+  --model_name_or_path BAAI/bge-base-en-v1.5 \
+  --input_file toy_finetune_data.jsonl \
+  --output_file toy_finetune_data_minedHN.jsonl \
+  --range_for_sampling 2-200 \
+  --negative_number 15 \
+  --use_gpu_for_searching 
+  ```
+
+  - **`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
+  - **`output_file`**: path to save JSON data with mined hard negatives for finetuning
+  - **`negative_number`**: the number of sampled negatives
+  - **`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
+  - **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
+  - **`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
+
+  ### Teacher Scores
+
+  Teacher scores can be used for model distillation. You can obtain the scores using the following command:
+
+  ```shell
+  git clone https://github.com/FlagOpen/FlagEmbedding.git
+  cd FlagEmbedding/scripts
+  ```
+
+  ```shell
+  python add_reranker_score.py \
+  --input_file toy_finetune_data_minedHN.jsonl \
+  --output_file toy_finetune_data_score.jsonl \
+  --range_for_sampling 2-200 \
+  --negative_number 15 \
+  --use_gpu_for_searching 
+  ```
+
+  - `input_file`: path to save JSON data with mined hard negatives for finetuning
+  - `output_file`: path to save JSON data with scores for finetuning
+  - **`use_fp16`**: Whether to use fp16 for inference. Default: True
+  - **`devices`**: Devices to use for inference. Default: None, multiple values allowed
+  - **`trust_remote_code`**: Trust remote code. Default: False
+  - **`reranker_name_or_path`**: The reranker name or path. Default: None
+  - **`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
+  - **`reranker_peft_path`**: The reranker peft path. Default: None
+  - **`use_bf16`**: Whether to use bf16 for inference. Default: False
+  - **`query_instruction_for_rerank`**: Instruction for query. Default: None
+  - **`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
+  - **`passage_instruction_for_rerank`**: Instruction for passage. Default: None
+  - **`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
+  - **`cache_dir`**: Cache directory for models. Default: None
+  - **`reranker_batch_size`**: Batch size for inference. Default: 3000
+  - **`reranker_query_max_length`**: Max length for reranking queries. Default: None
+  - **`reranker_max_length`**: Max length for reranking. Default: 512
+  - **`normalize`**: Whether to normalize the reranking scores. Default: False
+  - **`prompt`**: The prompt for the reranker. Default: None
+  - **`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
+  - **`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
+  - **`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed
 
 ## 3. Train
 
diff --git a/scripts/add_reranker_score.py b/scripts/add_reranker_score.py
@@ -78,16 +78,7 @@ class ModelArgs:
         default=None, metadata={"help": "The compress layers of lightweight reranker.", "nargs": "+"}
     )
 
-
-if __name__ == '__main__':
-    parser = HfArgumentParser((
-        ScoreArgs,
-        ModelArgs
-    ))
-    score_args, model_args = parser.parse_args_into_dataclasses()
-    eval_args: ScoreArgs
-    model_args: ModelArgs
-
+def main(score_args, model_args):
     reranker = FlagAutoReranker.from_finetuned(
         model_name_or_path=model_args.reranker_name_or_path,
         model_class=model_args.reranker_model_class,
@@ -138,4 +129,14 @@ class ModelArgs:
         for d in data:
             f.write(json.dumps(d) + '\n')
 
-    del reranker
+
+if __name__ == '__main__':
+    parser = HfArgumentParser((
+        ScoreArgs,
+        ModelArgs
+    ))
+    score_args, model_args = parser.parse_args_into_dataclasses()
+    score_args: ScoreArgs
+    model_args: ModelArgs
+    main(score_args, model_args)
+