You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/finetune/embedder/README.md
+26-10Lines changed: 26 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -55,12 +55,12 @@ python hn_mine.py \
55
55
--use_gpu_for_searching
56
56
```
57
57
58
-
-`input_file`: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
59
-
-`output_file`: path to save JSON data with mined hard negatives for finetuning
60
-
-`negative_number`: the number of sampled negatives
61
-
-`range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
62
-
-`candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
63
-
-`use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
58
+
-**`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
59
+
-**`output_file`**: path to save JSON data with mined hard negatives for finetuning
60
+
-**`negative_number`**: the number of sampled negatives
61
+
-**`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
62
+
-**`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
63
+
-**`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
-`input_file`: path to save JSON data with mined hard negatives for finetuning
84
84
-`output_file`: path to save JSON data with scores for finetuning
85
-
-`negative_number`: the number of sampled negatives
86
-
-`range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
87
-
-`candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
88
-
-`use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
85
+
-**`use_fp16`**: Whether to use fp16 for inference. Default: True
86
+
-**`devices`**: Devices to use for inference. Default: None, multiple values allowed
-**`reranker_name_or_path`**: The reranker name or path. Default: None
89
+
-**`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
90
+
-**`reranker_peft_path`**: The reranker peft path. Default: None
91
+
-**`use_bf16`**: Whether to use bf16 for inference. Default: False
92
+
-**`query_instruction_for_rerank`**: Instruction for query. Default: None
93
+
-**`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
94
+
-**`passage_instruction_for_rerank`**: Instruction for passage. Default: None
95
+
-**`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
96
+
-**`cache_dir`**: Cache directory for models. Default: None
97
+
-**`reranker_batch_size`**: Batch size for inference. Default: 3000
98
+
-**`reranker_query_max_length`**: Max length for reranking queries. Default: None
99
+
-**`reranker_max_length`**: Max length for reranking. Default: 512
100
+
-**`normalize`**: Whether to normalize the reranking scores. Default: False
101
+
-**`prompt`**: The prompt for the reranker. Default: None
102
+
-**`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
103
+
-**`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
104
+
-**`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed
Copy file name to clipboardExpand all lines: examples/finetune/reranker/README.md
+66-25Lines changed: 66 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,31 +36,72 @@ Train data should be a json file, where each line is a dict like this:
36
36
37
37
See [example_data](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder/example_data) for more detailed files.
38
38
39
-
### Hard Negatives
40
-
41
-
Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:
-`input_file`: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
59
-
-`output_file`: path to save JSON data with mined hard negatives for finetuning
60
-
-`negative_number`: the number of sampled negatives
61
-
-`range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
62
-
-`candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
63
-
-`use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
39
+
- Hard Negatives
40
+
41
+
Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:
-**`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
59
+
-**`output_file`**: path to save JSON data with mined hard negatives for finetuning
60
+
-**`negative_number`**: the number of sampled negatives
61
+
-**`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
62
+
-**`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
63
+
-**`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
64
+
65
+
### Teacher Scores
66
+
67
+
Teacher scores can be used for model distillation. You can obtain the scores using the following command:
-**`reranker_name_or_path`**: The reranker name or path. Default: None
89
+
-**`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
90
+
-**`reranker_peft_path`**: The reranker peft path. Default: None
91
+
-**`use_bf16`**: Whether to use bf16 for inference. Default: False
92
+
-**`query_instruction_for_rerank`**: Instruction for query. Default: None
93
+
-**`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
94
+
-**`passage_instruction_for_rerank`**: Instruction for passage. Default: None
95
+
-**`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
96
+
-**`cache_dir`**: Cache directory for models. Default: None
97
+
-**`reranker_batch_size`**: Batch size for inference. Default: 3000
98
+
-**`reranker_query_max_length`**: Max length for reranking queries. Default: None
99
+
-**`reranker_max_length`**: Max length for reranking. Default: 512
100
+
-**`normalize`**: Whether to normalize the reranking scores. Default: False
101
+
-**`prompt`**: The prompt for the reranker. Default: None
102
+
-**`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
103
+
-**`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
104
+
-**`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed
0 commit comments