You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: scripts/README.md
+20-10Lines changed: 20 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,12 +17,12 @@ Hard negatives is a widely used method to improve the quality of sentence embedd
17
17
18
18
```shell
19
19
python hn_mine.py \
20
-
--model_name_or_path BAAI/bge-base-en-v1.5 \
21
20
--input_file toy_finetune_data.jsonl \
22
21
--output_file toy_finetune_data_minedHN.jsonl \
23
22
--range_for_sampling 2-200 \
24
23
--negative_number 15 \
25
-
--use_gpu_for_searching
24
+
--use_gpu_for_searching \
25
+
--embedder_name_or_path BAAI/bge-base-en-v1.5
26
26
```
27
27
28
28
-**`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
@@ -31,6 +31,19 @@ python hn_mine.py \
31
31
-**`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
32
32
-**`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
33
33
-**`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
34
+
-**`search_batch_size`**: batch size for searching. Default is 64.
35
+
-**`embedder_name_or_path`**: The name or path to the embedder.
36
+
-**`embedder_model_class`**: Class of the model used for embedding (current options include 'encoder-only-base', 'encoder-only-m3', 'decoder-only-base', 'decoder-only-icl'.). Default is None. For the custom model, you should set this argument.
37
+
-**`normalize_embeddings`**: Set to `True` to normalize embeddings.
38
+
-**`pooling_method`**: The pooling method for the embedder.
39
+
-**`use_fp16`**: Use FP16 precision for inference.
40
+
-**`devices`**: List of devices used for inference.
41
+
-**`query_instruction_for_retrieval`**, **`query_instruction_format_for_retrieval`**: Instructions and format for query during retrieval.
42
+
-**`examples_for_task`**, **`examples_instruction_format`**: Example tasks and their instructions format. This is only used when `embedder_model_class` is set to `decoder-only-icl`.
43
+
-**`trust_remote_code`**: Set to `True` to trust remote code execution.
44
+
-**`cache_dir`**: Cache directory for models.
45
+
-**`embedder_batch_size`**: Batch sizes for embedding and reranking.
46
+
-**`embedder_query_max_length`**, **`embedder_passage_max_length`**: Maximum length for embedding queries and passages.
34
47
35
48
### Teacher Scores
36
49
@@ -40,9 +53,7 @@ Teacher scores can be used for model distillation. You can obtain the scores usi
40
53
python add_reranker_score.py \
41
54
--input_file toy_finetune_data_minedHN.jsonl \
42
55
--output_file toy_finetune_data_score.jsonl \
43
-
--range_for_sampling 2-200 \
44
-
--negative_number 15 \
45
-
--use_gpu_for_searching
56
+
--reranker_name_or_path BAAI/bge-reranker-v2-m3
46
57
```
47
58
48
59
-**`input_file`**: path to save JSON data with mined hard negatives for finetuning
metadata={"help": "The input file for hard negative mining."}
21
+
)
22
+
output_file: str=field(
23
+
metadata={"help": "The output file for hard negative mining."}
24
+
)
25
+
candidate_pool: Optional[str] =field(
26
+
default=None, metadata={"help": "The candidate pool for hard negative mining. If provided, it should be a jsonl file, each line is a dict with a key 'text'."}
27
+
)
28
+
range_for_sampling: str=field(
29
+
default="10-210", metadata={"help": "The range to sample negatives."}
30
+
)
31
+
negative_number: int=field(
32
+
default=15, metadata={"help": "The number of negatives."}
33
+
)
34
+
use_gpu_for_searching: bool=field(
35
+
default=False, metadata={"help": "Whether to use faiss-gpu for searching."}
36
+
)
37
+
search_batch_size: int=field(
38
+
default=64, metadata={"help": "The batch size for searching."}
39
+
)
40
+
41
+
42
+
@dataclass
43
+
classModelArgs:
44
+
"""
45
+
Model arguments for embedder.
46
+
"""
47
+
embedder_name_or_path: str=field(
48
+
metadata={"help": "The embedder name or path.", "required": True}
49
+
)
50
+
embedder_model_class: Optional[str] =field(
51
+
default=None, metadata={"help": "The embedder model class. Available classes: ['encoder-only-base', 'encoder-only-m3', 'decoder-only-base', 'decoder-only-icl']. Default: None. For the custom model, you need to specifiy the model class.", "choices": ["encoder-only-base", "encoder-only-m3", "decoder-only-base", "decoder-only-icl"]}
52
+
)
53
+
normalize_embeddings: bool=field(
54
+
default=True, metadata={"help": "whether to normalize the embeddings"}
55
+
)
56
+
pooling_method: str=field(
57
+
default="cls", metadata={"help": "The pooling method fot the embedder."}
58
+
)
59
+
use_fp16: bool=field(
60
+
default=True, metadata={"help": "whether to use fp16 for inference"}
61
+
)
62
+
devices: Optional[str] =field(
63
+
default=None, metadata={"help": "Devices to use for inference.", "nargs": "+"}
0 commit comments