You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/finetune/reranker/README.md
+68-68Lines changed: 68 additions & 68 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,74 +36,74 @@ Train data should be a json file, where each line is a dict like this:
36
36
37
37
See [example_data](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder/example_data) for more detailed files.
38
38
39
-
- Hard Negatives
40
-
41
-
Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:
-**`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
59
-
-**`output_file`**: path to save JSON data with mined hard negatives for finetuning
60
-
-**`negative_number`**: the number of sampled negatives
61
-
-**`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
62
-
-**`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
63
-
-**`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
64
-
65
-
### Teacher Scores
66
-
67
-
Teacher scores can be used for model distillation. You can obtain the scores using the following command:
-**`reranker_name_or_path`**: The reranker name or path. Default: None
91
-
-**`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
92
-
-**`reranker_peft_path`**: The reranker peft path. Default: None
93
-
-**`use_bf16`**: Whether to use bf16 for inference. Default: False
94
-
-**`query_instruction_for_rerank`**: Instruction for query. Default: None
95
-
-**`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
96
-
-**`passage_instruction_for_rerank`**: Instruction for passage. Default: None
97
-
-**`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
98
-
-**`cache_dir`**: Cache directory for models. Default: None
99
-
-**`reranker_batch_size`**: Batch size for inference. Default: 3000
100
-
-**`reranker_query_max_length`**: Max length for reranking queries. Default: None
101
-
-**`reranker_max_length`**: Max length for reranking. Default: 512
102
-
-**`normalize`**: Whether to normalize the reranking scores. Default: False
103
-
-**`prompt`**: The prompt for the reranker. Default: None
104
-
-**`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
105
-
-**`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
106
-
-**`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed
39
+
###Hard Negatives
40
+
41
+
Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command:
-**`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
59
+
-**`output_file`**: path to save JSON data with mined hard negatives for finetuning
60
+
-**`negative_number`**: the number of sampled negatives
61
+
-**`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
62
+
-**`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
63
+
-**`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
64
+
65
+
### Teacher Scores
66
+
67
+
Teacher scores can be used for model distillation. You can obtain the scores using the following command:
-**`reranker_name_or_path`**: The reranker name or path. Default: None
91
+
-**`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto
92
+
-**`reranker_peft_path`**: The reranker peft path. Default: None
93
+
-**`use_bf16`**: Whether to use bf16 for inference. Default: False
94
+
-**`query_instruction_for_rerank`**: Instruction for query. Default: None
95
+
-**`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}}
96
+
-**`passage_instruction_for_rerank`**: Instruction for passage. Default: None
97
+
-**`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}}
98
+
-**`cache_dir`**: Cache directory for models. Default: None
99
+
-**`reranker_batch_size`**: Batch size for inference. Default: 3000
100
+
-**`reranker_query_max_length`**: Max length for reranking queries. Default: None
101
+
-**`reranker_max_length`**: Max length for reranking. Default: 512
102
+
-**`normalize`**: Whether to normalize the reranking scores. Default: False
103
+
-**`prompt`**: The prompt for the reranker. Default: None
104
+
-**`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None
105
+
-**`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1
106
+
-**`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed
@@ -97,9 +89,9 @@ Besides the negatives in this group, the in-batch negatives also will be used in
97
89
For more training arguments please refer to [transformers.TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)
98
90
99
91
100
-
### 4. Model merging via [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)[optional]
92
+
### 4. Model merging via [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail)[optional]
101
93
102
-
For more details please refer to [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).
94
+
For more details please refer to [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail).
103
95
104
96
Fine-tuning the base bge model can improve its performance on target task,
105
97
but maybe lead to severe degeneration of model’s general capabilities
@@ -144,14 +136,15 @@ You can fine-tune the base model on more tasks and merge them to achieve better
144
136
145
137
146
138
### 5. Load your model
147
-
After fine-tuning BGE model, you can load it easily in the same way as [here](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding#usage)
139
+
After fine-tuning BGE model, you can load it easily in the same way as [here](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/baai_general_embedding#usage)
148
140
149
141
Please replace the `query_instruction_for_retrieval` with your instruction if you set a different value for hyper-parameter `--query_instruction_for_retrieval` when fine-tuning.
150
142
151
143
152
144
### 6. Evaluate model
153
-
We provide [a simple script](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance.
145
+
We provide [a simple script](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/research/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance.
154
146
A brief summary of how the script works:
147
+
155
148
1. Load the model on all available GPUs through [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html).
156
149
2. Encode the corpus and offload the embeddings in `faiss` Flat index. By default, `faiss` also dumps the index on all available GPUs.
157
150
3. Encode the queries and search `100` nearest neighbors for each query.
@@ -170,7 +163,7 @@ You can check the data formats for the [msmarco corpus](https://huggingface.co/d
The data format for reranker is the same as [embedding fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#data-format).
25
-
Besides, we strongly suggest to [mine hard negatives](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives) to fine-tune reranker.
13
+
The data format for reranker is the same as [embedding fine-tune](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#2-data-format).
14
+
Besides, we strongly suggest to [mine hard negatives](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/reranker#hard-negatives) to fine-tune reranker.
26
15
27
16
28
17
## 3. Train
29
18
30
19
```
31
20
torchrun --nproc_per_node {number of gpus} \
32
-
-m FlagEmbedding.reranker.run \
21
+
-m run \
33
22
--output_dir {path to save model} \
34
23
--model_name_or_path BAAI/bge-reranker-base \
35
24
--train_data ./toy_finetune_data.jsonl \
@@ -55,9 +44,9 @@ Besides the negatives in this group, the in-batch negatives also will be used in
55
44
More training arguments please refer to [transformers.TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)
56
45
57
46
58
-
### 4. Model merging via [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)[optional]
47
+
### 4. Model merging via [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail)[optional]
59
48
60
-
For more details please refer to [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).
49
+
For more details please refer to [LM-Cocktail](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/research/LM_Cocktail).
61
50
62
51
Fine-tuning the base bge model can improve its performance on target task,
63
52
but maybe lead to severe degeneration of model’s general capabilities
Copy file name to clipboardExpand all lines: research/old-examples/unified_finetune/README.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ See [toy_train_data](./toy_train_data) for an example of training data.
33
33
34
34
## 3. Train
35
35
36
-
> **Note**: If you only want to fine-tune the dense embedding of `BAAI/bge-m3`, you can refer to [here](../finetune/README.md).
36
+
> **Note**: If you only want to fine-tune the dense embedding of `BAAI/bge-m3`, you can refer to [here](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/finetune/embedder#1-standard-model).
37
37
38
38
Here is an simple example of how to perform unified fine-tuning (dense embedding, sparse embedding and colbert) based on `BAAI/bge-m3`:
39
39
@@ -63,14 +63,14 @@ torchrun --nproc_per_node {number of gpus} \
63
63
You can also refer to [this script](./unified_finetune_bge-m3_exmaple.sh) for more details. In this script, we use `deepspeed` to perform distributed training. Learn more about `deepspeed` at https://www.deepspeed.ai/getting-started/. Note that there are some important parameters to be modified in this script:
64
64
65
65
-`HOST_FILE_CONTENT`: Machines and GPUs for training. If you want to use multiple machines for training, please refer to https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node (note that you should configure `pdsh` and `ssh` properly).
66
-
-`DS_CONFIG_FILE`: Path of deepspeed config file. [Here](../finetune/ds_config.json) is an example of `ds_config.json`.
66
+
-`DS_CONFIG_FILE`: Path of deepspeed config file. [Here](https://github.com/hanhainebula/FlagEmbedding/blob/new-flagembedding-v1/examples/finetune/ds_stage0.json) is an example of `ds_config.json`.
67
67
-`DATA_PATH`: One or more paths of training data. **Each path must be a directory containing one or more jsonl files**.
68
-
-`DEFAULT_BATCH_SIZE`: Default batch size for training. If you use efficient batching strategy, which means you have split your data to different parts by sequence length, then the batch size for each part will be decided by the `get_file_batch_size()` function in [`BGE_M3/data.py`](../../FlagEmbedding/BGE_M3/data.py). Before starting training, you should set the corresponding batch size for each part in this function according to the GPU memory of your machines. `DEFAULT_BATCH_SIZE` will be used for the part whose sequence length is not in the `get_file_batch_size()` function.
68
+
-`DEFAULT_BATCH_SIZE`: Default batch size for training. If you use efficient batching strategy, which means you have split your data to different parts by sequence length, then the batch size for each part will be decided by the `get_file_batch_size()` function in [`BGE_M3/data.py`](../../BGE_M3/data.py). Before starting training, you should set the corresponding batch size for each part in this function according to the GPU memory of your machines. `DEFAULT_BATCH_SIZE` will be used for the part whose sequence length is not in the `get_file_batch_size()` function.
69
69
-`EPOCHS`: Number of training epochs.
70
70
-`LEARNING_RATE`: The initial learning rate.
71
71
-`SAVE_PATH`: Path of saving finetuned model.
72
72
73
73
You should set these parameters appropriately.
74
74
75
75
76
-
For more detailed arguments setting, please refer to [`BGE_M3/arguments.py`](../../FlagEmbedding/BGE_M3/arguments.py).
76
+
For more detailed arguments setting, please refer to [`BGE_M3/arguments.py`](../../BGE_M3/arguments.py).
0 commit comments