Skip to content

Commit aeb25da

Browse files
committed
add reranker score
1 parent fa8a76d commit aeb25da

2 files changed

Lines changed: 28 additions & 1 deletion

File tree

examples/finetune/embedder/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,31 @@ python hn_mine.py \
6262
- `candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
6363
- `use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
6464

65+
### Teacher Scores
66+
67+
Teacher scores can be used for model distillation. You can obtain the scores using the following command:
68+
69+
```shell
70+
git clone https://github.com/FlagOpen/FlagEmbedding.git
71+
cd FlagEmbedding/scripts
72+
```
73+
74+
```shell
75+
python add_reranker_score.py \
76+
--input_file toy_finetune_data_minedHN.jsonl \
77+
--output_file toy_finetune_data_score.jsonl \
78+
--range_for_sampling 2-200 \
79+
--negative_number 15 \
80+
--use_gpu_for_searching
81+
```
82+
83+
- `input_file`: path to save JSON data with mined hard negatives for finetuning
84+
- `output_file`: path to save JSON data with scores for finetuning
85+
- `negative_number`: the number of sampled negatives
86+
- `range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
87+
- `candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
88+
- `use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives.
89+
6590
## 3. Train
6691

6792
Detailed examples of various fine-tuning can be found in the bash files located in the corresponding folders. Here, we simply provide the training methods for the `standard model`, `bge-m3`, `bge-multilingual-gemma2` and `bge-en-icl`.

scripts/add_reranker_score.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,4 +136,6 @@ class ModelArgs:
136136

137137
with open(score_args.output_file, 'w') as f:
138138
for d in data:
139-
f.write(json.dumps(d) + '\n')
139+
f.write(json.dumps(d) + '\n')
140+
141+
del reranker

0 commit comments

Comments
 (0)