|
| 1 | +# 1. Introduction |
| 2 | + |
| 3 | +In this example, we show how to use scripts to make your fine-tuning process more convenient |
| 4 | + |
| 5 | +# 2. Installation |
| 6 | + |
| 7 | +```shell |
| 8 | +git clone https://github.com/FlagOpen/FlagEmbedding.git |
| 9 | +cd FlagEmbedding/scripts |
| 10 | +``` |
| 11 | + |
| 12 | +# 3. Usage |
| 13 | + |
| 14 | +### Hard Negatives |
| 15 | + |
| 16 | +Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command: |
| 17 | + |
| 18 | +```shell |
| 19 | +python hn_mine.py \ |
| 20 | +--model_name_or_path BAAI/bge-base-en-v1.5 \ |
| 21 | +--input_file toy_finetune_data.jsonl \ |
| 22 | +--output_file toy_finetune_data_minedHN.jsonl \ |
| 23 | +--range_for_sampling 2-200 \ |
| 24 | +--negative_number 15 \ |
| 25 | +--use_gpu_for_searching |
| 26 | +``` |
| 27 | + |
| 28 | +- **`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents). |
| 29 | +- **`output_file`**: path to save JSON data with mined hard negatives for finetuning |
| 30 | +- **`negative_number`**: the number of sampled negatives |
| 31 | +- **`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)** |
| 32 | +- **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file. |
| 33 | +- **`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives. |
| 34 | + |
| 35 | +### Teacher Scores |
| 36 | + |
| 37 | +Teacher scores can be used for model distillation. You can obtain the scores using the following command: |
| 38 | + |
| 39 | +```shell |
| 40 | +python add_reranker_score.py \ |
| 41 | +--input_file toy_finetune_data_minedHN.jsonl \ |
| 42 | +--output_file toy_finetune_data_score.jsonl \ |
| 43 | +--range_for_sampling 2-200 \ |
| 44 | +--negative_number 15 \ |
| 45 | +--use_gpu_for_searching |
| 46 | +``` |
| 47 | + |
| 48 | +- **`input_file`**: path to save JSON data with mined hard negatives for finetuning |
| 49 | +- **`output_file`**: path to save JSON data with scores for finetuning |
| 50 | +- **`use_fp16`**: Whether to use fp16 for inference. Default: True |
| 51 | +- **`devices`**: Devices to use for inference. Default: None, multiple values allowed |
| 52 | +- **`trust_remote_code`**: Trust remote code. Default: False |
| 53 | +- **`reranker_name_or_path`**: The reranker name or path. Default: None |
| 54 | +- **`reranker_model_class`**: The reranker model class. Available classes: ['auto', 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight']. Default: auto |
| 55 | +- **`reranker_peft_path`**: The reranker peft path. Default: None |
| 56 | +- **`use_bf16`**: Whether to use bf16 for inference. Default: False |
| 57 | +- **`query_instruction_for_rerank`**: Instruction for query. Default: None |
| 58 | +- **`query_instruction_format_for_rerank`**: Format for query instruction. Default: {{}{}} |
| 59 | +- **`passage_instruction_for_rerank`**: Instruction for passage. Default: None |
| 60 | +- **`passage_instruction_format_for_rerank`**: Format for passage instruction. Default: {{}{}} |
| 61 | +- **`cache_dir`**: Cache directory for models. Default: None |
| 62 | +- **`reranker_batch_size`**: Batch size for inference. Default: 3000 |
| 63 | +- **`reranker_query_max_length`**: Max length for reranking queries. Default: None |
| 64 | +- **`reranker_max_length`**: Max length for reranking. Default: 512 |
| 65 | +- **`normalize`**: Whether to normalize the reranking scores. Default: False |
| 66 | +- **`prompt`**: The prompt for the reranker. Default: None |
| 67 | +- **`cutoff_layers`**: The output layers of layerwise/lightweight reranker. Default: None |
| 68 | +- **`compress_ratio`**: The compress ratio of lightweight reranker. Default: 1 |
| 69 | +- **`compress_layers`**: The compress layers of lightweight reranker. Default: None, multiple values allowed |
| 70 | + |
| 71 | +### Split Data by Length |
| 72 | + |
| 73 | +You can split the data using the following command: |
| 74 | + |
| 75 | +```shell |
| 76 | +python split_data_by_length.py \ |
| 77 | +--input_path train_data \ |
| 78 | +--output_dir train_data_split \ |
| 79 | +--cache_dir .cache \ |
| 80 | +--log_name .split_log \ |
| 81 | +--length_list 0 500 1000 2000 3000 4000 5000 6000 7000 \ |
| 82 | +--model_name_or_path BAAI/bge-m3 \ |
| 83 | +--num_proc 16 \ |
| 84 | +--overwrite False |
| 85 | +``` |
| 86 | + |
| 87 | +- **`input_path`**: The path of input data. (Required) |
| 88 | +- **`output_dir`**: The directory of output data. (Required) |
| 89 | +- **`cache_dir`**: The cache directory. Default: None |
| 90 | +- **`log_name`**: The name of the log file. Default: `.split_log`, which will be saved to `output_dir` |
| 91 | +- **`length_list`**: The length list to split. Default: [0, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000] |
| 92 | +- **`model_name_or_path`**: The model name or path of the tokenizer. Default: `BAAI/bge-m3` |
| 93 | +- **`num_proc`**: The number of processes. Default: 16 |
| 94 | +- **`overwrite`**: Whether to overwrite the output file. Default: False |
0 commit comments