Skip to content

Commit a2d650e

Browse files
committed
upload embedder inference
1 parent 4750923 commit a2d650e

4 files changed

Lines changed: 300 additions & 2 deletions

File tree

FlagEmbedding/finetune/embedder/encoder_only/base/runner.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,14 @@ class EncoderOnlyEmbedderRunner(AbsEmbedderRunner):
1616
def load_tokenizer_and_model(self) -> Tuple[PreTrainedTokenizer, AbsEmbedderModel]:
1717
tokenizer = AutoTokenizer.from_pretrained(
1818
self.model_args.model_name_or_path,
19+
cache_dir=self.model_args.cache_dir,
20+
token=self.model_args.token,
1921
trust_remote_code=self.model_args.trust_remote_code
2022
)
2123
base_model = AutoModel.from_pretrained(
2224
self.model_args.model_name_or_path,
25+
cache_dir=self.model_args.cache_dir,
26+
token=self.model_args.token,
2327
trust_remote_code=self.model_args.trust_remote_code
2428
)
2529

FlagEmbedding/finetune/embedder/encoder_only/m3/runner.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@ def get_model(
7979
def load_tokenizer_and_model(self) -> Tuple[PreTrainedTokenizer, AbsEmbedderModel]:
8080
tokenizer = AutoTokenizer.from_pretrained(
8181
self.model_args.model_name_or_path,
82+
cache_dir=self.model_args.cache_dir,
83+
token=self.model_args.token,
8284
trust_remote_code=self.model_args.trust_remote_code
8385
)
8486

examples/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ torchrun --nproc_per_node 2 \
112112
--model_name_or_path BAAI/bge-reranker-large \
113113
--cache_dir ./cache/model \
114114
--train_data ./finetune/reranker/example_data/normal/examples.jsonl \
115-
--cache_path ~/.cache \
115+
--cache_path ./cache/data \
116116
--train_group_size 8 \
117117
--query_max_len 256 \
118118
--passage_max_len 256 \
@@ -131,7 +131,7 @@ torchrun --nproc_per_node 2 \
131131
--weight_decay 0.01 \
132132
--deepspeed ./finetune/ds_stage0.json \
133133
--logging_steps 1 \
134-
--save_steps 1000 \
134+
--save_steps 1000
135135
```
136136

137137
# 5. Evaluation
Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
# Embedder
2+
3+
- [Model List](#model-list)
4+
- [Usage](#usage)
5+
6+
An embedder can encode text into embeddings.
7+
8+
When provided with a query and a passage, the embedder encodes both separately, and then uses the similarity between their embeddings as the similarity score.
9+
10+
For more detailed using, you can look [embedder-encoder only](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/embedder/encoder_only) or [embedder-decoder only](https://github.com/hanhainebula/FlagEmbedding/tree/new-flagembedding-v1/examples/inference/embedder/decoder_only)
11+
12+
13+
## Model List
14+
15+
`bge` is short for `BAAI general embedding`.
16+
17+
| Model | Language | Description | query instruction for retrieval |
18+
| :----------------------------------------------------------- | :-----------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
19+
| [BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl) | English | A LLM-based embedding model with in-context learning capabilities, which can fully leverage the model's potential based on a few shot examples | Provide instructions and few-shot examples freely based on the given task. |
20+
| [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2) | Multilingual | A LLM-based multilingual embedding model, trained on a diverse range of languages and tasks. | Provide instructions based on the given task. |
21+
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | |
22+
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
23+
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
24+
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
25+
| [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | Chinese | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
26+
| [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5) | Chinese | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
27+
| [BAAI/bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5) | Chinese | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` |
28+
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | Embedding Model which map text into vector | `Represent this sentence for searching relevant passages: ` |
29+
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` |
30+
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
31+
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | Embedding Model which map text into vector | `为这个句子生成表示以用于检索相关文章:` |
32+
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
33+
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
34+
35+
## Usage
36+
37+
### Using FlagEmbedding
38+
39+
#### 1. Auto Model
40+
41+
You can use `FlagAutoModel` to load the model. If the model isn't included in `model_mapping`, it won't load correctly. You can either modify the `model_mapping` file yourself or submit a pull request.
42+
43+
```python
44+
from FlagEmbedding import FlagAutoModel
45+
sentences_1 = ["样例数据-1", "样例数据-2"]
46+
sentences_2 = ["样例数据-3", "样例数据-4"]
47+
model = FlagAutoModel.from_finetuned('BAAI/bge-large-zh-v1.5',
48+
query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
49+
use_fp16=True,
50+
devices=['cuda:1']) # Setting use_fp16 to True speeds up computation with a slight performance degradation
51+
embeddings_1 = model.encode_corpus(sentences_1)
52+
embeddings_2 = model.encode_corpus(sentences_2)
53+
similarity = embeddings_1 @ embeddings_2.T
54+
print(similarity)
55+
56+
# for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query
57+
# corpus in retrieval task can still use encode_corpus(), since they don't need instruction
58+
queries = ['query_1', 'query_2']
59+
passages = ["样例文档-1", "样例文档-2"]
60+
q_embeddings = model.encode_queries(queries)
61+
p_embeddings = model.encode_corpus(passages)
62+
scores = q_embeddings @ p_embeddings.T
63+
print(scores)
64+
```
65+
66+
#### 2. Normal Model
67+
68+
For `FlagModel`, it supports `BAAI/bge-large-en-v1.5`, `BAAI/bge-base-en-v1.5`, `BAAI/bge-small-en-v1.5`, `BAAI/bge-large-zh-v1.5`, `BAAI/bge-base-zh-v1.5`, `BAAI/bge-small-zh-v1.5`, `BAAI/bge-large-en`, `BAAI/bge-base-en`, `BAAI/bge-small-en`, `BAAI/bge-large-zh`, `BAAI/bge-base-zh`, `BAAI/bge-small-zh'`:
69+
70+
```python
71+
from FlagEmbedding import FlagModel
72+
sentences_1 = ["样例数据-1", "样例数据-2"]
73+
sentences_2 = ["样例数据-3", "样例数据-4"]
74+
model = FlagModel('BAAI/bge-large-zh-v1.5',
75+
query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
76+
use_fp16=True,
77+
devices=['cuda:1']) # Setting use_fp16 to True speeds up computation with a slight performance degradation
78+
embeddings_1 = model.encode_corpus(sentences_1)
79+
embeddings_2 = model.encode_corpus(sentences_2)
80+
similarity = embeddings_1 @ embeddings_2.T
81+
print(similarity)
82+
83+
# for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query
84+
# corpus in retrieval task can still use encode_corpus(), since they don't need instruction
85+
queries = ['query_1', 'query_2']
86+
passages = ["样例文档-1", "样例文档-2"]
87+
q_embeddings = model.encode_queries(queries)
88+
p_embeddings = model.encode_corpus(passages)
89+
scores = q_embeddings @ p_embeddings.T
90+
print(scores)
91+
```
92+
93+
#### 3. M3 Model
94+
95+
For `FlagModel`, it supports `BAAI/bge-m3`:
96+
97+
```python
98+
from FlagEmbedding import BGEM3FlagModel
99+
sentences_1 = ["样例数据-1", "样例数据-2"]
100+
sentences_2 = ["样例数据-3", "样例数据-4"]
101+
model = BGEM3FlagModel('BAAI/bge-m3',
102+
use_fp16=True,
103+
pooling_method='cls',
104+
devices=['cuda:1']) # Setting use_fp16 to True speeds up computation with a slight performance degradation
105+
embeddings_1 = model.encode_corpus(
106+
sentences_1,
107+
return_dense=True,
108+
return_sparse=True,
109+
return_colbert_vecs=False,
110+
)
111+
embeddings_2 = model.encode_corpus(
112+
sentences_2,
113+
return_dense=True,
114+
return_sparse=True,
115+
return_colbert_vecs=False,
116+
)
117+
dense_similarity = embeddings_1["dense_vecs"] @ embeddings_2["dense_vecs"].T
118+
print('dense similarity:', dense_similarity)
119+
sparse_similarity = model.compute_lexical_matching_score(
120+
embeddings_1["lexical_weights"],
121+
embeddings_2["lexical_weights"],
122+
)
123+
print('sparse similarity:', sparse_similarity)
124+
125+
queries = ['query_1', 'query_2']
126+
passages = ["样例文档-1", "样例文档-2"]
127+
q_embeddings = model.encode_queries(
128+
queries,
129+
return_dense=True,
130+
return_sparse=True,
131+
return_colbert_vecs=False,
132+
)
133+
p_embeddings = model.encode_corpus(
134+
passages,
135+
return_dense=True,
136+
return_sparse=True,
137+
return_colbert_vecs=False,
138+
)
139+
dense_scores = embeddings_1["dense_vecs"] @ embeddings_2["dense_vecs"].T
140+
print('dense scores:', dense_scores)
141+
sparse_scores = model.compute_lexical_matching_score(
142+
embeddings_1["lexical_weights"],
143+
embeddings_2["lexical_weights"],
144+
)
145+
print('sparse similarity:', sparse_scores)
146+
```
147+
148+
#### 4. LLM-based Model
149+
150+
For `FlagModel`, it supports `BAAI/bge-multilingual-gemma2`, `gte-Qwen2-7B-instruct`, `e5-mistral-7b-instruct`, .etc:
151+
152+
```python
153+
from FlagEmbedding import FlagLLMModel
154+
sentences_1 = ["样例数据-1", "样例数据-2"]
155+
sentences_2 = ["样例数据-3", "样例数据-4"]
156+
model = FlagLLMModel('BAAI/bge-multilingual-gemma2',
157+
query_instruction_for_retrieval="Given a question, retrieve passages that answer the question.",
158+
query_instruction_format="<instruct>{}\n<query>{}",
159+
use_fp16=True,
160+
devices=['cuda:1']) # Setting use_fp16 to True speeds up computation with a slight performance degradation
161+
queries = ['query_1', 'query_2']
162+
passages = ["样例文档-1", "样例文档-2"]
163+
q_embeddings = model.encode_queries(queries)
164+
p_embeddings = model.encode_corpus(passages)
165+
scores = q_embeddings @ p_embeddings.T
166+
print(scores)
167+
```
168+
169+
#### 5. LLM-based ICL Model
170+
171+
For `FlagModel`, it supports `BAAI/bge-en-icl`:
172+
173+
```python
174+
from FlagEmbedding import FlagICLModel
175+
176+
examples = [
177+
{
178+
'instruct': 'Given a web search query, retrieve relevant passages that answer the query.',
179+
'query': 'what is a virtual interface',
180+
'response': "A virtual interface is a software-defined abstraction that mimics the behavior and characteristics of a physical network interface. It allows multiple logical network connections to share the same physical network interface, enabling efficient utilization of network resources. Virtual interfaces are commonly used in virtualization technologies such as virtual machines and containers to provide network connectivity without requiring dedicated hardware. They facilitate flexible network configurations and help in isolating network traffic for security and management purposes."
181+
},
182+
{
183+
'instruct': 'Given a web search query, retrieve relevant passages that answer the query.',
184+
'query': 'causes of back pain in female for a week',
185+
'response': "Back pain in females lasting a week can stem from various factors. Common causes include muscle strain due to lifting heavy objects or improper posture, spinal issues like herniated discs or osteoporosis, menstrual cramps causing referred pain, urinary tract infections, or pelvic inflammatory disease. Pregnancy-related changes can also contribute. Stress and lack of physical activity may exacerbate symptoms. Proper diagnosis by a healthcare professional is crucial for effective treatment and management."
186+
}
187+
]
188+
sentences_1 = ["样例数据-1", "样例数据-2"]
189+
sentences_2 = ["样例数据-3", "样例数据-4"]
190+
model = FlagICLModel(
191+
'BAAI/bge-en-icl',
192+
query_instruction_for_retrieval="Given a question, retrieve passages that answer the question.",
193+
query_instruction_format="<instruct>{}\n<query>{}",
194+
examples_for_task=examples,
195+
examples_instruction_format="<instruct>{}\n<query>{}\n<response>{}",
196+
use_fp16=True,
197+
devices=['cuda:1']
198+
) # Setting use_fp16 to True speeds up computation with a slight performance degradation
199+
queries = [
200+
"how much protein should a female eat",
201+
"summit define"
202+
]
203+
passages = [
204+
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
205+
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
206+
]
207+
q_embeddings = model.encode_queries(queries)
208+
p_embeddings = model.encode_corpus(passages)
209+
scores = q_embeddings @ p_embeddings.T
210+
print(scores)
211+
```
212+
213+
### Using Sentence-Transformers
214+
215+
You can also use the `bge` models with [sentence-transformers](https://www.sbert.net/):
216+
217+
```
218+
pip install -U sentence-transformers
219+
```
220+
221+
```shell
222+
from sentence_transformers import SentenceTransformer
223+
sentences_1 = ["样例数据-1", "样例数据-2"]
224+
sentences_2 = ["样例数据-3", "样例数据-4"]
225+
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
226+
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
227+
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
228+
similarity = embeddings_1 @ embeddings_2.T
229+
print(similarity)
230+
```
231+
232+
For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)). But the instruction is not needed for passages.
233+
234+
```shell
235+
from sentence_transformers import SentenceTransformer
236+
queries = ['query_1', 'query_2']
237+
passages = ["样例文档-1", "样例文档-2"]
238+
instruction = "为这个句子生成表示以用于检索相关文章:"
239+
240+
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
241+
q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
242+
p_embeddings = model.encode(passages, normalize_embeddings=True)
243+
scores = q_embeddings @ p_embeddings.T
244+
```
245+
246+
### Using Langchain
247+
248+
You can use `bge` in langchain like this:
249+
250+
```python
251+
from langchain.embeddings import HuggingFaceBgeEmbeddings
252+
model_name = "BAAI/bge-large-en-v1.5"
253+
model_kwargs = {'device': 'cuda'}
254+
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
255+
model = HuggingFaceBgeEmbeddings(
256+
model_name=model_name,
257+
model_kwargs=model_kwargs,
258+
encode_kwargs=encode_kwargs,
259+
query_instruction="为这个句子生成表示以用于检索相关文章:"
260+
)
261+
model.query_instruction = "为这个句子生成表示以用于检索相关文章:"
262+
```
263+
264+
### Using HuggingFace Transformers
265+
266+
With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding.
267+
268+
```python
269+
from transformers import AutoTokenizer, AutoModel
270+
import torch
271+
# Sentences we want sentence embeddings for
272+
sentences = ["样例数据-1", "样例数据-2"]
273+
274+
# Load model from HuggingFace Hub
275+
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
276+
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
277+
model.eval()
278+
279+
# Tokenize sentences
280+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
281+
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
282+
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
283+
284+
# Compute token embeddings
285+
with torch.no_grad():
286+
model_output = model(**encoded_input)
287+
# Perform pooling. In this case, cls pooling.
288+
sentence_embeddings = model_output[0][:, 0]
289+
# normalize embeddings
290+
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
291+
print("Sentence embeddings:", sentence_embeddings)
292+
```

0 commit comments

Comments
 (0)