Skip to content

Commit 8c8b52d

Browse files
authored
Merge pull request #1076 from ZiyiXia/master
Reform README
2 parents 06a2113 + 3aae73d commit 8c8b52d

1 file changed

Lines changed: 67 additions & 24 deletions

File tree

README.md

Lines changed: 67 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
<h1 align="center">FlagEmbedding</h1>
22
<p align="center">
3+
<a href="https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d">
4+
<img alt="Build" src="https://img.shields.io/badge/BGE_series-🤗-yellow">
5+
</a>
36
<a href="https://github.com/FlagOpen/FlagEmbedding">
47
<img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
58
</a>
@@ -17,6 +20,8 @@
1720
<h4 align="center">
1821
<p>
1922
<a href=#news>News</a> |
23+
<a href=#installation>Installation</a> |
24+
<a href=#quick-start>Quick Start</a> |
2025
<a href="#projects">Projects</a> |
2126
<a href=#model-list>Model List</a> |
2227
<a href="#contributor">Contributor</a> |
@@ -40,6 +45,13 @@ FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following p
4045
- 7/26/2024: Release a new embedding model [bge-en-icl](https://huggingface.co/BAAI/bge-en-icl), an embedding model that incorporates in-context learning capabilities, which, by providing task-relevant query-response examples, can encode semantically richer queries, further enhancing the semantic representation ability of the embeddings. :fire:
4146
- 7/26/2024: Release a new embedding model [bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2), a multilingual embedding model based on gemma-2-9b, which supports multiple languages and diverse downstream tasks, achieving new SOTA on multilingual benchmarks (MIRACL, MTEB-fr, and MTEB-pl). :fire:
4247
- 7/26/2024: Release a new lightweight reranker [bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight), a lightweight reranker based on gemma-2-9b, which supports token compression and layerwise lightweight operations, can still ensure good performance while saving a significant amount of resources. :fire:
48+
49+
50+
51+
<details>
52+
<summary>More</summary>
53+
<!-- ### More -->
54+
4355
- 6/7/2024: Release a new benchmark [MLVU](https://github.com/JUNJIE99/MLVU), the first comprehensive benchmark specifically designed for long video understanding. MLVU features an extensive range of video durations, a diverse collection of video sources, and a set of evaluation tasks uniquely tailored for long-form video understanding. :fire:
4456
- 5/21/2024: Release a new benchmark [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench) together with Jina AI, Zilliz, HuggingFace, and other partners. AIR-Bench focuses on a fair out-of-distribution evaluation for Neural IR & RAG. It generates the synthetic data for benchmarking w.r.t. diverse domains and languages. It is dynamic and will be updated on regular basis. [Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard) :fire:
4557
- 4/30/2024: Release [Llama-3-8B-Instruct-80K-QLoRA](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA), extending the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA training on a few synthesized long-context data. The model achieves remarkable performance on various long-context benchmarks. [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora) :fire:
@@ -57,12 +69,6 @@ It is the first embedding model which supports all three retrieval methods, achi
5769
- 09/12/2023: New models:
5870
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
5971
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
60-
61-
62-
<details>
63-
<summary>More</summary>
64-
<!-- ### More -->
65-
6672
- 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
6773
- 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
6874
- 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
@@ -72,28 +78,67 @@ It is the first embedding model which supports all three retrieval methods, achi
7278

7379
</details>
7480

81+
## Installation
82+
- Using pip:
83+
```
84+
pip install -U FlagEmbedding
85+
```
86+
- Install from sources:
87+
Clone the repository
88+
```
89+
git clone https://github.com/FlagOpen/FlagEmbedding.git
90+
cd FlagEmbedding
91+
pip install .
92+
```
93+
For development in editable mode:
94+
```
95+
pip install -e .
96+
```
97+
98+
## Quick Start
99+
First, load one of the BGE embedding model:
100+
```
101+
from FlagEmbedding import FlagModel
75102
103+
model = FlagModel('BAAI/bge-base-en-v1.5',
104+
query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
105+
use_fp16=True)
106+
```
107+
Then, feed some sentences to the model and get their embeddings:
108+
```
109+
sentences_1 = ["I love NLP", "I love machine learning"]
110+
sentences_2 = ["I love BGE", "I love text retrieval"]
111+
embeddings_1 = model.encode(sentences_1)
112+
embeddings_2 = model.encode(sentences_2)
113+
```
114+
Once we get the embeddings, we can compute similarity.
115+
```
116+
similarity = embeddings_1 @ embeddings_2.T
117+
print(similarity)
118+
```
76119

77120
## Projects
78121

79-
### BGE-M3([Paper](https://arxiv.org/pdf/2402.03216.pdf), [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
80-
In this project, we introduce BGE-M3, the first embedding model which supports multiple retrieval modes、multilingual and multi-granularity retrieval.
81-
- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
82-
- Multi-Linguality: It can support more than 100 working languages.
83-
- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
122+
### BGE-M3 ([Paper](https://arxiv.org/pdf/2402.03216.pdf), [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
123+
124+
125+
In this project, we introduce BGE-M3, the first embedding model which supports:
126+
- **Multi-Functionality**: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
127+
- **Multi-Linguality**: It can support more than 100 working languages.
128+
- **Multi-Granularity**: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
84129

85-
We propose a novel self-knowledge distillation approach to improve the performance of single retrieval mode.
86-
We optimize the batching strategy, enabling a large batch size, which can used simply when fine-tuning with long text or large language model.
87-
We also construct a dataset for document retrieval and propose a simple strategy to improve the ability to model long text.
88130
**The training code and fine-tuning data will be open-sourced in the near future.**
89131

90132
### [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual)
91133
In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal Knowledge Retrieval, Composed Image Retrieval, and Knowledge Retrieval with Multi-Modal Queries.
92134

93135
Our model delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.
94136

137+
138+
95139
### [LongLLM QLoRA](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora)
96-
We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts. The dramatic context extension is mainly attributed to merely 3.5K synthetic data generated by GPT-4, which indicates the LLMs' inherent (yet largely underestimated) potential to extend its original context length. In fact, the context length could be extended far beyond 80K with more computing resources.
140+
We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine (the context length can go far beyond 80k with more computing resources). The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts.
141+
97142

98143
### [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon)
99144

@@ -104,13 +149,10 @@ More details please refer to our [paper](https://arxiv.org/abs/2401.03462) and [
104149

105150

106151
### [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
107-
108-
Model merging has been used to improve the performance of single model.
109-
We find this method is also useful for large language models and dense embedding model,
110-
and design the LM-Cocktail strategy which automatically merges fine-tuned models and base model using a simple function to compute merging weights.
111-
LM-Cocktail can be used to improve the performance on target domain without decrease
112-
the general capabilities beyond target domain.
113-
It also can be used to generate a model for new tasks without fine-tuning.
152+
153+
LM-Cocktail automatically merges fine-tuned models and base model using a simple function to compute merging weights.
154+
LM-Cocktail can be used to improve the performance on target domain without decrease the general capabilities beyond target domain,
155+
as well as generate a model for new tasks without fine-tuning.
114156
You can use it to merge the LLMs (e.g., Llama) or embedding models.
115157
More details please refer to our report: [LM-Cocktail](https://arxiv.org/abs/2311.13534) and [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).
116158

@@ -119,7 +161,7 @@ More details please refer to our report: [LM-Cocktail](https://arxiv.org/abs/231
119161
### [LLM Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder)
120162

121163
LLM Embedder is fine-tuned based on the feedback from LLMs.
122-
It can support the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval.
164+
It supports the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval.
123165
It is fine-tuned over 6 tasks: Question Answering, Conversational Search, Long Conversation,
124166
Long-Range Language Modeling, In-Context Learning, and Tool Learning.
125167
For more details please refer to [report](https://arxiv.org/abs/2310.07554) and [./FlagEmbedding/llm_embedder/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder)
@@ -135,7 +177,7 @@ The data format is the same as embedding model, so you can fine-tune it easily f
135177
For more details please refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
136178

137179

138-
180+
### [LLM Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker)
139181
We provide a new version of the cross-encoder that supports more languages and longer lengths. The data format is similar to our embedding models, but now includes prompt data for fine-tuning and inference. You can perform inference using specific layers or using the entire layers. You can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker#fine-tune).
140182
For more details please refer to [./FlagEmbedding/llm_reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker).
141183

@@ -189,6 +231,7 @@ Refer to our [report: c-pack](https://arxiv.org/pdf/2309.07597.pdf) and [code](h
189231

190232

191233
### Contributors:
234+
Thank all our contributors for their efforts and warmly welcome new members to join in!
192235

193236
<a href="https://github.com/FlagOpen/FlagEmbedding/graphs/contributors">
194237
<img src="https://contrib.rocks/image?repo=FlagOpen/FlagEmbedding" />

0 commit comments

Comments
 (0)