Merge pull request #1076 from ZiyiXia/master

staoxiao · web-flow · commit 8c8b52d2d20d · 2024-08-23T12:13:19.000+08:00
Reform README
diff --git a/README.md b/README.md
@@ -1,5 +1,8 @@
 <h1 align="center">FlagEmbedding</h1>
 <p align="center">
+    <a href="https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d">
+        <img alt="Build" src="https://img.shields.io/badge/BGE_series-🤗-yellow">
+    </a>
     <a href="https://github.com/FlagOpen/FlagEmbedding">
             <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
     </a>
@@ -17,6 +20,8 @@
 <h4 align="center">
     <p>
         <a href=#news>News</a> |
+        <a href=#installation>Installation</a> |
+        <a href=#quick-start>Quick Start</a> |
         <a href="#projects">Projects</a> |
         <a href=#model-list>Model List</a> |
         <a href="#contributor">Contributor</a> |
@@ -40,6 +45,13 @@ FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following p
 - 7/26/2024: Release a new embedding model [bge-en-icl](https://huggingface.co/BAAI/bge-en-icl), an embedding model that incorporates in-context learning capabilities, which, by providing task-relevant query-response examples, can encode semantically richer queries, further enhancing the semantic representation ability of the embeddings. :fire:
 - 7/26/2024: Release a new embedding model [bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2), a multilingual embedding model based on gemma-2-9b, which supports multiple languages and diverse downstream tasks, achieving new SOTA on multilingual benchmarks (MIRACL, MTEB-fr, and MTEB-pl). :fire:
 - 7/26/2024: Release a new lightweight reranker [bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight), a lightweight reranker based on gemma-2-9b, which supports token compression and layerwise lightweight operations, can still ensure good performance while saving a significant amount of resources. :fire:
+
+
+
+<details>
+  <summary>More</summary>
+<!-- ### More -->
+
 - 6/7/2024: Release a new benchmark [MLVU](https://github.com/JUNJIE99/MLVU), the first comprehensive benchmark specifically designed for long video understanding. MLVU features an extensive range of video durations, a diverse collection of video sources, and a set of evaluation tasks uniquely tailored for long-form video understanding. :fire:
 - 5/21/2024: Release a new benchmark [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench) together with Jina AI, Zilliz, HuggingFace, and other partners. AIR-Bench focuses on a fair out-of-distribution evaluation for Neural IR & RAG. It generates the synthetic data for benchmarking w.r.t. diverse domains and languages. It is dynamic and will be updated on regular basis. [Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard) :fire:
 - 4/30/2024: Release [Llama-3-8B-Instruct-80K-QLoRA](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA), extending the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA training on a few synthesized long-context data. The model achieves remarkable performance on various long-context benchmarks. [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora) :fire:
@@ -57,12 +69,6 @@ It is the first embedding model which supports all three retrieval methods, achi
 - 09/12/2023: New models: 
     - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. 
     - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
-
-
-<details>
-  <summary>More</summary>
-<!-- ### More -->
-
 - 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning. 
 - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).  
 - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**  
@@ -72,28 +78,67 @@ It is the first embedding model which supports all three retrieval methods, achi
 
 </details>
 
+## Installation
+- Using pip:
+```
+pip install -U FlagEmbedding
+```
+- Install from sources:
+Clone the repository
+```
+git clone https://github.com/FlagOpen/FlagEmbedding.git
+cd FlagEmbedding
+pip install  .
+```
+For development in editable mode:
+```
+pip install -e .
+```
+
+## Quick Start
+First, load one of the BGE embedding model:
+```
+from FlagEmbedding import FlagModel
 
+model = FlagModel('BAAI/bge-base-en-v1.5',
+                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
+                  use_fp16=True)
+```
+Then, feed some sentences to the model and get their embeddings:
+```
+sentences_1 = ["I love NLP", "I love machine learning"]
+sentences_2 = ["I love BGE", "I love text retrieval"]
+embeddings_1 = model.encode(sentences_1)
+embeddings_2 = model.encode(sentences_2)
+```
+Once we get the embeddings, we can compute similarity.
+```
+similarity = embeddings_1 @ embeddings_2.T
+print(similarity)
+```
 
 ## Projects
 
-### BGE-M3([Paper](https://arxiv.org/pdf/2402.03216.pdf), [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
-In this project, we introduce BGE-M3, the first embedding model which supports multiple retrieval modes、multilingual and multi-granularity retrieval.
-- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. 
-- Multi-Linguality: It can support more than 100 working languages. 
-- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. 
+### BGE-M3 ([Paper](https://arxiv.org/pdf/2402.03216.pdf), [Code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
+
+
+In this project, we introduce BGE-M3, the first embedding model which supports:
+- **Multi-Functionality**: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
+- **Multi-Linguality**: It can support more than 100 working languages.
+- **Multi-Granularity**: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
 
-We propose a novel self-knowledge distillation approach to improve the performance of single retrieval mode. 
-We optimize the batching strategy, enabling a large batch size, which can used simply when fine-tuning with long text or large language model. 
-We also construct a dataset for document retrieval and propose a simple strategy to improve the ability to model long text.
 **The training code and fine-tuning data will be open-sourced in the near future.**
 
 ### [Visualized-BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual)
 In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal Knowledge Retrieval, Composed Image Retrieval, and Knowledge Retrieval with Multi-Modal Queries.
 
 Our model delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.
 
+
+
 ### [LongLLM QLoRA](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora)
-We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts. The dramatic context extension is mainly attributed to merely 3.5K synthetic data generated by GPT-4, which indicates the LLMs' inherent (yet largely underestimated) potential to extend its original context length. In fact, the context length could be extended far beyond 80K with more computing resources.
+We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine (the context length can go far beyond 80k with more computing resources). The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts.
+
 
 ### [Activation Beacon](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon)
 
@@ -104,13 +149,10 @@ More details please refer to our [paper](https://arxiv.org/abs/2401.03462) and [
 
 
 ### [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
-
-Model merging has been used to improve the performance of single model. 
-We find this method is also useful for large language models and dense embedding model, 
-and design the LM-Cocktail strategy which automatically merges fine-tuned models and base model using a simple function to compute merging weights.
-LM-Cocktail can be used to improve the performance on target domain without decrease 
-the general capabilities beyond target domain.
-It also can be used to generate a model for new tasks without fine-tuning.
+ 
+LM-Cocktail automatically merges fine-tuned models and base model using a simple function to compute merging weights.
+LM-Cocktail can be used to improve the performance on target domain without decrease the general capabilities beyond target domain, 
+as well as generate a model for new tasks without fine-tuning.
 You can use it to merge the LLMs (e.g., Llama) or embedding models.
 More details please refer to our report: [LM-Cocktail](https://arxiv.org/abs/2311.13534) and [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).
 
@@ -119,7 +161,7 @@ More details please refer to our report: [LM-Cocktail](https://arxiv.org/abs/231
 ### [LLM Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder) 
 
 LLM Embedder is fine-tuned based on the feedback from LLMs. 
-It can support the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval. 
+It supports the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval. 
 It is fine-tuned over 6 tasks: Question Answering, Conversational Search, Long Conversation, 
 Long-Range Language Modeling, In-Context Learning, and Tool Learning.
 For more details please refer to [report](https://arxiv.org/abs/2310.07554) and [./FlagEmbedding/llm_embedder/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder)
@@ -135,7 +177,7 @@ The data format is the same as embedding model, so you can fine-tune it easily f
 For more details please refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
 
 
-
+### [LLM Reranker](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker) 
 We provide a new version of the cross-encoder that supports more languages and longer lengths. The data format is similar to our embedding models, but now includes prompt data for fine-tuning and inference. You can perform inference using specific layers or using the entire layers. You can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker#fine-tune).
 For more details please refer to [./FlagEmbedding/llm_reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker).
 
@@ -189,6 +231,7 @@ Refer to our [report: c-pack](https://arxiv.org/pdf/2309.07597.pdf) and [code](h
 
 
 ### Contributors:
+Thank all our contributors for their efforts and warmly welcome new members to join in!
 
 <a href="https://github.com/FlagOpen/FlagEmbedding/graphs/contributors">
   <img src="https://contrib.rocks/image?repo=FlagOpen/FlagEmbedding" />