FlagOpen
diff --git a/‎Long_LLM/activation_beacon/README.md‎
Lines changed: 2 additions & 3 deletions b/‎Long_LLM/activation_beacon/README.md‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎Long_LLM/activation_beacon/new/README.md‎
Lines changed: 25 additions & 9 deletions b/‎Long_LLM/activation_beacon/new/README.md‎
Lines changed: 25 additions & 9 deletions
diff --git a/‎Long_LLM/activation_beacon/new/docs/evaluation.md‎
Lines changed: 6 additions & 45 deletions b/‎Long_LLM/activation_beacon/new/docs/evaluation.md‎
Lines changed: 6 additions & 45 deletions
@@ -2,9 +2,8 @@
 <h1>Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon [<a href="https://arxiv.org/abs/2401.03462">paper</a>]</h1>
 </div>
 
-This is the codebase for Activation Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM by **x100** times. Currently we only apply activation beacon to [Llama-2-chat-7b](https://huggingface.co/namespace-Pt/activation-beacon-llama2-7b-chat) and [Mistral-7B-Instruct-v0.2](https://huggingface.co/namespace-Pt/activation-beacon-mistral-7b). More LLMs will be supported in the future.
-
+This is the codebase for Activation Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM through compressing KV cache. 
 
 ## File structure:
 - The [old](./old/) folder contains our initial implementation of Activation Beacon for Llama-2. You can use the code in it to reproduce the training/evaluation of the Llama-2 based model shown in our paper.
-- The [new](./new/) folder contains **newer** implementation of Activation Beacon for both Llama-2 and Mistral. It also supports more features, including **Deepspeed Zero3 training**, adding **chat template** in training and inference, and **evaluating on more tasks**. However, code in this folder are under development and subject to change in the future.
+- The [new](./new/) folder contains **newer** implementation of Activation Beacon. It supports more LLMs, including Mistral, Llama-3, and Qwen-2. It also supports more features, including **Deepspeed Zero3 training**, **Flash-Attention-2**, adding **chat template** in training and inference, and **evaluating on more tasks**. However, code in this folder are under development and subject to change in the future.
@@ -1,15 +1,26 @@
 # Activation-Beacon
 
-This folder contains the newer code for activation beacon with the support of **Mistral models**, **Deepspeed Zero3 training**, **chat templates**, and **more evaluation tasks**. The code here are under development and subject to change in the future.
+[Activation Beacon](https://arxiv.org/abs/2401.03462) compresses the original KV into fewer yet more compact states (a.k.a. beacons) and hence enabling the LLM to perceive longer context given its fixed context window. It is known for the following features:
+- **Effective**
+  - there is little information loss given a compression ratio of 2, 4, and 8;
+- **Efficient**
+  - it drastically reduces the GPU consumption of KV cache;
+- **Compatible**
+  - it can work together with position extrapolation (e.g. YaRN) to further extends the context length; it can also work with grouped query attention to further reduce the KV cache size;
+- **Low-Cost**
+  - it is light-weight and can be efficiently trained with roughly 1B tokens. 
+
+This folder contains the newer code for activation beacon. It supports more LLMs, including Mistral, Llama-3, and Qwen-2. It also supports more features, including **Deepspeed Zero3 training**, **Flash-Attention-2**, adding **chat template** in training and inference, and **evaluating on more tasks**. However, code in this folder are under development and subject to change in the future.
 
 ## Environment
 ```bash
 conda create beacon python=3.10.14
 
 conda activate beacon
 
+# You may need to adjust the cuda version
 conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
-pip install transformers==4.39.3 deepspeed accelerate datasets peft pandas seaborn rouge fuzzywuzzy jieba
+pip install transformers==4.39.3 deepspeed accelerate datasets peft pandas seaborn rouge fuzzywuzzy jieba python-Levenshtein
 pip install flash-attn --no-build-isolation
 ```
 
@@ -19,10 +30,15 @@ import json
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
-model_id = "namespace-Pt/activation-beacon-mistral-7b"
+model_id = "namespace-Pt/beacon-qwen-2-7b-instruct"
 
 tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.bfloat16)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id, 
+    trust_remote_code=True, 
+    torch_dtype=torch.bfloat16, 
+    attn_implementation="flash_attention_2"
+)
 
 model = model.cuda().eval()
 
@@ -32,7 +48,7 @@ with torch.no_grad():
   inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
   outputs = model.generate(**inputs, max_new_tokens=50)
   print(f"Input Length: {inputs['input_ids'].shape[1]}")
-  print(f"Output:       {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
+  print(f"Output:       {repr(tokenizer.decode(outputs[0], skip_special_tokens=True))}")
 
   # reset memory before new generation task
   model.memory.reset()
@@ -55,16 +71,16 @@ with torch.no_grad():
 You should download the data for fine-tuning & evaluation then untar the file at anywhere you prefer, e.g. `/data`:
 ```bash
 # feel free to alternate /data to your prefered location
-wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/activation-beacon-new.tar.gz?download=true -O /data/activation-beacon-new.tar.gz
+wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/long-llm.tar.gz?download=true -O /data/long-llm.tar.gz
 
 cd /data
-tar -xzvf activation-beacon-new.tar.gz
+tar -xzvf long-llm.tar.gz
 ```
 
 **IMPORTANT NOTE**
 
-For any path specified for `train_data` and `eval_data`: if it is prefixed with `activation-beacon:`, it will be solved to the relative path against [`data_root`](./src/args.py). 
-  - e.g. `activation-beacon:lm/pg19.json` becomes `${data_root}/lm/pg19.json`
+For any path specified for `train_data` and `eval_data`: if it is prefixed with `long-llm:`, it will be solved to the relative path against [`data_root`](./src/args.py). 
+  - e.g. `long-llm:lm/pg19.json` becomes `${data_root}/lm/pg19.json`
   - you can modify the default value of [`data_root`](./src/args.py), so that you don't need to type it for each command.
 
 
 
@@ -1,69 +1,30 @@
 # Evaluation
 
-## Prerequisite
-
 Make sure you have created the environment and downloaded the data according to [README](../README.md).
 
 
-## Evaluating Beacon Models
 ```bash
 conda activate beacon
 
-model=namespace-Pt/activation-beacon-mistral-7b
+model=namespace-Pt/beacon-qwen-2-7b-instruct
 
 # language modeling perplexity
 torchrun --nproc_per_node 8 -m main.eval_lm --max_length 100000 --stride 32768 --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024
 
 # passkey retrieval accuracy
-torchrun --nproc_per_node 8 -m main.eval_passkey --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 --chat_template mistral
+torchrun --nproc_per_node 8 -m main.eval_passkey --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024
 
 # needle-in-a-haystack accuracy
-OPENAI_API_KEY="<you_api_key>" torchrun --nproc_per_node 8 -m main.eval_needle --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 --chat_template mistral --gpt_eval
+OPENAI_API_KEY="<you_api_key>" torchrun --nproc_per_node 8 -m main.eval_needle --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 --gpt_eval
 
 # topic retrieval accuracy
-torchrun --nproc_per_node 8 -m main.eval_topic --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 --chat_template mistral
+torchrun --nproc_per_node 8 -m main.eval_topic --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024
 
 # longbench
-torchrun --nproc_per_node 8 -m main.eval_longbench --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 --chat_template mistral
+torchrun --nproc_per_node 8 -m main.eval_longbench --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024
 
 # infinitebench
-torchrun --nproc_per_node 8 -m main.eval_infbench --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 --chat_template mistral
+torchrun --nproc_per_node 8 -m main.eval_infbench --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024
 ```
 
 All evaluation results will be saved at `data/results`.
-
-
-
-## Evaluating Full-Attention Models
-
-Full-attention models cannot run with more than 32K context length on a single A800 GPU. Parallel strategies are required. We use [`tensor_parallel`](https://github.com/BlackSamorez/tensor_parallel). You should create anothr environtment while downgrade to `transformers==4.35.1` and install `tensor_parallel`:
-```bash
-conda create full --clone beacon
-pip install transformers==4.35.1 tensor_parallel
-```
-
-Then, run the following commands: (feel free to switch `mistralai/Mistral-7B-Instruct-v0.2` to any models on huggingface)
-
-```bash
-conda activate full
-
-model=mistralai/Mistral-7B-Instruct-v0.2
-
-# language modeling perplexity
-python -m main.eval_lm  --max_length 100000 --stride 32768 --model_name_or_path $model --attn_impl flash_attention_2 --enable_tp
-
-# passkey retrieval accuracy
-python -m main.eval_passkey  --model_name_or_path $model --attn_impl flash_attention_2 --enable_tp --chat_template mistral
-
-# needle-in-a-haystack accuracy
-OPENAI_API_KEY="<you_api_key>" python -m main.eval_needle --model_name_or_path $model --attn_impl flash_attention_2 --enable_tp --chat_template mistral --gpt_eval
-
-# topic retrieval accuracy
-torchrun --nproc_per_node 8 -m main.eval_topic --model_name_or_path $model --attn_impl flash_attention_2 --chat_template mistral
-
-# longbench
-torchrun --nproc_per_node 8 -m main.eval_longbench --model_name_or_path $model --attn_impl flash_attention_2 --chat_template mistral
-
-# infbench
-python -m main.eval_infbench --model_name_or_path $model --attn_impl flash_attention_2 --chat_template mistral --enable_tp
-```