Skip to content

Commit 2754e97

Browse files
committed
release activation beacon for mistral
1 parent 502c2f2 commit 2754e97

121 files changed

Lines changed: 13071 additions & 85891 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
outputs
2+
results
3+
pretrain
Lines changed: 4 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1,96 +1,10 @@
11
<div align="center">
22
<h1>Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon [<a href="https://arxiv.org/abs/2401.03462">paper</a>]</h1>
3-
4-
<img src="imgs/impress.png" width="80%" class="center">
53
</div>
64

7-
This is the codebase for Activation Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM by **x100** times. Currently we only apply activation beacon to [Llama-2-chat-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). More LLMs will be supported in the future.
8-
9-
## Features
10-
- **Effectiveness**
11-
- significantly improves the performance of Llama-2 on long-context generation (language modeling) and long-context understanding (e.g. long-document QA).
12-
- **Efficiency**
13-
- low memory usage; low inference latency (compeitive against FlashAttention2); inference latency increases linearly with the input length.
14-
- **Compatibility**
15-
- preserve the short-context capability of Llama-2;
16-
- can be combined with context window extension techniques for futher context extension (e.g. 1M with NTK-Aware);
17-
- can be combined with retrieval for higher memory accuracy (*ongoing*).
18-
- **Low-Cost Training**
19-
- train with 80000 texts within 9 hours;
20-
- most training samples are shorter than 4096.
21-
22-
## Note
23-
Activation Beacon is a working project. We have released newer code in the [new folder](./new/), which support:
24-
- deepspeed-3 training
25-
- fine-tuning and evaluating with chat template
26-
- needle-in-a-haystack evaluation
27-
28-
You can use code there if you're interested. The code in this current folder will be deprecated in the future.
29-
30-
31-
## Environment
32-
The main dependencies are:
33-
```
34-
pytorch==2.1.2 transformers==4.36.1 accelerate==0.25.0 datasets==2.14.7 numpy==1.26.2 flash-attn==2.4.2
35-
```
36-
You can install our environment with:
37-
```bash
38-
conda env create -f environment.yaml --name activation-beacon
39-
```
40-
41-
## Usage
42-
```python
43-
import json
44-
import torch
45-
from transformers import AutoModelForCausalLM, AutoTokenizer
46-
47-
model_id = "namespace-Pt/activation-beacon-llama2-7b-chat"
48-
49-
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
50-
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.bfloat16)
51-
52-
model = model.cuda().eval()
53-
54-
with torch.no_grad():
55-
# short context
56-
text = "Tell me about yourself."
57-
inputs = tokenizer(text, return_tensors="pt").to("cuda")
58-
outputs = model.generate(**inputs, max_new_tokens=20)
59-
print(f"Input Length: {inputs['input_ids'].shape[1]}")
60-
print(f"Output: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
61-
62-
# reset memory before new generation task
63-
model.memory.reset()
64-
65-
# long context
66-
with open("data/toy/narrativeqa.json", encoding="utf-8") as f:
67-
example = json.load(f)
68-
inputs = tokenizer(example["context"], return_tensors="pt").to("cuda")
69-
outputs = model.generate(**inputs, do_sample=False, top_p=1, temperature=1, max_new_tokens=20)[:, inputs["input_ids"].shape[1]:]
70-
print("*"*20)
71-
print(f"Input Length: {inputs['input_ids'].shape[1]}")
72-
print(f"Answer: {example['answer']}")
73-
print(f"Prediction: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
74-
```
75-
**NOTE**: It's okay to see warnings like `This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.` Just ignore it.
76-
77-
## Training
78-
See [training section](./docs/training.md).
79-
80-
## Evaluation
81-
See [evaluation section](./docs/evaluation.md).
5+
This is the codebase for Activation Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM by **x100** times. Currently we only apply activation beacon to [Llama-2-chat-7b](https://huggingface.co/namespace-Pt/activation-beacon-llama2-7b-chat) and [Mistral-7B-Instruct-v0.2](https://huggingface.co/namespace-Pt/activation-beacon-mistral-7b). More LLMs will be supported in the future.
826

83-
## Citation
84-
If you find this repository useful, please give us a star ⭐.
857

86-
To cite our work:
87-
```
88-
@misc{zhang2024soaring,
89-
title={Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon},
90-
author={Peitian Zhang and Zheng Liu and Shitao Xiao and Ninglu Shao and Qiwei Ye and Zhicheng Dou},
91-
year={2024},
92-
eprint={2401.03462},
93-
archivePrefix={arXiv},
94-
primaryClass={cs.CL}
95-
}
96-
```
8+
## File structure:
9+
- The `old` folder contains our initial implementation of Activation Beacon for Llama-2. You can use the code in it to reproduce the training/evaluation of the Llama-2 based model shown in our paper.
10+
- The `new` folder contains **newer** implementation of Activation Beacon for both Llama-2 and Mistral. It also supports more features, including **Deepspeed Zero3 training**, adding **chat template** in training and inference, and **evaluating on more tasks**. However, code in this folder are under development and subject to change in the future.
Lines changed: 59 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,74 @@
11
# Activation-Beacon
22

3-
This folder contains the newer code for activation beacon with the support of deepspeed-3 training. This project is under development and subject to change in the future.
3+
This folder contains the newer code for activation beacon with the support of **Mistral models**, **Deepspeed Zero3 training**, **chat templates**, and **more evaluation tasks**. The code here are under development and subject to change in the future.
44

55
## Environment
6-
The main dependencies are:
7-
```
8-
pytorch==2.1.2 transformers==4.36.1 accelerate==0.25.0 datasets==2.14.7 numpy==1.26.2 flash-attn==2.4.2
9-
```
10-
You can install our environment with:
116
```bash
12-
conda env create -f environment.yaml --name activation-beacon
7+
conda create beacon python=3.10.14
8+
9+
conda activate beacon
10+
11+
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
12+
pip install transformers==4.39.3 deepspeed accelerate datasets peft pandas seaborn
13+
pip install flash-attn --no-build-isolation
14+
15+
# these packages are used in evaluation
16+
pip install rouge fuzzywuzzy jieba
1317
```
1418

19+
## Usage
20+
```python
21+
import json
22+
import torch
23+
from transformers import AutoModelForCausalLM, AutoTokenizer
1524

16-
## Data
17-
You should download the data for fine-tuning & evaluation then untar the file at anywhere you prefer, e.g. `/data`, which results in a folder `/data/activation-beacon`:
18-
```bash
19-
# feel free to alternate /data to your prefered location
20-
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/activation-beacon.tar.gz?download=true -O /data/activation-beacon.tar.gz
25+
model_id = "namespace-Pt/activation-beacon-mistral-7b"
2126

22-
cd /data
23-
tar -xzvf activation-beacon.tar.gz
27+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
28+
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.bfloat16)
2429

25-
# you must download the new longalpaca dataset that was organized into single-turn conversation
26-
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/longalpaca.json?download=true -O /data/activation-beacon/finetune/longalpaca.new.json
30+
model = model.cuda().eval()
31+
32+
with torch.no_grad():
33+
# short context
34+
text = "Tell me about yourself."
35+
inputs = tokenizer(text, return_tensors="pt").to("cuda")
36+
outputs = model.generate(**inputs, max_new_tokens=20)
37+
print(f"Input Length: {inputs['input_ids'].shape[1]}")
38+
print(f"Output: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
39+
40+
# reset memory before new generation task
41+
model.memory.reset()
42+
43+
# long context
44+
with open("data/toy/infbench.json", encoding="utf-8") as f:
45+
example = json.load(f)
46+
inputs = tokenizer(example["context"], return_tensors="pt").to("cuda")
47+
outputs = model.generate(**inputs, do_sample=False, top_p=1, temperature=1, max_new_tokens=20)[:, inputs["input_ids"].shape[1]:]
48+
print("*"*20)
49+
print(f"Input Length: {inputs['input_ids'].shape[1]}")
50+
print(f"Answers: {example['answer']}")
51+
print(f"Prediction: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
2752
```
53+
**NOTE**: It's okay to see warnings like `This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (32768). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.` Just ignore it.
2854

29-
**IMPORTANT NOTE**
30-
- For any path specified for `train_data` and `eval_data`: if it is prefixed with `activation-beacon:`, it will be solved to the relative path against [`data_root`](../src/args.py).
31-
- e.g. `activation-beacon:lm/pg19.json` becomes `${data_root}/lm/pg19.json`
32-
- you can modify the default value of [`data_root`](../src/args.py), so that you don't need to type it for each command.
55+
## Training
56+
See [training section](./docs/training.md). **The training script for Mistral will be released in future.**
3357

58+
## Evaluation
59+
See [evaluation section](./docs/evaluation.md).
3460

35-
## Command
36-
```bash
37-
cd new
38-
39-
torchrun --nproc_per_node 8 -m main.train \
40-
--output_dir data/outputs/activation-beacon-llama2-chat-7b \
41-
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
42-
--train_data activation-beacon:pretrain/redpajama-sample.json activation-beacon:finetune/longalpaca.new.json \
43-
--max_length 8192 \
44-
--min_length 1200 \
45-
--max_train_num_per_data 200000 \
46-
--num_train_epochs 1 \
47-
--enable_beacon \
48-
--beacon_window 1024 \
49-
--beacon_stride 1024 \
50-
--beacon_attn step-expansion \
51-
--beacon_sink_size 1 \
52-
--beacon_ratio 2 4 8 16 32 64 128 \
53-
--beacon_ratio_mix step-random \
54-
--beacon_param q k v o \
55-
--gradient_checkpointing \
56-
--save_strategy steps \
57-
--max_steps 10000 \
58-
--save_steps 10000 \
59-
--logging_steps 50 \
60-
--chat_template llama-2 \
61-
--group_by_stride strict \
62-
--deepspeed data/deepspeed/stage3.json \
63-
64-
65-
# Evaluation
66-
for model in data/outputs/activation-beacon-llama2-chat-7b/*
67-
do
68-
COMMAND="--beacon_sink_size 1"
69-
70-
# 100K perplexity
71-
torchrun --nproc_per_node 8 -m main.eval_lm --model_name_or_path $model --max_length 100000 --beacon_ratio 32 --min_length 400000 --enable_beacon --stride 0 $COMMAND
72-
# 400K perplexity
73-
torchrun --nproc_per_node 8 -m main.eval_lm --model_name_or_path $model --max_length 400000 --beacon_ratio 128 --min_length 400000 --enable_beacon --stride 0 $COMMAND
74-
# LongBench
75-
torchrun --nproc_per_node 8 -m main.eval_longbench --model_name_or_path $model --max_length 15500 --enable_beacon $COMMAND
76-
# Topic Retrieval
77-
torchrun --nproc_per_node 8 -m main.eval_longeval --model_name_or_path $model --enable_beacon $COMMAND
78-
done
61+
## Citation
62+
If you find this repository useful, please give us a star ⭐.
63+
64+
To cite our work:
7965
```
66+
@misc{zhang2024soaring,
67+
title={Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon},
68+
author={Peitian Zhang and Zheng Liu and Shitao Xiao and Ninglu Shao and Qiwei Ye and Zhicheng Dou},
69+
year={2024},
70+
eprint={2401.03462},
71+
archivePrefix={arXiv},
72+
primaryClass={cs.CL}
73+
}
74+
```

Long_LLM/activation_beacon/new/data/deepspeed/stage2_offload.json renamed to Long_LLM/activation_beacon/new/data/deepspeed/stage2-offload.json

File renamed without changes.

Long_LLM/activation_beacon/new/data/deepspeed/stage2_small.json

Lines changed: 0 additions & 43 deletions
This file was deleted.

Long_LLM/activation_beacon/new/data/deepspeed/stage0.json renamed to Long_LLM/activation_beacon/new/data/deepspeed/stage3-offload-optim.json

Lines changed: 27 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,37 @@
11
{
2+
"zero_optimization": {
3+
"stage": 3,
4+
"overlap_comm": true,
5+
"contiguous_gradients": true,
6+
"sub_group_size": 1e9,
7+
"reduce_bucket_size": "auto",
8+
"stage3_prefetch_bucket_size": "auto",
9+
"stage3_param_persistence_threshold": "auto",
10+
"stage3_max_live_parameters": 1e9,
11+
"stage3_max_reuse_distance": 1e9,
12+
"stage3_gather_16bit_weights_on_model_save": true,
13+
14+
"offload_optimizer": {
15+
"device": "cpu",
16+
"pin_memory": true
17+
}
18+
},
219
"fp16": {
320
"enabled": "auto",
421
"loss_scale": 0,
22+
"initial_scale_power": 10,
523
"loss_scale_window": 1000,
6-
"initial_scale_power": 16,
724
"hysteresis": 2,
825
"min_loss_scale": 1
926
},
10-
1127
"bf16": {
12-
"enabled": "auto"
28+
"enabled": "auto",
29+
"loss_scale": 0,
30+
"initial_scale_power": 10,
31+
"loss_scale_window": 1000,
32+
"hysteresis": 2,
33+
"min_loss_scale": 1
1334
},
14-
1535
"optimizer": {
1636
"type": "AdamW",
1737
"params": {
@@ -31,15 +51,11 @@
3151
"total_num_steps": "auto"
3252
}
3353
},
34-
35-
"zero_optimization": {
36-
"stage": 0
37-
},
38-
54+
3955
"gradient_accumulation_steps": "auto",
4056
"gradient_clipping": "auto",
41-
"steps_per_print": 100,
57+
"steps_per_print": 1000,
4258
"train_batch_size": "auto",
4359
"train_micro_batch_size_per_gpu": "auto",
4460
"wall_clock_breakdown": false
45-
}
61+
}

0 commit comments

Comments
 (0)