Skip to content

Commit 0247d32

Browse files
committed
update activation beacon code
1 parent 7745d3a commit 0247d32

3 files changed

Lines changed: 254 additions & 2 deletions

File tree

research/Long_LLM/activation_beacon/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -77,10 +77,10 @@ For any path specified for `train_data` and `eval_data`: if it is prefixed with
7777

7878

7979
## Training
80-
See [training section](./docs/training.md).
80+
See [training section](./examples/training.md).
8181

8282
## Evaluation
83-
See [evaluation section](./docs/evaluation.md).
83+
See [evaluation section](./examples/evaluation.md).
8484

8585

8686
## Citation
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Evaluation
2+
3+
Make sure you have created the environment and downloaded the data according to [README](../README.md).
4+
5+
6+
```bash
7+
conda activate beacon
8+
9+
model=namespace-Pt/beacon-qwen-2-7b-instruct
10+
11+
# language modeling perplexity
12+
torchrun --nproc_per_node 8 -m main.eval_lm --max_length 100000 --stride 32768 --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024
13+
14+
# passkey retrieval accuracy
15+
torchrun --nproc_per_node 8 -m main.eval_passkey --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024
16+
17+
# needle-in-a-haystack accuracy
18+
OPENAI_API_KEY="<you_api_key>" torchrun --nproc_per_node 8 -m main.eval_needle --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 --gpt_eval
19+
20+
# topic retrieval accuracy
21+
torchrun --nproc_per_node 8 -m main.eval_topic --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024
22+
23+
# longbench
24+
torchrun --nproc_per_node 8 -m main.eval_longbench --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024
25+
26+
# infinitebench
27+
torchrun --nproc_per_node 8 -m main.eval_infbench --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024
28+
```
29+
30+
All evaluation results will be saved at `data/results`.
Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# Training
2+
3+
There are two stages in training:
4+
- Pretrain
5+
- 1B token from [redpajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample) with auto-regressive language modeling
6+
- Add eos to each document and no packing
7+
- 20K context length at maximum
8+
9+
- Finetune
10+
- 5K samples from [LongAlpaca](https://huggingface.co/datasets/Yukang/LongAlpaca-12k), 2K samples from [Booksum](https://huggingface.co/datasets/kmfoda/booksum), 16K synthetic long-context QA data from GPT-3.5, and 5K samples from pretraining data
11+
- 20K context length at maximum
12+
13+
14+
## Prerequisite
15+
16+
Make sure you have created the environment and downloaded the data according to [README](../README.md).
17+
18+
### Mistral
19+
#### Pretrain
20+
```bash
21+
output_name=beacon-mistral-pretrain
22+
23+
torchrun --nproc_per_node 8 $DDP -m main.train \
24+
--output_dir data/outputs/$output_name \
25+
--model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 \
26+
--train_data long-llm:redpajama/train.json \
27+
--min_length 2400 \
28+
--max_length 20000 \
29+
--group_by_stride strict \
30+
--enable_beacon \
31+
--beacon_window 2048 \
32+
--beacon_stride 2048 \
33+
--beacon_attn full-coverage \
34+
--beacon_attend_prev True \
35+
--beacon_sink_size 0 \
36+
--beacon_ratio 2 4 8 16 32 \
37+
--beacon_ratio_mix step-random \
38+
--beacon_param q k v \
39+
--beacon_pos interleave \
40+
--attn_impl flash_attention_2 \
41+
--gradient_checkpointing \
42+
--use_reentrant False \
43+
--save_only_model \
44+
--save_strategy epoch \
45+
--evaluation_strategy steps \
46+
--num_train_epochs 1 \
47+
--logging_steps 50 \
48+
--bf16 \
49+
--deepspeed data/deepspeed/stage2.json
50+
```
51+
52+
#### Finetune
53+
```bash
54+
output_name=beacon-mistral-finetune
55+
56+
torchrun --nproc_per_node 8 $DDP -m main.train \
57+
--output_dir data/outputs/$output_name \
58+
--model_name_or_path data/outputs/beacon-mistral-pretrain/* \
59+
--train_data long-llm:gpt/one_detail_book.train.16K.json long-llm:gpt/one_detail_paper.train.16K.json long-llm:longalpaca/train.json long-llm:booksum/train.16K.json long-llm:needle/train.16K.json long-llm:redpajama/train.json[5000] \
60+
--max_length 20000 \
61+
--min_length 7200 \
62+
--group_by_stride strict \
63+
--enable_beacon \
64+
--beacon_window 2048 \
65+
--beacon_stride 2048 \
66+
--beacon_attn full-coverage \
67+
--beacon_attend_prev True \
68+
--beacon_sink_size 0 \
69+
--beacon_ratio 2 4 8 \
70+
--beacon_ratio_mix step-random \
71+
--beacon_param q k v \
72+
--beacon_pos interleave \
73+
--attn_impl flash_attention_2 \
74+
--learning_rate 1e-5 \
75+
--gradient_checkpointing \
76+
--use_reentrant False \
77+
--save_only_model \
78+
--num_train_epochs 1 \
79+
--save_strategy epoch \
80+
--logging_steps 50 \
81+
--bf16 \
82+
--deepspeed data/deepspeed/stage2.json \
83+
--chat_template mistral
84+
```
85+
86+
### Llama-3
87+
NOTE: according to our experiment, Llama-3 requires attention sink.
88+
89+
#### Pretrain
90+
```bash
91+
output_name=beacon-llama3-pretrain
92+
93+
torchrun --nproc_per_node 8 $DDP -m main.train \
94+
--output_dir data/outputs/$output_name \
95+
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
96+
--train_data long-llm:redpajama/train.json \
97+
--min_length 2400 \
98+
--max_length 20000 \
99+
--group_by_stride strict \
100+
--enable_beacon \
101+
--beacon_window 1024 \
102+
--beacon_stride 1024 \
103+
--beacon_attn full-coverage \
104+
--beacon_attend_prev True \
105+
--beacon_sink_size 1 \
106+
--beacon_ratio 2 4 8 16 32 \
107+
--beacon_ratio_mix step-random \
108+
--beacon_param q k v \
109+
--beacon_pos interleave \
110+
--attn_impl flash_attention_2 \
111+
--gradient_checkpointing \
112+
--use_reentrant False \
113+
--save_only_model \
114+
--save_strategy epoch \
115+
--evaluation_strategy steps \
116+
--num_train_epochs 1 \
117+
--logging_steps 50 \
118+
--bf16 \
119+
--deepspeed data/deepspeed/stage2.json
120+
```
121+
122+
#### Finetune
123+
```bash
124+
output_name=beacon-llama3-finetune
125+
126+
torchrun --nproc_per_node 8 $DDP -m main.train \
127+
--output_dir data/outputs/$output_name \
128+
--model_name_or_path data/outputs/beacon-llama3-pretrain/* \
129+
--train_data long-llm:gpt/one_detail_book.train.16K.json long-llm:gpt/one_detail_paper.train.16K.json long-llm:longalpaca/train.json long-llm:booksum/train.16K.json long-llm:needle/train.16K.json long-llm:redpajama/train.json[5000] \
130+
--max_length 20000 \
131+
--min_length 7200 \
132+
--group_by_stride strict \
133+
--enable_beacon \
134+
--beacon_window 1024 \
135+
--beacon_stride 1024 \
136+
--beacon_attn full-coverage \
137+
--beacon_attend_prev True \
138+
--beacon_sink_size 1 \
139+
--beacon_ratio 2 4 8 \
140+
--beacon_ratio_mix step-random \
141+
--beacon_param q k v \
142+
--beacon_pos interleave \
143+
--attn_impl flash_attention_2 \
144+
--learning_rate 1e-5 \
145+
--gradient_checkpointing \
146+
--use_reentrant False \
147+
--save_only_model \
148+
--num_train_epochs 1 \
149+
--save_strategy epoch \
150+
--logging_steps 50 \
151+
--bf16 \
152+
--deepspeed data/deepspeed/stage2.json \
153+
--chat_template llama-3
154+
```
155+
156+
### Qwen-2
157+
#### Pretrain
158+
```bash
159+
output_name=beacon-qwen2-pretrain
160+
161+
torchrun --nproc_per_node 8 $DDP -m main.train \
162+
--output_dir data/outputs/$output_name \
163+
--model_name_or_path Qwen/Qwen2-7B-Instruct \
164+
--train_data long-llm:redpajama/train.json \
165+
--min_length 2400 \
166+
--max_length 20000 \
167+
--group_by_stride strict \
168+
--enable_beacon \
169+
--beacon_window 2048 \
170+
--beacon_stride 2048 \
171+
--beacon_attn full-coverage \
172+
--beacon_attend_prev True \
173+
--beacon_sink_size 0 \
174+
--beacon_ratio 2 4 8 16 32 \
175+
--beacon_ratio_mix step-random \
176+
--beacon_param q k v \
177+
--beacon_pos interleave \
178+
--attn_impl flash_attention_2 \
179+
--gradient_checkpointing \
180+
--use_reentrant False \
181+
--save_only_model \
182+
--save_strategy epoch \
183+
--evaluation_strategy steps \
184+
--num_train_epochs 1 \
185+
--logging_steps 50 \
186+
--bf16 \
187+
--deepspeed data/deepspeed/stage2.json
188+
189+
```
190+
191+
192+
#### Finetune
193+
```bash
194+
torchrun --nproc_per_node 8 $DDP -m main.train \
195+
--output_dir data/outputs/$output_name \
196+
--model_name_or_path data/outputs/beacon-qwen2-pretrain/* \
197+
--train_data long-llm:gpt/one_detail_book.train.16K.json long-llm:gpt/one_detail_paper.train.16K.json long-llm:longalpaca/train.json long-llm:booksum/train.16K.json long-llm:needle/train.16K.json long-llm:redpajama/train.json[5000] \
198+
--max_length 20000 \
199+
--min_length 7200 \
200+
--group_by_stride strict \
201+
--enable_beacon \
202+
--beacon_window 2048 \
203+
--beacon_stride 2048 \
204+
--beacon_attn full-coverage \
205+
--beacon_attend_prev True \
206+
--beacon_sink_size 0 \
207+
--beacon_ratio 2 4 8 \
208+
--beacon_ratio_mix step-random \
209+
--beacon_param q k v \
210+
--beacon_pos interleave \
211+
--attn_impl flash_attention_2 \
212+
--learning_rate 1e-5 \
213+
--gradient_checkpointing \
214+
--use_reentrant False \
215+
--save_only_model \
216+
--num_train_epochs 1 \
217+
--save_strategy epoch \
218+
--logging_steps 50 \
219+
--bf16 \
220+
--deepspeed data/deepspeed/stage2.json \
221+
--chat_template qwen
222+
```

0 commit comments

Comments
 (0)