Skip to content

Commit 325be85

Browse files
authored
Merge pull request karpathy#420 from vinjn/fix-371-enc-is-not-defined
Move enc to gloabal namespace to fix karpathy#371
2 parents a022d02 + dccf362 commit 325be85

1 file changed

Lines changed: 2 additions & 1 deletion

File tree

data/openwebtext/prepare.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@
1616
# it is better than 1 usually though
1717
num_proc_load_dataset = num_proc
1818

19+
enc = tiktoken.get_encoding("gpt2")
20+
1921
if __name__ == '__main__':
2022
# takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
2123
dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)
@@ -38,7 +40,6 @@
3840
# })
3941

4042
# we now want to tokenize the dataset. first define the encoding function (gpt2 bpe)
41-
enc = tiktoken.get_encoding("gpt2")
4243
def process(example):
4344
ids = enc.encode_ordinary(example['text']) # encode_ordinary ignores any special tokens
4445
ids.append(enc.eot_token) # add the end of text token, e.g. 50256 for gpt2 bpe

0 commit comments

Comments
 (0)