Skip to content

Commit 41d7014

Browse files
authored
Merge pull request karpathy#301 from okuvshynov/master
[easy] allow multithreading in load_dataset
2 parents 7339b90 + bb7e967 commit 41d7014

1 file changed

Lines changed: 6 additions & 1 deletion

File tree

data/openwebtext/prepare.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,13 @@
1111
# good number to use is ~order number of cpu cores // 2
1212
num_proc = 8
1313

14+
# number of workers in load_dataset() call
15+
# best number might be different from num_proc above as it also depends on NW speed.
16+
# it is better than 1 usually though
17+
num_proc_load_dataset = num_proc
18+
1419
# takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
15-
dataset = load_dataset("openwebtext")
20+
dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)
1621

1722
# owt by default only contains the 'train' split, so create a test split
1823
split_dataset = dataset["train"].train_test_split(test_size=0.0005, seed=2357, shuffle=True)

0 commit comments

Comments
 (0)