We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
2 parents 7339b90 + bb7e967 commit 41d7014Copy full SHA for 41d7014
1 file changed
data/openwebtext/prepare.py
@@ -11,8 +11,13 @@
11
# good number to use is ~order number of cpu cores // 2
12
num_proc = 8
13
14
+# number of workers in load_dataset() call
15
+# best number might be different from num_proc above as it also depends on NW speed.
16
+# it is better than 1 usually though
17
+num_proc_load_dataset = num_proc
18
+
19
# takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
-dataset = load_dataset("openwebtext")
20
+dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)
21
22
# owt by default only contains the 'train' split, so create a test split
23
split_dataset = dataset["train"].train_test_split(test_size=0.0005, seed=2357, shuffle=True)
0 commit comments