Skip to content

Commit f961f12

Browse files
authored
Merge pull request #601 from nntoan209/fix-split-data
Fix "idx" bug in split_data_by_length.py of BGE-M3
2 parents e0cbfab + 549a5f9 commit f961f12

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

FlagEmbedding/BGE_M3/split_data_by_length.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ def _process_file(self, file_path: str, output_path: str):
130130
continue
131131

132132
idxs = mapped_dataset.filter(lambda x: length_l <= x['max_length'] < length_r, num_proc=self.num_proc)
133-
split_dataset = dataset.select(idxs['idx'])
133+
split_dataset = dataset.select(list(idxs._indices.to_pandas()['indices'].values))
134134

135135
split_info_dict[f'len-{length_l}-{length_r}'] = len(split_dataset)
136136

0 commit comments

Comments
 (0)