Aboout multiple .loom files

#115
by allenxiao - opened

Hi, thank you for the useful tool.
For pretraining a new model from scratch with a new pretraining corpus other than Genecorpus-30M, if the memory for the device is not enough for the large dataset, could I split the dataset into multiple parts and generate a .loom file for each part? And then, tokenize each .loom file to a .arrow file. Finally merge those .arrow files into one .arrow file with the help of Dataset.concatenate_datasets. Is that correct?

Thank you for your question. The transcriptome tokenizer scans through .loom files without loading the whole file into memory so if you have many .loom files, you can provide the transcriptome tokenizer the directory of the files and it can compose them into a single .dataset. If you are encountering memory limitations in the step of generating the .dataset though, yes, you could generate a separate .dataset for each batch of data and then concatenate them once in the .dataset format.

ctheodoris changed discussion status to closed

Sign up or log in to comment