Aboout multiple .loom files

#115

by allenxiao - opened Jul 13, 2023

Jul 13, 2023

Hi, thank you for the useful tool.
For pretraining a new model from scratch with a new pretraining corpus other than Genecorpus-30M, if the memory for the device is not enough for the large dataset, could I split the dataset into multiple parts and generate a .loom file for each part? And then, tokenize each .loom file to a .arrow file. Finally merge those .arrow files into one .arrow file with the help of Dataset.concatenate_datasets. Is that correct?

ctheodoris

Owner Jul 14, 2023

Thank you for your question. The transcriptome tokenizer scans through .loom files without loading the whole file into memory so if you have many .loom files, you can provide the transcriptome tokenizer the directory of the files and it can compose them into a single .dataset. If you are encountering memory limitations in the step of generating the .dataset though, yes, you could generate a separate .dataset for each batch of data and then concatenate them once in the .dataset format.

ctheodoris changed discussion status to closed Jul 14, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment