ctheodoris/Geneformer · About gene median dict pickle files

Jul 10, 2023

Thank you so much for the amazing work.
According to the tutorial (obtain_nonzero_median_digests.ipynb), there are three kinds of .pickle files generated, including gene_median_digest_dict.pickle, total_gene_median_dict.pickle, and detected_gene_median_dict.pickle.
For the purpose of pretraining a new model from scratch with a new pretraining corpus other than Genecorpus-30M, which one should be used?

As I understood, the gene_median_digest_dict.pickle is used for a single dataset, and the total_gene_median_dict.pickle is generated by merging some gene_median_digest_dict.pickle from different datasets. So if I have only one dataset, total_gene_median_dict.pickle is the same as the gene_median_digest_dict.pickle, right?
And the detected_gene_median_dict.pickle saves only detected genes' medians by filtering the nan values within the total_gene_median_dict.pickle.
So the final .pickle file should be used for pretraining is the detected_gene_median_dict.pickle, right?

ctheodoris changed discussion title from About gene mdeidan dict pickle files to About gene median dict pickle files Jul 10, 2023

ctheodoris

Owner Jul 10, 2023

Thank you for your interest in Geneformer - yes, that's correct.

ctheodoris changed discussion status to closed Jul 10, 2023