About gene median dict pickle files
Thank you so much for the amazing work.
According to the tutorial (obtain_nonzero_median_digests.ipynb), there are three kinds of .pickle files generated, including gene_median_digest_dict.pickle, total_gene_median_dict.pickle, and detected_gene_median_dict.pickle.
For the purpose of pretraining a new model from scratch with a new pretraining corpus other than Genecorpus-30M, which one should be used?
As I understood, the gene_median_digest_dict.pickle is used for a single dataset, and the total_gene_median_dict.pickle is generated by merging some gene_median_digest_dict.pickle from different datasets. So if I have only one dataset, total_gene_median_dict.pickle is the same as the gene_median_digest_dict.pickle, right?
And the detected_gene_median_dict.pickle saves only detected genes' medians by filtering the nan values within the total_gene_median_dict.pickle.
So the final .pickle file should be used for pretraining is the detected_gene_median_dict.pickle, right?
Thank you for your interest in Geneformer - yes, that's correct.