š°ļø Llama-3.1-405B took 39 million GPU-hours to train, i.e. about 4.5 thousand years.
š“š» If they had needed all this time, we would have GPU stories from the time of Pharaoh š: "Alas, Lord of Two Lands, the shipment of counting-stones arriving from Cathay was lost to pirates, this shall delay the building of your computing temple by many moons "
š ļø But instead, they just parallelized the training on 24k H100s, which made it take just a few months. This required parallelizing across 4 dimensions: data, tensor, context, pipeline. And it is infamously hard to do, making for bloated code repos that hold together only by magic.
š¤ ššš š»š¼š šš² š±š¼š»'š š»š²š²š± šµšš“š² šæš²š½š¼š š®š»ššŗš¼šæš²! Instead of building mega-training codes, Hugging Face colleagues cooked in the other direction, towards tiny 4D parallelism libs. A team has built Nanotron, already widely used in industry. And now a team releases Picotron, a radical approach to code 4D Parallelism in just a few hundred lines of code, a real engineering prowess, making it much easier to understand what's actually happening!
ā” šš'š šš¶š»š, šš²š š½š¼šš²šæš³šš¹: Counting in MFU (Model FLOPs Utilization, how much the model actually uses all the compute potential), this lib reaches ~50% on SmolLM-1.7B model with 8 H100 GPUs, which is really close to what huge libs would reach. (Caution: the team is leading further benchmarks to verify this)