You may have noticed that last week, Microsoft and Nvidia announced they had trained "the world's largest and most powerful generative language model," known as "Megatron-Turing NLG 530B," as ZDNet's Chris Duckett reported[1].
The model, in this case, is a neural network program based on the "Transformer" approach that has become widely popular in deep learning. Megatron-Turing is able to produce realistic-seeming text and also perform on various language tests such as sentence completion.
The news was somewhat perplexing in that Microsoft had already announced a program a year ago that seemed to be bigger and more powerful. While Megatron-Turing NLG 530B uses 530 billion neural "weights," or parameters, to compose its language model, what's known as "1T" has one trillion parameters.
Microsoft's blog post[2] explaining Megatron-Turing linked to the Github repo[3] maintained by Nvidia's Jared Casper, where the various different language models are listed, along with stats. Those stats show that not only is 1T bigger than Megatron-Turing NLG 530B, it has higher numbers for every performance figure, including the peak tera-FLOPs, or trillions of floating point operations per second, that were achieved.
So how can Megatron-Turing NLG 530B be the biggest if 1T is bigger by every measure? To resolve the matter, ZDNet spoke with Nvidia's Paresh Kharya, senior director of product marketing and management.
The key is that 1T was never "trained to convergence," a term that means that the model has been fully developed and can now be used for performing inference, the stage where predictions are made. Instead, 1T went through a limited number of training runs, said Kharya, known as "epochs," which do not lead to convergence.
As Kharya explains, "Training large models to convergence takes weeks and even months depending