A driving force of the competition between AI companies is the belief that bigger is better. GPT-4, the model that powers the most advanced version of ChatGPT, contains an estimated 1.8 trillion parameters, or variables that determine how it responds to inputs. That’s six times more than the 175 billion parameters possessed by its predecessor, GPT-3—and 1,200 times more than those contained in GPT-2. The datasets used to train these models also continue to grow. OpenAI used 570GB of Internet text data to train GPT-3—a massive expansion beyond the 8 million web pages used to train GPT-2.
Bigger large language models (LLMs) have gotten better at producing humanlike, coherent, and contextually appropriate text. But this improvement has come with costs. Researchers estimated that the training process for GPT-3 consumed roughly the amount of energy consumed by 120 American households over the course of a year. An October study projected that by 2027, the AI sector could have an annual energy consumption roughly equivalent to that of the Netherlands.
The exponential growth of LLMs’ size—and energy consumption—isn’t likely to stop any time soon. As OpenAI, Google, Meta, and other companies race to develop better models, they’ll probably rely on expanding datasets and adding more parameters. But some researchers question the rationale behind this competition. At the first session of the Harvard Efficient ML Seminar Series, researcher Sara Hooker likened the pursuit of larger and larger models to “building a ladder to the moon”: costly and inefficient, with no realistic endpoint.
“Why do we need [these models] so big in the first place?” asks Hooker, head of the nonprofit research group Cohere For AI. “What is this scale giving us?” Some benefits of size can also be achieved through other techniques, researchers say, such as efficient parameterizations (activating only the relevant parameters for a given input) and meta-learning (teaching models to learn independently). Instead of pouring immense resources into the pursuit of ever-bigger models, AI companies could contribute to research developing these efficiency methods.
One draw of bigger models is that they seem to develop “emergent properties,” or behaviors that weren’t explicitly programmed into a system: the sudden ability to produce multilingual responses, for instance, or complete math problems. When these abilities emerged from LLMs, it “was mind-blowing to a lot of people in the field,” says seminar lead organizer and AI researcher at Harvard Medical School Jonathan Richard Schwarz. Before LLMs, previous AI models had been explicitly programmed to learn from experience, such as image recognition systems that can recognize new categories after being shown fewer than a dozen examples. But LLMs did not contain algorithms that allowed them to learn and adapt; they were simply trained to predict Internet text. The fact that they seemed to be learning from experience anyway was exciting—and many assumed this was happening because of their sheer size.
But these emergent properties didn’t just appear “out of nowhere,” Hooker says. Often, “the data is there in the pre-training,” she says. “It’s just that the dataset is so big that we don’t know it’s there.” In the case of the emergence of multilingual abilities, training datasets may have included multilingual sources—developers just thought they were only including English-language sources. Larger models are also more likely to capture rare data points, resulting in unexpected outcomes. But that doesn’t mean only large models can achieve this. The question becomes, Hooker says, how smaller models can be taught to similarly memorize those data points.
One way to encourage the emergence of similar abilities in smaller models is by training them in meta-learning, or teaching them to learn on their own. “The properties that seem to emerge almost by accident,” Schwarz says, might also arise in smaller models “if you found a way of more directly encouraging that behavior.” Researchers can draw from meta-learning techniques used in non-LLM AI models. But more research would also be useful: “There’s a lot of room for making these techniques more stable and easier to use,” Schwarz says, “and to reduce forgetting,” or the inability of models to retrieve previously stored information.
The data used in training also matters. “If you have a lot of junk in your dataset, you spend a lot of extra capacity and extra model size trying to deal with it,” Hooker says. “By focusing on data quality, you can get away with smaller and more highly performing models.” Pruning huge datasets isn’t easy: even if one figures out a way to distinguish between high- and low-quality data, there is the additional question of redundancy within the good data. Here is another place to direct resources: “It’s a massive inflection point in our field,” Hooker says, “where we’ve moved away from the algorithm and now think about what the data is and how it interacts with the algorithm.”
Not only is there too much data within many models, she argues; there are also far more parameters than are needed in most cases. A 2013 paper found that many deep learning models contain significant redundancy in their parameters. And of the millions or billions of parameters possessed by a model, only a fraction are required to process most inputs. The rest are needed only for fringe examples, “like if you keep your finger on the J key, and just input a page of J,” Hooker says. “That’s not a coherent pattern, so the model will spend a lot of time trying to understand what that is—and that’s not a good use of capacity.” Other fringe examples, though, are worth the capacity, such as text consisting of rare languages.
To address this problem, researchers can develop efficient parameterization systems, activating only the relevant parameters for a given input. “Instead of training my model of billions or trillions of parameters to be a single sort of system, I can instead try to actively encourage different parts of the model to specialize in different types of problems,” Schwarz says. This technique—called “mixture of experts”—consists of separating parameters into clusters (“experts”) and activating only the relevant experts for a given query. The challenge then becomes identifying which experts are needed for a given input. Research is ongoing to develop ways of doing this, such as by detecting keywords.
Experience in efficiently training such large-scale machine learning systems is in relatively “short supply,” according to the Efficient ML Seminar Series organizers. The organizations creating the most resource-intensive models, such as Meta and OpenAI, have not shared details about their energy consumption. Without that information—and subsequent public pressure—the incentives may continue to encourage size over efficiency.
Hooker believes that a solution could lie in a centralized system that rates models for energy efficiency: “In the same way we rate buildings for energy standards, we can do the same thing for models,” she says. “I’m in favor of badges like this, that reflect the energy standards that were used during training. Because right now it’s a black box, and we don’t communicate what the cost is.”