Don't just look for the biggest model, instead look for a right-sized model that solves your problem

Is a bigger machine learning model always a better model? It might look like it, but it's not that easy. It mainly depends on what you want to achieve. Smaller can be smarter. And cheaper.

News: Large language models' surprise emergent behavior written off as 'a mirage'
Summary: "Forget those huge hyped-up systems, that smaller one might be right for you. And here's why."
By: Thomas Claburn, 2023 (via The Register)

Published in January 2024

With everybody talking about the latest Artificial Intelligence (AI) trend around “Large Language Models” (LLMs), I found that an article in The Register provided an interesting perspective on this cutting-edge technology.

The article starts with a reference to an academic paper published in the “Transactions on Machine Learning Research” (TMLR) journal 08/2022:
“Emergent Abilities of Large Language Models” (2022) (arXiv:2206.07682v2 [cs.CL], 26 Oct 2022).

The paper is coming out of Google Research, Stanford University, UNC Chapel Hill, DeepMind (a Google acquisition founded in the UK).

”Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.”

This sounds like bigger is better. However, most of the article is based on a different “preprint” paper from 2023: “Are Emergent Abilities of Large Language Models a Mirage?” (arXiv:2304.15004 [cs.AI]) by Rylan Schaeffer, Brando Miranda, Sanmi Koyejo.

That paper tries to put these new “emergent” capabilities into a wider context:

“Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale.”

”[…] we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.”

The Register article opens with a very concise opening summary:

“GPT-3, PaLM, LaMDA and other next-gen language models have been known to exhibit unexpected “emergent” abilities as they increase in size. However, some Stanford scholars argue that’s a consequence of mismeasurement rather than miraculous competence.”

This is relevant, because for many people LLMs, or more accurately ChatGPT and other LLM applications, appear to be the solution to everything - not least to their own business and non-business tasks.

“The idea that some capability just suddenly appears in a model at a certain scale feeds concerns people have about the opaque nature of machine-learning models and fears about losing control to software. Well, those emergent abilities in AI models are a load of rubbish, say computer scientists at Stanford.”

But because due to the nature of these models, it’s difficult or even impossible to exactly understand why some input produces a certain output. And of course, facilitating indeterministic behaviour of machines or algorithms is not necessarily a great idea. But perhaps it’s not as bad as it sounds!

“Stanford’s Schaeffer, Miranda, and Koyejo propose that when researchers are putting models through their paces and see unpredictable responses, it’s really due to poorly chosen methods of measurement rather than a glimmer of actual intelligence.”

At first, this all sounds very abstract, but it’s really pretty simple:

“The issue with using such pass-or-fail tests to infer emergent behavior, the researchers say, is that nonlinear output and lack of data in smaller models creates the illusion of new skills emerging in larger ones. Simply put, a smaller model may be very nearly right in its answer to a question, but because it is evaluated using the binary Exact String Match, it will be marked wrong whereas a larger model will hit the target and get full credit.”

The conclusion is probably even more relevant for many commercial applications where size ultimately often translates into cost:

“It means smaller models, which are more affordable to run, aren’t deficient because of some test deviation and are probably good enough to do the required job.”

Finally, something that has nothing to do with LLMs. While browsing through the papers, I found an interesting website I didn’t know yet: Open Peer Review

I like the idea behind the project:

“OpenReview aims to promote openness in scientific communication, particularly the peer review process, by providing a flexible cloud-based web interface and underlying database API”

(Prompt for Craiyon V3 to generate the header image: “Show a small pile of £ money in front of a small machine learning algorithm. Next to it a large pile of £ money in front of a big machine learning algorithm. Show a trend line between the two piles to illustrate the cost difference.” / Style: Drawing.)