The Tiered Approach Explained
What?
The tiered approach in the context of the EU AI Act is a regulatory approach where foundation models beyond a threshold are regulated with requirements that are more demanding than models that are below this threshold. The principles that inspired the tiered approach in AI are applied in many sectors of EU regulation: finance, social media, privacy etc.
Why?
Having a burden of regulation which is proportionate with the levels of risks: It allows to preserve many models that are riskless from unwarranted regulatory burdens.
Ensuring that risks of systems like ChatGPT deployed at large scales are regulated: The most generally competent systems like ChatGPT are deployed to hundreds of millions of users. Hence, they must receive a commensurate level of scrutiny before being deployed.
Putting the regulatory burden on large actors: All thresholds currently on the table affect very few actors (fewer than 25) that can build the bigger models that bring most of the risks. Those actors are all capitalised at more than 100 000 000 of euros. Spending a few millions for safety and compliance is not a barrier for such actors. On the other hand, the tiered approach allows smaller startups capitalised only with a few hundreds of thousand of euros funding to not bear a heavy regulatory burden, which they could bear if the regulation was on the use case instead of foundation models.
Requiring developers to increase their understanding of the risks of models they don't understand: As surprising as it may sound, developers of large foundation models don't understand well the risks of their models, essentially because they don't understand what their model is capable of. We say that foundation models are "black box". Since developers don't understand their systems, regulation must ensure that they apply sufficient scrutiny to mitigate potential risks before deploying them to thousands of customers.
How?
Answering to the "How?" requires to answer three questions:
Do we already have measures of risks?
How do we set the thresholds?
Given the pace of the technology, is there a legal structure that allows us to adapt and update those thresholds & criteria?
- Measures of risks
Current risks arise from increasingly general and capable models. Hence, measuring risks among foundation models is mostly about measuring capabilities. There are currently five metrics that correlate well with the levels of capabilities of a system. All of those metrics are contained in this one graphic below summarizing how capabilities (each curve of color summarizes a particular set of capabilities that are evaluated) evolves as a function of computing power (in FLOP) , data, the size of the model, and the loss (how far is a model from perfect prediction) of a model.
As you can see, capabilities and hence risks increase relatively predictably with the following four metrics:
Computing power (often called "compute"), measured in FLOP: Most capabilities of modern foundation models arise from their massive scale. There's little difference between each of the four GPTs, the flagship model of OpenAI, except their scale. Each GPT uses 100 times more computing power than the previous one, which causes reliable increases in capabilities. The capabilities of each GPT are drastically different despite comparable architectures. This is the result of something we call scaling laws, i.e. that a system developed with more "compute" will predictably have more capabilities. It allows to approximate levels of capabilities on known benchmarks.
Data: More computing power is associated with larger datasets that are used to develop models. Furthermore, a model of any given size can be trained longer in order to have more capabilities. Hence, thresholds on data are also useful for foundation models.
Size: Models that are larger and trained optimally are more powerful than smaller models. Hence, in practice, risks are mostly caused by systems with significant amounts of parameters.
Loss: Less notably, there are relations that have been defined between loss and the computing power (what we called scaling laws above) of AI systems, and those are also pretty steadily moving and correlated with risks.
Beyond that, the science of measurement of capabilities (called evals) is moving forward and is allowing us to estimate increasingly precisely the capabilities of a system. This is a fifth criterion that should be used as we make it increasingly precise.
Thresholds based on a combination of two of the five criteria are sufficiently precise to be hard to circumvent and well correlated with risks.
2. What Thresholds?
Everyone agrees that thresholds should be improved over time, as the science of measurement of capabilities and risks improves. We explain in 3) how to do so. But until that improves, we still need to manage risks. There are discussions about what temporary threshold makes the most sense for a tiered approach.
What threshold in the short-run?
Most agree that for a tiered approach, the threshold implemented in the short-run should be between 10^23 and 10^26 FLOP. For reference, here are models, the compute they need to be trained, and the amounts of money at stake to acquire the necessary hardware:
BERT (used by industry for many simple tasks): 10^20 FLOP
GPT-J (quite used in open source communities): 10^21 FLOP
Mistral best model: not reported but likely lower than 10^23 FLOP
LLaMa-2: 10^23 FLOP (750 000€ for the compute only)
ChatGPT-3.5 (free version): 10^24 FLOP (7 500 000€ for the compute only)
ChatGPT-4: 10^25 FLOP (75 000 000€ for the compute only)
Some of the options discussed are:
Regulating the riskiest systems available to all bad actors: it would require a threshold a 10^23 FLOP thresholds, affect less than 25 developers, all capitalised at more than 100 000 000€ and with 750 000€ of computing power at least.
Regulating no existing foundation models (i.e. ChatGPT remains unregulated), but regulating the next generation of models developed by Big Tech companies: it would require to use a 10^26 FLOP threshold, affect 0 current developers and probably up to 5 developers next year, all capitalised at more than 1 000 000 000€ and with 750 000 000€ of compute only.
Signatory Professor Bengio supports a threshold of 10^26 FLOP on the basis that:
it will affect no one currently and won't stop any European company.
it is reached only once companies reach the billion-scale of investment and so can afford compliance easily.
it should still prevent many of the most extreme risks, especially from the next generation of foundation models.
Signatory SaferAI supports a threshold of 10^23 FLOP on the basis that models trained approximately with this amount of computing power like LLaMa-2 require risk management, because they:
already present potentially significant risks when one accounts with enhancements that can be added on top of it
can't be shown to not democratize access to biological weapons development to all citizen, for instance by providing most of the instructions necessary to obtain, reconstitute and release the 1918 influenza (Esvelt et al., 2023)
empower bad actors such as North Korea with a mean to enhance their cyberoffence capabilities as evidenced by US national security officials
allow to democratize to mass-manipulation, which can be used for political campaigning, for ads, or to acquire ransoms from citizens, old people and children (see below).
3. How do we ensure that we can update thresholds and criteria?
Adapting thresholds or criteria without the need a new regulation to be changed is not uncommon in EU law. There are various ways to achieve that, one of the most commonly referred to being the use of delegated acts, allowing the European Commission to supplement or amend elements of a legislation within a limited scope. This could apply to the criteria and thresholds that have been defined for the tiered approach.