AI’s Data Bottleneck: Navigating the Limits of Scale The Catalyst Quanta

BY DANIEL FANG

The Coming Data Drought: How AI Firms Are Adapting

For years now, artificial intelligence has seen exponential development. With each new model, AI advances to become larger, faster and more capable than the last, and this is primarily fuelled by the large supply of information available on the internet. Recently, researchers have begun to ask the question: What happens when we run out of data? Data in itself is practically limitless and constantly being generated by users daily, but the high-quality text data (training data) used to train AI models is now running short.

This is a worrying issue as AI systems essentially depend on large banks of high-quality text, images and code to learn and improve. Yes, better algorithms and greater computational power will also play a part, but ultimately, the process becomes far more challenging without more data to consume. As a result, experts are now warning that we are close to this peak data point where there won’t be enough fresh, usable material left online to feed the growing appetite of AI models.

So, what are the leading AI developers doing to alleviate this issue?

OpenAI and the Limits of the Internet

When OpenAI released its newest model, GPT-4, it was clear that this was a major breakthrough within the industry. The model could analyse literature, solve equations, and even simulate emotional tone. But the simple truth hidden behind the creation of GPT-4 is that it had already been trained on most of the accessible text on the internet. OpenAI’s chief scientist, Ilya Sutskever, himself admitted that “most of the useful text online has already been used,” and that the rapid development of AI would “unquestionably end” when the source is completely depleted.

Since then, OpenAI has utilised two methods to mitigate this issue. First, the company has started to generate synthetic data (data produced by AI itself) to expand its training supply. Second, it has invested in Reinforcement Learning from Human Feedback (RLHF), where people rate the quality of AI-generated answers so that future versions can learn from those judgements.

The challenge really is balance. With synthetic data, the advantage is that it is easy to make, but it can also be heavily misleading; models trained on their own outputs risk repeating and amplifying mistakes. With human feedback, on the other hand, the advantage is that it is undoubtedly more reliable but also far more costly and time-consuming. The result is that there must be a compromise between the two, each correcting the other just enough to keep progress alive.

Meta and the Race for Refinement

Meta has chosen a different approach. Rather than waiting for more data to appear, it is trying to make existing data more useful. In 2025, the company invested billions in Scale AI, one of the world’s largest data-labelling firms, and also brought in its founder, Alexander Wang, as Chief AI Officer to ensure that future models are trained on high-quality annotated material.

At the same time, Meta’s research division is pushing for self-supervised learning, where models teach themselves by predicting missing parts of the information they see. With techniques like this, Meta hopes to reduce its dependence on human-labelled datasets altogether.

This dual strategy — human precision combined with machine independence — represents Meta’s attempt to avoid the bottleneck by improving quality rather than quantity. It is less about finding new stacks of information and more about refining and perfecting the ones we already have.

NVIDIA and the Creation of Synthetic Worlds

For NVIDIA, the data bottleneck is an engineering issue. Its Omniverse Replicator allows developers to build lifelike 3-D simulations from which AI systems can learn. If a factory robot needs to recognise defective parts, engineers can create thousands of virtual examples instead of waiting for real ones. In these digital worlds, every object is perfectly labelled, every condition precisely controlled. The company is, in effect, turning computation into a new source of data.

In that sense, NVIDIA may be providing us with a glimpse into the future — one where synthetic worlds play a key role in generating data for AI as a new source of growth.

Better Data > Better Models

A decade ago, breakthroughs in AI came from designing smarter algorithms. Today, however, progress depends on collecting better data. This shift, often referred to as data-centric AI, suggests that the future of this industry lies not in building ever-larger networks but in refining the material that feeds them.

Andrew Ng, one of AI’s early pioneers, has said that “smaller amounts of high-quality data often outperform mountains of mediocre data.” His point is simple: better data makes better intelligence.

Companies like Scale AI and Labelbox have turned this principle into an industry, employing thousands of human annotators to check and correct AI’s training material. A heavy reliance on human labour shows that ultimately, the data bottleneck is not just the result of technical barriers.

An Overview of the Different Paths Taken

Each major player has tackled the problem in their own way:

OpenAI relies on a mix of synthetic generation and human feedback, prioritising reliability over speed.
Meta invests in refinement and self-supervised methods to stretch every piece of information further.
NVIDIA builds entire virtual environments, moving away from reality to simulation.

Shifting Priorities

In short, we are entering a stage where AI ought to evolve from consuming data to understanding it. With access to such a vast store of information, AI should be developed in a way that enables it to interpret and better understand what it already has, as opposed to mindlessly acquiring even more knowledge.

The bottleneck, in the end, may actually serve to benefit us, as it forces us to take a step back and reflect on how we might employ AI in the future. It will encourage us to move away from a competition-driven way of expansion and towards intentional innovation, where models are purposefully created to tackle specific challenges in a precise and efficient manner.

Bibliography

Business Insider (2025). AI Has Already Run Out of Training Data, Goldman’s Data Chief Says.https://www.businessinsider.com/ai-training-data-shortage-slop-goldman-sachs-2025-10
TechCrunch (2025). Can Scale AI and Alexandr Wang Reignite Meta’s AI Efforts?https://bestofai.com/article/can-scale-ai-and-alexandr-wang-reignite-metas-ai-efforts-techcrunch
TIME Magazine (2025). Alexandr Wang on AI’s Potential and Its Deficiencies.https://time.com/7296215/alexandr-wang-interview/
MIT Sloan (2022). Why It’s Time for Data-Centric Artificial Intelligence. https://mitsloan.mit.edu/ideas-made-to-matter/why-its-time-data-centric-artificial-intelligence
NVIDIA Technical Blog (2023). Training Defect Detection Models Using Synthetic Data with Omniverse Replicator. https://developer.nvidia.com/blog/how-to-train-a-defect-detection-model-using-synthetic-data-with-nvidia-omniverse-replicator
Label Your Data (2020). The Big Data Labelling Challenge. https://labelyourdata.com/articles/the-crisis-of-ai-the-big-data-labeling-challenge
Andreessen Horowitz (2024). Unlocking AI’s Future: The Power of Frontier Data. https://a16z.com/frontier-data-foundries-alex-wang-scale-ai
Meta AI Blog (2021). Self-Supervised Learning: The Dark Matter of Intelligence.https://www.christianhaller.me/blog/projectblog/2021-05-03-Self-Supervised/

AI’s Data Bottleneck: Navigating the Limits of Scale