I first saw the initial version of this paper in late 2022, and I thought it was just another empirical claim. However, the June 2024 version (Paper attached in the comments) caught my attention. Over the past two years of working with AI, I’ve gained a deeper understanding of the data challenge we face.

The paper highlights that as large language models (LLMs) continue to grow, they will soon exhaust the available stock of public human-generated text data.

🎯 This could happen as early as 2026 to 2032, depending on how models are trained.

The current datasets used for training LLMs are enormous, often reaching hundreds of trillions of tokens. Yet, the total amount of human-generated text data available is finite.

As we roll into 2025, this issue is becoming more prominent in discussions and news articles. The pressure is mounting on corporations to use data responsibly and adhere to emerging government regulations regarding data usage and privacy.

To make it simple – the current situation suggests that the “wow factor” of new AI models may begin to diminish as they reach the limits of available training data.

To address this challenge, the authors propose several strategies:
1️⃣ synthetic data generation,
2️⃣ transfer learning,
3️⃣ and enhancing data efficiency.

While I agree that synthetic data can help fill gaps, it raises questions about whether it can truly replace human-generated content.

If I have to place a bet – it might be that, moving forward, we may need shift our focus from merely improving models to implementing AI in practical use cases. This could lead to creating more real-time data that can be leveraged for future training.

📍 This is a pivotal time in AI’s journey and a topic that is going to gain momentum in 2025.

Download the full paper here