AI Giants Grapple with Data Dearth: Will the Internet Be Enough?

Amidst this data drought, AI companies are searching for untapped information sources and rethinking their training strategies. OpenAI, renowned for creating ChatGPT, is reportedly considering trainin...
AI Giants Grapple with Data Dearth: Will the Internet Be Enough?
Written by Rich Ord

In the ever-evolving landscape of artificial intelligence (AI), where innovation knows no bounds, a new challenge has emerged on the horizon: the scarcity of digital information. As noted recently by an article in the Wall Street Journal, AI behemoths like OpenAI and Google race to develop increasingly powerful systems; they grapple with a dilemma—the internet might be too small for their grand plans.

These AI systems’ thirst for vast oceans of data to learn from is rapidly outpacing the available pool of quality public data online. This imbalance is exacerbated by data owners increasingly blocking access to AI companies, citing concerns over privacy and fair compensation.

“The demand for high-quality text data could outstrip supply within two years, potentially slowing AI’s development,” warns a concerned executive in the industry.

Amidst this data drought, AI companies are searching for untapped information sources and rethinking their training strategies. OpenAI, renowned for creating ChatGPT, is reportedly considering training its next model, GPT-5, on transcriptions of public YouTube videos—a testament to the industry’s hunger for alternative data sources.

However, concerns loom over the use of AI-generated, or synthetic data for training, with many researchers fearing it could lead to crippling malfunctions. Yet, amidst the secrecy that shrouds these endeavors, executives remain steadfast in pursuing solutions, viewing them as potential competitive advantages.

“The data shortage is a frontier research problem,” remarks Ari Morcos, an AI researcher and founder of DatologyAI. His company, backed by AI pioneers, is at the forefront of developing tools to improve data selection—a crucial step towards alleviating the industry’s data woes.

But data scarcity is just one piece of the puzzle. The shortage of chips required to power large-language models and concerns about data center capacity and energy consumption further compound the industry’s challenges.

OpenAI’s most advanced language model, GPT-4, reportedly trained on trillions of data tokens, setting a new standard for AI capabilities. However, estimates suggest that future models, like GPT-5, may require even larger datasets, exacerbating the data shortage.

“The biggest uncertainty is what breakthroughs you’ll see,” muses AI researcher Pablo Villalobos. His comparison to “peak oil” underscores the potential for technological advancements to address resource constraints—an optimism shared by many in the AI community.

Yet, questions loom over data quality and privacy as the industry navigates these challenges. Social media platforms and news publishers are increasingly restricting access to their data, while public reluctance to share private conversational data limits available resources.

In response, AI companies are exploring innovative data selection and generation approaches. Curriculum learning, which involves feeding data to language models in a specific order, shows promise in optimizing training efficiency.

However, challenges remain in ensuring the quality and relevance of synthetic data. “This is the dirty secret of deep learning: It’s throwing spaghetti against the wall,” Morcos remarks, underscoring the task’s complexity.

Despite data scarcity’s uncertainty, the industry remains undeterred in pursuing AI advancement. Whether through novel data selection methods, synthetic data generation, or other innovative approaches, it continues to push the boundaries of what’s possible.

As the quest for digital gold continues, only time will tell whether the internet will prove vast enough to fuel the next generation of AI breakthroughs. In the meantime, industry leaders remain steadfast in their commitment to overcoming the challenges that lie ahead.

Get the WebProNews newsletter delivered to your inbox

Get the free daily newsletter read by decision makers

Subscribe
Advertise with Us

Ready to get started?

Get our media kit

Advertise with Us