Artificial intelligence has raced forward on the back of bigger models and vast training sets. Yet a stubborn limit persists. Much of the knowledge that matters lives on the public web, in formats never meant for machines that query at machine speed.
The web itself wasn’t built for this. Billions of new URLs appear weekly. Sites load with JavaScript, block automated visitors, and change without notice. Traditional snapshots of data grow stale before models can act on them. Enterprises now hit this wall daily. Stale answers produce bad decisions. Models hallucinate when context runs dry.
Or Lenchner, CEO of Bright Data, puts it plainly. “If it can’t retrieve real-time information, it lacks context,” he told MIT Technology Review. “In a business setting, that’s not acceptable anymore. Stale answers lead to bad decisions and disappointed consumers.”
So a new layer of infrastructure is taking shape. It sits between the raw, chaotic web and the AI systems that need fresh, structured inputs. This layer handles discovery across millions of domains. It delivers data with low latency. And it tailors results to the exact context a query demands. Call it the web data infrastructure layer for AI. Its emergence marks a shift from static training to continuous, grounded operation.
Early AI gains came from scaling compute and parameters. That era is giving way to one defined by data quality and timeliness. Retrieval-augmented generation helped, yet many deployments still falter. According to Gartner, 60% of AI projects unsupported by ready data—accurate, structured, organized, and contextualized—will likely be abandoned by year’s end. The numbers tell a consistent story. A survey cited in Bright Data’s materials found 56% of AI practitioners believe real-time web access is needed to build trust in outputs. Another 97% of organizations already depend on such infrastructure, though 90% report feeling constrained by technical and legal barriers.
The problem runs deeper than latency. Models trained on frozen datasets miss price shifts, sentiment swings, or breaking news. They operate without the living knowledge that turns intelligence into useful action. Lenchner draws a sharp analogy. “Think of the trained model as intelligence and relevant data as knowledge. A powerful intelligence layer sitting on top of a hollow knowledge layer is like a genius who knows nothing—useless in practice. Intelligence and knowledge have to come together.”
Providers are responding with platforms that emulate human browsing at massive scale. These systems rotate through proxies, mimic browser fingerprints, and respect site expectations. They process 80 billion interactions a day across millions of sites while appearing exactly as a legitimate user would. The result? Structured feeds that feed directly into agents or RAG pipelines. No more brittle scrapers. No more outdated indexes.
Bright Data stands out in this space. The company surpassed $300 million in annualized revenue, growing more than 40% year over year, according to a Forbes analysis of the market. It supports 14 of the top 20 global LLM labs and powers over 100 million daily AI-agent interactions. Its approach emphasizes ethical collection, compliance with GDPR and CCPA, and focus on public data only. Other players populate a fragmented field.
The market itself is large and expanding fast. Big data infrastructure reached $209 billion in 2024 with a 21.6% compound annual growth rate, the same Forbes report notes. Web scraping software, a key piece, stood at $754 million that year and is forecast to hit nearly $2.9 billion by 2034. Alternative data markets, which often rely on web sources, add billions more. Four tiers define the competitive map: tech giants like Google and AWS at the top; enterprise specialists such as Bright Data and Oxylabs with nine-figure revenue; high-growth challengers including Apify, which saw revenue jump 80% to $13.3 million in 2024; and specialized innovators like Diffbot that focus on turning unstructured pages into knowledge graphs.
Yet size alone doesn’t capture the shift. Tools built specifically for AI agents now go further. Nimble’s web search agents, highlighted in an April analysis on the Nimble site, use headless browsers to pull live structured JSON rather than cached snippets. This approach grounds model inference in data that is current at request time. The agents adapt to site changes, control search depth, and stream results into frameworks like LangChain or CrewAI. Market projections in that piece peg the broader AI data infrastructure sector at $90 billion this year, climbing toward $465 billion by 2033.
Real-world applications already show the difference. Retailers adjust prices based on competitors’ live listings. Brands scan for trademark violations across global domains. Financial teams track sentiment or supply signals that change hourly. Each case demands infrastructure that discovers relevant pages, extracts cleanly, and delivers without delay. Latency isn’t a technical footnote. It determines whether an AI system feels responsive or obsolete.
Governance questions grow alongside capability. Continuous retrieval from public sources must respect privacy rules and site terms. Leading platforms limit activity to openly accessible content, avoid logins or paywalls, and maintain auditable consent mechanisms for proxy networks. They treat compliance as table stakes rather than an afterthought. When data infrastructure becomes critical to operations, building it internally turns into a distraction. “When this is critical infrastructure for a company,” Lenchner observed, “doing it in-house becomes a full-time engineering problem that competes with the actual AI work.”
That tension explains why specialized providers are gaining traction. They absorb the complexity of anti-bot evasion, JavaScript rendering, geographic variation, and format diversity. Organizations then focus engineering talent on model refinement and application logic instead of fighting the web’s defenses.
The layer’s maturation could blur old boundaries. Over time, the distinction between the AI model and the infrastructure that feeds it may fade. Models will operate in tighter loops with live sources. Agents will navigate the web more autonomously. Outputs will carry higher confidence because they rest on verifiable, timely knowledge rather than memorized patterns.
Challenges remain. Scale brings cost. Regulatory scrutiny is rising. Not every site welcomes automated access, even when done transparently. Yet the direction is clear. Everything happening in the world ends up on the public web, Lenchner notes. The volume of new data accelerates. AI systems that cannot keep pace will fall behind those that treat real-time web access as foundational rather than optional.
Investors have taken notice. The Forbes piece describes the sector as one of technology’s most strategic battlegrounds, combining explosive AI-driven demand with technical moats around scale, compliance, and adaptive intelligence. Early leaders haven’t locked in permanent dominance. Consolidation appears likely as enterprises demand production-grade reliability over experimental tools.
Look at recent signals. Just this month, discussions on X echoed the MIT Technology Review piece, with users sharing the article as evidence that data infrastructure now defines AI progress as much as model architecture does. Separate coverage from VAST Data on June 18 highlighted how legacy web-era data stacks fail under AI workloads, calling for collapsed architectures that reduce latency for GPUs and agents alike.
None of this suggests the web data layer will solve every AI shortcoming. Models still require careful design. Bias, reasoning limits, and energy demands persist. But without reliable, fresh inputs, even the most sophisticated systems operate half-blind. The infrastructure now emerging aims to open their eyes to the full, dynamic scope of online knowledge.
Enterprises that build or adopt these capabilities early stand to gain. Their AI applications can track markets in motion, respond to customers with current information, and maintain trust through reduced hallucinations. Those that treat data retrieval as an afterthought risk watching competitors pull ahead with systems that actually know what is happening right now.
The web holds more data than any single organization can map alone. The new infrastructure layer doesn’t change that fact. It simply gives AI the tools to explore it, retrieve what matters, and act before the moment passes. That ability may prove decisive in the next phase of artificial intelligence.


WebProNews is an iEntry Publication