The Chip Startup That Wants to Be the Air Traffic Controller for AI Inference

A small Israeli semiconductor company called NeuReality thinks the biggest bottleneck in artificial intelligence isn’t the GPU. It’s everything around it.

The company, which emerged from stealth in 2022 and has since attracted backing from former Google AI hardware chief Amin Vahdat as a strategic advisor, is making a bet that the industry’s obsession with raw chip performance has obscured a more fundamental problem: how AI inference workloads are actually orchestrated, routed, and served inside data centers. Their answer is a new kind of processor — not a GPU competitor, but a purpose-built system-on-chip designed to sit between the network and the AI accelerators, managing inference traffic the way a switchboard operator once managed phone calls.

The product is called NR-Nexus. And it represents one of the more contrarian hardware plays in a market drowning in GPU hype.

As reported by The Next Web, NeuReality’s NR-Nexus is an inference-centric system-on-chip that handles the overhead tasks surrounding AI model execution — networking, scheduling, load balancing, preprocessing, and postprocessing — all on a single piece of silicon. The idea is to strip away the general-purpose CPU that traditionally manages these functions and replace it with dedicated hardware that does the job faster, cheaper, and with far less power. NeuReality CEO Moshe Tanach has described the conventional approach as using “a sledgehammer to crack a nut,” pointing out that server-class CPUs consume enormous power and rack space just to shuttle data to and from accelerators.

The pitch resonates because inference — running trained AI models in production — is rapidly becoming the dominant workload in cloud computing. Training gets the headlines. Inference pays the bills. Every time a user queries ChatGPT, generates an image with Midjourney, or gets a recommendation from Netflix, an inference operation fires somewhere in a data center. Goldman Sachs has estimated that inference will account for the vast majority of AI compute spending over the coming years, dwarfing training costs as models move from research labs into commercial deployment at scale.

But here’s the catch. The infrastructure designed for inference today was largely repurposed from training. GPUs sit in servers managed by beefy CPUs, connected by standard networking stacks, all originally architected for a different purpose. NeuReality argues this creates massive inefficiency. Their internal benchmarks, cited by The Next Web, suggest that in a typical inference pipeline, the host CPU can consume 30% to 50% of total system power — not for running the model, but for managing the data flow around it. That’s a staggering overhead.

The NR-Nexus chip is designed to collapse that overhead. It integrates networking interfaces, Arm-based compute cores for lightweight control tasks, hardware accelerators for pre- and post-processing, and a smart scheduler that can dynamically route inference requests to whichever accelerator — GPU, custom ASIC, or otherwise — is best suited and available. Think of it less as a competitor to Nvidia’s H100 and more as the traffic cop standing at the intersection in front of it.

Amin Vahdat’s involvement adds credibility. Vahdat spent years at Google leading the engineering of its cloud networking and AI infrastructure before joining NeuReality’s advisory board. His presence signals that the problem NeuReality is targeting is real and recognized at the highest levels of hyperscale computing. Google, after all, builds its own TPUs and has spent billions optimizing inference serving — so when one of its former top architects backs a startup attacking inference infrastructure from a different angle, the industry pays attention.

NeuReality isn’t alone in identifying the inference bottleneck, though its approach is distinctive. Companies like d-Matrix, Groq, and Cerebras have all built inference-optimized silicon, but they’ve focused primarily on the compute itself — building faster, more efficient engines for executing model weights and activations. NeuReality is going after the plumbing. The analogy the company uses internally is telling: they compare the current state of AI inference to the early days of networking, before dedicated routers and switches replaced general-purpose computers handling packet forwarding. The transition from software-based routing to purpose-built networking hardware transformed the internet. NeuReality believes a similar transition is overdue for AI serving.

The technical architecture is worth examining. The NR-Nexus SoC is fabricated on a modern process node and integrates high-speed network interfaces capable of handling 400 Gbps Ethernet. It includes a hardware inference scheduler that can manage multiple downstream accelerators simultaneously, distributing requests based on latency, throughput, and model-specific requirements. The chip also handles tokenization, detokenization, and other preprocessing steps that in conventional systems are offloaded to the host CPU. By moving these functions into dedicated silicon, NeuReality claims it can reduce end-to-end inference latency while simultaneously cutting power consumption by a significant margin.

The business model targets cloud service providers, telecom operators, and enterprises building large-scale inference deployments. NeuReality has been working with select partners to integrate the NR-Nexus into server designs, positioning the chip as a complement to — not a replacement for — existing accelerators. This is a shrewd go-to-market strategy. Rather than asking customers to rip out their Nvidia GPUs or Intel CPUs, NeuReality is offering a new component that slides into the existing rack, potentially improving the utilization and efficiency of hardware customers have already purchased.

The timing matters. Enterprises and cloud providers are grappling with the economics of inference at scale. Nvidia’s GPUs are expensive and supply-constrained. Power and cooling costs are spiraling. Data center capacity is tight. Anything that can extract more useful inference work per watt and per dollar of existing hardware has an immediate market. NeuReality’s value proposition is essentially: you’ve already bought the expensive accelerators — now let us help you use them properly.

There are risks. Plenty of them.

Semiconductor startups face brutal odds. The capital requirements are enormous, the design cycles are long, and the incumbents are formidable. Nvidia doesn’t just sell GPUs — it sells CUDA, TensorRT, Triton Inference Server, and an entire software stack that makes its hardware sticky. Intel and AMD are both pushing inference-optimized products. And the hyperscalers — Google, Amazon, Microsoft, Meta — are all designing custom silicon internally, often with the explicit goal of reducing dependence on third-party chips.

NeuReality also faces the classic startup challenge of market education. The concept of a dedicated “inference networking” chip is new. Convincing data center architects to add another component to their bill of materials requires demonstrating not just technical superiority but also integration simplicity and total cost of ownership benefits that are compelling enough to overcome institutional inertia. That’s a high bar.

Still, the underlying thesis is sound. The AI industry is in the early innings of a massive buildout of inference infrastructure, and the architectures being deployed today are, in many ways, kludges — training-era designs pressed into service for a fundamentally different workload. The companies that figure out how to optimize inference serving at the system level, not just the chip level, stand to capture significant value.

Recent developments across the semiconductor industry underscore the urgency. Nvidia’s latest earnings calls have repeatedly highlighted inference as the fastest-growing segment of its data center business. AMD has been aggressively positioning its Instinct MI300X for inference workloads. And a wave of startups — from Groq with its deterministic inference engine to Etched with its transformer-specific ASIC — are all circling the same opportunity from different angles. NeuReality’s differentiation is that it isn’t trying to replace the accelerator. It’s trying to make every accelerator work better.

Moshe Tanach has said publicly that the company views the NR-Nexus not as a product but as a new category. That’s ambitious language. But the history of computing infrastructure suggests he might be onto something. Dedicated load balancers, network interface cards with offload engines, SmartNICs, DPUs — the industry has a long track record of carving out specialized silicon for functions that general-purpose processors handle poorly at scale. Nvidia itself recognized this pattern when it acquired Mellanox for $7 billion in 2020 and subsequently launched its BlueField DPU line. NeuReality is making a parallel argument: inference serving deserves its own dedicated hardware layer.

Whether NeuReality can execute on that vision — with sufficient capital, engineering talent, and market traction — remains an open question. The semiconductor graveyard is littered with companies that had the right idea at the wrong time, or the right technology without the right go-to-market motion. But in a market where Nvidia’s inference revenue alone is measured in tens of billions of dollars per quarter, even capturing a thin slice of the supporting infrastructure opportunity could build a substantial business.

For now, NeuReality occupies an interesting position: a company that isn’t trying to win the AI chip war, but is betting it can profit from making everyone else’s weapons more effective. In a market defined by escalating compute costs, power constraints, and insatiable demand for real-time AI services, that’s not a bad place to be.

The Chip Startup That Wants to Be the Air Traffic Controller for AI Inference

Notice an error?

Ready to get started?