Robotics teams chasing end-to-end learning face a quiet drag on progress. They call it the data layer tax. It shows up in wasted engineering hours, underused GPUs, and iteration cycles stretched across days instead of hours. Unlike the language models that scaled on mature pipelines for text and code, physical AI systems wrestle with multimodal streams that arrive at different rates, carry spatial relationships, and demand precise time alignment.
The Rerun blog post maps this tax from evaluation back to collection. Written by Nikolaus West, it lays out how immature infrastructure for physical data compounds costs at every stage. Recent analysis from Voxel51 echoes the point. In its May 2026 guide, the computer vision platform reports that top-performing physical AI teams spend three times more time on data work than those falling behind. Ninety-seven percent struggle with dataset iteration.
Start at evaluation. Real-world robot trials take hours or days to run. Teams cannot iterate on hundreds of repeatable tests the way LLM developers do. They turn to proxy metrics instead. Reward models score task progress. Trajectory smoothness estimates success. Yet these signals judge individual episodes, not final policy quality. Researchers watch rollouts, build intuition from repeated viewings, and trace failures backward. That tracing often means switching between disconnected tools and mismatched file formats. Friction accumulates. Insights arrive too slowly to reshape the next training run.
Training exposes the tax even more clearly. Models output actions over time. A single sample might pull camera frames from three views, joint states from thirty motors, gripper positions, and a language command. Then it needs a chunk of future actions, sometimes the next 50 to 100 steps. Constructing these samples on the fly while feeding GPUs at full speed proves tricky. Naive row-oriented reads pull far more data than required. Column-aware loaders help, but they must handle non-uniform sampling patterns that change with each new architecture.
Video makes the problem worse. It often represents 90 percent of dataset size thanks to temporal compression. Codecs store groups of pictures with one keyframe followed by delta frames. Random access to a middle frame can require decoding a dozen others. Larger groups of pictures save storage but slow training. Smaller ones speed access at the cost of disk space. LeRobot defaults to a group of pictures size of two to favor random access. The choice matters when a model conditions on frames spaced at irregular intervals across multiple cameras. One sample can trigger twelve separate decode operations. Dataloaders grow complex. Teams either accept GPU starvation or invest in heavy preprocessing jobs that reduce flexibility.
The original Rerun analysis notes that inflexible dataloaders discourage rapid experimentation with dataset mixes. Physical Intelligence’s pi0 model mixed teleoperated data, simulation, and open datasets using power-law weights to avoid over-representing common tasks. A CoRL 2024 best paper found that, beyond a baseline number of demonstrations, task diversity drives gains more than additional examples of the same task. Yet testing a new mix often means exporting a fresh combined dataset. The overhead discourages systematic tuning.
Curation carries its own costs. Real recordings contain missing streams, schema mismatches, and outright failures. One paper found 33.5 percent of certain trajectories in the DROID dataset were failures. Simple filters based on jerkiness or gripper usage require easy access to synchronized multimodal data. Visual review remains essential. Humans spot hesitation patterns, awkward camera angles, and inconsistent approaches that no automated metric catches. Learned reward models and mutual information estimators add scale but introduce their own compute demands.
But. The tax begins even earlier. Collection setups vary wildly. Teleoperation rigs tie directly to specific robots. Simulation data floods in at different rates and formats. Merging sources demands custom loaders and conversion scripts. Storage decisions lock in downstream pain. File-based archives resist efficient querying. Without a queryable layer, teams copy and transform data repeatedly.
A newer post from the same team, “A new data layer for robot learning,” proposes a purpose-built alternative. Rerun treats column chunks as the core storage primitive. Tall chunks suit high-frequency scalar data such as joint velocities. Short chunks handle bulky video frames. The .rrd file format encodes these chunks with Apache Arrow for zero-copy access to tools like DataFusion or Polars. An index tracks byte ranges, schemas, and time spans so readers fetch only needed sections from object storage.
This design supports selective streaming. Compute jobs pull precisely the columns and time windows required instead of entire recordings. A PyTorch dataloader streams encoded images, scalars, or compressed video directly. It handles random access and works with distributed training. Chunk processing APIs let teams merge streams from MCAP, Parquet, or URDF files into Arrow records, then apply declarative lenses that resemble jq for mutation, casting, or reshaping. The catalog server indexes local directories for SQL and dataframe queries. Rerun Hub, now in private preview, extends the pattern to shared storage at scale.
These capabilities address the tax directly. Visualization happens in a unified viewer that scrubs synchronized multimodal timelines. Queries replace manual detective work. Transformations run without repeated full exports. Training loops iterate faster because data reaches GPUs without constant reformatting. Early users report reduced engineering overhead on pipeline glue.
Industry observers see the same gap. A June 2026 guide from Robotics Center notes that physical AI lacks any equivalent to the internet-scale text that trained large language models. Every demonstration requires physical labor on rigs costing tens of thousands of dollars and yielding limited samples per day. Voxel51’s 2026 physical AI data platform report adds that 89 percent of teams view data as the primary success driver, yet 36 percent say less than half their annotated data ever reaches production. Annotation remains painful. Dataset iteration feels endless.
A February 2026 article in The Next Platform warned that robotics will break existing AI infrastructure. Data movement becomes the constraint once fleets generate synchronized video, LiDAR, joint states, and simulation output. Usability demands indexing, time synchronization, and searchable organization. Without it, even massive GPU clusters sit idle waiting for the right samples.
Scaling laws have begun to work for robotics. Models like pi0.5 and NVIDIA’s GROOT show capabilities once considered distant. Yet progress hinges on closing the data layer gap. LLM teams iterated quickly because their infrastructure abstracted away format conversions, decoding logic, and sample construction. Robot learning teams still carry that burden themselves.
Some organizations respond by building dedicated data collection facilities. Others double down on simulation before fine-tuning on real demonstrations. Hybrid approaches help, but they still require a data layer that treats physical recordings as first-class citizens rather than awkward appendages to analytics pipelines. The teams that reduce the tax most effectively will move faster from idea to working policy. They will experiment with mixes, filters, and architectures without waiting weeks for each data preparation job.
Rerun’s bet is that a columnar, time-aware, multimodal store with built-in visualization, querying, and training integration can become that shared foundation. Its recent 0.33 release added headless rendering and further ROS 2 support, signaling continued investment in production readiness. Whether one vendor provides the full answer or the market fragments into specialized tools, the direction is clear. Physical AI needs infrastructure designed for its data types from the start.
Until then the tax remains. Engineers debug across formats. GPUs wait on decoders. Researchers trust intuition built from painful manual review. The race toward capable robots will reward those who treat data infrastructure as a core competitive advantage instead of an afterthought. The difference between stalling and scaling may come down to how efficiently a team can move from raw recording to better model, again and again.


WebProNews is an iEntry Publication