The AI Layer — Perception, Manipulation, and Task Learning

Even if you solve the hardware, the robot still needs to see, understand, and act in an unstructured environment. The AI problems are in some ways harder than the mechanical ones.

## Perception

A robot doing home cleaning needs to identify and localize objects in 3D space using cameras or depth sensors. This sounds like a solved problem — object detection has been impressive since 2015. But there's a gap between "identify that there is a glass on the table" and "determine the glass's exact 3D pose, weight class, and fragility estimate well enough to pick it up without breaking it."

The pose estimation and grasp planning step is called 6-DoF object pose estimation (6 degrees of freedom: x, y, z position plus roll, pitch, yaw orientation). It's computationally intensive and still fails regularly on transparent objects, reflective surfaces, and objects in unusual orientations. Cleaning involves all of these.

## Manipulation

Grasping is an unsolved problem in the general sense. Robotic grippers designed for one class of objects (industrial robot arms picking identical parts from fixed positions) work well. A two-fingered or three-fingered gripper that works reliably across arbitrary household objects — the way a human hand does — does not yet exist.

Dexterous manipulation (turning a doorknob, picking up a delicate glass, scrubbing a surface with appropriate pressure) requires force feedback at the finger level, fast control loops, and a learned model of how different objects respond to forces. Tesla's Optimus demo of folding a shirt took months of training to produce a single reliable demo. That's the state of the art.

## Task Learning

The most significant AI development in robotics over the past two years has been the application of large-scale imitation learning and reinforcement learning from human demonstration. Instead of programming tasks explicitly, robots learn from watching humans do them.

This is why companies are racing to collect human demonstration data. Figure AI and Google DeepMind's RT-2 project showed that visuomotor policies trained on large datasets can generalize across environments better than hand-coded controllers. But the sample efficiency problem remains: a robot needs thousands of demonstrations to learn a task a human child learns in minutes.

The path forward is probably combining large vision-language models (which provide world understanding) with robot-specific fine-tuning on task demonstrations. That combination is what companies like Physical Intelligence (pi) and Covariant are developing.

Getting from "this works in the lab" to "this works reliably across diverse homes" is fundamentally a data problem as much as an algorithm problem. The companies that collect the most high-quality human demonstration data for home tasks will have a structural advantage — which is why Gatsby's real commercial deployment is a data collection opportunity as much as a revenue opportunity.

The AI Layer — Perception, Manipulation, and Task Learning

// COMMENTS

ON THIS PAGE