Robot Learning / Visuomotor Policy Research
A history of the solo visuomotor-policy research I did from August 2023 to October 2025, what each phase taught me, why it led to the next, and where it stands now. The throughline is policy learning for bimanual manipulation, aimed at real industrial deployment. To give perspective on time and money constraints all work was done while also doing a 20-40 hours a week of SWE contract work and spending time knocking on local Los Angeles manufacturer doors. All research was done on a rig with a 3090ti and 4090 and 64gb of ram. I used both the YAM and Ufactory Xarm7 robot arm with Oak-d Pro and the Intel RealSense D405.
The working loop has been the same throughout: form a theory about what should work, dig into the related literature, then test and validate it on real hardware. Each section below is a turn of that loop.
Humble Beginnings — Foundations · VR Teleop · Isaac Sim · Action Chunking Transformers
Starting out In August 19th 2023 I didn’t have the capital to buy a robot arm, the best way to get started was simulation. The starting point was a bimanual robot setup in Isaac Sim and a VR-based teleoperation rig I built to collect human demonstration rollouts for imitation learning. The operator’s hands drove the arms, with video streamed back to the headset, and a session recorder captured paired video + action sequences.
To get a baseline of existing policies I modified Action Chunking with Transformers (ACT) to work with my data pipeline and trained on 200-300 human demonstrations of stacking blocks in a randomized area. This was the first model I trained end-to-end on data I’d collected myself, the foundation every subsequent experiment built on. Over the next couple of months I read all of the interesting robot learning papers I could find to understand the current research direction of research. With a good understanding of the current research direction took a step back and trained very basic autoregressive visuomotor policies while layering in industry tricks to get an intuition for how these things worked and why.
It was clear the current research direction in autoregression based policies wasn’t going to get us to production grade policies. I spent a few months running experiments on a mix of different architectures appiled to robot learning including V-JEPA and MAMBA with varying success.
Real Robot Hardware
I had finally saved enough from contract work to buy a Ufactory XArm7, a DH Robotics AG-160-95 gripper and two OakD-Pro’s for the testing rig and had a basic policy to run on it by May 21st. With a very fast and easy to use VR based data collection system. I recreated, trained and tested both Chi et al.’s Diffusion Policy and Chen et al.’s Diffusion Forcing on the tesbench to get a baseline from them. Then trained a SD-VAE training recipe + Swin V2 U-Net backbone to use latents instead of raw pixels in the diffusion forcing policy with interesting success.
Flow Matching Policies
After wrapping up with Diffusion Forcing I took a step back think about the problem from a different perspective. For the use cases I was targetting I needed a model that could perform one task at high enough reliability to get paying customers. I didn’t have access to large amounts of data or compute. I needed something equivilent to the pre-training step in LLM’s, low quality data then post train the robot on task revelent data. I was inspired by infant motor babbling and early motor learning and the possibility of applying something similar to robotics. I first had the idea of the robot randomly move around and have a vision predict the robots state but I didn’t think the autoregressive and diffusion models would be good for this. My hypothesis was that giving the model an explicit inductive bias for the robot’s configuration space, rather than letting it learn the C-space structure implicitly from data, would generate better trajectories. That pointed me back to a paper I’d previously bookmarked, Braun et al.’s Riemannian Flow Matching Policy for Robot Motion Learning, which fit the hypothesis well. I refreshed my understanding of manifolds and implemented it.
The first version was a minimal Conditional Flow Matching policy: a small MLP vector field, Gaussian source distribution per the original Lipman et al. recipe, integrated as an ODE with zuko. Pure toy implementation to make sure I understood the flow matching formulation before adding any of the moving parts.
From there I started wiring it to the actual robot data. I plugged in SAM 2 features as the vision conditioning first, then swapped over to NVIDIA’s Cosmos Causal Video Tokenizer for compressed latent video conditioning, which was the encoder I stuck with. The tokenizer output (concatenated with the proprioception tokens) became the memory, and the action chunk was passed through a standard PyTorch transformer decoder that cross-attended to that memory. The decoder was the action head, predicting the flow-matching velocity field over the action chunk conditioned on the encoded video and the current robot state.
The hardest piece, and the one that took the most iteration, was getting the rotation half of the action right. Treating the SO(3) components as just-another-dimension in R⁴ and renormalizing the quaternion at each integration step distorted the geometry of the path. I pivoted the integrator onto the actual manifold, a product of S³ (the unit-quaternion 3-sphere) and R⁴ for the Euclidean components, so the rotation half is integrated as geodesics on the sphere rather than projected back from Euclidean space. Concretely, the 8-dimensional action vector splits into translation (R³), rotation as a quaternion constrained to S³ (4 dims, 3-dimensional manifold), and gripper (R¹). I tested several manifold configurations along the way: Euclidean baseline, R³ × S³, and finally the S³ × R⁴ product manifold I settled on.
I noticed the model was short-circuiting, leaning on its own joint state to predict actions instead of attending to the visual signal. This turned out to match a known imitation-learning failure mode, de Haan et al.’s Causal Confusion in Imitation Learning. I tested masking the proprioception tokens at 50% and 75% during training to push the model onto the visual input, and settled on 50%.
Worth noting that Physical Intelligence’s π₀ paper appeared shortly after I started this direction. The flow-matching action head with cross-attention conditioning is the same core idea, arrived at independently, though theirs is a full vision-language-action model on a PaliGemma backbone, where mine was vision-action only with the Cosmos Tokenizer as the visual backbone. Their “action expert” framing pushed me to revisit the action-head architecture, and I swapped my plain transformer decoder for a Switch-Transformer (sparse mixture-of-experts) encoder-decoder.
I moved the training stack onto PyTorch Lightning with CARBS for cost-aware hyperparameter sweeps, trained against teleop demonstrations on the UFactory xArm 7 with pick-and-place as the canonical evaluation task.
Mamba2 Policy
Mamba2 was attractive for two reasons: it’s fast at inference, which matters for running policies at the edge, and its recurrent state-transition structure is a natural fit for sequential robot control. The question was how a Mamba2-based vision-action model would perform.
The setup: a Mamba2-based policy (12 layers, hidden dim 1024) conditioned via per-layer cross-attention FiLM, with modulation parameters generated from action queries against frozen DINOv3-S/16 patch tokens. The action head was a residual MLP. The model was trained with a L1 loss.
I ran two variants. The first used flat DINOv3 patches and zeroed proprioception at training to test how far vision alone could carry the model. The second swapped in a custom 3D bidirectional Mamba (BiMamba2_3D) to keep the patch grid intact across time.
On the pick-and-place tennis ball task, even with 214 episodes in the dataset, the operator had naturally collected more demonstrations from certain zones than others, and the L1-regression Mamba2 head collapsed onto the densest one. It would only pick up the ball from that region.
One nice property despite the collapse: the policy could recover from human intervention. If I moved the ball or took it out of the gripper while the policy was running, it would go back and complete the pick with a fairly high success rate.
What I took away from this thread: Mamba2 looks great for cases with limited compute and tight latency budgets, since the model produces actions at 30Hz off the camera’s latest image at each step rather than relying on action chunking. The natural follow-up would be pairing this latency advantage with a generative action head to also get multi-modal action distributions.
Data Collection Pipeline
flowchart LR
subgraph hw[" External hardware "]
VR[VR Controller
Meta Quest]
Robot[I2RT YAM]
Cam1[RealSense
D405/D450]
Cam2[OAK-D Pro
DepthAI v2]
end
subgraph ctrl[" yam_robot_controller (Python) "]
IK[Mink IK
+ controller loop]
end
subgraph dcs[" data_collection_service (C++) "]
direction TB
TR[Telemetry Receiver]
CM[Camera Manager]
BM[Buffer Manager
lock-free ring buffers]
RM[Recording Manager]
Writer[MCAP Writer]
end
%% control loop
VR -- pose / buttons --> IK
IK -- joint commands --> Robot
Robot -- joint state --> IK
%% observation path
IK -- "telemetry (UDS)
VR + joint state + TCP" --> TR
Cam1 --> CM
Cam2 --> CM
%% internal flow
TR --> BM
CM --> BM
%% recording control (signal, not data)
TR -. VR state / B-button .-> RM
%% output
BM --> Writer
RM -. start / stop / abort .-> Writer
Writer --> Disk[(MCAP files)]
I needed a multi-stream capture system that could gather every input the policy would consume (cameras, robot state, VR controller pose, button events) with consistent per-sample timestamps and zero frame drops, and store everything in a single format that could be synced, processed, chunked into training datasets after the fact.
The architecture is a fan-in. Separate capture threads per source (RealSense D405/D450, OAK-D Pro via DepthAI v2, telemetry from yam_robot_controller over UDS) write into per-source lock-free atomic ring buffers, single-writer / single-reader with acquire/release ordering, 64-byte aligned to avoid false sharing, pre-allocated at startup so there is no allocation on the hot path. A dedicated writer thread drains them into an MCAP file with one channel per stream: /camera/<id>/<stream>/data for FlatBuffer-encoded RawImages (Foxglove schema, so any recording opens directly in Foxglove for visual debugging), /telemetry/vr_controller and /telemetry/robot_state as JSON, plus /robot/config and /system/health for run metadata.
Every sample is timestamped at the capture site, before it enters the buffer, so the MCAP file preserves the real wall-clock arrival time of every sample regardless of when it got flushed. That made downstream sync tractable. A Python post-processing step loads the MCAP, time-aligns the streams to a target rate, normalizes the robot and action data, applies any training-specific preprocessing, and emits chunked datasets the training stack can mmap directly.
Recording lifecycle is VR-driven so the operator never has to switch context. The Recording Manager watches the telemetry stream for VR engagement (start a new session), VR disengagement (close the MCAP cleanly), and the B-button (abort and delete the in-progress file). Pre/post-capture buffers preserve a few seconds before and after each engagement so the demonstration isn’t cut off at the boundary.
Remote Teleoperation (Iris)
I ran a short experiment to see how feasible remote teleoperation was. I set up a cooking task and routed the teleop traffic through an AWS-hosted relay (Iris) to simulate the round-trip lag a real remote operator would face. With some optimization tricks, teleoperation seemed feasible for some tasks.
Under the hood, Iris is a C++ relay server that handles video through rtpengine’s kernel-mode RTP forwarding (sub-millisecond pass-through, since the kernel module bypasses userspace) and a separate UDP relay for the VR controller packets, with an HTTP API for setting up operator-robot session pairings.