Watch the animation: a human arm (cyan) and a robot arm (red) both try to reach the same target.
The human takes multiple different paths - sometimes from the left, sometimes from the right. All valid!
The Problem
What if the robot averages these paths?
The average of left + right = straight into the obstacle!
- Human demonstrations show multiple valid solutions
- Simple neural networks try to average them
- The average is often INVALID
Now watch two robots race: one predicts single actions, one predicts action chunks.
The single-action robot jitters and drifts. The chunking robot moves smoothly.
Action Chunking
Predict K=100 actions at once!
Instead of "what's next?", ask "what's the next 100 steps?"
- Single-action prediction: errors compound over time
- Action chunking: predict a smooth trajectory
- Execute chunk → get fresh prediction → reset errors
Result
Smooth motion instead of jerky steps!
The Conditional VAE (CVAE) solves the averaging problem. It captures the style of each demonstration.
How CVAE Works
Encoder compresses demonstration → latent Z
Z captures "which approach" - left, right, high, low...
Watch: multiple trajectories enter the encoder. The Z variable (gold) emerges - it selects one style, doesn't average!
- Encoder: actions + images → latent Z
- Z is a style variable - not an average
- Decoder: Z + current state → coherent trajectory
Temporal ensembling makes predictions even smoother by blending overlapping chunks.
The Technique
Multiple chunks overlap in time
Blend them with exponential weighting → ultra-smooth motion
Watch the animation: older predictions fade out, newer predictions get more weight.
- Each timestep has multiple predictions from different chunks
- Blend with weights: newer = more trusted
- Result: seamless trajectory execution
Demo → Style → Fusion → Chunk → Smooth
The complete ACT pipeline
You've just learned the three software innovations that made ACT successful on $20K hardware!
The Complete Picture
1. CVAE: Capture style, don't average
2. Action Chunking: Predict K actions at once
3. Temporal Ensembling: Blend for smoothness
These ideas aren't new - but combining them correctly made bimanual manipulation actually work.
- 84% success on Ziplock opening (others: 0-12%)
- 96% success on battery insertion (others: 0-8%)
- Same hardware, different software