Imagine 100 humans demonstrating how to pick up a cup. Every single trajectory is valid - some go from the left, some from the right, some arc high, some stay low.
This is the beautiful diversity of human demonstration data. There's no single "correct" way to accomplish the task.
Key Insight
There's no single "correct" way to reach the goal. All 100 trajectories are valid solutions.
What happens if we try to learn from all 100 demonstrations by averaging them?
The red trajectory shows the mathematical average of all paths. It goes straight into the table, missing the cup entirely!
The Problem
The average of valid solutions can be completely invalid.
This is the multimodal problem that breaks naive supervised learning.
- Average of "go left" + "go right" = crash in middle
- MSE loss encourages this averaging behavior
- The robot learns the worst of all worlds
The latent variable Z doesn't average styles - it encodes them. Each region of Z-space corresponds to a different valid trajectory style.
Watch: the purple orb represents Z floating above the scene, with connection beams showing how it relates to different trajectory styles.
How Z Works
Z encodes "which style" as a continuous variable.
Different Z values = different valid approaches to the same task.
- Z ≈ 0.2 → "arc from left" style
- Z ≈ 0.8 → "arc from right" style
- Each Z value gives a coherent trajectory
When we sample a specific Z value, the decoder commits to ONE coherent trajectory. No averaging, no compromise.
The golden trajectory shows the selected path - all other styles fade to gray. Z has made its choice.
Eureka!
Selection preserves coherence.
Instead of blending all styles into mush, we pick one and execute it cleanly.
- At inference: sample Z from prior (standard normal)
- Decoder produces one complete trajectory
- Result: smooth, valid motion
SELECT, Don't AVERAGE
The key insight that makes CVAE work
The final comparison: AVERAGE vs SELECT.
The Punchline
Left side: averaging crashes (red trajectory into table)
Right side: selection succeeds (golden trajectory to cup)
The ablation study proves it: removing CVAE drops performance by 33%. Z is essential.
- CVAE doesn't find the "best" trajectory
- It commits to ONE valid style
- That's why it works!