100 Valid Demonstrations
🤔 Curiosity
😰 Averaging = Disaster
😰 Frustration
✨ The Latent Variable Z
✨ Hope
🎯 Z Commits to ONE Style
Selection, not averaging - Z picks ONE valid style to execute
🎯 Satisfaction
🏆 The Punchline
❌ AVERAGE
✓ SELECT
Averaging destroys the trajectory
Selecting ONE style reaches the goal
💡 CVAE commits to ONE valid style instead of averaging them all
🏆 Triumph
CVAE Style Selection
Why Z Selects, Not Averages
Section 1 of 5

The Variety

100 valid ways to reach the goal

Imagine 100 humans demonstrating how to pick up a cup. Every single trajectory is valid - some go from the left, some from the right, some arc high, some stay low.

This is the beautiful diversity of human demonstration data. There's no single "correct" way to accomplish the task.

Key Insight

There's no single "correct" way to reach the goal. All 100 trajectories are valid solutions.

What happens if we try to learn from all 100 demonstrations by averaging them?

The red trajectory shows the mathematical average of all paths. It goes straight into the table, missing the cup entirely!

The Problem

The average of valid solutions can be completely invalid.

This is the multimodal problem that breaks naive supervised learning.

  • Average of "go left" + "go right" = crash in middle
  • MSE loss encourages this averaging behavior
  • The robot learns the worst of all worlds

The latent variable Z doesn't average styles - it encodes them. Each region of Z-space corresponds to a different valid trajectory style.

Watch: the purple orb represents Z floating above the scene, with connection beams showing how it relates to different trajectory styles.

How Z Works

Z encodes "which style" as a continuous variable.

Different Z values = different valid approaches to the same task.

  • Z ≈ 0.2 → "arc from left" style
  • Z ≈ 0.8 → "arc from right" style
  • Each Z value gives a coherent trajectory

When we sample a specific Z value, the decoder commits to ONE coherent trajectory. No averaging, no compromise.

The golden trajectory shows the selected path - all other styles fade to gray. Z has made its choice.

Eureka!

Selection preserves coherence.

Instead of blending all styles into mush, we pick one and execute it cleanly.

  • At inference: sample Z from prior (standard normal)
  • Decoder produces one complete trajectory
  • Result: smooth, valid motion
SELECT, Don't AVERAGE
The key insight that makes CVAE work

The final comparison: AVERAGE vs SELECT.

The Punchline

Left side: averaging crashes (red trajectory into table)

Right side: selection succeeds (golden trajectory to cup)

The ablation study proves it: removing CVAE drops performance by 33%. Z is essential.

  • CVAE doesn't find the "best" trajectory
  • It commits to ONE valid style
  • That's why it works!