CVAE Style Selection - Why Z Selects, Not Averages

Imagine 100 humans demonstrating how to pick up a cup. Every single trajectory is valid - some go from the left, some from the right, some arc high, some stay low.

This is the beautiful diversity of human demonstration data. There's no single "correct" way to accomplish the task.

Key Insight

There's no single "correct" way to reach the goal. All 100 trajectories are valid solutions.

What happens if we try to learn from all 100 demonstrations by averaging them?

The red trajectory shows the mathematical average of all paths. It goes straight into the table, missing the cup entirely!

The Problem

The average of valid solutions can be completely invalid.

This is the multimodal problem that breaks naive supervised learning.

Average of "go left" + "go right" = crash in middle
MSE loss encourages this averaging behavior
The robot learns the worst of all worlds

The latent variable Z doesn't average styles - it encodes them. Each region of Z-space corresponds to a different valid trajectory style.

Watch: the purple orb represents Z floating above the scene, with connection beams showing how it relates to different trajectory styles.

How Z Works

Z encodes "which style" as a continuous variable.

Different Z values = different valid approaches to the same task.

Z ≈ 0.2 → "arc from left" style
Z ≈ 0.8 → "arc from right" style
Each Z value gives a coherent trajectory

When we sample a specific Z value, the decoder commits to ONE coherent trajectory. No averaging, no compromise.

The golden trajectory shows the selected path - all other styles fade to gray. Z has made its choice.

Eureka!

Selection preserves coherence.

Instead of blending all styles into mush, we pick one and execute it cleanly.

At inference: sample Z from prior (standard normal)
Decoder produces one complete trajectory
Result: smooth, valid motion

SELECT, Don't AVERAGE

The key insight that makes CVAE work

The final comparison: AVERAGE vs SELECT.

The Punchline

Left side: averaging crashes (red trajectory into table)

Right side: selection succeeds (golden trajectory to cup)

The ablation study proves it: removing CVAE drops performance by 33%. Z is essential.

CVAE doesn't find the "best" trajectory
It commits to ONE valid style
That's why it works!

The Variety