Action Chunking with Transformers
Three Software Innovations
Section 1 of 5

The Imitation Problem

Why averaging destroys good solutions

Watch the animation: a human arm (cyan) and a robot arm (red) both try to reach the same target.

The human takes multiple different paths - sometimes from the left, sometimes from the right. All valid!

The Problem

What if the robot averages these paths?

The average of left + right = straight into the obstacle!

  • Human demonstrations show multiple valid solutions
  • Simple neural networks try to average them
  • The average is often INVALID

Now watch two robots race: one predicts single actions, one predicts action chunks.

The single-action robot jitters and drifts. The chunking robot moves smoothly.

Action Chunking

Predict K=100 actions at once!

Instead of "what's next?", ask "what's the next 100 steps?"

  • Single-action prediction: errors compound over time
  • Action chunking: predict a smooth trajectory
  • Execute chunk → get fresh prediction → reset errors
Result

Smooth motion instead of jerky steps!

The Conditional VAE (CVAE) solves the averaging problem. It captures the style of each demonstration.

How CVAE Works

Encoder compresses demonstration → latent Z

Z captures "which approach" - left, right, high, low...

Watch: multiple trajectories enter the encoder. The Z variable (gold) emerges - it selects one style, doesn't average!

  • Encoder: actions + images → latent Z
  • Z is a style variable - not an average
  • Decoder: Z + current state → coherent trajectory

Temporal ensembling makes predictions even smoother by blending overlapping chunks.

The Technique

Multiple chunks overlap in time

Blend them with exponential weighting → ultra-smooth motion

Watch the animation: older predictions fade out, newer predictions get more weight.

  • Each timestep has multiple predictions from different chunks
  • Blend with weights: newer = more trusted
  • Result: seamless trajectory execution
Demo → Style → Fusion → Chunk → Smooth
The complete ACT pipeline

You've just learned the three software innovations that made ACT successful on $20K hardware!

The Complete Picture

1. CVAE: Capture style, don't average

2. Action Chunking: Predict K actions at once

3. Temporal Ensembling: Blend for smoothness

These ideas aren't new - but combining them correctly made bimanual manipulation actually work.

  • 84% success on Ziplock opening (others: 0-12%)
  • 96% success on battery insertion (others: 0-8%)
  • Same hardware, different software