What will I learn here?

A discretized Q-table maps angle and angular velocity to three actions: push left, coast, or push right. Height-based rewards and ε-greedy exploration illustrate the core RL loop.

How is reward defined?

Each step rewards swing height (1−cos θ) with a bonus for high swings. Pushing incurs a small cost; extreme speed or angle ends the episode.

What are α, γ, and ε?

α is the learning rate, γ the discount factor, and ε the exploration rate for random actions. Adjust sliders and watch how returns and the policy change.

NN Classifier Loss Landscape Lab KNN Neighbor Classifier RL Agent Conv Vision Attention Playground Claude Agents

Playground

Swing RL

Q-learning discovers when to push and when to coast—just like pumping a swing!

Episode0

Learning rate αDiscount γExploration εSpeed

Show formulas

Training settings

Push on the way down, coast on the way up—the Q-table learns this timing from rewards.

Learning rate α: How much each Q update moves. Large values learn fast but can oscillate.
Discount γ: How much future rewards matter now. Closer to 1 weights distant rewards more.
Exploration ε: Random push/coast chance. High ε tries many rhythms; low ε sticks to what worked.

Swing setup

Rope, friction, and wind change the challenge

Reward is swing height (1−cos θ). The agent learns left/right pushes to build amplitude.

Rope:: 1.20 m
Friction:: 0.035
Push:: 2.2

Push opposite to motion near the bottom to add energy
Coasting near the top usually helps

Swing simulator

Purple robot = agent · bar = height

Steps this episode

0.0000

Episode return

0.0000

Height

0.0043

High swings

0.0000

θ -5.3°ω 0.23

Episode return

Higher swings raise the curve

Start training to see per-episode returns

Related lessons