Unit 6 Actor-Critic Methods with Robotics Environments
Intro
- Value-based method
- Q-learning method
- Policy-based method
- Policy-gradient method
- Use Monte-Carlo sampling to estimate return as cannot calculate for all trajectories.
- Use lots of samples as trajectories can lead to different returns with high variance.
- Which causes slower training as lots of samples are needed.
- Policy-gradient method
- Actor-Critic methods
- An Actor that controls how our agent behaves. (Policy-based)
- A Critic that measures how good the take action is. (Value-based)
- Will study one of the hybrid methods, Advantage Actor Critic (A2C).
- The high level idea is using a value function (critic) to replace Monte-Carol sampling.
Advantage Actor-Critic (A2C)
The Actor-Critic Process
- Actor, a policy function parameterized by
:
.
- Critic, a value function parameterized by
:
.
Process:
-
- At timestamp
, current state
from environment and pass it to actor and critic.
- At timestamp
-
- Actor (policy function) takes input
and outputs an action
.
- Actor (policy function) takes input
-
- Critic (value function) takes input
and outputs an a Q-value
.
- Critic (value function) takes input
-
- The action
results in a new state
and a new reward
.
- The action
-
- Actor updates parameters using Q-value. Produces new action
.
.
- Here we use
to approximate accumulative rewards to trajectories. Still the idea of Mote-Carol error.
- Actor updates parameters using Q-value. Produces new action
-
- Critic updates parameters.
.
: Temproral-difference error.
: Gradient of value function.
Adding Advantage in Actor-Critic (A2C)
Measures how taking that action at a state is better compared to the average value of the state. Extra rewards. Push in direction if more, opposite direction if less. .
Use it to replace action value function. .
Use TD error as a good estimator of the advantage function .
is the immediate return.
is the average value of the next state.
Leave a comment