Unit 4. Policy Gradient with PyTorch
Value-based, Policy-based, and Actor-critic methods
-
Value based
- Learn a value function leading to an optimal policy.
- Minimize the predicted and target value.
- Generate policy directly from value function.
-
Policy based
- Learn to approximate
without learning a value function.
- Parameterize the policy.
- e.g. Stochastic Policy:
- Define an objective function
, expected cumulative reward, and want to find the value
that maximizes the objective function.
- Learn to approximate
-
Actor-critic
- A combination of both.
Difference between policy-based and policy-gradient methods
Policy-gradient methods is a subclass of policy-based methods.
- In policy-based methods, the optimization is most of the time on-policy since for each update, we only use data (trajectories) collected by our most recent version of policy.
Difference lies in how to optimize the parameters:
- In policy-based methods, search directly for optimal policy. Optimize the parameter indirectly by maximizing the local approximation of the object function with hill climbing, simulated annealing or evolution strategies.
- In policy-gradient methods, also search directly for optimal policy. Optimize the parameter directly by performing gradient ascent on the performance of the objective function.
Advantages
- Simplicity.
- Learn a stochastic policy.
- Don’t need to implement exploration/exploitation trade-off by hand.
- Get ride of perceptual aliasing. (Two states seem the same but need different actions).
- More efficient in high-dimensional action spaces and continuous action spaces.
- Deep Q-learning assigns a score for each possible action. But polcy gradients output a probability distribution over actions.
- Better convergence properties.
- Smooth change without using
argmaxwhich is applied in value-based methods.
- Smooth change without using
Disadvantages
- Converge to a local maximum instead of a global optimum.
- Slower: Step by step. It can take longer to train.
- High variance. (To be discussed with actor-critic unit)
Deeper dive into policy-gradient
Policy Gradient Algorithm
- Training Loop
- Collect an episode with the policy.
- Calculate the return (sum of rewards).
- Update weights of the policy.
- If positive return -> increase the possibility of each (state, action) pairs taken during the episode.
- If negative return -> decrease.
Objective Function
Performance of the agent given a trajectory (state action sequence without considering reward), outputs the expected cumulative reward.
in which
: Discounted cumulative reward of the trajectory.
Optimize Objective Function
The problem:
- Can’t calculate the true gradient of the objective function
since it requires calculating the probability of each possible trajectory. Require a sample-based estimate.
- Can’t differentiate the state distribution (Markov Decision Process dynamics)
as we might not know about it.
The solution:
- Policy Gradient Theorem:
- Math proof. The main tricks are:
- Derivative log trick (likelihood ratio trick or reinforce trick):
.
- Translate
to
.
- Use sampling to approximate distribution/expectation.
- Translate
to
- Translate
- Derivative log trick (likelihood ratio trick or reinforce trick):
The Reinforcement algorithm (Monte Carlo Reinforce)
-
Monte carol reinforce: Uses an estimate return from **an entire episode to update the policy parameter
.
-
One trajectory:
-
Multiple trajectories:
-
Intuitions
is the direction of steepest increase of the (log) probability of selecting action
from state
.
is the scoring function.
- If return is high, it will push up the probabilities of the (state, action) combinations.
- Otherwise, push down.
Leave a comment