1. Policy Evaluation
Recall for an arbitrary policy π:
vπ(s)=Eπ[Gt|St=s]=Eπ[Rt+1+γGt+1|St=s]=Eπ[Rt+1+γvπ(St+1)|St=s]=∑aπ(a|s)∑s′,rp(s′,r|s,a)[r+γvπ(s′)],∀s∈SIf the environment’s dynamics are completely known, then it is a system of |S| simultaneous linear equations in |S| unknowns. In this case, Iterative solution
methods are most suitable: consider a sequence of approximate value functions v0,v1,v2..., with the initial approximation v0 chosen arbitrarily, except at the terminal state, where it has to be 0, and each successive approximation is obtained by using the Bellman Equation
for vπ as an update rule:
vk=vπ is a fixed point for this update rule because the Bellman equation
for vπ assures us of equality in this case. It can be shown that the sequence of vk converges to vπ.
Expected updates
can be performed in place or using two arrays. With the former, new values immediately overwrite the old ones.
2. Policy Improvement
Recall that following an existing policy π, consider selecting a in s and thereafter following the existing policy π. The value of this way of behaving is:
qπ(s,a)=Eπ[Gt|St=s,At=a]=Eπ[Rt+1+γGt+1|St=s,At=a]=Eπ[Rt+1+γvπ(St+1)|St=s,At=a]=∑s′,rp(s′,r|s,a)[r+γvπ(s′)]If it is better to select such action a thereafter follow policy π than it would be to follow π all the time, then the new policy would be better overall:
qπ(s,π′(s))≥vπ(s),∀s∈S, then π′ is a better policy, i.e. vπ′(s)≥vπ(s)
Proof:
vπ(s)≤qπ(s,π′(s))=Eπ[Rt+1+γvπ(St+1)|St=s,At=π′(s)]=Eπ′[Rt+1+γvπ(St+1)|St=s]≤Eπ′[Rt+1+γqπ(St+1,π′(St+1))|St=s](the condition)=Eπ′[Rt+1+γEπ[Rt+2+γvπ(St+2)|St+1,At+1=π′(St+1)]|St=s](the action-value function)=Eπ′[Rt+1+γRt+2+γ2vπ(St+2)|St=s]≤Eπ′[Rt+1+γRt+2+γ2Rt+3+γ3vπ(St+3)|St=s]…≤Eπ′[Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+…|St=s]=vπ′(s)So far we have seen how we can easily evaluate a change in the policy at a single state. Consider changes at all stages, using a greedy policy π′, given by:
π′(s)=argmaxaqπ(s,a)=argmaxaE[Rt+1+γvπ(St+1)|St=s,At=a]=argmaxa∑s′,rp(s′,r|s,a)[r+γvπ(s′)]If the new greedy policy, π′, is as good as, but not better than, the old policy π, then:
vπ′(s)=maxwhich is the same as the Bellman Optimality Equation
, and therefore v_{\pi'} must be v_*. It also applies to stochastic policies where a probability \pi(a | s) is given for each possible action at state s.
3. Policy Iteration
Once a policy, \pi, has been improved using v_{\pi} to yield a better policy, \pi', we can then compute v_{\pi'} and improve it again to yield an even better \pi''. A sequence of monotonically improving policies and value functions can be obtained:
\pi_0 \xrightarrow {\text{Evaluation}} v_{\pi_0} \xrightarrow {\text{Improvement}} \pi_1 \xrightarrow {\text{Evaluation}} v_{\pi_1} \xrightarrow {\text{Improvement}} \pi_2 \xrightarrow {\text{Evaluation}} v_{\pi_2} \xrightarrow \dotsBecause a finite MDP has only a finite number of deterministic policies, this process must converge to an optimal policy and the optimal value function in a finite number of iterations.
4. Value Iteration
Each of the policy iterations involves policy evaluation, which may itself be an iterative computation requiring multiple sweeps through the state set. In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration. One special case is when policy evaluation is stopped after just one sweep. This is called value iteration
:
For arbitrary v_0, the sequence ${v_k}$ can be shown to converge to v_*. Note that value iteration
is obtained simply by turning the Bellman optimality equation
into an update rule. Also note how the value iteration update
is identical to the policy evaluation update
except that it requires the max to be taken over all actions.