1. Policy Evaluation

Recall for an arbitrary policy π:

vπ(s)=Eπ[Gt|St=s]=Eπ[Rt+1+γGt+1|St=s]=Eπ[Rt+1+γvπ(St+1)|St=s]=aπ(a|s)s,rp(s,r|s,a)[r+γvπ(s)],sS

If the environment’s dynamics are completely known, then it is a system of |S| simultaneous linear equations in |S| unknowns. In this case, Iterative solution methods are most suitable: consider a sequence of approximate value functions v0,v1,v2..., with the initial approximation v0 chosen arbitrarily, except at the terminal state, where it has to be 0, and each successive approximation is obtained by using the Bellman Equation for vπ as an update rule:

vk+1(s)=Eπ[Rt+1+γvk(St+1)|St=s]=aπ(a|s)s,rp(s,r|s,a)[r+γvk(s)],sS

vk=vπ is a fixed point for this update rule because the Bellman equation for vπ assures us of equality in this case. It can be shown that the sequence of vk converges to vπ.

Expected updates can be performed in place or using two arrays. With the former, new values immediately overwrite the old ones.

2. Policy Improvement

Recall that following an existing policy π, consider selecting a in s and thereafter following the existing policy π. The value of this way of behaving is:

qπ(s,a)=Eπ[Gt|St=s,At=a]=Eπ[Rt+1+γGt+1|St=s,At=a]=Eπ[Rt+1+γvπ(St+1)|St=s,At=a]=s,rp(s,r|s,a)[r+γvπ(s)]

If it is better to select such action a thereafter follow policy π than it would be to follow π all the time, then the new policy would be better overall:

qπ(s,π(s))vπ(s),sS, then π is a better policy, i.e. vπ(s)vπ(s)

Proof:

vπ(s)qπ(s,π(s))=Eπ[Rt+1+γvπ(St+1)|St=s,At=π(s)]=Eπ[Rt+1+γvπ(St+1)|St=s]Eπ[Rt+1+γqπ(St+1,π(St+1))|St=s](the condition)=Eπ[Rt+1+γEπ[Rt+2+γvπ(St+2)|St+1,At+1=π(St+1)]|St=s](the action-value function)=Eπ[Rt+1+γRt+2+γ2vπ(St+2)|St=s]Eπ[Rt+1+γRt+2+γ2Rt+3+γ3vπ(St+3)|St=s]Eπ[Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+|St=s]=vπ(s)

So far we have seen how we can easily evaluate a change in the policy at a single state. Consider changes at all stages, using a greedy policy π, given by:

π(s)=argmaxaqπ(s,a)=argmaxaE[Rt+1+γvπ(St+1)|St=s,At=a]=argmaxas,rp(s,r|s,a)[r+γvπ(s)]

If the new greedy policy, π, is as good as, but not better than, the old policy π, then:

vπ(s)=max

which is the same as the Bellman Optimality Equation, and therefore v_{\pi'} must be v_*. It also applies to stochastic policies where a probability \pi(a | s) is given for each possible action at state s.

3. Policy Iteration

Once a policy, \pi, has been improved using v_{\pi} to yield a better policy, \pi', we can then compute v_{\pi'} and improve it again to yield an even better \pi''. A sequence of monotonically improving policies and value functions can be obtained:

\pi_0 \xrightarrow {\text{Evaluation}} v_{\pi_0} \xrightarrow {\text{Improvement}} \pi_1 \xrightarrow {\text{Evaluation}} v_{\pi_1} \xrightarrow {\text{Improvement}} \pi_2 \xrightarrow {\text{Evaluation}} v_{\pi_2} \xrightarrow \dots

Because a finite MDP has only a finite number of deterministic policies, this process must converge to an optimal policy and the optimal value function in a finite number of iterations.

4. Value Iteration

Each of the policy iterations involves policy evaluation, which may itself be an iterative computation requiring multiple sweeps through the state set. In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration. One special case is when policy evaluation is stopped after just one sweep. This is called value iteration:

\begin{align*} v_{k+1}(s) &= \max_a \mathbb{E}[R_{t+1} + \gamma v_k(S_{t+1}) | S_t = s, A_t = a] \\ &= \max_a \sum_{s',r} p(s', r |s,a)[r + \gamma v_k(s')] \end{align*}

For arbitrary v_0, the sequence ${v_k}$ can be shown to converge to v_*. Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule. Also note how the value iteration update is identical to the policy evaluation update except that it requires the max to be taken over all actions.