1. The Agent–Environment Interface
Markov property: The state that has the Markov property includes information about all aspects of the past agent-environment interaction that make a difference for the future. The MDP framework is an abstraction of the problem of goal-directed learning from interaction.
State, action, reward chain: S0,A0,R1,S1,A1,R2,S2,A2,R3 (Define receiving reward to be the very first thing at each time).
In finite MDP, Rt and St have a well-defined probability distribution only on preceding state and actions:
p(s′,r|s,a)=Prand they sum to 1: \sum_{s'\in S}\sum_{r\in R}p(s',r|s,a) = 1 , for all s \in S , a \in A .
From the four-argument dynamics function p , we can write the following:
p(s'|s,a) = \Pr(S_t=s'|S_{t-1}=s,A_{t-1}=a) = \sum_{r\in R}p(s',r|s,a) r(s,a) = \mathbb{E}(R_t|S_{t-1}=s,A_{t-1}=a) = \sum_{r\in R}r * p(r|s,a)= \sum_{r\in R}r\sum_{s'\in S}p(s',r|s,a) r(s,a,s') = \mathbb{E}(R_t|S_{t-1}=s,A_{t-1}=a, S_{t}=s') = \sum_{r\in R}r * p(r|s,a,s')=\sum_{r\in R}r\frac{p(s',r|s,a)}{p(s'|s,a)}2. Goals, Rewards, Policies, and Value Functions
Reward: A simple number R_t \in \mathbb{R} .
Return: The sum of rewards can be defined as:
G_t = R_{t+1} + R_{t+2} + ... + R_{T}or
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots = \sum^{\infty}_{k=0} \gamma^k R_{t+k+1}, \quad 0 \leq \gamma \leq 1\gamma is the discount rate. Note G_t is the future return (does not include reward at t ). It can also be written as:
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} +... = R_{t+1} + \gamma ( R_{t+2} + \gamma R_{t+3} +...) = R_{t+1} + \gamma G_{t+1}, \quad t < TNote that although the return is a sum of an infinite number of terms, it is still finite if the series of possible rewards are bounded, and \gamma < 1 :
G_t \leq \sum_{k=0}^{\infty}\gamma^k * R_{\max} = \frac{1}{1-\gamma}R_{\max}Policy: A mapping from states to probabilities of selecting each possible action: \pi(a|s) = p(A_t=a | S_t=s ) .
Value function: The expected return when starting in s and following \pi thereafter, v_{\pi}(s) . For MDPs, we can define v_{\pi} to be:
v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s] = \mathbb{E}_{\pi}[\sum^{\infty}_{k=0}\gamma^k R_{t+k+1} | S_t = s], \quad \forall s \in SAction-value function: Define the value of taking action a in state s under a policy \pi , denoted q_{\pi}(s,a) , as the expected return starting from s , taking action a , and thereafter following policy \pi :
q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = A] = \mathbb{E}_{\pi}[\sum^{\infty}_{k=0}\gamma^k R_{t+k+1} | S_t = s, A_t = s], \quad \forall s \in SA fundamental property of value functions used in RL and DP is that they satisfy recursive relationships: For any policy
\pi
and any state
s
, the following consistency condition
holds between the value of a state and its possible successor states:
v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s] = \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t = s] = \sum_a\pi(a|s)\sum_{s'}\sum_r p(s',r|s,a)[r + \gamma \mathbb{E}[G_{t+1}|S_{t+1} = s'] ] = \sum_a\pi(a|s)\sum_{s', r}p(s',r|s,a)[r + \gamma v_{\pi}(s')], \quad \forall s \in S
Note the above equation is the Bellman Equation
for
v_{\pi}
.
3. Optimal Policies and Optimal Value Functions
Optimal Policy: A better policy satisfies v_{\pi}(s) \geq v_{\pi'}(s) ,\forall s \in S . The optimal policy is the one that is better than or equal to all other policies:
v_{*}(s)=\max_{\pi}v_{\pi}(s), \quad \forall s \in SOptimal policies also share the same optimal action-state function:
q_*(s,a) = \max_{\pi}q_{\pi}(s,a), \quad \forall s \in S, a\in AFor the state-action pair (s,a) , this function gives the expected return for taking action a in state s and thereafter following an optimal policy, thus:
q_*(s,a) = \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t=a]Because
v_*
is the value function for a policy, it must satisfy the consistency condition
given by the Bellman equation
for state values. Because it is the optimal value function, however,
v_*’s
consistency condition can be written in a special form without reference to any specific policy. This is the Bellman equation for
v_*
, or the Bellman optimality equation
:
Intuitively, the Bellman optimality equation
expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.
The Bellman optimality equation
for
q_*
is:
Solution: For finite MDPs, the Bellman optimality equation
for
v_*
has a unique solution: It is a system of equations, one for each state. Only the values
v_*(s)
are unknown if the dynamics
p
of the environment are known.
Once one has
v_*
, to determine an optimal policy, for each state
s
, there will be one or more actions at which the max is obtained in the Bellman Optimality equation
. Any policy that only assigns probability to such actions is optimal (One step search, as
v_*
already takes into account the reward consequences of all possible future behavior).
If one has q_* , the agent can simply find any action that maximizes q_*(s,a) .