COMP90054 AI Planning for Autonomy

Dale8/1/24

Does any resource help with this course?

What is this course about?

This course is an introduction to AI planning for autonomy. The course covers AI planning, including search algorithms, Markov decision processes, relaxation, and reinforcement learning. In the end, the course get in touch with game theory.

What is AI Planning?

AI planning is the process of determining a sequence of actions that will achieve a goal. It is a subfield of artificial intelligence that deals with the problem of finding a sequence of actions that will achieve a goal. AI planning is used in a wide range of applications, including robotics, logistics, and scheduling.

Are there any applications of AI Planning?

AI planning has a wide range of applications, including:

Robotics: AI planning is used to plan the actions of robots to achieve a goal.
Game playing: AI planning is used to plan the actions of players in games.
Autonomous vehicles: AI planning is used to plan the actions of autonomous vehicles.

What is BFS?

队列，先进先出，后进后出。

Optimal when costs are uniform.

def breadthFirstSearch(problem):
    candidate_queue = util.Queue()
    init_state = problem.getStartState()
    init_path = []
    candidate_queue.push((init_state,init_path))
    viewed = []
    while not candidate_queue.isEmpty():
        state,path = candidate_queue.pop()
        if problem.isGoalState(state):
            return path
        if state not in viewed:
            viewed.append(state)
            for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
                child_path = path + [child_action]
                candidate_queue.push((child_state,child_path))
    return None

What is DFS?

栈，先进后出，后进先出。

搜索空间可能无限大（无限深）。

def depthFirstSearch(problem):
    candidate_stack = util.Stack()
    init_state = problem.getStartState()
    init_path = []
    candidate_stack.push((init_state,init_path))
    viewed = []
    while not candidate_stack.isEmpty():
        state,path = candidate_stack.pop()
        if problem.isGoalState(state):
            return path
        if state not in viewed:
            viewed.append(state)
            for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
                child_path = path + [child_action]
                candidate_stack.push((child_state,child_path))
    return None

What is Uniform Cost Search?

Priority Queue，最先被探索离起始节点最近（即路径成本最低）的节点。

Breath First Search属于Uniform Cost Search的特例，即每个节点的路径成本都是1。

UCS只看到了路径成本，没有考虑启发式。

def uniformCostSearch(problem):
    candidate_pq = util.PriorityQueue()
    init_state = problem.getStartState()
    init_path = []
    candidate_pq.push((init_state,init_path),0)
    viewed = []
    while not candidate_pq.isEmpty():
        state,path = candidate_pq.pop()
        if problem.isGoalState(state):
            return path
        if state not in viewed:
            viewed.append(state)
            for child_state, child_action, child_cost in problem.getSuccessors(state):
                child_path = path + [child_action]
                candidate_pq.push((child_state,child_path),problem.getCostOfActions(child_path))
    return None

What is Greedy Best First Search?

BFS只看到了启发式，没有考虑路径成本。

If h=0，BFS是什么根据它的Priority Queue的实现。

def bestFirstSearch(problem, heuristic=nullHeuristic):
    candidate_pq = util.PriorityQueue()
    init_state = problem.getStartState()
    init_path = []
    candidate_pq.push((init_state,init_path),heuristic(init_state, problem))
    viewed = []
    while not candidate_pq.isEmpty():
        state,path = candidate_pq.pop()
        if problem.isGoalState(state):
            return path
        if state not in viewed:
            viewed.append(state)
            for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
                child_path = path + [child_action]
                candidate_pq.push((child_state,child_path),heuristic(child_state, problem))
    return None

What is A* Search?

If h=0, A* is equivalent to Uniform Cost Search.

If h admissible, A* is optimal. 因为我们的h比最优h小，在达到最优h之前，我们已经尝试过这个状态。

def aStarSearch(problem, heuristic=nullHeuristic):
    """Search the node that has the lowest combined cost and heuristic first."""
    myPQ = util.PriorityQueue()
    startState = problem.getStartState()
    startNode = (startState, 0, [])
    myPQ.push(startNode, heuristic(startState, problem))
    best_g = dict()
    while not myPQ.isEmpty():
        node = myPQ.pop()
        state, cost, path = node
        if (not state in best_g) or (cost < best_g[state]): # 重新打开
            best_g[state] = cost
            if problem.isGoalState(state):
                return path
            for succ in problem.getSuccessors(state):
                succState, succAction, succCost = succ
                new_cost = cost + succCost
                newNode = (succState, new_cost, path + [succAction])
                myPQ.push(newNode, heuristic(succState, problem) + new_cost)

    return None  # Goal not found

Why there is a need for Weighted A* Search?

给heuristic函数加权，以调整搜索的速度。

w越大，搜索越快，但可能会错过最优解，w越小，搜索越慢，但更有可能找到最优解。

w=0时，等价于Uniform Cost Search。

w to ∞时，等价于Greedy Best First Search。

def weightedAStarSearch(problem, heuristic=nullHeuristic, weight=1):
    """Search the node that has the lowest combined cost and heuristic first."""
    myPQ = util.PriorityQueue()
    startState = problem.getStartState()
    startNode = (startState, 0, [])
    myPQ.push(startNode, weight * heuristic(startState, problem))
    best_g = dict()
    while not myPQ.isEmpty():
        node = myPQ.pop()
        state, cost, path = node
        if (not state in best_g) or (cost < best_g[state]):
            best_g[state] = cost
            if problem.isGoalState(state):
                return path
            for succ in problem.getSuccessors(state):
                succState, succAction, succCost = succ
                new_cost = cost + succCost
                newNode = (succState, new_cost, path + [succAction])
                myPQ.push(newNode, weight * heuristic(succState, problem) + new_cost)

    return None  # Goal not found

What is Hill Climbing?

local最优。

σ := make-root-node(init())
永远执行以下操作：
    如果 is-goal(state(σ))：
        返回 extract-solution(σ)
    Σ′ := {make-node(σ,a,s′) |(a,s′) ∈succ(state(σ)) }
    σ := 选择 Σ′ 中 h 值最小的元素 /* （随机打破平局） */

What are the properties of heuristic function?

$h$ remaining cost to reach the goal, $*$ optimal
假设 $\Pi$ 是一个计划任务，具有状态空间 $\Theta_{\Pi} = (S, L, c, T, I, G)$ ，并且 $h$ 是 $\Pi$ 的一个启发式。如果启发式满足以下条件，那么它被称为：
- 安全（Safe）：如果对于所有 $h(s) = \infty$ 的状态 $s \in S$ ，都有 $h^*(s) = \infty$ ，则启发式被称为安全。
- 目标感知（Goal-aware）：如果对于所有目标状态 $s \in S_G$ ，都有 $h(s) = 0$ ，则启发式被称为目标感知。
- 可接受（Admissible）：如果对于所有状态 $s \in S$ ，都有 $h(s) \leq h^*(s)$ ，则启发式被称为可接受。
- 一致（Consistent）：如果对于所有 $s \xrightarrow{a} s'$ 的转移，都有 $h(s) \leq h(s') + c(a)$ ，则启发式被称为一致。
命题：假设 $\Pi$ 是一个计划任务，具有状态空间 $\Theta_{\Pi} = (S, L, c, T, I, S_G)$ ，并且 $h$ 是 $\Pi$ 的一个启发式。
- 如果 $h$ 是一致的和目标感知的，则 $h$ 是可接受的。
- 如果 $h$ 是可接受的，则 $h$ 是目标感知的。
- 如果 $h$ 是可接受的，则 $h$ 是安全的。
- 没有其他这种形式的蕴含关系成立。
不可接受：最优的节点如果被高估，就会优先扩展其他节点，而错过最优。
一致性：保证最优路径依次访问。

What is STRIPS?

$P = \langle F,O,I,G \rangle$

$F$ fact, 原子, 变量
$O$ 或 $A$ operator, action, 操作符, 动作
$I \subseteq F$ 代表初始情况
$G \subseteq F$ 代表目标情况
操作符 $o \in O$ 由以下表示：
- 添加列表 $Add(o) \subseteq F$
- 删除列表 $Del(o) \subseteq F$
- 前提条件列表 $Pre(o) \subseteq F$

What is Relaxation?

Goal：Helps compute heuristic function。

设 $h^* : P \rightarrow R^+_0 \cup \{\infty\}$ 是一个函数。 $h^*$ 的松弛是一个三元组 $R= (P',r,h'^*)$ ，其中 $P'$ 是任意集合， $r : P \rightarrow P'$ 和 $h'^* : P' \rightarrow R^+_0 \cup \{\infty\}$ 是函数，对于所有的 $\Pi \in P$ ，松弛启发式 $h_R(\Pi) := h'^*(r(\Pi))$ 满足 $h_R(\Pi) \leq h^*(\Pi)$ 。松弛是：

问题 $P$ ：寻路。
更简单的问题 $P'$ ：鸟类的寻路。
$P'$ 的完美启发式 $h'^*$ ：直线距离。
转换 $r$ ：假装你是一只鸟。
原生的，如果 $P' \subseteq P$ 且 $h'^* = h^*$ ；
可有效构造的，如果存在一个多项式时间算法，给定 $\Pi \in P$ ，可以计算 $r(\Pi)$ ；
可有效计算的，如果存在一个多项式时间算法，给定 $\Pi' \in P'$ ，可以计算 $h'^*(\Pi')$ 。

提醒：你有一个问题 $P$ ，你希望估计其完美启发式 $h^*$ 。你定义了一个更简单的问题 $P'$ ，其完美启发式 $h'^*$ 可以用来（可接受地！）估计 $h^*$ 。你定义了一个从 $P$ 到 $P'$ 的转换 $r$ 。给定 $\Pi \in P$ ，你通过 $h'^*(r(\Pi))$ 来估计 $h^*(\Pi)$ 。

notation in this course:
$\Pi_s$ ：将初始状态替换为 $s$ 的 $\Pi$ ，即，将 $\Pi = (F,A,c,I,G)$ 更改为 $(F,A,c,s,G)$ 。
- c: clause, preconditions+effects

Delete Relaxation

忽略搜索中所有的delete效果，在发现goal之前减少重复的状态。

State Dominance: 如果一个状态支配另一个状态，那么我们可以忽略支配状态。被包含了。

def greedyRelaxedPlan(problem):
    candidate_pq = util.PriorityQueue()
    init_state = problem.getStartState()
    init_path = []
    candidate_pq.push((init_state,init_path),0)
    viewed = []
    while not candidate_pq.isEmpty():
        state,path = candidate_pq.pop()
        if problem.isGoalState(state):
            return path
        if state not in viewed:
            viewed.append(state)
            for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
                child_path = path + [child_action]
                candidate_pq.push((child_state,child_path),0)
    return None

Neither admissible nor consistent. 因为不保证optimal，只保证能找到解决方案。

Optimal的都NP-hard。

Additive and Max Heuristics

Additive: 相加子目标的启发式，明显不是admissible。

$h^{add}(s, g) = \begin{cases} 0 & \text{if } g \subseteq s \\ \min_{a \in A, g \in add_a} (c(a) + h^{add}(s, pre_a)) & \text{if } |g| = 1 \\ \sum_{g' \in g} h^{add}(s, \{g'\}) & \text{if } |g| > 1 \end{cases}$

Max: 选择子目标中最大的启发式。最难解决的子节点。

$h^{max}(s, g) = \begin{cases} 0 & \text{if } g \subseteq s \\ \max_{a \in A, g \in add_a} (c(a) + h^{max}(s, pre_a)) & \text{if } |g| = 1 \\ \max_{g' \in g} h^{max}(s, \{g'\}) & \text{if } |g| > 1 \end{cases}$

都goal-aware，因为h+ ∞时，h也是∞。

Best Supporter Heuristic

$bs_{s}^{max}(p) = argmin_{a\in A,p \in add_a}c(a)+h^{max}(s,pre_a)$ \
$bs_{s}^{add}(p) = argmin_{a\in A,p \in add_a}c(a)+h^{add}(s,pre_a)$
把 $h_{add}$ 和 $h_{max}$ 结合起来，选择最好的支持者。
argmin: 在一系列动作里，选最小的h。

Bellman-Ford for hmax and hadd

Bellman-Ford variant computing hadd for state s.

反复更新表Tadd，直到表中的值不再改变。在每次迭代中，对于每个目标状态g，都会计算一个新的值fi(g)，这个值是当前状态s到状态g的最短路径的估计值。

def bellmanFordHadd(problem):
    states = problem.getStates()
    actions = problem.getActions()
    P = problem.getTransitionProbabilities()
    r = problem.getRewards()
    gamma = problem.getDiscount()
    theta = 0.01
    Tadd = {s: 0 for s in states}
    while True:
        delta = 0
        for s in states:
            v = Tadd[s]
            Tadd[s] = min([sum([r[s][a][s_prime] + gamma * Tadd[s_prime] for s_prime in states]) for a in actions])
            delta = max(delta, abs(v - Tadd[s]))
        if delta < theta:
            break
    return Tadd

Iterated Width

Novelty：只考虑 $w(s)$ 个状态变量atoms的变化情况。

搜索直到目标状态的状态变量的数量。

一个state的novelty：第一次出现的atom组合中的atom数量。the size of the smallest subset of atoms in s，that is true for the first time in search。

What is Markov Decision Process?

The Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It is used in a wide range of applications, including robotics, economics, and game theory.

MDP

状态空间: $S$
初始状态: $s_0 \in S$
一组目标状态: $G \subseteq S$
在每个状态 $s \in S$ 中可应用的动作: $A(s) \subseteq A$
对于 $s \in S$ 和 $a \in A(s)$ ，有转移概率: $P_a(s'|s)$
动作成本: $c(a,s) > 0$

其中：

解决方案是将状态映射到动作的函数（策略）
最优解最小化预期的前往目标的成本

Partially Observable MDPs (POMDPs)

状态: $s \in S$
在每个状态 $s \in S$ 中可应用的动作: $A(s) \subseteq A$
对于 $s \in S$ 和 $a \in A(s)$ ，有转移概率: $P_a(s'|s)$
初始信念状态: $b_0$
最终信念状态: $b_f$
由概率 $P_a(o|s)$ ， $o \in Obs$ 给出的传感器模型

其中：

信念状态是关于 $S$ 的概率分布
解决方案是将信念状态映射到动作的策略
最优策略最小化从 $b_0$ 到 $G$ 的预期成本

Value Iteration

一种动态规划算法，用于计算MDP的最优策略。

def value_iteration(states, actions, P, r, gamma, theta):
    V = {s: 0 for s in states}  # Initialize value function
    while True:
        delta = 0
        for s in states:
            v = V[s]
            V[s] = max(sum(P[s][a][s_prime] * (r[s][a][s_prime] + gamma * V[s_prime])
                            for s_prime in states) for a in actions[s])
            delta = max(delta, abs(v - V[s]))
        if delta < theta:  # Check for convergence
            break
    return V

Bellman Optimality Equation

$V^*(s) = \max_{a \in A(s)} \sum_{s'} P_a(s'|s) \left[ R_a(s'|s) + \gamma V^{\*}{(s')} \right]$

所有可能的下一个状态的概率
动作的奖励
下一个状态的价值 x 折扣，前一个iteration存储的价值

Q-Value

对于每个状态 $s \in S$ ，其一个可能动作 $a \in A(s)$ 的质量是：
$Q(s,a) = \sum_{s'} P_a(s'|s) \left[ R_a(s'|s) + \gamma V^*(s') \right]$

其中 $\gamma$ 是折扣，越接近1，越重视长期奖励，越接近0，越重视短期奖励。

0.95, 0.9025, 0.857375, 0.81450625...
0.9, 0.81, 0.729, 0.6561...
0.8, 0.64, 0.512, 0.4096...
0.7, 0.49, 0.343, 0.2401...

有关Quality Value的一些概念

Value：在强化学习中，value通常指的是一个状态的价值，也就是从这个状态开始，遵循某个策略能够获得的预期回报。
Q-value：Q-value是对于状态-动作对(state-action pair)的价值的一种评估。也就是说，如果在某个状态下执行某个动作，然后遵循某个策略，能够获得的预期回报。
Q-value table：Q-value table是一种数据结构，用于存储每个状态-动作对的Q-value。在表格型强化学习算法（如Q-learning）中，Q-value table是主要的数据结构。
Q-value function：Q-value function是一个函数，它接受一个状态和一个动作作为输入，返回这个状态-动作对的Q-value。在函数逼近方法（如深度Q网络）中，Q-value function通常由神经网络来表示。
Q-table：Q-table和Q-value table是同一个概念，只是名称不同。

Policy

$\pi(s) = arg max Q(s,a)$

Multi-Armed Bandit

平行地尝试多个动作，平衡exploitation和exploration。

minimising regret

没有选择最佳动作的损失

输入: 多臂老虎机问题 $M = \{X_{i,k}, A, T\}$
输出: Q函数 $Q$
初始化 $Q$ 为任意值; 例如，对所有的臂 $a$ ， $Q(a) \leftarrow 0$
初始化 $N$ 为任意值; 例如，对所有的臂 $a$ ， $N(a) \leftarrow 0$
$k \leftarrow 1$
while $k \leq T$ do
$\quad$ $a \leftarrow$ 在第 $k$ 轮选择一个臂
$\quad$ 在第 $k$ 轮执行臂 $a$ 并观察奖励 $X_{a,k}$
$\quad$ $N(a) \leftarrow N(a) + 1$
$\quad$ $Q(a) \leftarrow Q(a) + \frac{1}{N(a)} [X_{a,k} - Q(a)]$
$\quad$ $k \leftarrow k + 1$
end while

$\epsilon$ -greedy

以 $1-\epsilon$ 的概率选择最佳动作，以 $\epsilon$ 的概率选择随机动作

$\epsilon$ -greedy with decay

随着时间的推移，减少 $\epsilon$ 的值

UCB

$\text{argmax}_{a}\left(Q(a) + \sqrt{\frac{2 \ln t}{N(a)}}\right)$

Q-Learning

Input: MDP $M = \langle S, s_0, A, P_a(s' | s), r(s, a, s') \rangle$
Output: Q-function $Q$
Initialise $Q$ arbitrarily; e.g., $Q(s, a) \leftarrow 0$ for all $s$ and $a$
repeat
$\quad$ $s \leftarrow$ the first state in episode $e$
$\quad$ repeat (for each step in episode $e$ )
$\quad\quad$ Select action $a$ to apply in $s$ ; e.g. using $Q$ and a multi-armed bandit algorithm such as $\epsilon$ -greedy
$\quad\quad$ Execute action $a$ in state $s$
$\quad\quad$ Observe reward $r$ and new state $s'$
$\quad\quad$ $\delta \leftarrow r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a)$
$\quad\quad$ $Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \delta$
$\quad\quad$ $s \leftarrow s'$
$\quad$ until $s$ is the last state of episode $e$ (a terminal state)
until $Q$ converges

SARSA

Input: MDP $M = \langle S, s_0, A, P_a(s' | s), r(s, a, s') \rangle$
Output: Q-function $Q$
Initialise $Q$ arbitrarily; e.g., $Q(s, a) \leftarrow 0$ for all $s$ and $a$
repeat
$\quad$ $s \leftarrow$ the first state in episode $e$
$\quad$ Choose $a$ from $s$ using policy derived from $Q$ (e.g., $\epsilon$ -greedy)
$\quad$ repeat (for each step in episode $e$ )
$\quad\quad$ Take action $a$ , observe $r$ , $s'$
$\quad\quad$ Choose $a'$ from $s'$ using policy derived from $Q$ (e.g., $\epsilon$ -greedy)
$\quad\quad$ $\delta \leftarrow r + \gamma \cdot Q(s', a') - Q(s, a)$
$\quad\quad$ $Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \delta$
$\quad\quad$ $s \leftarrow s'$ , $a \leftarrow a'$
$\quad$ until $s$ is the last state of episode $e$ (a terminal state)
until $Q$ converges

$max_{a'} Q(s', a')$ 也可以写成 $V(s')$ ，即下一个状态的价值

off-policy vs on-policy

Q-Learning是off-policy，因为on当前策略下的Q值，对当前策略更乐观
SARSA是on-policy，所以off了当前策略下的Q值，更保守

n-step reinforcement learning

记账，在n-step后再一起更新Q值。

MCTS

Selection：选择一个节点，直到找到一个未扩展的节点
Expansion：扩展一个未扩展的节点
Simulation：模拟一个随机游戏，直到结束
Backpropagation：更新所有访问的节点的值

offline vs online

offline：完成所有模拟后再选择最佳动作
online：每次模拟后选择最佳动作，继续对新的节点进行模拟。在下次选择时，同时也利用了之前的模拟结果，MCTS是online的。

用平均值更新：新Q = 旧Q + 学习率 * 误差，实际上就是平均值

UCT

用UCB来select。

$\text{argmax}_{a \in A(s)} Q(s,a) + 2 C_p \sqrt{\frac{2 \ln N(s)}{N(s,a)}}$
where $C_p$ 自己选，看是更偏向exploration还是exploitation

Linear Q-function Approximation

Number of features: # features = # states * # actions

$Q(s,a) = f^T w = \sum_{i=1}^{n} f_i(s,a) w_i$

Update: $w \leftarrow w + \alpha \delta f(s,a)$
where
$\delta = r + \gamma \max_{a'} Q(s',a') - Q(s,a)$ if Q-learning
$\delta = r + \gamma Q(s',a') - Q(s,a)$ if SARSA

Shaped Reward

$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \underbrace{F(s,s')}_{\text{additional reward}} + \gamma \max_{a'} Q(s',a') - Q(s,a)]$

Potential-based Reward Shaping

$F(s,s') = \gamma \Phi(s') - \Phi(s)$

For example, in Gridworld,
$\Phi(s) = 1 - \frac{|x(g) - x(s)| + |y(g) - y(s)|}{width + height - 2}$

Policy Iteration

魔改bellman方程，将所有动作可能性替换成当前策略下的动作。

What is Game Theory?

Game theory is the study of mathematical models of strategic interaction among rational decision-makers. It has applications in all fields of social science, as well as in logic, systems science and computer science.

Nash Equilibrium

每个agent都选了最优策略，其他agent不会改变策略

a weakly(strictly) dominant strategy: always better than any other strategy
indifference: 通过调整我的策略概率，改变对手的收益（payoff）期望，使无论对手如何选择，他的满意度（utility）都一样

Utility Function

$U_i(a)$ : what can agent $i$ get from action $a$

Normal Form Game

一轮，不知道对手的策略，只知道对手的utility function

Extensive Form Game 广义形式博弈

轮流决策，所以知道对手的策略

Subgame Perfect Equilibrium 子博弈完美均衡

当前玩家在他的回合的最优策略，对手在他的回合的最优策略。。。

Backward Induction 反向归纳

输入：广义形式博弈 $G = (N, Agt, S, s_0, A, T, r)$
输出：每个状态 $s \in S$ 的子博弈均衡

函数 BackwardInduction( $s \in S$ ):
$\quad$ 如果 $A(s) = \emptyset$ ，则返回 $r(s)$
$\quad$ best_child $\leftarrow (-\infty, ..., -\infty)$
$\quad$ 对于每个 $a \in A(s)$ ：
$\quad\quad$ $s' \leftarrow T(s,a)$
$\quad\quad$ child_reward $\leftarrow$ BackwardInduction( $s'$ )
$\quad\quad$ 如果 child_reward( $P(s)$ ) > best_child( $P(s)$ )，则 best_child $\leftarrow$ child_reward
$\quad$ 返回 best_child

返回 BackwardInduction( $s_0$ )

Multi-agent Q-learning

输入：随机博弈 $G = (S, s_0, A^1, ..., A^n, r^1, ..., r^n, Agt, P, \gamma)$
输出：Q函数 $Q^j$ ，其中 $j$ 是 self agent

初始化 $Q^j$ 任意，例如，对于所有的状态 $s$ 和联合动作 $a$ ， $Q^j(s,a)=0$

重复：
$\quad$ $s \leftarrow$ episode $e$ 的第一个状态
$\quad$ 重复（对于 episode $e$ 的每一步）：
$\quad\quad$ 选择在 $s$ 中应用的动作 $a^j$
$\quad\quad\quad$ 例如，使用 $Q^j$ 和一个多臂老虎机算法，如 $\epsilon$ -greedy
$\quad\quad$ 在状态 $s$ 中执行动作 $a^j$
$\quad\quad$ 观察奖励 $r^j$ 和新状态 $s'$
$\quad\quad$ $Q^j(s,a) \leftarrow Q^j(s,a) + \alpha * [r^j + \gamma * \max_{a'} Q^j(s',a') - Q^j(s,a)]$
$\quad\quad$ $s \leftarrow s'$
$\quad$ 直到 episode $e$ 结束（一个终止状态）
直到 $Q$ 收敛