COMP90054 AI Planning for Autonomy
Does any resource help with this course?
What is this course about?
This course is an introduction to AI planning for autonomy. The course covers AI planning, including search algorithms, Markov decision processes, relaxation, and reinforcement learning. In the end, the course get in touch with game theory.
What is AI Planning?
AI planning is the process of determining a sequence of actions that will achieve a goal. It is a subfield of artificial intelligence that deals with the problem of finding a sequence of actions that will achieve a goal. AI planning is used in a wide range of applications, including robotics, logistics, and scheduling.
Are there any applications of AI Planning?
AI planning has a wide range of applications, including:
- Robotics: AI planning is used to plan the actions of robots to achieve a goal.
- Game playing: AI planning is used to plan the actions of players in games.
- Autonomous vehicles: AI planning is used to plan the actions of autonomous vehicles.
What is BFS?
队列,先进先出,后进后出。
Optimal when costs are uniform.
def breadthFirstSearch(problem):
candidate_queue = util.Queue()
init_state = problem.getStartState()
init_path = []
candidate_queue.push((init_state,init_path))
viewed = []
while not candidate_queue.isEmpty():
state,path = candidate_queue.pop()
if problem.isGoalState(state):
return path
if state not in viewed:
viewed.append(state)
for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
child_path = path + [child_action]
candidate_queue.push((child_state,child_path))
return None
What is DFS?
栈,先进后出,后进先出。
搜索空间可能无限大(无限深)。
def depthFirstSearch(problem):
candidate_stack = util.Stack()
init_state = problem.getStartState()
init_path = []
candidate_stack.push((init_state,init_path))
viewed = []
while not candidate_stack.isEmpty():
state,path = candidate_stack.pop()
if problem.isGoalState(state):
return path
if state not in viewed:
viewed.append(state)
for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
child_path = path + [child_action]
candidate_stack.push((child_state,child_path))
return None
What is Uniform Cost Search?
Priority Queue,最先被探索离起始节点最近(即路径成本最低)的节点。
Breath First Search属于Uniform Cost Search的特例,即每个节点的路径成本都是1。
UCS只看到了路径成本,没有考虑启发式。
def uniformCostSearch(problem):
candidate_pq = util.PriorityQueue()
init_state = problem.getStartState()
init_path = []
candidate_pq.push((init_state,init_path),0)
viewed = []
while not candidate_pq.isEmpty():
state,path = candidate_pq.pop()
if problem.isGoalState(state):
return path
if state not in viewed:
viewed.append(state)
for child_state, child_action, child_cost in problem.getSuccessors(state):
child_path = path + [child_action]
candidate_pq.push((child_state,child_path),problem.getCostOfActions(child_path))
return None
What is Greedy Best First Search?
BFS只看到了启发式,没有考虑路径成本。
If h=0,BFS是什么根据它的Priority Queue的实现。
def bestFirstSearch(problem, heuristic=nullHeuristic):
candidate_pq = util.PriorityQueue()
init_state = problem.getStartState()
init_path = []
candidate_pq.push((init_state,init_path),heuristic(init_state, problem))
viewed = []
while not candidate_pq.isEmpty():
state,path = candidate_pq.pop()
if problem.isGoalState(state):
return path
if state not in viewed:
viewed.append(state)
for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
child_path = path + [child_action]
candidate_pq.push((child_state,child_path),heuristic(child_state, problem))
return None
What is A* Search?
If h=0, A* is equivalent to Uniform Cost Search.
If h admissible, A* is optimal. 因为我们的h比最优h小,在达到最优h之前,我们已经尝试过这个状态。
def aStarSearch(problem, heuristic=nullHeuristic):
"""Search the node that has the lowest combined cost and heuristic first."""
myPQ = util.PriorityQueue()
startState = problem.getStartState()
startNode = (startState, 0, [])
myPQ.push(startNode, heuristic(startState, problem))
best_g = dict()
while not myPQ.isEmpty():
node = myPQ.pop()
state, cost, path = node
if (not state in best_g) or (cost < best_g[state]): # 重新打开
best_g[state] = cost
if problem.isGoalState(state):
return path
for succ in problem.getSuccessors(state):
succState, succAction, succCost = succ
new_cost = cost + succCost
newNode = (succState, new_cost, path + [succAction])
myPQ.push(newNode, heuristic(succState, problem) + new_cost)
return None # Goal not found
Why there is a need for Weighted A* Search?
给heuristic函数加权,以调整搜索的速度。
w越大,搜索越快,但可能会错过最优解,w越小,搜索越慢,但更有可能找到最优解。
w=0时,等价于Uniform Cost Search。
w to ∞时,等价于Greedy Best First Search。
def weightedAStarSearch(problem, heuristic=nullHeuristic, weight=1):
"""Search the node that has the lowest combined cost and heuristic first."""
myPQ = util.PriorityQueue()
startState = problem.getStartState()
startNode = (startState, 0, [])
myPQ.push(startNode, weight * heuristic(startState, problem))
best_g = dict()
while not myPQ.isEmpty():
node = myPQ.pop()
state, cost, path = node
if (not state in best_g) or (cost < best_g[state]):
best_g[state] = cost
if problem.isGoalState(state):
return path
for succ in problem.getSuccessors(state):
succState, succAction, succCost = succ
new_cost = cost + succCost
newNode = (succState, new_cost, path + [succAction])
myPQ.push(newNode, weight * heuristic(succState, problem) + new_cost)
return None # Goal not found
What is Hill Climbing?
local最优。
σ := make-root-node(init())
永远执行以下操作:
如果 is-goal(state(σ)):
返回 extract-solution(σ)
Σ′ := {make-node(σ,a,s′) |(a,s′) ∈succ(state(σ)) }
σ := 选择 Σ′ 中 h 值最小的元素 /* (随机打破平局) */
What are the properties of heuristic function?
remaining cost to reach the goal, optimal假设
是一个计划任务,具有状态空间 ,并且 是 的一个启发式。如果启发式满足以下条件,那么它被称为:安全(Safe):如果对于所有
的状态 ,都有 ,则启发式被称为安全。目标感知(Goal-aware):如果对于所有目标状态
,都有 ,则启发式被称为目标感知。可接受(Admissible):如果对于所有状态
,都有 ,则启发式被称为可接受。一致(Consistent):如果对于所有
的转移,都有 ,则启发式被称为一致。
命题:假设
是一个计划任务,具有状态空间 ,并且 是 的一个启发式。- 如果
是一致的和目标感知的,则 是可接受的。 - 如果
是可接受的,则 是目标感知的。 - 如果
是可接受的,则 是安全的。 - 没有其他这种形式的蕴含关系成立。
- 如果
不可接受:最优的节点如果被高估,就会优先扩展其他节点,而错过最优。
一致性:保证最优路径依次访问。
What is STRIPS?
fact, 原子, 变量 或 operator, action, 操作符, 动作 代表初始情况 代表目标情况操作符
由以下表示:- 添加列表
- 删除列表
- 前提条件列表
- 添加列表
What is Relaxation?
Goal:Helps compute heuristic function。
设
- 问题
:寻路。 - 更简单的问题
:鸟类的寻路。 的完美启发式 :直线距离。- 转换
:假装你是一只鸟。 - 原生的,如果
且 ; - 可有效构造的,如果存在一个多项式时间算法,给定
,可以计算 ; - 可有效计算的,如果存在一个多项式时间算法,给定
,可以计算 。
提醒:你有一个问题
notation in this course:
:将初始状态替换为 的 ,即,将 更改为 。- c: clause, preconditions+effects
Delete Relaxation
忽略搜索中所有的delete效果,在发现goal之前减少重复的状态。
- State Dominance: 如果一个状态支配另一个状态,那么我们可以忽略支配状态。被包含了。
def greedyRelaxedPlan(problem):
candidate_pq = util.PriorityQueue()
init_state = problem.getStartState()
init_path = []
candidate_pq.push((init_state,init_path),0)
viewed = []
while not candidate_pq.isEmpty():
state,path = candidate_pq.pop()
if problem.isGoalState(state):
return path
if state not in viewed:
viewed.append(state)
for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
child_path = path + [child_action]
candidate_pq.push((child_state,child_path),0)
return None
Neither admissible nor consistent. 因为不保证optimal,只保证能找到解决方案。
Optimal的都NP-hard。
Additive and Max Heuristics
- Additive: 相加子目标的启发式,明显不是admissible。
- Max: 选择子目标中最大的启发式。最难解决的子节点。
都goal-aware,因为h+ ∞时,h也是∞。
Best Supporter Heuristic
\- 把
和 结合起来,选择最好的支持者。 - argmin: 在一系列动作里,选最小的h。
Bellman-Ford for hmax and hadd
Bellman-Ford variant computing hadd for state s.
反复更新表Tadd,直到表中的值不再改变。在每次迭代中,对于每个目标状态g,都会计算一个新的值fi(g),这个值是当前状态s到状态g的最短路径的估计值。
def bellmanFordHadd(problem):
states = problem.getStates()
actions = problem.getActions()
P = problem.getTransitionProbabilities()
r = problem.getRewards()
gamma = problem.getDiscount()
theta = 0.01
Tadd = {s: 0 for s in states}
while True:
delta = 0
for s in states:
v = Tadd[s]
Tadd[s] = min([sum([r[s][a][s_prime] + gamma * Tadd[s_prime] for s_prime in states]) for a in actions])
delta = max(delta, abs(v - Tadd[s]))
if delta < theta:
break
return Tadd
Iterated Width
Novelty:只考虑
搜索直到目标状态的状态变量的数量。
一个state的novelty:第一次出现的atom组合中的atom数量。the size of the smallest subset of atoms in s,that is true for the first time in search。
What is Markov Decision Process?
The Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It is used in a wide range of applications, including robotics, economics, and game theory.
MDP
- 状态空间:
- 初始状态:
- 一组目标状态:
- 在每个状态
中可应用的动作: - 对于
和 ,有转移概率: - 动作成本:
其中:
- 解决方案是将状态映射到动作的函数(策略)
- 最优解最小化预期的前往目标的成本
Partially Observable MDPs (POMDPs)
- 状态:
- 在每个状态
中可应用的动作: - 对于
和 ,有转移概率: - 初始信念状态:
- 最终信念状态:
- 由概率
, 给出的传感器模型
其中:
- 信念状态是关于
的概率分布 - 解决方案是将信念状态映射到动作的策略
- 最优策略最小化从
到 的预期成本
see also: https://gibberblot.github.io/rl-notes/single-agent/MDPs.html
Value Iteration
一种动态规划算法,用于计算MDP的最优策略。
def value_iteration(states, actions, P, r, gamma, theta):
V = {s: 0 for s in states} # Initialize value function
while True:
delta = 0
for s in states:
v = V[s]
V[s] = max(sum(P[s][a][s_prime] * (r[s][a][s_prime] + gamma * V[s_prime])
for s_prime in states) for a in actions[s])
delta = max(delta, abs(v - V[s]))
if delta < theta: # Check for convergence
break
return V
Bellman Optimality Equation
- 所有可能的下一个状态的概率
- 动作的奖励
- 下一个状态的价值 x 折扣,前一个iteration存储的价值
Q-Value
对于每个状态
其中
0.95, 0.9025, 0.857375, 0.81450625...
0.9, 0.81, 0.729, 0.6561...
0.8, 0.64, 0.512, 0.4096...
0.7, 0.49, 0.343, 0.2401...
有关Quality Value的一些概念
Value:在强化学习中,value通常指的是一个状态的价值,也就是从这个状态开始,遵循某个策略能够获得的预期回报。
Q-value:Q-value是对于状态-动作对(state-action pair)的价值的一种评估。也就是说,如果在某个状态下执行某个动作,然后遵循某个策略,能够获得的预期回报。
Q-value table:Q-value table是一种数据结构,用于存储每个状态-动作对的Q-value。在表格型强化学习算法(如Q-learning)中,Q-value table是主要的数据结构。
Q-value function:Q-value function是一个函数,它接受一个状态和一个动作作为输入,返回这个状态-动作对的Q-value。在函数逼近方法(如深度Q网络)中,Q-value function通常由神经网络来表示。
Q-table:Q-table和Q-value table是同一个概念,只是名称不同。
Policy
Multi-Armed Bandit
平行地尝试多个动作,平衡exploitation和exploration。
minimising regret
没有选择最佳动作的损失
输入: 多臂老虎机问题
输出: Q函数
初始化
初始化
while
end while
以
随着时间的推移,减少
UCB
$\text{argmax}_{a}\left(Q(a) + \sqrt{\frac{2 \ln t}{N(a)}}\right)$
Q-Learning
Input: MDP
Output: Q-function
Initialise
repeat
until
SARSA
Input: MDP
Output: Q-function
Initialise
repeat
until
也可以写成 ,即下一个状态的价值
off-policy vs on-policy
- Q-Learning是off-policy,因为on当前策略下的Q值,对当前策略更乐观
- SARSA是on-policy,所以off了当前策略下的Q值,更保守
n-step reinforcement learning
记账,在n-step后再一起更新Q值。
MCTS
- Selection:选择一个节点,直到找到一个未扩展的节点
- Expansion:扩展一个未扩展的节点
- Simulation:模拟一个随机游戏,直到结束
- Backpropagation:更新所有访问的节点的值
offline vs online
- offline:完成所有模拟后再选择最佳动作
- online:每次模拟后选择最佳动作,继续对新的节点进行模拟。在下次选择时,同时也利用了之前的模拟结果,MCTS是online的。
用平均值更新:新Q = 旧Q + 学习率 * 误差,实际上就是平均值
UCT
用UCB来select。
where
Linear Q-function Approximation
Number of features: # features = # states * # actions
Update:
where
Shaped Reward
Potential-based Reward Shaping
For example, in Gridworld,
Policy Iteration
魔改bellman方程,将所有动作可能性替换成当前策略下的动作。
What is Game Theory?
Game theory is the study of mathematical models of strategic interaction among rational decision-makers. It has applications in all fields of social science, as well as in logic, systems science and computer science.
Nash Equilibrium
每个agent都选了最优策略,其他agent不会改变策略
- a weakly(strictly) dominant strategy: always better than any other strategy
- indifference: 通过调整我的策略概率,改变对手的收益(payoff)期望,使无论对手如何选择,他的满意度(utility)都一样
Utility Function
Normal Form Game
一轮,不知道对手的策略,只知道对手的utility function
Extensive Form Game 广义形式博弈
轮流决策,所以知道对手的策略
Subgame Perfect Equilibrium 子博弈完美均衡
当前玩家在他的回合的最优策略,对手在他的回合的最优策略。。。
Backward Induction 反向归纳
输入:广义形式博弈
输出:每个状态
函数 BackwardInduction(
返回 BackwardInduction(
Multi-agent Q-learning
输入:随机博弈
输出:Q函数
初始化
重复:
直到