COMP90054 AI Planning for Autonomy
Does any resource help with this course?
What is this course about?
This course is an introduction to AI planning for autonomy. The course covers AI planning, including search algorithms, Markov decision processes, relaxation, and reinforcement learning. In the end, the course get in touch with game theory.
What is AI Planning?
AI planning is the process of determining a sequence of actions that will achieve a goal. It is a subfield of artificial intelligence that deals with the problem of finding a sequence of actions that will achieve a goal. AI planning is used in a wide range of applications, including robotics, logistics, and scheduling.
Are there any applications of AI Planning?
AI planning has a wide range of applications, including:
- Robotics: AI planning is used to plan the actions of robots to achieve a goal.
- Game playing: AI planning is used to plan the actions of players in games.
- Autonomous vehicles: AI planning is used to plan the actions of autonomous vehicles.
What is BFS?
队列,先进先出,后进后出。
Optimal when costs are uniform.
def breadthFirstSearch(problem):
candidate_queue = util.Queue()
init_state = problem.getStartState()
init_path = []
candidate_queue.push((init_state,init_path))
viewed = []
while not candidate_queue.isEmpty():
state,path = candidate_queue.pop()
if problem.isGoalState(state):
return path
if state not in viewed:
viewed.append(state)
for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
child_path = path + [child_action]
candidate_queue.push((child_state,child_path))
return None
What is DFS?
栈,先进后出,后进先出。
搜索空间可能无限大(无限深)。
def depthFirstSearch(problem):
candidate_stack = util.Stack()
init_state = problem.getStartState()
init_path = []
candidate_stack.push((init_state,init_path))
viewed = []
while not candidate_stack.isEmpty():
state,path = candidate_stack.pop()
if problem.isGoalState(state):
return path
if state not in viewed:
viewed.append(state)
for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
child_path = path + [child_action]
candidate_stack.push((child_state,child_path))
return None
What is Uniform Cost Search?
Priority Queue,最先被探索离起始节点最近(即路径成本最低)的节点。
Breath First Search属于Uniform Cost Search的特例,即每个节点的路径成本都是1。
UCS只看到了路径成本,没有考虑启发式。
def uniformCostSearch(problem):
candidate_pq = util.PriorityQueue()
init_state = problem.getStartState()
init_path = []
candidate_pq.push((init_state,init_path),0)
viewed = []
while not candidate_pq.isEmpty():
state,path = candidate_pq.pop()
if problem.isGoalState(state):
return path
if state not in viewed:
viewed.append(state)
for child_state, child_action, child_cost in problem.getSuccessors(state):
child_path = path + [child_action]
candidate_pq.push((child_state,child_path),problem.getCostOfActions(child_path))
return None
What is Greedy Best First Search?
BFS只看到了启发式,没有考虑路径成本。
If h=0,BFS是什么根据它的Priority Queue的实现。
def bestFirstSearch(problem, heuristic=nullHeuristic):
candidate_pq = util.PriorityQueue()
init_state = problem.getStartState()
init_path = []
candidate_pq.push((init_state,init_path),heuristic(init_state, problem))
viewed = []
while not candidate_pq.isEmpty():
state,path = candidate_pq.pop()
if problem.isGoalState(state):
return path
if state not in viewed:
viewed.append(state)
for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
child_path = path + [child_action]
candidate_pq.push((child_state,child_path),heuristic(child_state, problem))
return None
What is A* Search?
If h=0, A* is equivalent to Uniform Cost Search.
If h admissible, A* is optimal. 因为我们的h比最优h小,在达到最优h之前,我们已经尝试过这个状态。
def aStarSearch(problem, heuristic=nullHeuristic):
"""Search the node that has the lowest combined cost and heuristic first."""
myPQ = util.PriorityQueue()
startState = problem.getStartState()
startNode = (startState, 0, [])
myPQ.push(startNode, heuristic(startState, problem))
best_g = dict()
while not myPQ.isEmpty():
node = myPQ.pop()
state, cost, path = node
if (not state in best_g) or (cost < best_g[state]): # 重新打开
best_g[state] = cost
if problem.isGoalState(state):
return path
for succ in problem.getSuccessors(state):
succState, succAction, succCost = succ
new_cost = cost + succCost
newNode = (succState, new_cost, path + [succAction])
myPQ.push(newNode, heuristic(succState, problem) + new_cost)
return None # Goal not found
Why there is a need for Weighted A* Search?
给heuristic函数加权,以调整搜索的速度。
w越大,搜索越快,但可能会错过最优解,w越小,搜索越慢,但更有可能找到最优解。
w=0时,等价于Uniform Cost Search。
w to ∞时,等价于Greedy Best First Search。
def weightedAStarSearch(problem, heuristic=nullHeuristic, weight=1):
"""Search the node that has the lowest combined cost and heuristic first."""
myPQ = util.PriorityQueue()
startState = problem.getStartState()
startNode = (startState, 0, [])
myPQ.push(startNode, weight * heuristic(startState, problem))
best_g = dict()
while not myPQ.isEmpty():
node = myPQ.pop()
state, cost, path = node
if (not state in best_g) or (cost < best_g[state]):
best_g[state] = cost
if problem.isGoalState(state):
return path
for succ in problem.getSuccessors(state):
succState, succAction, succCost = succ
new_cost = cost + succCost
newNode = (succState, new_cost, path + [succAction])
myPQ.push(newNode, weight * heuristic(succState, problem) + new_cost)
return None # Goal not found
What is Hill Climbing?
local最优。
σ := make-root-node(init())
永远执行以下操作:
如果 is-goal(state(σ)):
返回 extract-solution(σ)
Σ′ := {make-node(σ,a,s′) |(a,s′) ∈succ(state(σ)) }
σ := 选择 Σ′ 中 h 值最小的元素 /* (随机打破平局) */
What are the properties of heuristic function?
remaining cost to reach the goal, optimal
假设是一个计划任务,具有状态空间,并且是的一个启发式。如果启发式满足以下条件,那么它被称为:
安全(Safe):如果对于所有的状态,都有,则启发式被称为安全。
目标感知(Goal-aware):如果对于所有目标状态,都有,则启发式被称为目标感知。
可接受(Admissible):如果对于所有状态,都有,则启发式被称为可接受。
一致(Consistent):如果对于所有的转移,都有,则启发式被称为一致。
命题:假设是一个计划任务,具有状态空间,并且是的一个启发式。
- 如果是一致的和目标感知的,则是可接受的。
- 如果是可接受的,则是目标感知的。
- 如果是可接受的,则是安全的。
- 没有其他这种形式的蕴含关系成立。
不可接受:最优的节点如果被高估,就会优先扩展其他节点,而错过最优。
一致性:保证最优路径依次访问。
What is STRIPS?
fact, 原子, 变量
或 operator, action, 操作符, 动作
代表初始情况
代表目标情况
操作符由以下表示:
- 添加列表
- 删除列表
- 前提条件列表
What is Relaxation?
Goal:Helps compute heuristic function。
设是一个函数。的松弛是一个三元组,其中是任意集合,和是函数,对于所有的,松弛启发式满足。松弛是:
- 问题:寻路。
- 更简单的问题:鸟类的寻路。
- 的完美启发式:直线距离。
- 转换:假装你是一只鸟。
- 原生的,如果且;
- 可有效构造的,如果存在一个多项式时间算法,给定,可以计算;
- 可有效计算的,如果存在一个多项式时间算法,给定,可以计算。
提醒:你有一个问题,你希望估计其完美启发式。你定义了一个更简单的问题,其完美启发式可以用来(可接受地!)估计。你定义了一个从到的转换。给定,你通过来估计。
notation in this course:
:将初始状态替换为的,即,将更改为。
- c: clause, preconditions+effects
Delete Relaxation
忽略搜索中所有的delete效果,在发现goal之前减少重复的状态。
- State Dominance: 如果一个状态支配另一个状态,那么我们可以忽略支配状态。被包含了。
def greedyRelaxedPlan(problem):
candidate_pq = util.PriorityQueue()
init_state = problem.getStartState()
init_path = []
candidate_pq.push((init_state,init_path),0)
viewed = []
while not candidate_pq.isEmpty():
state,path = candidate_pq.pop()
if problem.isGoalState(state):
return path
if state not in viewed:
viewed.append(state)
for child_state, child_action, _ in problem.getSuccessors(state): # ignore cost as we are blind
child_path = path + [child_action]
candidate_pq.push((child_state,child_path),0)
return None
Neither admissible nor consistent. 因为不保证optimal,只保证能找到解决方案。
Optimal的都NP-hard。
Additive and Max Heuristics
- Additive: 相加子目标的启发式,明显不是admissible。
- Max: 选择子目标中最大的启发式。最难解决的子节点。
都goal-aware,因为h+ ∞时,h也是∞。
Best Supporter Heuristic
- \
- 把和结合起来,选择最好的支持者。
- argmin: 在一系列动作里,选最小的h。
Bellman-Ford for hmax and hadd
Bellman-Ford variant computing hadd for state s.
反复更新表Tadd,直到表中的值不再改变。在每次迭代中,对于每个目标状态g,都会计算一个新的值fi(g),这个值是当前状态s到状态g的最短路径的估计值。
def bellmanFordHadd(problem):
states = problem.getStates()
actions = problem.getActions()
P = problem.getTransitionProbabilities()
r = problem.getRewards()
gamma = problem.getDiscount()
theta = 0.01
Tadd = {s: 0 for s in states}
while True:
delta = 0
for s in states:
v = Tadd[s]
Tadd[s] = min([sum([r[s][a][s_prime] + gamma * Tadd[s_prime] for s_prime in states]) for a in actions])
delta = max(delta, abs(v - Tadd[s]))
if delta < theta:
break
return Tadd
Iterated Width
Novelty:只考虑个状态变量atoms的变化情况。
搜索直到目标状态的状态变量的数量。
一个state的novelty:第一次出现的atom组合中的atom数量。the size of the smallest subset of atoms in s,that is true for the first time in search。
What is Markov Decision Process?
The Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It is used in a wide range of applications, including robotics, economics, and game theory.
MDP
- 状态空间:
- 初始状态:
- 一组目标状态:
- 在每个状态 中可应用的动作:
- 对于 和 ,有转移概率:
- 动作成本:
其中:
- 解决方案是将状态映射到动作的函数(策略)
- 最优解最小化预期的前往目标的成本
Partially Observable MDPs (POMDPs)
- 状态:
- 在每个状态 中可应用的动作:
- 对于 和 ,有转移概率:
- 初始信念状态:
- 最终信念状态:
- 由概率 , 给出的传感器模型
其中:
- 信念状态是关于 的概率分布
- 解决方案是将信念状态映射到动作的策略
- 最优策略最小化从 到 的预期成本
see also: https://gibberblot.github.io/rl-notes/single-agent/MDPs.html
Value Iteration
一种动态规划算法,用于计算MDP的最优策略。
def value_iteration(states, actions, P, r, gamma, theta):
V = {s: 0 for s in states} # Initialize value function
while True:
delta = 0
for s in states:
v = V[s]
V[s] = max(sum(P[s][a][s_prime] * (r[s][a][s_prime] + gamma * V[s_prime])
for s_prime in states) for a in actions[s])
delta = max(delta, abs(v - V[s]))
if delta < theta: # Check for convergence
break
return V
Bellman Optimality Equation
- 所有可能的下一个状态的概率
- 动作的奖励
- 下一个状态的价值 x 折扣,前一个iteration存储的价值
Q-Value
对于每个状态 ,其一个可能动作 的质量是:
其中 是折扣,越接近1,越重视长期奖励,越接近0,越重视短期奖励。
0.95, 0.9025, 0.857375, 0.81450625...
0.9, 0.81, 0.729, 0.6561...
0.8, 0.64, 0.512, 0.4096...
0.7, 0.49, 0.343, 0.2401...
有关Quality Value的一些概念
Value:在强化学习中,value通常指的是一个状态的价值,也就是从这个状态开始,遵循某个策略能够获得的预期回报。
Q-value:Q-value是对于状态-动作对(state-action pair)的价值的一种评估。也就是说,如果在某个状态下执行某个动作,然后遵循某个策略,能够获得的预期回报。
Q-value table:Q-value table是一种数据结构,用于存储每个状态-动作对的Q-value。在表格型强化学习算法(如Q-learning)中,Q-value table是主要的数据结构。
Q-value function:Q-value function是一个函数,它接受一个状态和一个动作作为输入,返回这个状态-动作对的Q-value。在函数逼近方法(如深度Q网络)中,Q-value function通常由神经网络来表示。
Q-table:Q-table和Q-value table是同一个概念,只是名称不同。
Policy
Multi-Armed Bandit
平行地尝试多个动作,平衡exploitation和exploration。
minimising regret
没有选择最佳动作的损失
输入: 多臂老虎机问题
输出: Q函数
初始化 为任意值; 例如,对所有的臂 ,
初始化 为任意值; 例如,对所有的臂 ,
while do
在第 轮选择一个臂
在第 轮执行臂 并观察奖励
end while
-greedy
以 的概率选择最佳动作,以 的概率选择随机动作
-greedy with decay
随着时间的推移,减少 的值
UCB
$\text{argmax}_{a}\left(Q(a) + \sqrt{\frac{2 \ln t}{N(a)}}\right)$
Q-Learning
Input: MDP
Output: Q-function
Initialise arbitrarily; e.g., for all and
repeat
the first state in episode
repeat (for each step in episode )
Select action to apply in ; e.g. using and a multi-armed bandit algorithm such as -greedy
Execute action in state
Observe reward and new state
until is the last state of episode (a terminal state)
until converges
SARSA
Input: MDP
Output: Q-function
Initialise arbitrarily; e.g., for all and
repeat
the first state in episode
Choose from using policy derived from (e.g., -greedy)
repeat (for each step in episode )
Take action , observe ,
Choose from using policy derived from (e.g., -greedy)
,
until is the last state of episode (a terminal state)
until converges
- 也可以写成 ,即下一个状态的价值
off-policy vs on-policy
- Q-Learning是off-policy,因为on当前策略下的Q值,对当前策略更乐观
- SARSA是on-policy,所以off了当前策略下的Q值,更保守
n-step reinforcement learning
记账,在n-step后再一起更新Q值。
MCTS
- Selection:选择一个节点,直到找到一个未扩展的节点
- Expansion:扩展一个未扩展的节点
- Simulation:模拟一个随机游戏,直到结束
- Backpropagation:更新所有访问的节点的值
offline vs online
- offline:完成所有模拟后再选择最佳动作
- online:每次模拟后选择最佳动作,继续对新的节点进行模拟。在下次选择时,同时也利用了之前的模拟结果,MCTS是online的。
用平均值更新:新Q = 旧Q + 学习率 * 误差,实际上就是平均值
UCT
用UCB来select。
where 自己选,看是更偏向exploration还是exploitation
Linear Q-function Approximation
Number of features: # features = # states * # actions
Update:
where
if Q-learning
if SARSA
Shaped Reward
Potential-based Reward Shaping
For example, in Gridworld,
Policy Iteration
魔改bellman方程,将所有动作可能性替换成当前策略下的动作。
What is Game Theory?
Game theory is the study of mathematical models of strategic interaction among rational decision-makers. It has applications in all fields of social science, as well as in logic, systems science and computer science.
Nash Equilibrium
每个agent都选了最优策略,其他agent不会改变策略
- a weakly(strictly) dominant strategy: always better than any other strategy
- indifference: 通过调整我的策略概率,改变对手的收益(payoff)期望,使无论对手如何选择,他的满意度(utility)都一样
Utility Function
: what can agent get from action
Normal Form Game
一轮,不知道对手的策略,只知道对手的utility function
Extensive Form Game 广义形式博弈
轮流决策,所以知道对手的策略
Subgame Perfect Equilibrium 子博弈完美均衡
当前玩家在他的回合的最优策略,对手在他的回合的最优策略。。。
Backward Induction 反向归纳
输入:广义形式博弈
输出:每个状态 的子博弈均衡
函数 BackwardInduction():
如果 ,则返回
best_child
对于每个 :
child_reward BackwardInduction()
如果 child_reward() > best_child(),则 best_child child_reward
返回 best_child
返回 BackwardInduction()
Multi-agent Q-learning
输入:随机博弈
输出:Q函数 ,其中 是 self agent
初始化 任意,例如,对于所有的状态 和联合动作 ,
重复:
episode 的第一个状态
重复(对于 episode 的每一步):
选择在 中应用的动作
例如,使用 和一个多臂老虎机算法,如 -greedy
在状态 中执行动作
观察奖励 和新状态
直到 episode 结束(一个终止状态)
直到 收敛