強化學習的數學原理(英文版)

作者：趙世鈺|責編:郭賽
出版社：清華大學
ISBN：9787302658528

出版日期：2024/07/01
裝幀：平裝
頁數：301

人民幣：RMB 118 元售價：元

內容大鋼

本書從強化學習最基本的概念開始介紹，將介紹基礎的分析工具，包括貝爾曼公式和貝爾曼最優公式，然後推廣到基於模型的和無模型的強化學習演算法，最後推廣到基於函數逼近的強化學法。本書強調從數學的角度引入概念、分析問題、分析演算法，並不強調演算法的編程實現。本書不要求讀者具備任何關於強化學習的知識背景，僅要求讀者具備一定的概率論和線性代數的知識。如果讀者已經具備強化學習的學習基礎，本書可以幫助讀者更深入地理解一些問題並提供新的視角。
本書面向對強化學習感興趣的本科生、研究生、研究人員和企業或研究所的從業者。

作者介紹

趙世鈺|責編:郭賽
趙世鈺，西湖大學工學院AI分支特聘研究員，智能無人系統實驗室負責人，國家海外高層次人才引進計劃青年項目獲得者；本碩畢業於北京航空航天大學，博士畢業於新加坡國立大學，曾任英國謝菲爾德大學自動控制與系統工程系Lecturer；致力於研發有趣、有用、有挑戰性的下一代機器人系統，重點關注多機器人系統中的控制、決策與感知等問題。

Overview of this Book
Chapter 1  Basic Concepts
  1.1  A grid world example
  1.2  State and action
  1.3  State transition
  1.4  Policy
  1.5  Reward
  1.6  Trajectories, returns, and episodes
  1.7  Markov decision processes
  1.8  Summary
  1.9  Q&A
Chapter 2  State Values and the Bellman Equation
  2.1  Motivating example 1: Why are returns important?
  2.2  Motivating example 2: How to calculate returns?
  2.3  State values
  2.4  The Bellman equation
  2.5  Examples for illustrating the Bellman equation
  2.6  Matrix-vector form of the Bellman equation
  2.7  Solving state values from the Bellman equation
    2.7.1  Closed-form solution
    2.7.2  Iterative solution
    2.7.3  Illustrative examples
  2.8  From state value to action value
    2.8.1  Illustrative examples
    2.8.2  The Bellman equation in terms of action values
  2.9  Summary
  2.10  Q&A
Chapter 3  Optimal State Values and the Bellman Optimality Equation
  3.1  Motivating example: How to improve policies?
  3.2  Optimal state values and optimal policies
  3.3  The Bellman optimality equation
    3.3.1  Maximization of the right-hand side of the BOE
    3.3.2  Matrix-vector form of the BOE
    3.3.3  Contraction mapping theorem
    3.3.4  Contraction property of the right-hand side of the BOE
  3.4  Solving an optimal policy from the BOE
  3.5  Factors that influence optimal policies
  3.6  Summary
  3.7  Q&A
Chapter 4  Value Iteration and Policy Iteration
  4.1  Value iteration
    4.1.1  Elementwise form and implementation
    4.1.2  Illustrative examples
  4.2  Policy iteration
    4.2.1  Algorithm analysis
    4.2.2  Elementwise form and implementation
    4.2.3  Illustrative examples
  4.3  Truncated policy iteration
    4.3.1  Comparing value iteration and policy iteration
    4.3.2  Truncated policy iteration algorithm

  4.4  Summary
  4.5  Q&A
Chapter 5  Monte Carlo Methods
  5.1  Motivating example: Mean estimation
  5.2  MC Basic: The simplest MC-based algorithm
    5.2.1  Converting policy iteration to be model-free
    5.2.2  The MC Basic algorithm
    5.2.3  Illustrative examples
  5.3  MC Exploring Starts
    5.3.1  Utilizing samples more efficiently
    5.3.2  Updating policies more efficiently
    5.3.3  Algorithm description
  5.4  MC ?-Greedy: Learning without exploring starts
    5.4.1  ?-greedy policies
    5.4.2  Algorithm description
    5.4.3  Illustrative examples
  5.5  Exploration and exploitation of ?-greedy policies
  5.6  Summary
  5.7  Q&A
Chapter 6  Stochastic Approximation
  6.1  Motivating example: Mean estimation
  6.2  Robbins-Monro algorithm
    6.2.1  Convergence properties
    6.2.2  Application to mean estimation
  6.3  Dvoretzky's convergence theorem
    6.3.1  Proof of Dvoretzky's theorem
    6.3.2  Application to mean estimation
    6.3.3  Application to the Robbins-Monro theorem
    6.3.4  An extension of Dvoretzky's theorem
  6.4  Stochastic gradient descent
    6.4.1  Application to mean estimation
    6.4.2  Convergence pattern of SGD
    6.4.3  A deterministic formulation of SGD
    6.4.4  BGD, SGD, and mini-batch GD
    6.4.5  Convergence of SGD
  6.5  Summary
  6.6  Q&A
Chapter 7  Temporal-Difference Methods
  7.1  TD learning of state values
    7.1.1  Algorithm description
    7.1.2  Property analysis
    7.1.3  Convergence analysis
  7.2  TD learning of action values: Sarsa
    7.2.1  Algorithm description
    7.2.2  Optimal policy learning via Sarsa
  7.3  TD learning of action values: n-step Sarsa
  7.4  TD learning of optimal action values: Q-learning
    7.4.1  Algorithm description
    7.4.2  Off-policy vs. on-policy
    7.4.3  Implementation

    7.4.4  Illustrative examples
  7.5  A unifed viewpoint
  7.6  Summary
  7.7  Q&A
Chapter 8  Value Function Approximation
  8.1  Value representation: From table to function
  8.2  TD learning of state values with function approximation
    8.2.1  Objective function
    8.2.2  Optimization algorithms
    8.2.3  Selection of function approximators
    8.2.4  Illustrative examples
    8.2.5  Theoretical analysis
  8.3  TD learning of action values with function approximation
    8.3.1  Sarsa with function approximation
    8.3.2  Q-learning with function approximation
  8.4  Deep Q-learning
    8.4.1  Algorithm description
    8.4.2  Illustrative examples
  8.5  Summary
  8.6  Q&A
Chapter 9  Policy Gradient Methods
  9.1  Policy representation: From table to function
  9.2  Metrics for defining optimal policies
  9.3  Gradients of the metrics
    9.3.1  Derivation of the gradients in the discounted case
    9.3.2  Derivation of the gradients in the undiscounted case
  9.4  Monte Carlo policy gradient (REINFORCE)
  9.5  Summary
  9.6  Q&A
Chapter 10  Actor-Critic Methods
  10.1  The simplest actor-critic algorithm (QAC)
  10.2  Advantage actor-critic (A2C)
    10.2.1  Baseline invariance
    10.2.2  Algorithm description
  10.3  Of-policy actor-critic
    10.3.1&nb

同類熱銷排行榜

最近瀏覽的商品

強化學習的數學原理(英文版)