多臂赌博机问题

问题描述

In probability theory, the multi-armed bandit problem (sometimes called the K- or N-armed bandit problem) is a problem in which a gambler at a row of slot machines (sometimes known as "one-armed bandits") has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. –Wiki

• $T$ 步累积奖赏： $E[\frac{1}{T} \sum_{t=1}^{T} r_{t}]$
• $\gamma$ 折现累积奖赏： $E[\sum_{t=0}^{T-1} \gamma^{t} r_{t+1}]$
• 累积遗憾： $Th^{*} - E[\sum_{t=1}^{T} r_{t}]$ ，其中 $h^{*} = \max_{i=1,2,3,\dotsc,K} h_{i}$

蒙特卡洛树搜索

问题描述

Monte Carlo Tree Search (MCTS) is a method for making optimal decisions in artificial intelligence (AI) problems, typically move planning in combinatorial games. MCTS combines the generality of random simulation with the precision of tree search.

John von Neumann's 1928 minimax theorem paved the way for adversarial tree search methods that have formed the basis of decision making in computer science and AI almost since their inception. Monte Carlo methods were later formalised in the 1940s as a way to approach less well-defined problems unsuitable for tree search through the use of random sampling. Rémi Coulomb combined these two ideas in 2006 to provide a new approach for move planning in Go now known as MCTS.

Research interest in MCTS has risen sharply due to its spectacular success with computer Go and potential application to a number of other difficult problems. Its application extends beyond games, and MCTS can theoretically be applied to any domain that can be described in terms of {state, action} pairs and simulation used to forecast outcomes.

• 利用树结构来重新表达决策问题
• 利用蒙特卡洛方法来进行搜索

Demo展示

• 有人用Wolfram Mathematica可视化了不同数据集的无向图，基于此来找规律，有点类似于领域知识的探索
• 有人没用蒙特卡洛树搜索，而是用了贪心算法，考虑到有的无向图规模太大，反而取得了不错的成绩

作为同构的仿真优化

• Ranking & Selection
• Discrete Optimization via Simulation (Heuristic methods / Random Search)
• Response Surface Methodology
• Stochastic Approximation
• Sample Average Approximation
• Variance Reduction Techniques

Ranking & Selection

OCBA的核心思路是利用信噪比来分配仿真资源，即仿真资源应该尽可能的分配给噪声大（比如样本方差）的选择以及信号比较强烈（同当前最优选择比较接近）的选择。可以看到从多臂赌博机问题的角度出发，OCBA相对更倾向于exploration，因为其目的只是关心『最终』能选择出最好的选择，而不在乎每次仿真得到的收益。

Footnotes:

1

2

softmax distribution的均值随其参数变化，将包含观测值从最小到最大的所有可能取值

4

Dec 24 2016