Journal ArticleDOI
Solving two‐armed Bernoulli bandit problems using a Bayesian learning automaton
TLDR
Research is reported into a completely new family of solution schemes for the TABB problem: the Bayesian learning automaton (BLA) family, based upon merely counting rewards/penalties, combined with random sampling from a pair of twin Beta distributions.Abstract:
Purpose – The two‐armed Bernoulli bandit (TABB) problem is a classical optimization problem where an agent sequentially pulls one of two arms attached to a gambling machine, with each pull resulting either in a reward or a penalty. The reward probabilities of each arm are unknown, and thus one must balance between exploiting existing knowledge about the arms, and obtaining new information. The purpose of this paper is to report research into a completely new family of solution schemes for the TABB problem: the Bayesian learning automaton (BLA) family.Design/methodology/approach – Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. BLA avoids the problem of computational intractability by not explicitly performing the Bayesian computations. Rather, it is based upon merely counting rewards/penalties, combined with random sampling from a pair of twin Beta distributions. This is intuitively appealing since the Bayesian conjugate prior for a bino...read more
Citations
More filters
Proceedings Article
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle,Lihong Li +1 more
TL;DR: Empirical results using Thompson sampling on simulated and real data are presented, and it is shown that it is highly competitive and should be part of the standard baselines to compare against.
Proceedings Article
Analysis of Thompson Sampling for the Multi-armed Bandit Problem
Shipra Agrawal,Navin Goyal +1 more
TL;DR: In this paper, the Thompson sampling algorithm achieves logarithmic expected regret for the stochastic multi-armed bandit problem, where the expected regret is O( lnT + 1 3 ).
Proceedings Article
Thompson Sampling for Contextual Bandits with Linear Payoffs
Shipra Agrawal,Navin Goyal +1 more
TL;DR: In this article, a generalization of Thompson sampling is proposed for the stochastic contextual multi-armed bandit problem with linear payoff functions, where the contexts are provided by an adaptive adversary, and a high probability regret bound of O(d2/e√T1+e) is shown.
Book ChapterDOI
Thompson sampling: an asymptotically optimal finite-time analysis
TL;DR: The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem is answered positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret.
Journal ArticleDOI
Combinatorial bandits
Nicolò Cesa-Bianchi,Gábor Lugosi +1 more
TL;DR: A variant of a strategy by Dani, Hayes and Kakade achieving a regret bound that, for a variety of concrete choices of S, is of order ndln|S| where n is the time horizon is introduced.
References
More filters
Book
Reinforcement Learning: An Introduction
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Journal ArticleDOI
Finite-time Analysis of the Multiarmed Bandit Problem
TL;DR: This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
Journal ArticleDOI
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples
Book ChapterDOI
Bandit based monte-carlo planning
Levente Kocsis,Csaba Szepesvári +1 more
TL;DR: In this article, a bandit-based Monte-Carlo planning algorithm is proposed for large state-space Markovian decision problems (MDPs), which is one of the few viable approaches to find near-optimal solutions.