scispace - formally typeset
Journal ArticleDOI

Solving two‐armed Bernoulli bandit problems using a Bayesian learning automaton

TLDR
Research is reported into a completely new family of solution schemes for the TABB problem: the Bayesian learning automaton (BLA) family, based upon merely counting rewards/penalties, combined with random sampling from a pair of twin Beta distributions.
Abstract
Purpose – The two‐armed Bernoulli bandit (TABB) problem is a classical optimization problem where an agent sequentially pulls one of two arms attached to a gambling machine, with each pull resulting either in a reward or a penalty. The reward probabilities of each arm are unknown, and thus one must balance between exploiting existing knowledge about the arms, and obtaining new information. The purpose of this paper is to report research into a completely new family of solution schemes for the TABB problem: the Bayesian learning automaton (BLA) family.Design/methodology/approach – Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. BLA avoids the problem of computational intractability by not explicitly performing the Bayesian computations. Rather, it is based upon merely counting rewards/penalties, combined with random sampling from a pair of twin Beta distributions. This is intuitively appealing since the Bayesian conjugate prior for a bino...

read more

Citations
More filters
Proceedings Article

An Empirical Evaluation of Thompson Sampling

TL;DR: Empirical results using Thompson sampling on simulated and real data are presented, and it is shown that it is highly competitive and should be part of the standard baselines to compare against.
Proceedings Article

Analysis of Thompson Sampling for the Multi-armed Bandit Problem

TL;DR: In this paper, the Thompson sampling algorithm achieves logarithmic expected regret for the stochastic multi-armed bandit problem, where the expected regret is O( lnT + 1 3 ).
Proceedings Article

Thompson Sampling for Contextual Bandits with Linear Payoffs

TL;DR: In this article, a generalization of Thompson sampling is proposed for the stochastic contextual multi-armed bandit problem with linear payoff functions, where the contexts are provided by an adaptive adversary, and a high probability regret bound of O(d2/e√T1+e) is shown.
Book ChapterDOI

Thompson sampling: an asymptotically optimal finite-time analysis

TL;DR: The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem is answered positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret.
Journal ArticleDOI

Combinatorial bandits

TL;DR: A variant of a strategy by Dani, Hayes and Kakade achieving a regret bound that, for a variety of concrete choices of S, is of order ndln|S| where n is the time horizon is introduced.
References
More filters
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Journal ArticleDOI

Finite-time Analysis of the Multiarmed Bandit Problem

TL;DR: This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
Book ChapterDOI

Bandit based monte-carlo planning

TL;DR: In this article, a bandit-based Monte-Carlo planning algorithm is proposed for large state-space Markovian decision problems (MDPs), which is one of the few viable approaches to find near-optimal solutions.