Signal detection theory (SDT) has been widely used to identify the optimal response of a receiver to a stimulus when it could be generated by more than one signaler type. While SDT assumes that the receiver adopts the optimal response at the outset, in reality, receivers often have to learn how to respond. We, therefore, recast a simple signal detection problem as a multi-armed bandit (MAB) in which inexperienced receivers chose between accepting a signaler (gaining information and an uncertain payoff) and rejecting it (gaining no information but a certain payoff). An exact solution to this exploration–exploitation dilemma can be identified by solving the relevant dynamic programming equation (DPE). However, to evaluate how the problem is solved in practice, we conducted an experiment. Here humans (n = 135) were repeatedly presented with a four readily discriminable signaler types, some of which were on average profitable, and others unprofitable to accept in the long term. We then compared the performance of SDT, DPE, and three candidate exploration–exploitation models (Softmax, Thompson, and Greedy) in explaining the observed sequences of acceptance and rejection. All of the models predicted volunteer behavior well when signalers were clearly profitable or clearly unprofitable to accept. Overall however, the Softmax and Thompson sampling models, which predict the optimal (SDT) response towards signalers with borderline profitability only after extensive learning, explained the responses of volunteers significantly better. By highlighting the relationship between the MAB and SDT models, we encourage others to evaluate how receivers strategically learn about their environments.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)