How the Multi-Armed Bandit Problem Changed the Way I Live

How probability theory can influence your everyday decisions

Indecision

If you're like me, you're often plagued by indecision. Spending a debilitating epoch selecting an ice cream flavour, or the right drink at a bar. Or you find yourself in the aisle of Tesco paralysed by the countless flavours of crisp. And with this struggle comes a familiar self-hatred: not only are you impotent, but you are also wasting your precious, finite time on this earth on the absolute epitome of tedium. YET, upon choosing, you are distinctly excluding yourself from the other choices, and you experience, in-advance, the dull sense of accompanying grief.

'Decide' comes from the Latin decidere, 'to cut off', and that's exactly what it feels like. You are cutting off the other perfectly-fine choices, along with their, possibly superior, futures. Perhaps I'm missing out on riches, on love, on enriching and soul-nourishing experiences. Or perhaps it's that salt & pepper kettle chips are simply more delicious than the Tyrells equivalent. But how will I know? Sure I enjoyed them the last time, but today it may be different. But it doesn't matter. But it DOES matter. Maybe I'll try something new?

Along with the vast array of other facets: price, calories, salt, weight, current hunger levels, current thirst levels, will I be meeting someone later?, where will I eat them? has the bag been crushed? etc, etc, etc. It's astonishing we arrive at any single decision in our lifetimes at all.

Aside from existential fear (from my experience years of therapy is the only answer), the multi-bandit problem may offer you some decision solace, along with a strategy that cuts through the decision feedback loop.

The Problem

A 'one-armed bandit' is a euphemism for a slot machine. They rob you of all your worth, despite having only one arm to rob you with.

The multi-armed bandit problem is a classic in probability theory, and the solution is one that's applicable in game theory, reinforcement learning, business decision making, and as I see it: conquering decision fatigue.

The problem: You are in a casino hall full of slot machines. Your goal is to find the strategy with the highest payout (you could argue that the best strategy is to leave the casino altogether, but we will assume (just like in life) you have to play. There may not be a money making strategy, but it works the same if you aim to minimise expected loss). Each slot machine has an unknown probability of payout, and even worse, these probabilities change over time.

So what's your strategy? Do you walk around randomly pulling levers? Or stick to one trusty machine? The tricky part is balancing exploration with exploitation, i.e. balancing finding a better slot machine, and cashing in when you've found a good one. There are different solutions to this problem. We will look at the epsilon-greedy solution

The Epsilon-greedy solution

This strategy tries to always pick the optimal next move. We want to pick the next action that maximises your payout. First, we need a way of quantifying the value of each action in a situation where we don't know the probability that each machine will payout. Since we don't know, we use an average to estimate the value of each action.

For an action a its action-value V_a is defined as:

V_a = (Total rewards from action a so far)/(Total times we have tried action a)

And here comes the strategy:

  • Pick the action with the highest expected payout (the action with largest V_a) most of the time (with probability 1-ε)
  • Pick a random action a small amount of the time (with probability ε)

So as you move forward, by picking more actions you update your knowledge on the value of that action. But you need to be prepared that another action (picking another slot machine) may be better, so you do so with some small probability.

Picking a value for epsilon doesn't matter as long as it's small. (Try (0.01, 0.1)) and you will tend towards the optimal solution.

What it means for everyday decision making

So you're still in that aisle in Tesco's, how do I make a decision?

The change in mindset that enables you to finally collapse the smorgasbord of options is that you will, in the course of your life, make many many decisions. So by employing the Epsilon-greedy strategy, we pick the packet of crisps we know we like most of the time, and we pick a random packet of crisps some of the time, knowing that we are using this option to explore the collection of crisps on offer (that we might like more).

If we do like that second packet of crisps more, we try it again until we adjust our probabilities in favour of picking that crisp. If there are crisps of equal value, most of the time we pick randomly between them, yet always reserving that epsilon wild card.

Bear in mind that this is just one solution to the problem, and there are many others. (I think in my life I value variety, and novelty on a temporal basis. I think this factors into my mental action value model.)

But for me, knowing that I am tending towards a pretty good solution is enough to get me out of the bloody shop.