The Reader’s Digest conducted an experiment attempting to measure how honest the world’s cities are. They “lost” 12 wallets in a variety of cities containing about $50 in the local currency and the contact information of the wallet’s owner. They then measured how many wallets were returned to the owner.

In this short post, I would like to highlight how little this experiment can really say about honesty across cities because of small sample size.

First, here are the results of the experiment.

One can think of the wallet experiment as a coin flip. You flip a coin (lose a wallet) 12 times, and you either come up with heads or tails (get the wallet back or not). Depending on how biased the coin is, you might be more likely to get one outcome over the other.

Mathematically, this is described by the binomial distribution. This distribution can be thought of as a series of trials, each of which is independent of the others and can end in a success or a failure.

Two numbers are enough to specify a binomial distribution: the number of trials (which is 12 for our wallet experiment), and the probability of success. The latter one can be thought of as the probability of getting heads in a coin flip, or the probability of getting the wallet back in the experiment. For instance, if we have a fair coin, the probability will be .5 or 50%.

Using this distribution, we can easily establish a confidence interval around the results of the wallet experiment. For instance, in New York 8 out of 12 wallets got returned. We may use the binomial distribution to calculate what the probability is that 8/12 wallets get returned if only 50% of New Yorkers are honest. We can of course ask this question for any arbitrary percentage of honest people (e.g. what if 72.53% of New Yorkers are honest).

Obviously, if 8/12 wallets are retrieved then the most likely outcome is that 8/12 = 67% of New Yorkers are honest. But other possibilities (that say only 50% of them are honest) can also happen. This is because the sample size is small (only 12 trials were run), so it’s hard to establish confidence in the results.

Now, if we get the result that say there’s only a 5% chance that only 50% of New Yorkers are honest, then we may say this is too low a probability. So we may conclude that it is quite unlikely that only 50% of New Yorkers are honest and yet 8 wallets got returned. What is too low a probability is hard to determine and to some extent subjective.

Below is a table where I make this cut-off probability 10%. This means that as long as there’s a less than 10% chance to have 8/12 wallets retrieved when the fraction of honest people is x%, I will conclude that the fraction of honest people probably cannot be x%. This allows me to establish confidence bounds around the results.

With this we can see how little confidence there is behind the results. Essentially, as long as the confidence intervals for two cities overlap in the table, we cannot really be sure that the fraction of honest people is different between them. For instance, this experiment has so little power that for all we know Amsterdam/Moscow (7 retrieved wallets) are equally honest as Helsinki (11 retrieved wallets).

I suppose this post is just a cautionary tale: one should be very careful about trusting (and drawing far-reaching conclusions from) all kinds of arbitrary analyses and numbers.

**Appendix.** The numbers above were generated using the following R code.

# Function to calculate probability for given: # - number of wallets retrieved, # - hypothesized fraction of honest people. walletProb = function(successes, prob) { pbinom(successes, 12, prob=prob) - pbinom(successes-1, 12, prob=prob) } # Use function to establish confidence bounds for (retrieved in 1:12) { high.probs = c() for (honest in seq(0, 1, by=.05)) { prob.honest = walletProb(retrieved, honest) if (prob.honest > .1) { high.probs = c(high.probs, honest) } } print(paste(min(high.probs), max(high.probs), sep=",")) }