A Cultural Distance Matrix

In this post I introduce a dataset that I developed recently. The dataset is simply a matrix of 77 countries of the world that shows how close each country pair is in terms of cultural values. Values are measured by answers to the World Values Survey.

The online version of the dataset can be found here, and you can download a .csv version here. The values of cultural distance have been normalized to be between 0 and 1, where 0 is complete agreement on all questions and 1 is complete disagreement on all questions.

The idea behind the matrix is simple. I took the average answer of each country’s respondents to each question to be that particular country’s answer. Then I calculated the compatibility of these answers with the similarly calculated answers of other countries. The way this compatibility is calculated is detailed in another post of mine which you can see here.

If you read that post, you may realize that one caveat is that for type 2 and type 3 questions, I have to take a country’s answer to be one single discrete choice, as opposed to an average. This is alright, but then when I calculate a country’s residual with itself, even that will be nonzero.

To see this, suppose in a country 50% preferred option A, 30% option B and 20% option C for a type 2 question. Then when I compute the compatibility of this country with other countries, I would have to take its answer to be option A, because that is the option chosen by the most respondents.

This is fine, but then when I calculate the residual of this country with itself, it will be 1 – 0.5 = 0.5, so it will not be zero. This is not a huge problem, but it would result in the fact that each country’s compatibility score with itself would be nonzero and quite likely to differ across countries. Since this number is meaningless, it is better to artificially normalize each country’s score with itself to 0.

Note that this problem only occurs with type 2 and type 3 questions, of which there are only eight in the data (out of 130). So it would be useless to try to interpret this nonzero residual of a country with itself as say a lack of agreement or a degree of polarization within the population. This is another argument for just normalizing these scores to 0.

Another thing to notice is that the maximum residual a country can attain with itself is quite low. Of the eight questions where a nonzero residual can occur, there are six with four answer possibilities. There, the maximum residual is 0.75 per question, so altogether for the six questions it’s 4. This is because the residual for each question is one minus the percentage of people choosing the top choice within the country. But if there are four possibilities, then the top choice has to be chosen by at least 25% of the population. Hence the maximum residual for each question is 1-0.25 = 0.75.

There is one question with five answer choices. There the maximum residual is 0.8. And one question with three choices, where it is 0.67. So altogether, the theoretical maximum residualÂ a country can have with itself is 4+0.8+0.67 = 5.47.

Indeed, in the matrix the maximum of such residuals is 4.92, the mean is 4.30 and the minimum is 3.29. Whereas, if we look at each country’s residual with other countries only, the minimum is 10.28, the mean is 20.85 and the maximum is 34.11. So indeed, the size of within-country residuals is low compared to between-country residuals.

Anyways, this is just a methodological choice I made. There can be other options out there. One could for instance just simply ignore the eight type 2 and 3 questions, and calculate the residuals with the 122 type 1 questions only. Such alternative options are available upon request.

As for normalization, it is done the usual way, that is for every element i in the matrix I calculate

$x^n_i = \frac{x_i - x_{min}}{x_{max}-x_{min}},$

where x^n is the normalized value, x is the non-normalized value and x_min and x_max are the minimum and maximum values x can attain, respectively. The minimum value is clearly 0. This is true theoretically, and also in practice at least if we artificially normalize each country’s residual with itself to 0. The maximum value can be assumed to be two things: the theoretical maximum or the practical maximum.

The theoretical maximum residual is 130. This is the case if two countries completely disagree on all questions. In practice, however, the maximum is only 34.11. So one can normalize either way. I choose the theoretical maximum.