Have you ever wondered which country fits your cultural and other values best? Where do people think the most like you? Even if you never thought about these questions, you might be interested in the answer. If yes, do read on.

To put it short, based on the World Values Survey‘s questions and results, I developed an online questionnaire that compares your answers to various questions with what people all over the world responded to the very same questions. Without any further due, here is the link to the questionnaire. Enjoy. Below I detail the methodology behind the questionnaire for those interested.

I took micro datasets from Wave 5 and Wave 6 of the World Values Survey. These were done between 2005-2009 and 2010-2014, respectively. I identified the overlapping questions between the two waves, and deleted all questions that were not present in one of them. I also compared the countries involved in the two waves (as they are not the same), and kept only those countries from Wave 5 that were not in Wave 6. Additionally, I deleted Hong Kong and Kuwait because they had many missing questions.

The main idea for my online questionnaire is to have a respondent answer all the questions, and then compare her answers to the (mean) answers of each country. I would thus calculate the discrepancy between the respondent and each country for each question, add up the discrepancies, and the country the respondent has the lowest discrepancy with is the most compatible one.

I identified three types of questions that needed to be treated differently. A short discussion of how each one was treated follows.

**Type 1.** These are questions whose answers can be easily ranked. For instance, questions of the form “*do you agree with …*” have answers ranging from completely disagree to strongly agree. Thus these answers have an order. They can be easily ranked. And therefore, taking an average of many answers makes sense. Consequently, for these questions I took country-level averages, and for each question I assigned to each country their average score.

Then I normalized the answers. This is necessary because some questions have two answer options, while others have 10. If I normalize, then all answers will be between 0 and 1. Normalization follows the formula

where for question *i*, *x^n* denotes the normalized score, *x* is the “raw” score and *x^min* and *x^max* are the minimum and maximum values one can attain for that question, respectively. This transformation ensures that all answers are between 0 and 1, and that the minimum answer is 0, the maximum is 1.

Now, that all questions are normalized I take the user’s normalized answer and compare it to the average answer to the same question in each country. Let us call the discrepancy between user *u*‘s answer and country *j*‘s average answer to question *i* the residual. Then the residual is calculated as

where the *x*‘s are normalized answers. Thus the residual is merely the absolute difference between the user’s answer and the average response in a given country. Obviously, the lower the residual, the closer the user’s opinion is to the country’s.

**Type 2.** The second type of questions are the ones whose answers cannot be ordered. For instance, if I ask “*of the following, what do you prefer?*” and the answer choices are ice cream, candy and chocolate, then it’s impossible to rank the answers. The average score will not be meaningful. To see this, consider coding ice cream with 1, candy with 2 and chocolate with 3. So if we have country where half the population prefers ice cream (1) and half prefers chocolate (3), then the average will be 2. But the average will also be 2 in a second country where everybody prefers candy. Now, if I have a user who prefers candy, his residual (if merely compared to the average score of each country) would be 0 for both countries. But country 2 clearly likes candy much more than country 1. So the residual would be useless.

For this reason, for questions of type 2 I calculate the percentage of people choosing each alternative in each country. And the residual is defined as one minus the percentage that prefers the user’s alternative. Using the example above, for country 1, 50% prefers ice cream, 0% candy, 50% chocolate; for country 2, 0% ice cream, 100% candy, 0% chocolate. If the user chooses candy, then the residual for country 1 would be one minus the percentage of candy-choosers, i.e. *1-0% = 1*. Similarly, for country 2 it would be *1-100% = 0*. So the residual would clearly be lower if more people have the same preference as the user.

More generally, if respondent *u* chooses answer *a*, then the residual between them and country *j* is

where *n_{aj}* is the total number of people who chose option *a* in country *j*.

**Type 3.** Finally, moving on to questions of type 3. These questions are like type 2 questions, but the respondent has to choose two alternatives. So to the ice cream-question, you can have a first choice of what you prefer, and a second choice. Of course, I could just treat the first and second choices as separate questions and use type 2 methodology. This would have the following problem.

Consider the ice cream-question with a fourth alternative, say cookies. Then suppose country A’s first choice on average is chocolate, second is candy; and country B’s first choice is cookie, second is ice cream. Now if we have a respondent who wants candy first, and chocolate second, then – using type 2 methodology – the residual would likely be quite high with both country A and B. However, clearly the respondent’s answer is much more compatible with country A, which chose the same two answers but in a different order.

To remedy this problem, I calculate once more what percentage of people chose each alternative as a first and second choice. And then the residual of the first choice will be some weighted average of the discrepancy between the respondent’s first choice, and the fraction of people who chose the same as the respondent as their first or second choice. Obviously, the weight on the proportion who chose the alternative as a first choice is higher. And this works vice versa for a second choice.

As an example, consider a country where first choices are distributed as 70% chocolate, 30% candy, 0% others; second choices are 40% candy, 30% chocolate, 30% ice cream, 0% cookies. Then the residual for someone who chose chocolate is some weighted average of *1-0.7* and *1-0.3*. Similarly, if the respondent chooses candy, then their residual is some weighted average of *1-0.3* and *1-0.4*.

Formally, the residual for respondent *u* preferring alternative *a* with country *j*, if the question asks forthe *k*th choice (i.e. *k = 1, 2*) is

where *w_k* is some weight (between 0 and 1), *n^k_{aj}* is the number of people in country *j* who chose alternative *a* as their *k*th choice, and *n^{-k}_{aj}* is the number of people in country *j* who chose alternative *a* not as their *k*th but other choice. I.e. if *k = 1*, then with this notation *-k = 2*, and vice versa.

**Other adjustments.** Last but not least, let me mention one more problem that had to be overcome: missing answers. In some countries, very few to no people answered certain questions for whatever reasons. If a question is missing in some country, then I could either delete the whole country from the sample, which is too extreme a method and would shrink the country sample (this to me is largely undesirable); or I could delete the missing question from the whole data (i.e. for all countries). This latter option would, however, dramatically decrease the number of remaining questions. A third option is to simply fill in those missing questions somehow. In this case, neither the country nor the question needs to be deleted. I chose this very last option. Of course, one has to decide what to fill in for missing answers. I think the most natural choice is just to simply put the average of all other countries’ answers to the same question. So this is what I did.

Pingback: ZeeConomics | A Cultural Distance Matrix

Pingback: ZeeConomics | Can the world be divided up into civilizations?