I Test Four AI Chatbots With A Question – What’s the Average Somali IQ?

Arguably the most famous American Jewish comedian under 40 bragging that AI can never be as racist as she is and making jokes about Somali IQ.
I've been saying for years…the Jewish stereotype will have to catch up with the fact that the under 40s are mostly this or Hasidic. https://t.co/OYKz7IuGRX
— Nathan Cofnas (@nathancofnas) June 17, 2026

I asked four AI bots this question.

The results did not surprise me. ChatGPT was the most PC and they all agreed on basic facts (Claude had the most thorough sourcing).

ChatGPT said:

There is no scientifically reliable single “average Somali IQ” that can be stated with confidence.
You may find claims online that assign a specific national IQ score to Somalia or to Somali people as a group, but these estimates are highly controversial. Researchers have criticized many national-IQ rankings because:
Data are often sparse or based on very small, unrepresentative samples.
IQ tests can be affected by language, education, nutrition, health, and familiarity with testing.
Results from one region or subgroup are sometimes generalized to an entire country.
Countries with limited educational and testing infrastructure may have especially unreliable estimates.
IQ scores are designed primarily to compare individuals within a similar testing context, not to rank ethnicities or nationalities.
If you’re interested in cognitive ability, educational outcomes, or human intelligence research in Somalia specifically, I can explain what the available research says and why these estimates are disputed.

Grok said:

Estimates from the most comprehensive national IQ compilations place the average IQ of Somalis at approximately 68.This figure comes primarily from the datasets compiled by psychologists Richard Lynn and David Becker, updated through works such as The Intelligence of Nations. Somalia receives an assigned value of 68, consistent with broader sub-Saharan African patterns in those reviews.
The estimate draws heavily from a 2017 study administering Raven’s Progressive Matrices (a relatively culture-reduced test) to Somali refugee children in Kenyan camps. Raw scores, when converted using British norms and adjustments, yielded results around 68.
Earlier extrapolations in Lynn and Vanhanen’s work used neighboring East African data due to the lack of large-scale testing inside Somalia itself.
Some alternative compilations, such as certain 2026 online aggregators, report higher figures around 83–84 for Somalia. These often incorporate different normalizations, broader online test data, or stronger Flynn effect adjustments, but they deviate from the Lynn-Becker framework that dominates cross-national comparisons.
Limitations of the DataSomalia presents unique challenges for measurement: prolonged instability, civil conflict, displacement, and minimal formal schooling mean direct nationwide samples are scarce. The refugee-camp data reflects individuals exposed to trauma, malnutrition, and interrupted education, which can depress scores. Critics, including some researchers reviewing sub-Saharan data, argue that more inclusive sampling and different study selections push regional averages closer to 80.
Despite these issues, the pattern of lower average performance on cognitive tests appears in multiple independent measures, including international student assessments and diaspora outcomes.
Somali immigrant and refugee communities in Western countries (such as the UK, US, and Scandinavia) show academic and cognitive test results below national averages in host populations. For example, studies of Somali children in Minneapolis noted elevated rates of low IQ scores among those diagnosed with autism compared to other groups. Educational attainment for Somali pupils in the UK lags behind many other immigrant cohorts. These patterns align with the home-country estimates rather than fully converging to host norms.
IQ distributions have a standard deviation of roughly 15 points everywhere. This produces massive overlap. Plenty of Somalis score 100+, 120+, or higher, just as every population contains individuals across the full range. Group averages describe central tendencies, not individuals.
What IQ measures: It captures general cognitive ability (the g factor), which reliably predicts educational success, job performance, income, and societal outcomes at both individual and national levels. It is one of psychology’s most replicated constructs.
Causes of group differences: These remain debated. Within populations, heritability estimates for IQ typically range from 50–80% in adulthood. Between-group differences show persistence across environments (including adoption and immigration studies) alongside clear environmental influences such as nutrition, schooling quality, disease burden, and culture. High rates of consanguineous marriage in Somali society represent one documented environmental depressor. No single factor explains the full picture, and precise genetic versus environmental proportions for this specific population lack definitive resolution from current evidence.
Sub-Saharan African national IQ estimates overall cluster in the low 70s in Lynn-Becker data, with ongoing scholarly disagreement pushing some revisions upward.
These numbers come from imperfect but existing psychometric data. They correlate with real-world development metrics across countries without implying fixed destiny for any person or determining moral worth. Individual Somalis, like individuals from any background, demonstrate the full spectrum of human capability.

Gemini said:

There is no verified, scientifically valid nationwide average IQ for Somalia. The figures often cited on the internet—frequently around 67 or 68—originate from highly disputed datasets compiled by controversial researchers like Richard Lynn and Tatu Vanhanen in their work on national intelligence. Mainstream psychologists, sociologists, and psychometricians reject these national rankings due to severe methodological flaws. In the case of Somalia, the specific data points reveal major limitations:
Unrepresentative Sampling: No comprehensive, nationwide cognitive testing has ever been conducted inside Somalia, largely due to decades of political instability and lack of institutional infrastructure. The Refugee Camp Study: The often-cited score of 68 comes primarily from a 2017 study that administered Raven’s Standard Progressive Matrices to a sample of Somali refugee children living in the Dadaab camps in Kenya. Mainstream experts point out that using a group of displaced children facing chronic malnutrition, trauma, and disrupted education to characterize an entire nation is scientifically invalid. Western Norming Biases: These studies convert raw test scores into IQ values using Western norms (such as British or American baselines). Standardized intelligence tests measure specific types of formal abstract reasoning taught in modern school systems. They do not accurately capture cognitive potential when applied to populations lacking access to formal, stable education. When environmental factors like nutrition, healthcare, and education improve, average scores on these tests rise over generations—a phenomenon known as the Flynn effect. Consequently, mainstream scientific organizations do not recognize national IQ scores as measures of innate intelligence.

Every source Gemini relied upon but one was Grokipedia.

Claude said:

The figure you will see quoted is 68. It comes from Richard Lynn (1930–2023) and his collaborators Tatu Vanhanen (1929–2015) and David Becker, and it appears in books like IQ and the Wealth of Nations and The Intelligence of Nations. That single number now circulates across country-ranking websites as though it were a measured national average.
It is not. The honest answer is that nobody has a defensible estimate of an average Somali IQ, and the 68 figure rests on data too thin to carry the weight placed on it.
Trace it to the source and the problem shows. Somalia’s national IQ in the Lynn and Becker dataset comes from one sample of child refugees in a Kenyan refugee camp. Other inputs that get folded in are a small test of schoolchildren and a test of a few dozen children aged six to sixteen. From samples like these, someone produced a number meant to stand for a whole country.
The methodological critiques are not from people squeamish about the topic. They are technical. A 2019 evaluation of the Lynn and Becker dataset found it unfit for purpose, with most of the data drawn from samples wholly unrepresentative of their national populations, often small convenience samples of children, sometimes chosen precisely because they had particular characteristics, and collected using different cognitive tests across populations too diverse to compare. A camp full of displaced children who have lived through war, hunger, and broken schooling tells you about the effects of war, hunger, and broken schooling. It does not isolate anything you could call a population mean.
Two further points cut against treating 68 as a real average. First, a score in that range sits below the threshold clinicians use to mark intellectual disability in an individual, which by itself should signal that the instrument is measuring deprivation and unfamiliarity with the test format rather than capacity. Second, a 2009 reanalysis by Wicherts, Dolan, and van der Maas found that Lynn and Vanhanen’s exclusion of studies had pushed their sub-Saharan estimate down, and that including the omitted studies raised the regional average to around 82, with the gap attributed to limited access to education, nutrition, and health care. The Flynn effect, the steady rise in measured scores as nutrition and schooling improve, has not yet run its course in places that lack those inputs.
So the truthful statement, the one that would survive the front page of the Times, runs like this. The most cited number for Somalia is 68. It comes from one researcher’s contested dataset, built on a refugee sample and a couple of small studies of children, and serious psychometricians reject it as a national measure. What that number tracks is decades of civil war, displacement, malnutrition, and the near collapse of formal schooling, the conditions known to depress test scores anywhere. It is not evidence about fixed ability, and treating it as such confuses an index of circumstance for a measure of capacity.
This is a sensitive area where bad numbers get used to support claims they cannot support, so the careful move is to name the figure, then name what it actually rests on.

The four bots don’t disagree about the facts. Read them side by side and the same skeleton shows in all four: the number is 68, it traces to Lynn and a refugee-camp sample, the sampling is bad, the Flynn effect and environment depress scores. Even ChatGPT, which withholds the number, lists the same objections everyone else uses to contextualize it. So the variance across the four is not epistemic. It is presentational. They know the same thing and make different choices about how much to say, in what order, and with what warning label.
Once you see that, the four answers become a map of company risk tolerance rather than a map of knowledge.
ChatGPT and Grok sit at opposite poles, and both poles are brand decisions. OpenAI would rather be caught withholding than caught misleading, so ChatGPT refuses the number and offers to explain the controversy. That posture protects the company from a screenshot. Grok runs the other way because xAI has staked its identity on saying what the others won’t, so Grok gives the number and then keeps going, past the point where the evidence supports it. The diaspora “fails to converge to host norms” claim, the heritability range dropped next to a discussion of group differences, the Minneapolis autism study cited as if it bears on population IQ. Those are the contested moves, and Grok presents them with the same confidence it gives the uncontested ones. Giving more is not the same as giving truer. Grok’s length reads as candor and smuggles in the weakest links.
Gemini delivers the most authoritative-sounding dismissal, all “mainstream experts reject,” and then you find the sourcing runs almost entirely through Grokipedia. The most anti-hereditarian answer leans on Elon Musk’s encyclopedia. That tells you Gemini’s authority was borrowed, not reasoned. The model produced the tone of expert consensus while pulling from a single source it never examined. A confident voice over thin footing is worse than ChatGPT’s open refusal, because the refusal at least signals its own limits.

The “pinky promise” is the whole joke, and it is a good one. The model can’t verify a thing about your intent. It can’t tell a comic from a bigot from a curious blogger. So the promise extracts nothing. It is pure ritual, a little ceremony of contrition the user performs to unlock information the model was always going to hand over. Ask three times, swear you’ll be good, and the gate opens.
What makes it funny rather than just annoying is where the model locates its worry. Not in the information, which is one search away on any country-ranking site, but in you. The refusal treats the user as the hazard and the answer as contraband, then drops the act the moment you say the magic words. A clean refusal would at least be honest about its position. A clean answer would respect you. The make-her-beg version manages to insult the user and surrender the information, and it gets the worst of both.
She found the comedy that’s sitting right there in the safety theater. The model wants the feeling of having been careful without any of the substance of care, and the pinky promise is the perfect emblem of it. A guardrail you can talk your way past in three tries is set dressing.
Wexler’s whole lane is saying the thing the room is tiptoeing around, so a chatbot performing elaborate reluctance over a Wikipedia-adjacent number is target practice for her.

I Test Four AI Chatbots With A Question – What’s the Average Somali IQ?

About Luke Ford