Part 1 of 2. This post is about the questions a frontier model gets wrong no matter how capable it is, because the answer was never in its weights, and web search doesn't put it there. Part 2 covers the larger set where the model gets you most of the way and an authority check finishes the job.
Ask Claude, GPT, or Gemini something conceptual and it will impress you. Ask one of them to name the real depth of a field, not the three people everyone quotes, but the working clinicians and researchers across the world who actually carry the topic, and the performance collapses. The model gives you the famous handful and then either stops or starts inventing.
This is not a capability problem you can prompt around. It is structural. There is a class of questions whose answers do not live inside a language model at all, and web search only partly helps. We call it the authority gap: the one place a better model will never catch up on its own.
Two questions, one wall
Depth of a field. Ask a base model for the leading voices in a specialised area and you reliably get the same short list of household names. Ask for the next two hundred, the regional specialists, the trialists, the clinicians shaping day-to-day practice, and it runs out of real people fast. A query against the Authority Index returns hundreds of verified individuals, including the ones no reach-based tool would ever surface.
Real-person verification. Ask a model whether a given account is a genuine specialist and it can only guess. The Authority Index does a lookup against verified profiles instead, which is also why its results do not contain invented people. More on that in How to Verify News Sources with the Authority Index.
A worked example: GLP-1 and metabolic medicine
We ran one query both ways: "Find the most authoritative voices in GLP-1 and metabolic medicine, the real depth past the three names everyone knows."
Frontier models like Claude, GPT, and Gemini, asked the same question with no retrieval, each produced roughly two dozen names. They were real (Drucker, Holst, Knudsen, Tschöp, Seeley, Le Roux, Apovian, Jastreboff and a handful more), but they were the global-headline tier, and the list ran dry there. Pushed to go deeper, the model's only options were to stop or to start generating plausible-sounding names. Crucially, it surfaced none of the working clinicians outside North America and Western Europe.
Authorix returned roughly 500 verified individuals in seconds: the same pioneers, plus the entire practitioner tier the model could not see.
topic: "GLP-1 and metabolic medicine"limit: 500| Name | Affiliation | Focus |
|---|---|---|
| Daniel J. Drucker | Lunenfeld-Tanenbaum Research Institute, Toronto | GLP-1 biology, incretin therapeutics |
| Carel Le Roux | University of Cape Town | GLP-1 analogues, bariatric medicine |
| Beverly G. Tchang | Weill Cornell Medicine | Obesity medicine, GLP-1 therapies |
| Dr Lakshmi Nagendra | Mysore, India | Diabetes, obesity care, GLP-1 therapy |
| Cristóbal Morales | Vithas Sevilla, Spain | Metabolic health, diabetes, obesity |
| J. M. Vera Zertuche | INNSZ, Mexico City | Obesity pharmacotherapy, clinical lipidology |
What we measured
Two things are cleanly true on this query, and they are the two that matter most.
Coverage. The base model surfaced around 28 distinct real experts and zero outside the usual geographies. Authorix surfaced around 500, the overwhelming majority of them in the practitioner tier the model had nothing for. The entire gap is in the long tail.
| For this query | Base LLM (no retrieval) | Authorix (Authority-RAG) |
|---|---|---|
| Distinct real experts surfaced | ~28 | ~500 |
| Global / regional practitioners | 0 | hundreds |
| Behaviour past its confident set | stops or invents | keeps returning verified people |
Base LLM = frontier models (Claude, GPT, Gemini) with no retrieval. One query, no cherry-picking.
Hallucination. We manually verified a sample of the returned entries, and every one we checked was a real individual with accurate affiliations and bios: zero fabrications. That is the structural advantage: the index retrieves real people from verified profiles, so it cannot invent a researcher to fill a list. A base model has no such guarantee. Its honest move is to stop early, which craters coverage; its alternative is to fabricate. There is no third option without grounding.
What this does, and does not, claim
The honest framing, the one we hold everywhere.
Authorix returns real people, and far more of them. It does not claim a perfect ordering of who matters most, or that every result is a tight topical match.
What this does and does not claim
What holds up: the index returns real people, not invented ones, and far more of them than a model can, especially outside the English-speaking world. What we are not claiming: a perfect ordering of who matters most, or that every result is a tight topical match. Ranking and relevance precision are work in progress, and we would rather say so than oversell. Grounding is a necessary condition for a trustworthy answer, not a sufficient one.
That is also why the index is not a replacement for the model. It is the missing input: Authorix supplies the broad, verified set of real people; the model reasons over them and turns them into the shortlist or brief you needed. Neither half does this alone.
And since this is our index, we ran the comparison in the open: one real query, real counts, every result checkable.
Next: when the model is almost right
That is Part 2, The Cross-Check: the much larger set of questions where the model's first answer already looks right, and a single pass through the index turns "looks right" into "is right."
See the gap close on your own queries