Part 1 of 2. This post is about the questions a frontier model gets wrong no matter how capable it is, because the answer was never in its weights, and web search doesn't put it there. Part 2 covers the larger set where the model gets you most of the way and an authority check finishes the job.

Ask Claude, GPT, or Gemini something conceptual and it will impress you. Ask one of them to name the real depth of a field, not the three people everyone quotes, but the working clinicians and researchers across the world who actually carry the topic, and the performance collapses. The model gives you the famous handful and then either stops or starts inventing.

This is not a capability problem you can prompt around. It is structural. There is a class of questions whose answers do not live inside a language model at all, and web search only partly helps. We call it the authority gap: the one place a better model will never catch up on its own.

Two questions, one wall

Depth of a field. Ask a base model for the leading voices in a specialised area and you reliably get the same short list of household names. Ask for the next two hundred, the regional specialists, the trialists, the clinicians shaping day-to-day practice, and it runs out of real people fast. A query against the Authority Index returns hundreds of verified individuals, including the ones no reach-based tool would ever surface.

Real-person verification. Ask a model whether a given account is a genuine specialist and it can only guess. The Authority Index does a lookup against verified profiles instead, which is also why its results do not contain invented people. More on that in How to Verify News Sources with the Authority Index.

A worked example: GLP-1 and metabolic medicine

We ran one query both ways: "Find the most authoritative voices in GLP-1 and metabolic medicine, the real depth past the three names everyone knows."

Frontier models like Claude, GPT, and Gemini, asked the same question with no retrieval, each produced roughly two dozen names. They were real (Drucker, Holst, Knudsen, Tschöp, Seeley, Le Roux, Apovian, Jastreboff and a handful more), but they were the global-headline tier, and the list ran dry there. Pushed to go deeper, the model's only options were to stop or to start generating plausible-sounding names. Crucially, it surfaced none of the working clinicians outside North America and Western Europe.

Authorix returned roughly 500 verified individuals in seconds: the same pioneers, plus the entire practitioner tier the model could not see.

Claudewith Authorix connected
You
Find the most authoritative voices in GLP-1 and metabolic medicine, the real depth past the three names everyone knows.
Authorix· indextool call
topic: "GLP-1 and metabolic medicine"limit: 500
Authorix returned a breadth sample
NameAffiliationFocus
Daniel J. DruckerLunenfeld-Tanenbaum Research Institute, TorontoGLP-1 biology, incretin therapeutics
Carel Le RouxUniversity of Cape TownGLP-1 analogues, bariatric medicine
Beverly G. TchangWeill Cornell MedicineObesity medicine, GLP-1 therapies
Dr Lakshmi NagendraMysore, IndiaDiabetes, obesity care, GLP-1 therapy
Cristóbal MoralesVithas Sevilla, SpainMetabolic health, diabetes, obesity
J. M. Vera ZertucheINNSZ, Mexico CityObesity pharmacotherapy, clinical lipidology
and ~490 more, each with a verified profile and bio
Past the handful of names that dominate every GLP-1 article, the field has real depth, and most of it is regional. The model could name the pioneers; it could not produce the working clinicians across India, Spain, Mexico, Saudi Arabia, Turkey and Japan who carry the topic in practice. Asked to reach that far on its own, it would have started inventing.

What we measured

Two things are cleanly true on this query, and they are the two that matter most.

Coverage. The base model surfaced around 28 distinct real experts and zero outside the usual geographies. Authorix surfaced around 500, the overwhelming majority of them in the practitioner tier the model had nothing for. The entire gap is in the long tail.

For this queryBase LLM (no retrieval)Authorix (Authority-RAG)
Distinct real experts surfaced~28~500
Global / regional practitioners0hundreds
Behaviour past its confident setstops or inventskeeps returning verified people

Base LLM = frontier models (Claude, GPT, Gemini) with no retrieval. One query, no cherry-picking.

Experts surfaced for one query: GLP-1 and metabolic medicine
Base LLMAuthorix
distinct real experts (higher = better)01252503755002828Globally-known names0470Regionallong-tail practitioners
Both the model and Authorix return the globally-known names. The authority gap is entirely the regional, long-tail practitioner layer.

Hallucination. We manually verified a sample of the returned entries, and every one we checked was a real individual with accurate affiliations and bios: zero fabrications. That is the structural advantage: the index retrieves real people from verified profiles, so it cannot invent a researcher to fill a list. A base model has no such guarantee. Its honest move is to stop early, which craters coverage; its alternative is to fabricate. There is no third option without grounding.

Fabricated people in the answer
0Authorix (sampled)no invented peoplenot scoredBase LLM (forced to depth)declined to pad
Authorix retrieves from verified profiles; sampled entries contained no invented individuals. A base model forced past its confident set has no such floor: in our run it declined to pad rather than invent, trading hallucination for a collapse in coverage.

What this does, and does not, claim

The honest framing, the one we hold everywhere.

Authorix returns real people, and far more of them. It does not claim a perfect ordering of who matters most, or that every result is a tight topical match.

What this does and does not claim

What holds up: the index returns real people, not invented ones, and far more of them than a model can, especially outside the English-speaking world. What we are not claiming: a perfect ordering of who matters most, or that every result is a tight topical match. Ranking and relevance precision are work in progress, and we would rather say so than oversell. Grounding is a necessary condition for a trustworthy answer, not a sufficient one.

That is also why the index is not a replacement for the model. It is the missing input: Authorix supplies the broad, verified set of real people; the model reasons over them and turns them into the shortlist or brief you needed. Neither half does this alone.

And since this is our index, we ran the comparison in the open: one real query, real counts, every result checkable.

Next: when the model is almost right

That is Part 2, The Cross-Check: the much larger set of questions where the model's first answer already looks right, and a single pass through the index turns "looks right" into "is right."

See the gap close on your own queries

$5 in free credits. No credit card required.