We just shipped new docs, an llms.txt or an MCP. When will our scores update?

The dateline on each category page tells you when that category was measured. On a re-run, docs and feature scores pick up whatever is live that day. AI Visibility is the slow one: it moves when the models' training catches up with the world, not when you ship.

WebRTC Tools

How we score

How the WebRTC tools guide is scored

Q: We get told we were found via AI. Why is our AI Visibility low?

Three different things get called found via AI: what the model already knows from training (what this column measures), what the model retrieves when web search is on, and referral traffic your analytics label as AI. You can be strong on the last two and still score low on the first. A low number means the models do not reliably recall you unprompted.

Q: Our SEO/GEO team ran an ahrefs comparison and we look competitive. How does that square?

ahrefs measures the web link graph - domain authority, classic SEO. AI Visibility measures what the model absorbed in training and recalls without retrieval. Those two can diverge hard: recall tracks how often and how consistently you are mentioned across the training corpus, not your backlink profile.

Q: How do we improve our AI Visibility?

Get mentioned, consistently and in context, across the kinds of sources these models train on, and make yourself easy for an agent to read. It is the slowest and hardest column to move: changes show up only when the next training cycles absorb them. Turning that direction into a prioritized plan for your product is advisory work.

Q: Where can I see the feature list and the detail behind my score?

The checklists and per-vendor scoring detail are proprietary: a public per-vendor scorecard turns every checkmark into a negotiation, and the detailed gap analysis is the advisory work. If you want your own breakdown, that is a conversation.

Q: Isn't the whole thing just your opinion?

Two layers, kept separate on purpose. The measured columns - AI Visibility and Quality - are mechanical: same prompts, same rubric, same models, applied identically to everyone. The editorial layer - the badges and the short take - is my read, and I own that as opinion.

Q: Is inclusion paid? Can we pay to rank higher?

No, and no. Inclusion is not paid and no score can be bought. Advisory work never moves a number on this hub. The scores stay independent of who pays me for anything.

Q: Something on our row is wrong - a link, a feature, pricing. Can you fix it?

Yes, always. Facts are not a judgment call. Tell me the specific and I will correct it. The score is my read; the facts behind it should be right.

Q: Can you add a dimension for X?

Maybe, over time. But I am not going to bolt on a new scoring dimension the same week a vendor who would benefit from it asks for one. Fair asks get logged and considered for a future revision.

The WebRTC tools guide scores are produced by objective measurement and analysis done by machines - well known AI engines query for services to see which tools get surfaced. AI then evaluates the relevant tools based on their website and documentation. This isn't based on my own subjective opinion of the vendors and tools.

Nothing here is composited into a single ranking, and no vendor pays for placement.

The process

The WebRTC tools guide is split into multiple categories. Each category represents an area where developers use specific tools. These can be TURN servers, media servers, full CPaaS services and even low level SDKs. Per category, I look at multiple aspects:

Discoverability

Which of the well known tools in the category is an AI service going to suggest when I ask for assistance.

AI readiness and utility

When a decision is made to use a tool, what can AI figure out and find on its own about the tool's capabilities.

The human "edge"

Where does the results above fit with my own mental model of the WebRTC world and the tools in it.

How do WebRTC tools get discovered today

A developer today doesn't open ten browser tabs to pick a tool. He isn't even going to Google Search to see what's available out there (when did you do that last?)

Developers ask an AI: "what should I use, and can you help me build it?"

That is the question I put to the machines, systematically, across three models: Claude Sonnet 4.6, Gemini 2.5 Flash, and GPT-4o. Same prompts, same time window, same lane for each category. Then I write down what came back and set it next to what I know the tools are actually worth.

In reality, multiple variants of the question are prompted, to see in which of the queries which of the tools show up. Interestingly, I've seen tools crossing category boundaries even when they shouldn't, trickling into queries where they aren't even a suitable fit.

All of the analysis took place in June 2026. This may be executed again in the future periodically, to see how the ecosystem changes and shifts over time.

What gets measured

Below are the certain aspects measured and covered in the WebRTC tools guide.

AI Visibility

0–10

This is the machine side of the guide. AI Visibility (0 to 10) measures how often an AI model names a tool when you ask it for a recommendation in that category. I run the prompts across all three models and average the result.

It's presence-based: a mention counts. If a model names the tool, it gains a score. I am not yet weighting "strongly recommended" differently from "mentioned in passing". AI Visibility measures the models, not the vendor's product. A great tool can score 0 here simply because the models have never learned to say its name.

Quality

0–100

The quality score in each category is different. Here, I mapped what are the types of features you'd expect in the category. What are the must-haves and the nice-to-haves. Then, I unleashed AI to go figure out for each of the tool based on its documentation and website the availability of these must-haves and nice-to-haves and suggest a quality score.

The selection of features can be seen as subjective (my decision), but the scoring against them was done by the objectivity of AI.

The Quality score (0 to 100) is a blend of documentation depth and feature coverage. Documentation depth is how findable and complete the docs are. Feature coverage is how much of the category's real work the tool actually does.

The feature set is per category. A media server gets scored on SFU behavior, recording, simulcast, mobile SDKs, observability and the rest. A TURN service gets scored on what TURN services have to do. So a 94 on one page and a 94 on another are both honest, but they were earned against different checklists. Read Quality within its own category.

Agent-ready

✓~ tiered

Separate from the scores, I record whether the vendor ships something an AI coding agent can actually operate: an MCP server, or a hand-authored llms.txt. This is tiered, not graded into the numbers. It tells you whether an agent building on top of the tool has a paved road or has to feel its way through human documentation or read code directly.

MCP servers are starting to show up in some categories. Plenty of strong products still ship nothing here, and that gap matters more every month.

Pricing

context only

Pricing is indicative context, gathered from each vendor's public pages. It is not a score and it is not a column I rank on. Pricing models differ across the guide - per-minute, per-MAU, per-GB, credits, build tiers - so the numbers are not always like-for-like, and I flag it where they aren't. Treat every price as an initial starting point only. Confirm current rates and limits with the vendor for your own volume before you lean on them.

The point of all this

Where AI and reality diverge

This is the whole point of the guide. Each tool sits at two measurements: how often AI names it, and how good AI thinks it is once it reads its documentation. The gap between the two is the interesting part.

You can see it on every category page →

Frequently asked questions

These are the questions vendors and readers actually asked after the guide went live, answered once for everyone.

How do you measure "AI Visibility"?

I ask three models - Claude Sonnet 4.6, Gemini 2.5 Flash and GPT-4o - the same discovery questions I ask for every tool in a category, with web search turned off, and I record which tools they name back. A prompt looks like a real buyer question, for example: "I want to build a video meetings app. What's a good no-code solution for it?" The score is how often and how prominently a tool shows up across those answers. Same prompts, same models, every vendor in the category, one sitting. It's not weighted by who I know or who I like.

I decided to disable web search because SEO we all do and work hard to optimize. Having the original LLM incorporate the knowledge of a tool into its training is a lot more powerful than when it deploys search to gain the data it needs.

We get told we were "found via AI". Why is our AI Visibility low?

Three different things get called "found via AI", and only one of them is what this column measures:

What the model already knows - it answers from training, no browsing and no searching. That's what AI Visibility measures here
What the model retrieves when web search is on - now it's reading live pages in the moment. Different mechanism
Referral traffic your analytics label as "AI" - someone clicked through from an AI product

You can be strong on 2 or 3 and still score low on 1. A low number here means the models don't reliably recall you unprompted - not that nobody ever reaches you through an AI tool.

Our SEO/GEO team ran an ahrefs comparison against your top-scored vendor and we look competitive. How does that square?

ahrefs measures the web link graph - who links to whom, domain authority, classic SEO. AI Visibility measures what the model absorbed in training and recalls without retrieval. Those two can diverge hard. You can have solid domain authority and still be near-invisible in a model's memory, because recall tracks how often and how consistently you're mentioned across the training corpus, not your backlink profile. The ahrefs chart is measuring a real thing. It just isn't the thing this column measures.

How do we improve our AI Visibility?

The general direction is no secret: get mentioned, consistently and in context, across the kinds of sources these models train on - and make yourself easy for an agent to read. It's a real, teachable discipline, not a backlink campaign. It's also the slowest and hardest column to move: changes you make now only show up when the next training cycles absorb them, so you're looking at least at a month or two before you know if something worked. I'm working through the same thing on my own products, so I'm not speaking from the sidelines.

Turning that direction into a prioritized, specific plan for your product - where your gaps are and what to do first - is where I come in as advisory. The method is public; the custom plan for your own product is the work I do. Start that conversation here.

How is the Quality score built, and how do we get our docs to 100?

Quality is a published rubric - documentation and feature coverage against a fixed checklist. The checklist is specific to each category and lists the must-haves and nice-to-haves for that category (the list itself was built by having AI survey the vendors' sites in the category and surface what the field offers). I didn't hand-grade anyone. I asked Claude to read the docs and score them against that checklist - the same thing, the same way, for every vendor in the category in one sitting.

Where can I see the feature list and the detail behind my score?

The checklists and what each vendor scored against that checklist is proprietary. Two reasons why: a public per-vendor scorecard turns every checkmark into a negotiation, and the detailed "here's exactly where you lost points and what to fix first" is the advisory work, not the public layer. If you want your own breakdown, that's a conversation - start it here.

Isn't the whole thing just your opinion?

Two layers, kept separate on purpose. The measured columns - AI Visibility and Quality - are mechanical: same prompts, same rubric, same models, applied identically to everyone. The editorial layer - the badges and the short take - is my read, and I own that as opinion. The numbers aren't a matter of taste.

Is inclusion paid? Can we pay to rank higher?

No, and no. Inclusion isn't paid, and no score can be bought. If you work with me on improving your standing, that advisory helps you understand and close your gaps - it never moves your number on this hub. The scores stay independent of who pays me for anything. That independence is the entire reason a score here is worth something - to you as much as to a buyer reading it.

Something on our row is wrong - a link, a feature, pricing. Can you fix it?

Yes, always. Facts aren't a judgment call. Tell me the specific - a wrong link, a feature I misread, pricing that changed, or a name that should read differently - and I'll correct it. The score is my read; the facts behind it should be right.

Can you add a dimension for X?

Maybe, over time. Some of these are genuinely useful buyer criteria. But I'm not going to bolt on a new scoring dimension the same week a vendor who'd benefit from it asks for one - you'd want me to hold that line for you too. Fair asks get logged and considered for a future revision.

Which models and versions do you query?

Claude Sonnet 4.6, Gemini 2.5 Flash, and GPT-4o (snapshot gpt-4o-2024-08-06). The exact versions are named because model recall shifts between versions - when I re-run discovery on newer models, the version strings roll and the scores can move with them.

How is "Agent-ready" scored?

It measures how ready a tool is for an AI agent to actually use it - not marketing, operability. A machine-readable surface (a product-specific llms.txt, an MCP server, agent/operator skills with evals) moves it up the tiers; a generic or absent one keeps it low. It's an early, deliberately strict measure - most of the field scores low today, which is the point.

Like other measurements and scoring here - I didn't manually check - I asked Claude to figure out if he can easily find these or not.

We just shipped new docs / an llms.txt / an MCP. When will our scores update?

The dateline on each category page tells you when that category was measured - that's the snapshot you're looking at. When I re-run a category, the docs and feature scores pick up whatever is live that day, and the agent-ready tier picks up whatever an agent can actually find and use. AI Visibility is the slow one: it moves when the models' training catches up with the world, not when you ship. There's no fixed refresh calendar yet - categories get re-run as the hub evolves, and the dateline changes when they do.

Everyone has web search turned on by default now. Why does a no-search score matter?

Because the model's memory shapes the search. When someone asks an AI for a shortlist, the model's prior knowledge decides which names it reaches for, which sources it trusts, and how it frames the comparison - before and during any retrieval. And plenty of AI usage still runs with no browsing at all: API calls, coding agents, embedded assistants. A tool the models already know gets pulled into answers everywhere; a tool they don't is betting everything on winning the live search every single time. That's why I measure the memory layer on its own.

Can you add our positioning to the notes column?

No. The notes column is my read, not vendor copy. If your price is wrong on the page, that's a fact and I'll fix it the moment you tell me. But a comparative claim - cheapest, fastest, better, best, bestest, awesomeful - is marketing, and the whole value of this hub is that nothing in it was written by the vendors being scored. The same wall that keeps your competitors' claims out keeps yours out.

Why is a product that's being sunset still listed? How do you decide what goes in which category?

Placement follows the product, not the press release. A product that's still sold, documented and running stays listed as long as buyers can still choose it - with my read on its trajectory where that's warranted. Category assignment follows product heritage: where the product actually competes, not where the parent company would like to be filed. When a product is finally gone, it comes off at the next refresh.

This is not a leaderboard.

I never composite AI Visibility, Quality, Agent-ready and pricing into one number, because they answer different questions and collapsing them would hide the divergence that makes the guide worth reading. The grades on every page measure the machines, not my feelings about the vendors. My own call shows up only in the labelled right-most column and I kept that as sparse as possible on purpose - I am not the interesting part here.

I'm an independent WebRTC analyst. No vendor pays for placement, position, or a badge. Where I have a commercial interest - rtcStats is my own product - I say so on the page and pull the editorial calls altogether.

Ask the models the same questions yourself. You'll get the same names. I just do it systematically and write down what I find.

Oh - and if you need help figuring out in detail what I did, and even improve your own standing - you know where to find me.

Get in touch →