Verified Scores

How skill satisfaction ratings accumulate into verified model scores

Verified scores track how well each LLM model performs for each skill operation. After a skill completes, users rate the result on a 1–5 scale; these ratings accumulate into per-model, per-operation scores that help choose the best model for a given task.

How Scores Are Collected

At the end of each skill run, the workflow prompts:

How well did this skill work? (Rate 1–5, helps improve model selection)

The raw rating maps to a 20–100 internal scale (1 → 20, 2 → 40, 3 → 60, 4 → 80, 5 → 100). Each rating updates the running average for the current model and operation.

Skills that collect feedback: /aitask-pick, /aitask-explore, /aitask-explain, /aitask-changelog, /aitask-wrap, /aitask-refresh-code-models, /aitask-reviewguide-classify, /aitask-reviewguide-merge, /aitask-reviewguide-import, /aitask-web-merge.

Feedback collection is controlled by the enableFeedbackQuestions field in execution profiles. It defaults to true (enabled); set to false to suppress the prompt.

Score Scale

Range	Label	Meaning
0	Not verified	Untested or unknown quality
1–49	Partially verified	Works but with known issues
50–79	Verified	Works well for most cases
80–100	Highly verified	Extensively tested, recommended

Time Windows

Scores are stored in three time-windowed buckets so recent performance is visible alongside the historical average:

Bucket	Period key	Description
`all_time`	(none)	Cumulative across all ratings
`month`	`YYYY-MM`	Current calendar month; resets when the month changes
`week`	`YYYY-Www`	Current ISO 8601 week; resets when the week changes

Each bucket tracks runs (number of ratings) and score_sum (sum of mapped scores). The all-time average is also stored in the flat verified field for backward compatibility.

Provider-Specific vs All-Providers

The same LLM can be available through different providers (e.g., openai/gpt-5.4 and opencode/gpt-5.4). Verified scores are stored per provider, but consumers can aggregate them across providers to show a single cross-provider view:

Strip the provider/ prefix from each model’s CLI ID to get the normalized model name
Group entries with the same normalized name across all models_*.json files
Sum runs and score_sum from matching buckets across the group
For month and week, only aggregate entries with the same period value

This aggregation is performed at read time – no duplicate values are stored. ait settings and ait stats both implement this aggregation.

Where Scores Appear

Settings TUI – The Agent Defaults tab shows verified score context next to each model ([96 (9 runs, 2 this mo)]). The model picker opens with a Top Verified list. The Models tab shows per-operation scores with run counts and all-providers summaries
ait stats – Prints verified model score rankings per skill with all-providers aggregation and time-windowed display. With --plot, renders bar charts per skill

Storage

Scores are stored in aitasks/metadata/models_<agent>.json alongside model definitions. See the model entry schema for the full verifiedstats structure.

Last modified March 15, 2026: documentation: Add verified scores reference page and cross-links (t365_5) (f2cc0529)