How we score
Every score, explained.
A six-dimension scoring framework with explicit weights, a confidence gate, and no editorial discretion to nudge a brand up or down. Here is exactly how it works.
The 1–10 scale
Calibrated to category, never inflated.
Each product receives a score from 1 to 10 in six dimensions. Dimension scores are weighted and averaged to produce an overall score. A 9 in the $300 monitor tier is not the same as a 9 in the $1,500 monitor tier, and we do not pretend otherwise.
Six dimensions
The weights, visible.
Weights are not invented per-comparison. They live in a category-specific table, then nudged by the user’s stated use case. Each ring shows the typical floor-to-ceiling range we apply.
Weights sum to 100% within a comparison. A laptop comparison weights Performance higher than a desktop monitor comparison would. A budget-focused query nudges Value up and Features down.
The confidence gate
Below 0.70 doesn’t ship.
Before any comparison is persisted to our database or surfaced in search-engine sitemaps, it runs through a four-signal confidence scorer.
Comparisons below 0.70 are not persisted, not indexed, and not shown with full prominence. They are silently regenerated, served with a low-confidence label, or rejected outright.
Pipeline
From spec sheet to verdict.
- 01
Data gather
Verified specs from retailer APIs and manufacturer datasheets. Owner feedback aggregated from public retailer reviews. The model never invents a spec; it works from the table we hand it.
- 02
Dimension scoring
Grok-3 scores each of the six dimensions on the 1–10 scale, with reasoning, against the spec table and use case.
- 03
Weighted aggregation
The per-dimension scores are combined using the category-specific weight table.
- 04
Verdict synthesis
A plain-English verdict is generated, with explicit reasoning per dimension and a clear winner.
- 05
Confidence & safety
The confidence scorer runs; the image safety filter runs; only verdicts that clear both are surfaced.
How context changes scores
Same products. Different question. Different verdict.
Tell GoodPickr you care about portability and battery life, and the laptop comparison’s weight table reweights toward those signals before the verdict is generated. Same products. Same data. Different question, different verdict. That is intentional. “Best TV for a dark home theater” and “Best TV for a bright living room” should not produce the same answer, and on GoodPickr they don’t.
Editorial integrity guarantee
- We never inflate scores for products we earn higher commissions from. Affiliate payout has no input into the scoring pipeline. The model does not know, and could not use, what we are paid.
- We do not accept payment for higher scores or for inclusion in any list, ranking, or comparison.
- The same methodology runs on every product. No brand-specific overrides exist in code.
What our scores are not
Honest about what this is.
- Not hands-on testing. We do not personally test or use the products in our home. Scores synthesize retailer data, manufacturer specs, and aggregated owner feedback — nothing more.
- Not a single canonical truth. Two queries with different use cases will produce different verdicts on the same product.
- Not a substitute for verifying critical specs. Before a significant purchase, confirm key specs on the manufacturer’s site.
Limitations we are honest about
Where the data thins out.
- Newly released products may have thin owner-feedback data; we flag confidence accordingly.
- Score precision is real but not absolute. A 7.2 vs a 7.4 is a close call. Read the per-dimension reasoning rather than fixating on a single decimal.
- Regional skew. Our retailer-API data is US-market-rich; pricing and availability outside the US may differ.
Have a question?
Spot a score that looks wrong?
Email billy@goodpickr.com with the URL and the specific data point. Confirmed errors are corrected within 48 hours.