How we score

Every score, explained.

A six-dimension scoring framework with explicit weights, a confidence gate, and no editorial discretion to nudge a brand up or down. Here is exactly how it works.

The 1–10 scale

Calibrated to category, never inflated.

Each product receives a score from 1 to 10 in six dimensions. Dimension scores are weighted and averaged to produce an overall score. A 9 in the $300 monitor tier is not the same as a 9 in the $1,500 monitor tier, and we do not pretend otherwise.

9–10Best-in-class for its tier. Category leader with minimal trade-offs.
7–8Strong performer. Recommended for most buyers in this tier.
5–6Average. Gets the job done with notable compromises.
3–4Below average. Significant weaknesses in this dimension.
1–2Poor. Major issues that affect usability or value.

Six dimensions

The weights, visible.

Weights are not invented per-comparison. They live in a category-specific table, then nudged by the user’s stated use case. Each ring shows the typical floor-to-ceiling range we apply.

Performance

Raw capability against same-tier competitors. Benchmarks, real-world throughput, output quality.

Value

Price-to-performance against alternatives at the same or nearby price points.

Build Quality

Materials, durability, fit-and-finish, long-term reliability signal.

Features

Useful capabilities relative to the stated use case — not spec-sheet padding.

Ecosystem

Software support, accessory availability, update track record.

User Experience

Setup, day-to-day usability, owner-reported friction.

Weights sum to 100% within a comparison. A laptop comparison weights Performance higher than a desktop monitor comparison would. A budget-focused query nudges Value up and Features down.

The confidence gate

Below 0.70 doesn’t ship.

Before any comparison is persisted to our database or surfaced in search-engine sitemaps, it runs through a four-signal confidence scorer.

0.4Spec verificationA retailer API (Best Buy, eBay) returns matching product names with high fuzzy-match confidence.
0.3Category coherenceBoth products are the same kind of thing — a 27″ monitor is not compared to a 65″ TV.
0.2Engine confidenceThe AI model’s self-reported confidence on the comparison row.
0.1Image qualityBoth products have a real, non-fallback product image.

Comparisons below 0.70 are not persisted, not indexed, and not shown with full prominence. They are silently regenerated, served with a low-confidence label, or rejected outright.

Pipeline

From spec sheet to verdict.

  1. 01

    Data gather

    Verified specs from retailer APIs and manufacturer datasheets. Owner feedback aggregated from public retailer reviews. The model never invents a spec; it works from the table we hand it.

  2. 02

    Dimension scoring

    Grok-3 scores each of the six dimensions on the 1–10 scale, with reasoning, against the spec table and use case.

  3. 03

    Weighted aggregation

    The per-dimension scores are combined using the category-specific weight table.

  4. 04

    Verdict synthesis

    A plain-English verdict is generated, with explicit reasoning per dimension and a clear winner.

  5. 05

    Confidence & safety

    The confidence scorer runs; the image safety filter runs; only verdicts that clear both are surfaced.

How context changes scores

Same products. Different question. Different verdict.

Tell GoodPickr you care about portability and battery life, and the laptop comparison’s weight table reweights toward those signals before the verdict is generated. Same products. Same data. Different question, different verdict. That is intentional. “Best TV for a dark home theater” and “Best TV for a bright living room” should not produce the same answer, and on GoodPickr they don’t.

Editorial integrity guarantee

What our scores are not

Honest about what this is.

  • Not hands-on testing. We do not personally test or use the products in our home. Scores synthesize retailer data, manufacturer specs, and aggregated owner feedback — nothing more.
  • Not a single canonical truth. Two queries with different use cases will produce different verdicts on the same product.
  • Not a substitute for verifying critical specs. Before a significant purchase, confirm key specs on the manufacturer’s site.

Limitations we are honest about

Where the data thins out.

  • Newly released products may have thin owner-feedback data; we flag confidence accordingly.
  • Score precision is real but not absolute. A 7.2 vs a 7.4 is a close call. Read the per-dimension reasoning rather than fixating on a single decimal.
  • Regional skew. Our retailer-API data is US-market-rich; pricing and availability outside the US may differ.

Have a question?

Spot a score that looks wrong?

Email billy@goodpickr.com with the URL and the specific data point. Confirmed errors are corrected within 48 hours.