Object Detection from Scratch: Part 4 - Reading Metrics Like an Engineer

June 15, 202620 min readNew

Part 4 explains how to read detection metrics in this project: precision, recall, mAP50, mAP50-95, per-class behavior, confusion patterns, and what the numbers imply for real product use.

Object Detection from Scratch: Part 4 - Reading Metrics Like an Engineer
React to this article

Engineers are often told to "watch the metrics," but metrics without interpretation are just decorated uncertainty.

This project does a better job than most of connecting the numbers to behavior. The repo does not stop at a headline score. It explains what each metric means, where the model is strong, and where the numbers hint at data or localization problems.

Precision and Recall Are Product Questions

Not abstract. These concepts surface directly in user experience. Every error propagates into the product.

Precision asks:

When the detector says it found a region, how often is that claim correct?

Recall asks:

When a region really exists, how often does the detector find it?

If precision is low, the product hallucinates structure. If recall is low, the product misses useful information. Both are annoying in different ways.

Precision and recall relationship
The precision-recall trade-off becomes easier to reason about when you can see the tension between false positives, false negatives, and usable operating points.

Why mAP50 Is Only Half the Story

The repo's strongest reported performance is excellent at mAP50. That means the detector usually places boxes with enough overlap to count as correct under a forgiving threshold.

That is good news. It means the model has learned the task.

But mAP50-95 tells the harder truth: how well does localization hold up when the overlap requirement becomes strict?

For this project, the gap between those metrics is not failure. It is signal.

IoU strictness comparison
IoU thresholds are not abstract scoring trivia. They change whether a box counts as approximately right or production-tight.

Small Regions Distort Strict Metrics

The repo's metrics guide explains the pattern clearly: large regions tolerate small coordinate errors well. Small regions do not.

Think about the art box versus the mana-cost box. A two-pixel shift barely changes the IoU of a large art rectangle, but it can materially hurt the IoU of a tiny mana-cost label. That is why small classes often show a larger gap between relaxed and strict metrics.

This is not merely "the model is worse at small objects." It can also mean:

  • annotation boxes are slightly inconsistent
  • labels around tiny regions are noisy
  • strict IoU punishes small-box error much harder

Per-Class Analysis Is Where the Story Lives

Mean metrics are useful, but class-level metrics are more honest.

In this project, the broad pattern is intuitive:

  • large classes like art, card, and description perform strongly
  • smaller classes like mana-cost and power are harder
  • narrow text bands like title and tags are sensitive to localization precision

Those results line up with the product's geometry and the dataset's annotation difficulty. That coherence is what makes the metrics trustworthy.

Confusion Matrices Need Context

A confusion matrix can look scientific without being useful. The real question is whether the confusion pattern makes sense for the task. If it does not, the detector is telling you where the pipeline is misaligned.

For this detector, plausible confusion includes:

  • title vs tags
  • description vs neighboring text regions
  • partial card boundaries under challenging framing

That kind of confusion is understandable because the regions are semantically related and visually adjacent. It is a believable mistake. If the detector were confusing art with mana-cost constantly, that would point to a structural problem instead.

Validation confusion matrix
The raw confusion matrix shows which classes the detector actually mixes up and where misses accumulate in absolute terms.
Normalized confusion matrix
The normalized view is easier to compare class by class because it converts the same confusion patterns into percentages instead of raw counts.

Metrics Should Influence the Pipeline

This is the part many ML posts miss. The metric is not the end of the story. It tells the rest of the pipeline what it can safely assume.

Examples:

  • Strong title detection helps OCR by delivering cleaner crops.
  • Strong art detection helps DINOv2 by isolating the right visual content.
  • Weaker small-box precision may be acceptable if those classes are not critical in the first production path.

That is why product context matters. A detector can be imperfect in areas that do not break the user-facing experience, and unacceptable in areas that do.

A Good Metric Reading Produces Decisions

When I read this repo's metrics, the decisions it suggests are:

  • keep the current baseline because it is strong where the pipeline needs strength
  • improve labels if the goal is to raise strict localization quality
  • be careful about assuming bigger models are the right next step
  • optimize around downstream reliability, not only benchmark aesthetics

That is exactly what good measurement should do.

Precision-recall curve from validation
The PR curve shows whether the detector can stay accurate as recall rises, which is why its shape matters more than any single threshold.

Start with the PR curve. It answers the biggest question first: does the detector remain useful as you try to recover more real regions?

F1 curve from validation
The F1 curve helps locate the confidence threshold where precision and recall are best balanced for the actual project.

Then look at the F1 curve to choose an operating point instead of guessing. It is the fastest way to see where the detector's trade-off feels most balanced.

Precision curve from validation
Precision rises as the confidence threshold gets stricter, which is useful when the product would rather suppress weak detections than hallucinate regions.
Recall curve from validation
Recall falls as the threshold rises, which is why threshold tuning is always a product trade-off instead of a purely academic one.

The precision and recall curves make the same threshold decision legible from opposite sides: one shows how aggressively false positives drop, the other shows what you pay in missed detections.

Conclusion

The detector's metrics are impressive, but the most valuable thing about them is not that they are high. It is that they are interpretable.

You can read the results and infer where the model is robust, where localization is brittle, and where the real ceiling is likely coming from. That makes the metrics useful for engineering, not just presentation.

Next, we follow the model output into the identification pipeline, where detection becomes OCR, lookup, and visual matching.

Further Reading

Arthur CostaA

Arthur Costa

Senior Full-Stack Engineer & Tech Lead

Senior Full-Stack Engineer with 8+ years in React, TypeScript, and Node.js. Expert in performance optimization and leading engineering teams.

View all articles →