Object Detection from Scratch: Part 2 - Dataset, Labels, and the Reality of Training Data

April 15, 202620 min readNew

Part 2 examines the dataset behind the MTG detector: splits, YOLO labels, class design, annotation noise, and why data quality sets the real performance ceiling.

Object Detection from Scratch: Part 2 - Dataset, Labels, and the Reality of Training Data
React to this article

In machine learning, the dataset is not "input." It is the first version of the system's truth model.

That is why I like the way this repo documents its data. The files are not hidden behind a vague training command. The project explains where the data comes from, how the splits work, and what the labels actually mean in practice.

The Dataset Source

The project uses a Roboflow Universe dataset for Magic: The Gathering card regions. According to the repo documentation, the working version contains thousands of annotated images split into train, validation, and test subsets.

The important detail is not just the count. It is that the annotations are region-level, not card-level. The model is being taught to understand a structured object.

Roboflow Universe dataset overview
The actual Universe page for the MTG detection project exposes the class chips, image count, and model metrics the article works from, so the rest of the post is anchored to a specific, inspectable source.

The Split Strategy

The repo explains the classic three-way split clearly:

  • train: where the model learns
  • valid: where training progress is monitored
  • test: where final evaluation stays honest

This matters because a detector that performs well on images it has effectively memorized is not useful. The split is one of the first safeguards against self-deception.

Dataset split overview
The train, validation, and test split is not paperwork. It is the control surface that separates learning, model selection, and honest evaluation.

The Seven Classes

The class set is practical and downstream-aware:

IDClassWhy it exists
0artUsed for visual matching
1cardCaptures the full card boundary
2descriptionPreserves rules text region
3mana-costSmall, semantically useful detail
4powerAnother small but important region
5tagsType line / category signal
6titlePrimary OCR target

This class design is stronger than a generic "card" detector because it aligns with downstream product needs. The title region exists because OCR needs a clean crop. The art region exists because print disambiguation needs one. The classes reflect system requirements, not academic neatness.

Seven detection classes on a real card
The class legend makes the task concrete: the detector is learning a structured document with seven distinct semantic regions, not a single global label.
Roboflow Universe Annotate — Layers view
In the Universe editor, the same seven classes render as filled regions over the card. That is the exact geometry the detector has to reproduce at inference time, one bounding box at a time.

What a YOLO Label Actually Stores

The file format is minimal:

class_id center_x center_y width height

All coordinates are normalized to [0, 1], which keeps the labels resolution-independent.

6 0.5524 0.9418 0.7480 0.1164
1 0.4949 0.5015 0.9754 0.9970

That simplicity is one of YOLO's practical strengths. The same simplicity is also unforgiving. If the label is sloppy, the model will faithfully learn sloppy targets.

YOLO label format diagram
The label file is compact on disk, but every normalized coordinate becomes part of the detector's truth model.
Roboflow Universe Raw Data view
Inside the platform each box is a JSON record with a class label and pixel-space coordinates. The YOLO text row above is just a normalized projection of exactly this structure.

Annotation Quality Is the Hidden Bottleneck

One of the most revealing details in the repo is the performance gap between relaxed and strict metrics:

  • strong mAP50
  • meaningfully lower mAP50-95

That pattern is common when the detector is finding roughly the right region but cannot consistently achieve near-pixel-perfect overlap. For large boxes, that is often acceptable. For small boxes like mana-cost and power, a few bad pixels matter a lot.

This is why the repo repeatedly returns to annotation quality in the metrics and training documents. Bigger models do not magically fix inconsistent boxes.

Data Quality Checks Matter More Than People Admit

The exploration scripts in this repo are not filler. They generate sample grids, class distributions, and quality checks. That is the right instinct.

Before tuning augmentations or scaling to cloud GPUs, an engineer should be able to answer:

  • Are labels missing?
  • Are some classes underrepresented?
  • Are small boxes consistently too loose or too tight?
  • Do train and validation samples look like the same world?

That is basic engineering hygiene. Without it, parameter tuning becomes expensive superstition.

Sample grid from dataset exploration
A real sample grid is what turns "the dataset looks fine" into something you can actually inspect before spending compute on training.

Annotation Noise and Small Objects

The hardest classes in this project are not surprising. Small semantic regions are always more fragile:

  • mana-cost
  • power
  • title
  • tags

These classes compress a lot of meaning into a small area. Any mismatch in labeling, cropping, camera angle, or resolution hurts disproportionately.

Annotated MTG card example
A single card already shows the precision burden of the task: several small semantic regions must be boxed tightly enough to support OCR and downstream matching.

This is also why the training strategy later leans on augmentations like mosaic, multi-scale training, and geometric transforms. Those choices are trying to help the model survive real-world variance without pretending the labels are perfect.

The Correction Workflow Is Part of the Architecture

One of my favorite details in the project is the explicit correction loop:

  • download test images
  • export predictions
  • review in Label Studio
  • merge corrected labels
  • retrain

That is the right way to think about quality improvement. The model is not just trained once. The dataset and the detector can improve together.

Manual annotation inside Roboflow Universe
The Annotate editor is where a human actually lives while the dataset grows. Each class in the sidebar is a box that was placed, checked, and adjusted one card at a time, which is exactly why label quality becomes a real engineering cost.

That is the baseline cost. It makes the value of assisted tooling obvious, but it also explains why teams get tempted to trust automation too early.

Roboflow Universe Analytics
The platform's analytics panel surfaces annotation counts, class balance, and image-level distributions. Automation compresses the first pass, but a dashboard like this is what tells you whether the result is still trustworthy.

The right takeaway is not "automation replaces judgment." It is "automation compresses the first pass, so human review can focus on the hard cases."

Manual versus assisted annotation comparison
The side-by-side comparison makes the trade-off obvious: speed improves with automation, but correctness still depends on review and correction loops.

Conclusion

A detector learns whatever definition of reality the dataset encodes. If the labels are thoughtful, consistent, and close to the downstream task, the model has a chance. If the labels are noisy, even a strong architecture will plateau early.

This repo gets the important part right: it treats the dataset as a first-class engineering artifact.

In the next article, we move from the data itself to the choices that shape training: transfer learning, augmentation, optimization, and the experiments that pushed this detector to its best result.

Further Reading

Arthur CostaA

Arthur Costa

Senior Full-Stack Engineer & Tech Lead

Senior Full-Stack Engineer with 8+ years in React, TypeScript, and Node.js. Expert in performance optimization and leading engineering teams.

View all articles →