Object Detection from Scratch: Part 2 - Dataset, Labels, and the Reality of Training Data

In machine learning, the dataset is not "input." It is the first version of the system's truth model.

That is why I like the way this repo documents its data. The files are not hidden behind a vague training command. The project explains where the data comes from, how the splits work, and what the labels actually mean in practice.

The Dataset Source

The project uses a Roboflow Universe dataset for Magic: The Gathering card regions. According to the repo documentation, the working version contains thousands of annotated images split into train, validation, and test subsets.

The important detail is not just the count. It is that the annotations are region-level, not card-level. The model is being taught to understand a structured object.

The actual Universe page for the MTG detection project exposes the class chips, image count, and model metrics the article works from, so the rest of the post is anchored to a specific, inspectable source.

The Split Strategy

The repo explains the classic three-way split clearly:

train: where the model learns
valid: where training progress is monitored
test: where final evaluation stays honest

This matters because a detector that performs well on images it has effectively memorized is not useful. The split is one of the first safeguards against self-deception.

The train, validation, and test split is not paperwork. It is the control surface that separates learning, model selection, and honest evaluation.

The Seven Classes

The class set is practical and downstream-aware:

ID	Class	Why it exists
0	`art`	Used for visual matching
1	`card`	Captures the full card boundary
2	`description`	Preserves rules text region
3	`mana-cost`	Small, semantically useful detail
4	`power`	Another small but important region
5	`tags`	Type line / category signal
6	`title`	Primary OCR target

This class design is stronger than a generic "card" detector because it aligns with downstream product needs. The title region exists because OCR needs a clean crop. The art region exists because print disambiguation needs one. The classes reflect system requirements, not academic neatness.

The class legend makes the task concrete: the detector is learning a structured document with seven distinct semantic regions, not a single global label.

In the Universe editor, the same seven classes render as filled regions over the card. That is the exact geometry the detector has to reproduce at inference time, one bounding box at a time.

What a YOLO Label Actually Stores

The file format is minimal:

class_id center_x center_y width height

All coordinates are normalized to [0, 1], which keeps the labels resolution-independent.

6 0.5524 0.9418 0.7480 0.1164
1 0.4949 0.5015 0.9754 0.9970

That simplicity is one of YOLO's practical strengths. The same simplicity is also unforgiving. If the label is sloppy, the model will faithfully learn sloppy targets.

The label file is compact on disk, but every normalized coordinate becomes part of the detector's truth model.

Inside the platform each box is a JSON record with a class label and pixel-space coordinates. The YOLO text row above is just a normalized projection of exactly this structure.

Annotation Quality Is the Hidden Bottleneck

One of the most revealing details in the repo is the performance gap between relaxed and strict metrics:

strong mAP50
meaningfully lower mAP50-95

That pattern is common when the detector is finding roughly the right region but cannot consistently achieve near-pixel-perfect overlap. For large boxes, that is often acceptable. For small boxes like mana-cost and power, a few bad pixels matter a lot.

This is why the repo repeatedly returns to annotation quality in the metrics and training documents. Bigger models do not magically fix inconsistent boxes.

Data Quality Checks Matter More Than People Admit

The exploration scripts in this repo are not filler. They generate sample grids, class distributions, and quality checks. That is the right instinct.

Before tuning augmentations or scaling to cloud GPUs, an engineer should be able to answer:

Are labels missing?
Are some classes underrepresented?
Are small boxes consistently too loose or too tight?
Do train and validation samples look like the same world?

That is basic engineering hygiene. Without it, parameter tuning becomes expensive superstition.

A real sample grid is what turns "the dataset looks fine" into something you can actually inspect before spending compute on training.

Annotation Noise and Small Objects

The hardest classes in this project are not surprising. Small semantic regions are always more fragile:

mana-cost
power
title
tags

These classes compress a lot of meaning into a small area. Any mismatch in labeling, cropping, camera angle, or resolution hurts disproportionately.

A single card already shows the precision burden of the task: several small semantic regions must be boxed tightly enough to support OCR and downstream matching.

This is also why the training strategy later leans on augmentations like mosaic, multi-scale training, and geometric transforms. Those choices are trying to help the model survive real-world variance without pretending the labels are perfect.

The Correction Workflow Is Part of the Architecture

One of my favorite details in the project is the explicit correction loop:

download test images
export predictions
review in Label Studio
merge corrected labels
retrain

That is the right way to think about quality improvement. The model is not just trained once. The dataset and the detector can improve together.

The Annotate editor is where a human actually lives while the dataset grows. Each class in the sidebar is a box that was placed, checked, and adjusted one card at a time, which is exactly why label quality becomes a real engineering cost.

That is the baseline cost. It makes the value of assisted tooling obvious, but it also explains why teams get tempted to trust automation too early.

The platform's analytics panel surfaces annotation counts, class balance, and image-level distributions. Automation compresses the first pass, but a dashboard like this is what tells you whether the result is still trustworthy.

The right takeaway is not "automation replaces judgment." It is "automation compresses the first pass, so human review can focus on the hard cases."

The side-by-side comparison makes the trade-off obvious: speed improves with automation, but correctness still depends on review and correction loops.

Conclusion

A detector learns whatever definition of reality the dataset encodes. If the labels are thoughtful, consistent, and close to the downstream task, the model has a chance. If the labels are noisy, even a strong architecture will plateau early.

This repo gets the important part right: it treats the dataset as a first-class engineering artifact.

In the next article, we move from the data itself to the choices that shape training: transfer learning, augmentation, optimization, and the experiments that pushed this detector to its best result.

Object Detection from Scratch: Part 2 - Dataset, Labels, and the Reality of Training Data

The Dataset Source

The Split Strategy

The Seven Classes

What a YOLO Label Actually Stores

Annotation Quality Is the Hidden Bottleneck

Data Quality Checks Matter More Than People Admit

Annotation Noise and Small Objects

The Correction Workflow Is Part of the Architecture

Conclusion

Further Reading

Object Detection from Scratch: Part 4 - Reading Metrics Like an Engineer

Object Detection from Scratch: Part 3 - Training the Detector

Object Detection from Scratch: Part 1 - Why This Project Is Worth Building

Arthur Costa