Object Detection from Scratch: Part 5 - From Detection to Identification

July 15, 202624 min readNew

Part 5 follows the live identification pipeline: detection, OCR, Scryfall lookup, DINOv2 art matching, error propagation, and why the system needs more than a detector.

Object Detection from Scratch: Part 5 - From Detection to Identification
React to this article

This is the chapter where the project stops looking like a detector and starts looking like a product.

By the time the user sees a card name, a price, and the exact printing, several independent systems have already cooperated:

  • the detector found the relevant regions
  • OCR read the title crop
  • Scryfall resolved a card identity
  • DINOv2 compared art crops to distinguish printings

That is a serious pipeline.

The Pipeline Shape

The solution document summarizes the core flow cleanly:

This architecture is effective because every stage has a narrow, defensible job.

Identification pipeline overview
The full identification path has two dependencies at once: the title branch resolves the card through OCR and Scryfall, while the art branch resolves the exact printing through DINOv2 similarity.

Stage 1: Detection Creates Structure

The detector's job is not "solve card identity." Its job is to transform an unstructured image into semantically useful regions.

That matters because OCR quality depends heavily on crop quality. A generic OCR pass over the entire card would invite clutter from rules text, symbols, borders, and background noise. A targeted title crop gives the OCR engine a much cleaner problem.

Detection result example
A real prediction screenshot shows the detector doing the unglamorous but essential work: carving the image into crops that the rest of the pipeline can trust.

Stage 2: OCR Turns Pixels into Searchable Meaning

The project uses RapidOCR to read the title region. That is an excellent pairing:

  • detection handles spatial localization
  • OCR handles text extraction

The combination is stronger than either one alone. OCR without detection sees too much noise. Detection without OCR still does not know the card's semantic identity.

Stage 3: Scryfall Resolves the Card

Once OCR produces a candidate title, the project calls the Scryfall API for fuzzy lookup and metadata retrieval.

This is a powerful design choice because it outsources a large body of card knowledge to a specialized external service:

  • canonical names
  • oracle text
  • prices
  • printings
  • images

The local system does not need to model that entire domain from scratch. It only has to turn vision output into a robust lookup query. That is a much narrower job.

Stage 4: DINOv2 Disambiguates Printings

This is where the project becomes especially interesting.

Text is often enough to identify the card. It is not always enough to identify the printing. Different printings can share the same name and rules text while still differing in art, set, and numbering.

That is why the art detection class matters so much. The art crop becomes the input to a DINOv2 embedding comparison against candidate printings.

Conceptually:

  1. crop the art region from the input image
  2. embed that crop into vector space
  3. embed candidate printing art images
  4. compare vectors with cosine similarity
  5. select the closest match

That is an elegant example of mixing specialized models instead of demanding one model do everything.

Printing disambiguation flow
OCR narrows the candidate set, but DINOv2 still has to compare the art crop against multiple candidate printings before the product can return the exact card version.

Error Propagation Is the Real Challenge

Pipelines are powerful because they decompose a hard problem. They are also dangerous. Every stage can compound failure.

This means the project cannot evaluate stages in isolation forever. It eventually has to reason about end-to-end reliability. That is where product quality lives.

The practical implication is simple. The best improvement is not always in the most glamorous stage. A slightly better crop can beat a much fancier matching model if it removes noise early.

Why the Service Split Is Strong

The web/services/ directory reflects sound decomposition:

  • detection.py
  • ocr.py
  • scryfall.py
  • image_match.py

That structure keeps responsibilities narrow and debuggable. If card resolution fails, the team can inspect which boundary broke:

  • Did detection miss the title?
  • Did OCR misread the crop?
  • Did Scryfall return poor fuzzy matches?
  • Did DINOv2 rank the wrong art highest?

This is exactly how production systems should be designed.

Web application demo
The real application flow shows the full chain working together: upload, detect, read, resolve, and return a usable card result instead of raw model output.

The Best Architectural Lesson

The project succeeds here because it does not confuse "AI-powered" with "single-model." It composes the right specialized tools:

  • detector for spatial structure
  • OCR for text
  • API lookup for canonical metadata
  • embedding model for visual similarity

That compositional mindset is more valuable than any single library choice.

Conclusion

The final user experience feels simple because the architecture is layered, not because the problem is simple.

Detection is the gateway. Identification is the orchestration that follows. That is the difference between a model demo and a working product.

Final identification screenshot
The finished interface makes the argument concrete: multiple model outputs and external services are fused into one user-facing answer.

In the final article, we look at how this system is exposed through the web application and what it takes to ship it responsibly.

Further Reading

Arthur CostaA

Arthur Costa

Senior Full-Stack Engineer & Tech Lead

Senior Full-Stack Engineer with 8+ years in React, TypeScript, and Node.js. Expert in performance optimization and leading engineering teams.

View all articles →