Object Detection from Scratch: Part 6 - Shipping the System

August 15, 202622 min readNew

Part 6 closes the series by looking at the web application, deployment boundaries, correction workflow, limitations, operational tradeoffs, and what it takes to ship the detector as a usable system.

Object Detection from Scratch: Part 6 - Shipping the System
React to this article

Most ML projects look complete when the model works. Most users disagree.

The final job is not training. It is shipping. That means packaging the pipeline behind stable interfaces, exposing it through a usable frontend, handling failure gracefully, and creating a path for the system to improve after release.

The Web Layer Is Where the Project Becomes Real

The web/ directory is the bridge from internal scripts to a product surface:

  • web/app.py orchestrates the API
  • web/services/ encapsulates the pipeline dependencies
  • web/static/ provides the browser experience

That separation matters because it keeps the product boundary explicit. Training scripts can evolve independently from the service layer as long as the served model contract stays stable.

FastAPI as the Product Shell

FastAPI is a practical choice for this project because it gives the pipeline:

  • explicit request/response boundaries
  • easy integration with Python-based ML dependencies
  • clean service composition
  • a path to validation, logging, and future monitoring

The important insight is that the API does not merely expose a detector. It exposes the full orchestration result. That keeps frontend complexity low and centralizes pipeline logic where it can be tested and observed.

Final product screenshot
Shipping means a user can move from raw camera input to a trustworthy answer in one product surface, not across disconnected scripts.

The Frontend Has Two Jobs

The static frontend supports both upload and live camera flows.

Those two modes imply different product concerns:

  • upload mode optimizes for clarity and reproducibility
  • live mode optimizes for responsiveness and trust

A good vision product should show users enough intermediate evidence to trust the result. Bounding boxes, cropped regions, and panel details help a lot here. The repo already leans in that direction.

Webcam flow example
A live webcam frame with detected regions is enough to prove the UI requirement: users trust the system more when they can see intermediate evidence instead of only a final answer.

Operational Boundaries Matter

This system relies on several external or heavyweight components:

  • model inference
  • OCR runtime
  • Scryfall availability
  • DINOv2 embedding comparisons

That creates operational questions the codebase should keep answering:

  • Which steps are local and which are remote?
  • What happens when the API lookup fails?
  • How should timeouts and retries behave?
  • Which intermediate results can still be shown when a later stage fails?

These are shipping questions, not academic questions.

The Correction Loop Belongs in Production Thinking

One of the strongest parts of the repo is the explicit annotation-correction workflow. That should not be treated as a side document. It is the system's learning loop.

This is the mature answer to model drift and blind spots. You do not just train once and hope. You build the path by which the product teaches the dataset where it is weak.

Active learning cycle
The correction loop is not a side quest. Production mistakes are collected, re-labeled, and folded back into training so the deployed system gets stronger on real edge cases.

Limitations Should Be Stated Plainly

The repo already contains enough information to state realistic limitations:

  • small regions are harder to localize tightly
  • annotation quality appears to limit strict metrics
  • OCR quality depends on clean title crops
  • printing disambiguation is only as strong as candidate retrieval and art similarity
  • cloud scaling can cost more than it helps if the data ceiling is unchanged

That kind of honesty is a strength. Users trust systems more when their limits are legible.

Where the Next Wins Probably Are

If I were continuing this project, the highest-value next steps would likely be:

  • improve or correct labels for small difficult classes
  • add more targeted examples for real webcam edge cases
  • instrument the end-to-end pipeline for stage-by-stage failure rates
  • evaluate latency and caching around Scryfall and art matching
  • measure real user errors, not only validation metrics

Notice how few of those ideas begin with "train a much bigger model." That is deliberate.

Next steps map
The most valuable future work clusters around better data, tighter evaluation loops, and stronger product surfaces, not just a larger checkpoint.

Conclusion

Shipping a vision system means accepting that the model is only one layer in a larger contract with the user.

This project gets that right. It trains a strong detector, but it also builds the scripts, services, correction workflow, and frontend surface required to make the result usable. That is why it is a compelling engineering project and not just a benchmark entry.

Further Reading

Object Detection from ScratchPart 6 of 6
Series Progress6 / 6
Arthur CostaA

Arthur Costa

Senior Full-Stack Engineer & Tech Lead

Senior Full-Stack Engineer with 8+ years in React, TypeScript, and Node.js. Expert in performance optimization and leading engineering teams.

View all articles →