Creating Fully Automatic OCR for Japanese Games Using Machine Learning

I'm developing a tool for Japanese learning from videogames that started out as a Google Cloud Vision frontend to perform OCR on games, but just today I managed to finish setting up an initial infrastructure that allows me to perform fully automatic, intelligent OCR on games. This solves many problems such as having to press a key every time you need to scan text, or full-screen scanning picking up unnecessary text.

I made an object detection model that detects dialogue boxes and lines within them, and then a lot of logic to decide when a found dialogue box should be scanned according to changes in the contained lines.

Created dataset with Roboflow and trained on Ultralytics HUB, derived from yolov8n. Model is exported to OpenVINO and runs amazingly well on my Intel laptop (~35ms per inference).

Game is captured from a virtual cam streamed from OBS, which is running in portable mode and I control programatically with obsws-python.

I wrapped this all up under a websockets server to communicate with my existing webapp, so I barely had to add anything new to it since all the heavy lifting is done in Python and all I have to do is send an event with base64 encoded images of the areas to OCR from.