Building an OCR pipeline in Python

Rami Awar
Agriplace
Published in
4 min readJun 27, 2023

--

In our winter 2-day hackathon at Agriplace, the day started with diverse teams forming around pitched ideas. Every team had a mix of marketing, sales, product, customer support, and development members. Our team had the idea to experiment with extracting certain document content and analyzing it.

To do that, we tested out several methods of doing OCR for the extraction part, and learned a few things that are worth sharing along the way. This was only one part of our hackathon project, but would set the foundation for our production OCR pipelines later on!

PyTesseract

Tesseract animation

As a first attempt, we used PyTesseract to test out some basic OCR. That worked ok as it supported multiple document types, multiple language packs, and had a simple interface. We performed OCR once to deduce the language, then another time with the right language pack. That took quite some extra time, but that wasn’t a big deal since the processing didn’t have to be realtime.

However when looking into deploying it, we realized that we’d have to create an API wrapper around our little pipeline to deploy it as its own microservice so that it can be used by several of our teams.

We knew that Amazon Textract would fit the task perfectly. But given that we had hundreds of thousands of documents with no limit on their number of pages, we estimated that using Textract would cost us thousands of dollars for this small experiment and limit us to PDFs mainly.

Tika

At this point we discovered Apache Tika, a content analysis toolkit that supports over a thousand file types. The Tika team has also built a containerized API wrapper that offers a simple interface and is deployable with ease.

One thing that was missing though was a Python client for this API. We found several out there, but none that had the customization we were looking for after going over the Tika Server documentation. We set out to build our own, starting with wrapping the core functions we were planning to use. And thus, PyTika was born! We also borrowed a cool design pattern from Golang (Functional API Options) to flexibly configure the client.

After processing the document outputs, we started noticing weird errors: “April” was extracted as “Apri!”, colons were captured as characters, and more. Lots of words that were nearly correct had one bad character somewhere which caused errors in our content analysis down the line. We also noticed this when using Tesseract (Tika uses Tesseract for OCR internally), so we knew it wasn’t a Tika problem.

Running the same documents through Amazon Textract however (supposedly best in class OCR as of this moment) we found that it somehow avoided these errors. They either have a better OCR model running or they’re somehow post-processing the output and correcting the results.

OCR Needs Error Correction

We learned the practical way that error corrector modules are a big factor in what makes a specific OCR product good or bad. There are a few papers on writing error corrector modules for OCR outputs, one of which is https://arxiv.org/pdf/1604.06225.pdf. I bet a big reason of why Textract is so good is a top tier error-corrector in their post-processing of documents.

We started with a simple multi-lingual dictionary based error-corrector and that already fixed a lot of the errors we were seeing. Now “Apri!” would get mapped to “April” which is the ‘closest’ word from the dictionaries provided.

What’s Next?

There’s much room for improvement still, but we kind of got all we needed from this simple implementation for now. In startup and hackathon spirit, we’ll fix it when it breaks! And with some Friday afternoon food and drinks, our winter hackathon came to a close.

For future OCR cases, we could just use Amazon Textract since we won’t be paying a massive cost upfront. Instead it’ll be distributed over years of documents collected, making it much more tolerable. But combining with Tika gives us maximum flexibility over different file types while still benefitting from Textract’s accuracy with PDFs.

--

--