Python OCR Data Retrieval
The process of identifying and turning a document in image or PDF format into text format is known as optical character recognition (OCR). It is commonly used to recognize text in scanned documents and images. Time and resources are saved by the ability to convert photos and PDFs, which often start out as scanned paper documents. Also, we can convert a non-editable document into an editable text format.
The benefits of automated OCR workflows are,
* Elimination of manual data entry
* Better accessibility and search capability
* More storage space
* Improved accuracy
* Speed up the process
* Improved productivity
Steps involved in OCR
1. Preprocessing the Image: Using an imaging device like a camera or scanner, the text, whether it is printed or handwritten, is first turned into a two-dimensional image. To make it easier to recognize characters accurately, this image has to be improved. The acquired image is altered in this step to enhance its quality.
2. Localization of text: The OCR toolbox then examines how the improved image is laid out. The OCR toolbox accomplishes this using AI-based algorithms and ABBYY Adaptive Document Recognition Technology. Locating or identifying the area of the image that contains the written or printed text is the aim of this stage.
3. Identification of Characters: The toolkit's fine reader Engine recognizes the characters as alphabets, numerals, and special characters.
4. Character Segmentation: Once characters have been identified, they are divided into groups based on what they are like.
5. Post-processing: After the characters have been machine-readable and segmented, they may either be exported into other file formats or put through processing processes like editing or translation.
There are many different OCR tools available. We cannot find any quality differences between them. But some of the tools that seem to be most developer-friendly are Tesseract, OCRopus, SwiftOCR etc. In this blog, let's focus on Tesseract OCR.
Tesseract OCR
Tesseract is a name for an optical character recognition engine that works with several operating systems. It was distributed as free software under the Apache Licence. Originally developed by Hewlett-Packard. Tesseract can be used with many programming languages and frameworks. In Python, we have a library called pytesseract.
Pytesseract
Pytesseract or Python-tesseract is an OCR tool for python. It can read and recognize text in scanned documents and is commonly used in python OCR image to text use cases.
Python-tesseract can be installed using pip as shown below.
pip install pytesseract
The user can extract text from photos once it has been installed. The image must be improved and preprocessed such that the characters are in black on a white backdrop before text can be extracted from it. The typeface or font of the text must be uniform, and the OCR engine does not work correctly with photos that have a lot of noise or distortion.
The code below shows how we can use pytesseract to extract text from scanned documents.
# Import modules
import io
from PIL import Image
import pytesseract
from odoo import fields, models
Importing image class from the Pillow image library. To load the input image from the user's storage device in the PIL format required by the Pillow Library is necessary. Also, import pytesseract and other libraries.
class OCRBlog(models.Model):
_name = "ocr.blog"
# Binary field to read document
image = fields.Binary("Document", required=True,
help="Upload .jpg, .jpeg, .png or .pdf
files")
# Text field to assign extracted data
extracted_data = fields.Text()
def action_submit(self):
# File fetching from ir.attachment model
file_attachment = self.env["ir.attachment"].search( ['|', ('res_field', '!=', False),
('res_field', '=', False),('res_id', '=', self.id), ('res_model', '=', 'ocr.blog')],
limit=1)
# Getting the file path
file_path = file_attachment._full_path(file_attachment.store_fname)
with open(file_path, mode='rb') as f:
binary_data = f.read()
# Create an image object of PIL library
img = Image.open(io.BytesIO(binary_data))
# Pass image into pytesseract module
text = pytesseract.image_to_string(img) + "\n"
self.extracted_data = text
Here, the image file is reading through a binary field and then taking its path and opening in read mode. Then creating an image object and converting the image into text using the Pytesseract library.
Similarly, we can extract the content of a PDF file also.