Easy Textual content Extraction Utilizing Python And Tesseract OCR

June 3, 2022

1

Introduction

Howdy! On this fast tutorial I’ll present the way to create a easy program utilizing Python and Tesseract OCR that may extract textual content from picture information.

What’s Tesseract?

Tesseract is an open supply OCR (Optical Character Recognition) engine that may acknowledge a number of languages.

An OCR engine can save time by digitilizing paperwork somewhat than manually typing the content material of the doc.

Putting in Teesseract and the engine

Instalation technique will fluctuate on the Working System you employ.

Directions on the way to obtain Tesseract will be discovered right here:
https://tesseract-ocr.github.io/tessdoc/Downloads.html

Additionally, you will wanted the skilled information for the language you’re coping with, pre-trained information will be discovered through the hyperlink under are downloaded through the command line and many others relying on the Working System you’re utilizing.
https://github.com/tesseract-ocr/tessdata

Initializing the Python Digital Setting

Making a digital atmosphere will be performed through the next command:

python3 -m venv env
supply env/bin/activate

Coding this system

Subsequent we have to really write some Python!
Open up “most important.py” in your most popular IDE and add the next:

Importing the modules and setting the Tesseract command.

Please observe that the Tesseract command will once more fluctuate based mostly on the Working System you’re utilizing. (For the document I am utilizing linux)

import argparse
import pytesseract
import cv2

# Path to the placement of the Tesseract-OCR executable/command
pytesseract.pytesseract.tesseract_cmd = "/usr/bin/tesseract"

Subsequent we have to create a way that reads textual content from a picture file and saves it right into a file referred to as “outcomes.txt”.

def read_text_from_image(picture):
  """Reads textual content from a picture file and outputs discovered textual content to textual content file"""
  # Convert the picture to grayscale
  gray_image = cv2.cvtColor(picture, cv2.COLOR_BGR2GRAY)

  # Carry out OTSU Threshold
  ret, thresh = cv2.threshold(gray_image, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)

  rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (18, 18))

  dilation = cv2.dilate(thresh, rect_kernel, iterations = 1)

  contours, hierachy = cv2.findContours(dilation, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

  image_copy = picture.copy()

  for contour in contours:
    x, y, w, h = cv2.boundingRect(contour)

    cropped = image_copy[y : y + h, x : x + w]

    file = open("outcomes.txt", "a")

    textual content = pytesseract.image_to_string(cropped)

    file.write(textual content)
    file.write("n")

  file.shut()

This technique principally takes a picture, converts it to grayscale, will get the textual content from the picture after which for every discovered textual content appends it to the outcomes file.

Lastly we have to write the principle technique.

if __name__ == "__main__":
  ap = argparse.ArgumentParser()
  ap.add_argument("-i", "--image", required = True, assist = "Path to enter file")
  args = vars(ap.parse_args())

  picture = cv2.imread(args["image"])
  read_text_from_image(picture)

Like my earlier tutorials the principle technique principally simply takes a picture file as enter after which passes it on to the textual content extraction technique.

This system can then be run through the next command:

python most important.py -i textual content.png

If all works properly it is best to have a “outcomes.txt” file in your working listing that incorporates the textual content from the picture.

Conclusion

Right here I’ve proven the way to create a easy program that extracts textual content from a picture utilizing Python and Tesseract OCR.

In case you have any enchancment strategies and many others please let me know.

The supply code for this tutorial will be discovered right here:
https://github.com/ethand91/python-text-extraction

Facet Notice

In case you are following me WebRTC Tutorial please wait just a bit longer, I’m presently within the strategy of switching jobs so I have not had a lot time to dedicate to the challenge.