Industry knowledge

Talking about text recognition software OCR

Feb 18, 2019 Leave a message

Talking about text recognition software OCR

We are a big printing company in Shenzhen China . We offer all book publications, hardcover book printing, papercover book printing, hardcover notebook, sprial book printing, saddle stiching book printing, booklet printing,packaging box, calendars, all kinds of PVC, product brochures, notes, Children's book, stickers, all kinds of special paper color printing products, game cardand so on.

For more information please visit 

http://www.joyful-printing.com.   ENG only    

http://www.joyful-printing.net 

http://www.joyful-printing.org

email: info@joyful-printing.net


The task of Chinese character recognition software is to study how to make a computer "literate". The system usually uses a photoelectric conversion device to convert a Chinese character or a word box into an electrical signal, and sends it to a computer, which is automatically recognized and read by a computer, so it is called optical. Optical Character Recognition (OCR).


OCR development profile


The concept of OCR was first proposed by German scientist Tausheck in 1929. Later, American scientist Handel also proposed the idea of using technology to identify words. The earliest research on the recognition of printed Chinese characters was Casey and Nagy of IBM. In 1966, they published the first article on Chinese character recognition, which used template matching to identify 1000 printed Chinese characters. In the early 1970s, Japanese scholars began to study Chinese character recognition and did a lot of work. The research on Chinese character recognition in China started relatively late, and the research work of OCR began in the late 1970s. The early OCR software failed to meet the actual requirements due to various factors such as recognition rate and productization. At the same time, due to the high cost of hardware equipment and slow operation speed, it has not reached a practical level. Only some departments, such as information departments, press and publication units, use OCR software. After 1986, China's OCR research has made great progress, and it has made innovations in Chinese character modeling and recognition methods. It has achieved fruitful results in system development and development applications. Many units have successively launched Chinese OCR products. After the 1990s, with the wide application of platform scanners and the popularization of information automation and office automation in China, the further development of OCR technology has been greatly promoted, and the recognition rate of OCR has been recognized, and the recognition speed has satisfied the users. Claim.


At present, there are many popular OCR software. The English OCR mainly includes OmniPage, Chinese OCR, mainly Tsinghua Unisplendour OCR, Tsinghua Wentong OCR, Hanwang OCR, Zhongjing Shangshu OCR, Danqing OCR, and Mengyu OCR. Despite the large amount of Chinese characters and complex fonts, OCR technology has matured. Many OCR software can not only recognize black and white printed Chinese characters, but also recognize grayscale and color printed Chinese characters. The recognition speed is fast and the recognition accuracy rate is over 99%. It can recognize various fonts such as Song, Bold, and Scorpion. Traditional; can recognize a variety of fonts, different font size mixing; some OCR software can also identify images, tables. At the same time, great progress has been made in the study of handwritten Chinese character recognition, and the correct recognition rate has reached more than 70%.


OCR software application


In the scanner market, many types of office and home scanners are equipped with OCR software. For example, the violet scanner is equipped with violet OCR, the crystal scanner is equipped with Shangshu OCR, and the Mustek scanner is equipped with Danqing OCR. The scanner and the OCR software share the entire process from input of the document to text recognition.


Document scanning is often used in the office field. Documents related to publications in newspapers, magazines, etc. are scanned by a scanner, and then OCR identification is performed, or stored as an image file, for later OCR recognition, and image files are converted into text. File or Word file for storage.


In addition, the storage and transmission of digital information is not only low in cost, high in efficiency, but also adaptable to the undeveloped needs of typesetting and network transmission. At present, there are a lot of paper treasures such as books, newspapers, magazines and so on left in China, and it is urgent to convert them into electronic information. For example, the establishment of an electronic library requires that the books be scanned page by page, and the identification of the OCR software replaces the manual typing of words, which greatly shortens the entry time, reduces labor intensity, saves manpower and reduces costs. Improve the accuracy of entry, work efficiency and modern office automation.


At present, the combination of OCR software and scanners has been applied to many fields of the information age, such as digital libraries, identification of various reports, and identification of banking and tax system standards. With the development and popularization of network and information, its application scope will become more and more extensive.


Composition of the OCR system


The function of the Chinese character recognition software OCR is to recognize various graphics or images of each Chinese character recorded in Chinese characters, prints or handwritings by computer, and to mark the Chinese character category codes. Therefore, Chinese character recognition is ultimately an image recognition problem. Due to the large amount of Chinese characters, different fonts, fonts, and complex structures, the process of Chinese character recognition is extremely complicated. Schematic diagram of OCR software workflow, as shown in the following table:


Document data → scan input → image processing → layout division → text recognition → text editing → document storage


Due to the popularity and wide application of scanners, OCR software only needs to provide an interface with the scanner and use the scanner driver software. Therefore, the OCR software is mainly composed of an image processing module, a layout division module, a text recognition module, and a text editing module.


1. Image processing module

The image processing module mainly has functions such as document scanning, image scaling, and image rotation. After input by the scanner, the document forms an image file, and the image processing module can enlarge the image to remove stains and scratches. If the image is not rotated correctly, the image can be rotated manually or automatically, in order to create better conditions for text recognition. The recognition rate is higher.


2. Layout division module

The layout division module mainly includes layout division and change division, that is, understanding of the layout, word segmentation, normalization, etc., and automatic or manual layout can be selected. The purpose is to tell the OCR software to separate the articles, tables, etc. in the same layout, so that they can be processed separately and in what order.


3. Text recognition module

The text recognition module is the core part of the OCR software, a simple text recognition process diagram, as shown below. The text recognition module mainly performs "reading" on the input Chinese characters, but it cannot be multi-line and must be cut line by line. For Chinese characters, it is usually recognized by one word and one word, that is, single word recognition, and then normalized. The text recognition module extracts the features of different sample Chinese characters, completes the recognition, automatically finds suspicious words, and has functions such as before and after association.


Scan input original → line cutting → word cutting → naturalization → recognition feature extraction → word recognition ┐ ┐

                                └-→Pre-classification feature extraction→Feature library (dictionary)→Output manuscript


4. Text editing module

The text editing module mainly modifies and edits the OCR recognized text. If the system recognizes that it is wrong, the text will be displayed in bold red or blue, and provide similar text for selection, and select the editor for output.


How to use OCR software


Although there are many types of OCR software, their use is similar. The first step is to scan the document and then perform OCR recognition. The use of OCR software is as follows:


1. Document scanning

In order to use OCR software for text recognition, documents can be scanned directly in the OCR software. After running the OCR software, the OCR software interface will appear.


Place the document to be scanned on the glass side of the scanner so that the side to be scanned faces the glass side of the scanner with the top end of the document facing down, aligned with the edge of the ruler, and then the scanner is covered to prepare for scanning. Click the "Scan" button in the window to enter the scan driver software for scanning. The scanning method will not be described here. However, it should be noted that the resolution can be set at 200~400dpi. For text documents, it is critical to adjust the brightness. 


2. OCR recognition

For ease of operation, select options from the menu and various icons appear to the left of the window.

For better use, first introduce the icon to the left of the screen from top to bottom:


"Zoom in" tool: for magnifying the image; "zoom out" tool: for reducing the image; "setting the recognition area" tool: for setting the recognition area; "setting the recognition order" tool: for setting the recognition order; Delete Recognition Area tool: to delete the recognition area; "Erase Image Noise" tool: to erase an area in the image; "Rotate Image" tool: to rotate the image 90°, 180° or 270°; Tilt Correction tool: For manual image tilt correction.


General steps for OCR identification:


(1) After the document is scanned, the text to be recognized that appears in the window is very small. First, select the “Zoom in” tool to properly enlarge the image to make the picture more clear. If necessary, you can also select the "zoom out" tool to reduce the size of the screen appropriately.

(2) If the screen needs to be rotated by 90°, 180° or 270°, use the Rotate Image tool to rotate the image. If the text screen is tilted, select the Tilt Correction tool to adjust the screen.

(3) Select the "Set Recognition Area" tool during recognition, and frame the area to be recognized on the character screen. In this case, multiple areas can be framed according to the screen condition. If the framed area is incorrect, you can use the Delete Identifying Area tool to delete the selected recognized area.

(4) In order to improve the recognition rate, if the selected recognition area has noise or an unrecognizable image, the "Erase Image Noise" tool can be selected to erase the noise bit by bit. If you need to erase in pieces, you can select the Wipe Image Block tool.

(5) Click the “Recognize” icon, then the OCR display is performing text segmentation, then transfer to the “Recognizing” screen, and the recognized text will be displayed step by step, and then transferred to the “Document Proofreading” window as shown.

Many OCR software has a text modification function that recognizes text that may be erroneous, is displayed in a clear color, and can be modified.

(6) Store the recognized file as a text (TXT) file or a Word RTF file.

Send Inquiry