Error Detection and Correction of OCR Document in Hindi Language by Vipul Chauhan

By:

Chauhan, Vipul

Contributor(s):

Harit, Gaurav

Material type: Text

TextPublication details: IIT Jodhpur Department of Computer Science & Engineering 2020Description: xiii,45p. HBSubject(s):

DDC classification:

006.4 C496E

Summary: Development of an OCR module for the Hindi language (the fourth most spoken language in the world) is an extensively researched topic in the document analysis community. The accuracy of OCR system developed so far depends on the quality of source document and thus requires postprocessing steps to correct the wrongly identified words. In this thesis, we have proposed two methods for error detection and correction.First, we propose Part of Speech (POS) tagging based approach to identify and correct nonword errors jointly. POS tagging is proved to be efficient in sentence correction as it considers semantic of the sentence while correcting it. We have applied the Viterbi Decoding Scheme to extract the best possible correction among the candidate words. The correction scheme considers the ngram probabilities and transition probabilities for tags and the emission probability for the candidate word given the tag. We have proposed a Joint Viterbi Decoding Scheme that benefits from POS tagging and error correction jointly.Second, we propose a postprocessing scheme which inherits error pattern learning and language modelling. Statistical error pattern learning benefits candidate words based on frequent OCR character confusions. Statistical Ngram language model is used to score candidate word based on the sentential context of the error. We have jointly addressed the task of error detection and correction. The method can detect and correct realword errors.Additionally, we have also addressed the issue of handling unseen correct words and errors language model is replaced with LSTM neural language model.The fastText embedding is used in neural language model which benefits words as fastText embedding considers bag of character ngrams for embedding. Hence this shows some positive result in handling correct OOV words.

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Item type	Home library	Collection	Call number	Status	Date due	Barcode	Item holds
Thesis	S. R. Ranganathan Learning Hub Course Reserve	Reference	006.4 C496E (Browse shelf(Opens below))	Not For Loan		TM00194

Total holds: 0

Development of an OCR module for the Hindi language (the fourth most spoken language in the world) is an extensively researched topic in the document analysis community. The accuracy of OCR system developed so far depends on the quality of source document and thus requires postprocessing steps to correct the wrongly identified words. In this thesis, we have proposed two methods for error detection and correction.First, we propose Part of Speech (POS) tagging based approach to identify and correct nonword errors jointly. POS tagging is proved to be efficient in sentence correction as it considers semantic of the sentence while correcting it. We have applied the Viterbi Decoding Scheme to extract the best possible correction among the candidate words. The correction scheme considers the ngram probabilities and transition probabilities for tags and the emission probability for the candidate word given the tag. We have proposed a Joint Viterbi Decoding Scheme that benefits from POS tagging and error correction jointly.Second, we propose a postprocessing scheme which inherits error pattern learning and language modelling. Statistical error pattern learning benefits candidate words based on frequent OCR character confusions. Statistical Ngram language model is used to score candidate word based on the sentential context of the error. We have jointly addressed the task of error detection and correction. The method can detect and correct realword errors.Additionally, we have also addressed the issue of handling unseen correct words and errors language model is replaced with LSTM neural language model.The fastText embedding is used in neural language model which benefits words as fastText embedding considers bag of character ngrams for embedding. Hence this shows some positive result in handling correct OOV words.

There are no comments on this title.

to post a comment.

Print
Send to device
Save record
BIBTEX Dublin Core MARC (non-Unicode/MARC-8) MARCXML RIS
More searches

Search for this title in:
Other Libraries (WorldCat) Other Databases (Google Scholar) Online Stores (Bookfinder.com) Open Library (openlibrary.org)