Image from Google Jackets

Error Detection and Correction of OCR Document in Hindi Language by Vipul Chauhan

By: Contributor(s): Material type: TextTextPublication details: IIT Jodhpur Department of Computer Science & Engineering 2020Description: xiii,45p. HBSubject(s): DDC classification:
  • 006.4 C496E
Summary: Development of an OCR module for the Hindi language (the fourth most spoken language in the world) is an extensively researched topic in the document analysis community. The accuracy of OCR system developed so far depends on the quality of source document and thus requires postprocessing steps to correct the wrongly identified words. In this thesis, we have proposed two methods for error detection and correction.First, we propose Part of Speech (POS) tagging based approach to identify and correct nonword errors jointly. POS tagging is proved to be efficient in sentence correction as it considers semantic of the sentence while correcting it. We have applied the Viterbi Decoding Scheme to extract the best possible correction among the candidate words. The correction scheme considers the ngram probabilities and transition probabilities for tags and the emission probability for the candidate word given the tag. We have proposed a Joint Viterbi Decoding Scheme that benefits from POS tagging and error correction jointly.Second, we propose a postprocessing scheme which inherits error pattern learning and language modelling. Statistical error pattern learning benefits candidate words based on frequent OCR character confusions. Statistical Ngram language model is used to score candidate word based on the sentential context of the error. We have jointly addressed the task of error detection and correction. The method can detect and correct realword errors.Additionally, we have also addressed the issue of handling unseen correct words and errors language model is replaced with LSTM neural language model.The fastText embedding is used in neural language model which benefits words as fastText embedding considers bag of character ngrams for embedding. Hence this shows some positive result in handling correct OOV words.
Tags from this library: No tags from this library for this title. Log in to add tags.
Star ratings
    Average rating: 0.0 (0 votes)
Holdings
Item type Home library Collection Call number Status Date due Barcode Item holds
Thesis Thesis S. R. Ranganathan Learning Hub Course Reserve Reference 006.4 C496E (Browse shelf(Opens below)) Not For Loan TM00194
Total holds: 0

Development of an OCR module for the Hindi language (the fourth most spoken language in the world) is an extensively researched topic in the document analysis community. The accuracy of OCR system developed so far depends on the quality of source document and thus requires postprocessing steps to correct the wrongly identified words. In this thesis, we have proposed two methods for error detection and correction.First, we propose Part of Speech (POS) tagging based approach to identify and correct nonword errors jointly. POS tagging is proved to be efficient in sentence correction as it considers semantic of the sentence while correcting it. We have applied the Viterbi Decoding Scheme to extract the best possible correction among the candidate words. The correction scheme considers the ngram probabilities and transition probabilities for tags and the emission probability for the candidate word given the tag. We have proposed a Joint Viterbi Decoding Scheme that benefits from POS tagging and error correction jointly.Second, we propose a postprocessing scheme which inherits error pattern learning and language modelling. Statistical error pattern learning benefits candidate words based on frequent OCR character confusions. Statistical Ngram language model is used to score candidate word based on the sentential context of the error. We have jointly addressed the task of error detection and correction. The method can detect and correct realword errors.Additionally, we have also addressed the issue of handling unseen correct words and errors language model is replaced with LSTM neural language model.The fastText embedding is used in neural language model which benefits words as fastText embedding considers bag of character ngrams for embedding. Hence this shows some positive result in handling correct OOV words.

There are no comments on this title.

to post a comment.