c# - OCR word recognition logic -


below function uses tessnet2 (ocr framework) scan through list of words captured ocr function built tessnet2. since pages i'm scanning in our less perfect quality detection of words not 100% accurate.

so confuse 's' '5' or 'l' '1'. also, doesn't take account capitalization. have search both cases.

the way works searching words close each other on paper. first set of words [i] " abstracting service ordered". if page contains words next each other moves next set of words [j], , next [h]. if page contains 3 sets of words returns true.

this best method i've thought of i'm hoping here can give me way try.

public boolean ispageabstracting(list<tessnet2.word> wordlist)     {          (int = 0; < wordlist.count; i++) //scan through words         {             if ((wordlist[i].text == "abstracting" || wordlist[i].text == "abstracting" || wordlist[i].text == "abstractmg" || wordlist[i].text == "abstractmg" && wordlist[i].confidence >= 50) && (wordlist[i + 1].text == "service" || wordlist[i + 1].text == "service" || wordlist[i + 1].text == "5ervice" && wordlist[i + 1].confidence >= 50) && (wordlist[i + 2].text == "ordered" || wordlist[i + 2].text == "ordered" && wordlist[i + 2].confidence >= 50)) //find 1st tier check             {                 (int j = 0; j < wordlist.count; j++) //scan through words again                 {                     if ((wordlist[j].text == "due" || wordlist[j].text == "oue" && wordlist[j].confidence >= 50) && (wordlist[j + 1].text == "date" || wordlist[j + 1].text == "oate" && wordlist[j + 1].confidence >= 50) && (wordlist[j + 2].text == "&" && wordlist[j + 2].confidence >= 50)) //find 2nd tier check                     {                         (int h = 0; h < wordlist.count; h++) //scan through words again                         {                             if ((wordlist[h].text == "additional" || wordlist[h].text == "additional" && wordlist[h].confidence >= 50) && (wordlist[h + 1].text == "comments" || wordlist[h + 1].text == "comments" && wordlist[h + 1].confidence >= 50) && (wordlist[h + 2].text == "about" || wordlist[h + 2].text == "about" && wordlist[h + 2].confidence >= 50) && (wordlist[h + 3].text == "this" || wordlist[h + 3].text == "this" && wordlist[h + 3].confidence >= 50)) //find 3rd tier check                             {                                 return true;                             }                         }                     }                 }             }         }          return false;     } 

firstly there's no need redundant nesting loops, each inner loop doesn't depend on outer loop, there's no need huge performance penalty looping on words n^3 times (as opposed 3n).

secondly, think there more elegant approaches (like using dictionary of words , calculating best match words aren't in dictionary, or other more dynamic approaches), involve more complicated algorithms. simple approach equivalent can done using regular expressions:

// combine words 1 string separated space // confidence high enough // use word regex's won't match words confidence // isn't high enough var text = wordlist.select(w => w.confidence >= 50 ? w.text : "dontmatch")            .aggregate((x,y) => x + " " + y);  // run text through regular expressions  // match each criteria allowing case insensitivity // , known misidentifications if (!regex.ismatch(text, @"abstract(in|m)g\s+(s|5)ervice\s+ordered", regexoptions.ignorecase))     return false;  if (!regex.ismatch(text, @"(d|o)ue\s+(d|o)ate\s+&", regexoptions.ignorecase))     return false;  if (!regex.ismatch(text, @"additional\s+comments\s+about\s+this", regexoptions.ignorecase))     return false; return true; 

since algorithm interested in few specific phrases, , don't want match when confidence word low, can combine words 1 long string separated spaces (for convenience). construct regular expressions cater 3 phrases of interest known alternatives, , test concatenated string against regular expressions.

it going cater specific case tho...


Comments