Updated based on Ray's comment

https://github.com/tesseract-ocr/tessdata_best/pull/17#issuecomment-374455979
This commit is contained in:
Shreeshrii 2018-03-20 10:29:25 +05:30 committed by GitHub
parent f102e00ba3
commit b3ad50c3a7
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -2,9 +2,11 @@
This repository contains fast integer versions of trained models for the [Tesseract Open Source OCR Engine](https://github.com/tesseract-ocr/tesseract).
- Most users will want to use these traineddata files to do OCR and these will be shipped as part of Linux distributions.
- These are a speed/accuracy compromise as to what offered the best "value for money" in speed vs accuracy.
- For some languages, this is still best, but for most not.
- The "best value for money" network configuration was then integerized for further speed.
- Most users will want to use these traineddata files to do OCR and these will be shipped as part of Linux distributions eg. Ubuntu 18.04.
- Fine tuning/incremental training will **NOT** be possible from these `fast` models, as they are 8-bit integer.
- It will be possible to convert a tuned `best` to integer to make it faster, but some of the speed in `fast` will be from the smaller model.
- When using the models in this repository, only the new LSTM-based OCR engine is supported. The legacy `tesseract` engine is not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.
## Two types of models
@ -15,13 +17,13 @@ The repository contains two types of models,
Most of the script models include English training data as well as the script, but not **Cyrillic**, as that would have a major ambiguity problem.
On Linux, the language based traineddata packages are named `tesseract-ocr-LANG` where LANG is the three letter language code eg. tesseract-ocr-eng (English language), tesseract-ocr-hin (Hindi language), etc.
On Debian and Ubuntu, the language based traineddata packages are named `tesseract-ocr-LANG` where LANG is the three letter language code eg. tesseract-ocr-eng (English language), tesseract-ocr-hin (Hindi language), etc.
On Linux, the script based traineddata packages are named `tesseract-ocr-script-SCRIPT` where SCRIPT is the four letter script code eg. tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari Script), etc.
On Debian and Ubuntu, the script based traineddata packages are named `tesseract-ocr-script-SCRIPT` where SCRIPT is the four letter script code eg. tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari Script), etc.
### Data files for a particular script
Initial capitals in the filename indicate the one model for all languages in that script.
Initial capitals in the filename indicate the one model for all languages in that script. These are now available under script subdirectory.
- **Latin** is all latin-based languages, except vie.
- **Vietnamese** is for latin-based Vietnamese language.
@ -30,11 +32,7 @@ Initial capitals in the filename indicate the one model for all languages in tha
### LSTM training details for different languages and scripts
For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. eg.
- Latin ~4500 fonts
- Devanagari ~50 fonts
- Kannada 15.
For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. eg. Latin ~4500 fonts, Devanagari ~50 fonts, Kannada 15.
With a theory that poor accuracy on test data and over-fitting on training data was caused by the lack of fonts, the training data has been mixed with English, so that some of the font diversity might generalize to the other script. The overall effect was slightly positive, hence the script models include English language also.
@ -46,7 +44,7 @@ With a theory that poor accuracy on test data and over-fitting on training data
**'jpn_vert'** is trained on text rendered vertically (but the image is rotated so the long edge is still horizontal).
'jpn' loads 'jpn_vert' as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution.
**'jpn'** loads **'jpn_vert'** as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution.
--------------------------------