From 86db1f450b0909c662fd631828748aa608baafe9 Mon Sep 17 00:00:00 2001 From: Shreeshrii Date: Wed, 21 Feb 2018 14:16:38 +0530 Subject: [PATCH] Update README.md --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index 4604453..eb6daca 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,10 @@ The repository contains two types of models, Most of the script models include English training data as well as the script, but not **Cyrillic**, as that would have a major ambiguity problem. +On Linux, the language based traineddata packages are named `tesseract-ocr-LANG` where LANG is the three letter language code eg. tesseract-ocr-eng (English language), tesseract-ocr-hin (Hindi language), etc. + +On Linux, the script based traineddata packages are named `tesseract-ocr-script-SCRIPT` where SCRIPT is the four letter script code eg. tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari Script), etc. + ### Data files for a particular script Initial capitals in the filename indicate the one model for all languages in that script. @@ -44,6 +48,8 @@ With a theory that poor accuracy on test data and over-fitting on training data 'jpn' loads 'jpn_vert' as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution. +-------------------------------- + See the [Tesseract wiki](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files) for additional information. All data in the repository are licensed under the