mirror of https://github.com/tesseract-ocr/tessdata_fast.git synced 2024-12-22 12:11:05 +01:00

Fast integer versions of trained LSTM models

Find a file

Stefan Weil 9f875fb819 Move trained data for scripts to new subdirectory This fixes a name conflict for Lao.traineddata and lao.traineddata which could not be distinguished on case insensitive filesystems (for example macOS, Windows). It makes it also easier for users to see which data is for scripts. Choosing a script works now like this: tesseract -l script/Latin ... Signed-off-by: Stefan Weil <sw@weilnetz.de>		2018-03-10 10:18:55 +01:00
script	Move trained data for scripts to new subdirectory	2018-03-10 10:18:55 +01:00
afr.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
amh.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ara.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
asm.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
aze.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
aze_cyrl.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
bel.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ben.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
bod.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
bos.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
bre.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
bul.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
cat.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ceb.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ces.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
chi_sim.traineddata	Fix extra intra-word spaces by adding config file	2018-02-20 20:18:27 +05:30
chi_sim_vert.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
chi_tra.traineddata	Fix extra spaces in words for chi_tra	2018-02-20 22:50:06 +05:30
chi_tra_vert.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
chr.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
COPYING	Use the full Apache License text	2017-09-15 07:27:16 +02:00
cos.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
cym.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
dan.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
deu.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
div.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
dzo.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ell.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
eng.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
enm.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
epo.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
est.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
eus.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
fao.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
fas.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
fil.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
fin.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
fra.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
frk.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
frm.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
fry.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
gla.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
gle.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
glg.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
grc.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
guj.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
hat.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
heb.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
hin.traineddata	Add config files to fix auto PSM issue 1273	2018-02-26 20:21:11 +05:30
hrv.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
hun.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
hye.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
iku.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ind.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
isl.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ita.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ita_old.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
jav.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
jpn.traineddata	Fix extra intra-word spaces by adding config file	2018-02-20 20:18:27 +05:30
jpn_vert.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
kan.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
kat.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
kat_old.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
kaz.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
khm.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
kir.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
kor.traineddata	Fix extra intra-word spaces by adding config file	2018-02-20 20:18:27 +05:30
kor_vert.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
kur_ara.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
lao.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
lat.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
lav.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
lit.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ltz.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
mal.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
mar.traineddata	Add config files to fix auto PSM issue 1273	2018-02-26 20:21:11 +05:30
mkd.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
mlt.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
mon.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
mri.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
msa.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
mya.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
nep.traineddata	Add config files to fix auto PSM issue 1273	2018-02-26 20:21:11 +05:30
nld.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
nor.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
oci.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ori.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
osd.traineddata	Use legacy Orientation Script Detector (OSD) because that is the only thing that currently works.	2017-09-15 11:49:11 -07:00
pan.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
pol.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
por.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
pus.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
que.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
README.md	Update README.md	2018-02-21 14:16:38 +05:30
ron.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
rus.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
san.traineddata	Fix config file for default oem mode, change to --oem 1	2017-09-15 18:44:40 +05:30
sin.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
slk.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
slv.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
snd.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
spa.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
spa_old.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
sqi.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
srp.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
srp_latn.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
sun.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
swa.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
swe.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
syr.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
tam.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
tat.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
tel.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
tgk.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
tha.traineddata	Fix extra intra-word spaces by adding config file	2018-02-20 20:18:27 +05:30
tir.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ton.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
tur.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
uig.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
ukr.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
urd.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
uzb.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
uzb_cyrl.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
vie.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
yid.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00
yor.traineddata	Initial import to github (on behalf of Ray)	2017-09-14 14:35:44 -07:00

README.md

tessdata_fast – Fast integer versions of trained models

This repository contains fast integer versions of trained models for the Tesseract Open Source OCR Engine.

Most users will want to use these traineddata files to do OCR and these will be shipped as part of Linux distributions.
Fine tuning/incremental training will NOT be possible from these fast models, as they are 8-bit integer.
It will be possible to convert a tuned best to integer to make it faster, but some of the speed in fast will be from the smaller model.
When using the models in this repository, only the new LSTM-based OCR engine is supported. The legacy tesseract engine is not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.

Two types of models

The repository contains two types of models,

those for a single language and
those for a single script supporting one or more languages.

Most of the script models include English training data as well as the script, but not Cyrillic, as that would have a major ambiguity problem.

On Linux, the language based traineddata packages are named tesseract-ocr-LANG where LANG is the three letter language code eg. tesseract-ocr-eng (English language), tesseract-ocr-hin (Hindi language), etc.

On Linux, the script based traineddata packages are named tesseract-ocr-script-SCRIPT where SCRIPT is the four letter script code eg. tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari Script), etc.

Data files for a particular script

Initial capitals in the filename indicate the one model for all languages in that script.

Latin is all latin-based languages, except vie.
Vietnamese is for latin-based Vietnamese language.
Fraktur is basically a combination of all the latin-based languages that have an 'old' variant.
Devanagari is for hin+san+mar+nep+eng.

LSTM training details for different languages and scripts

For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. eg.

Latin ~4500 fonts
Devanagari ~50 fonts
Kannada 15.

With a theory that poor accuracy on test data and over-fitting on training data was caused by the lack of fonts, the training data has been mixed with English, so that some of the font diversity might generalize to the other script. The overall effect was slightly positive, hence the script models include English language also.

Example - jpn and Japanese

'jpn' contains whatever appears on the www that is labelled as the language, trained only with fonts that can render Japanese.

Japanese contains all the languages that use that script (in this case just the one) PLUS English.The resulting model is trained with a mix of both training sets, with the expectation that some of the generalization to 4500 English training fonts will also apply to the other script that has a lot less.

'jpn_vert' is trained on text rendered vertically (but the image is rotated so the long edge is still horizontal).

'jpn' loads 'jpn_vert' as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution.

See the Tesseract wiki for additional information.

All data in the repository are licensed under the Apache-2.0 License, see file COPYING.

README.md Unescape Escape