Tesseract 是一个开源的OCR识别项目。

安装及辅助工具

使用 Homebrew 安装 Tesseract

  • –with-training-tools 安装训练工具
  • –all-languages 下载所有语言包
1
brew install tesseract --with-training-tools

建议和 imagemagick 配合使用,用于图片预处理

1
brew install imagemagick

建议下载 jTessBoxEditor,进行 Box 文件的编辑。

训练

TrainingTesseract 链接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/bin/bash

name="my_lang"

## 生成BOX文件,然后使用 jTessBoxEditor 手动修改 Box 文件
tesseract ${name}.tif ${name} batch.nochop makebox

# ------------------- #

## 执行训练, 生成 ${name}.tr
tesseract ${name}.tif ${name} nobatch box.train

## 计算字符集,生成 unicharset
unicharset_extractor ${name}.box

## 聚类
#### 生成 font_properties
echo ${name} 0 0 0 0 0 > font_properties

#### 生成 shapetable
shapeclustering -F font_properties -U unicharset ${name}.tr

#### 生成 inttemp, pffmtable
mftraining -F font_properties -U unicharset ${name}.tr

#### 生成 normproto
cntraining ${name}.tr

mv unicharset ${name}.unicharset
mv inttemp ${name}.inttemp
mv normproto ${name}.normproto
mv pffmtable ${name}.pffmtable
mv shapetable ${name}.shapetable

#### 生成 traineddata 文件,将其放在 tessdata 目录下即可使用
combine_tessdata ${name}.

图片预处理

二值化

根据情景,必须的时候进行二值化,或先灰度化之后再进行二值化,以去除干扰。

降噪

比如平滑降噪以及一些滤波算法。个人经验: 有时将原图缩小后再进行二值化,也有一定的降噪效果。

imagemagick 常用命令

1
2
3
4
5
6
7
8
9
10
11
## 转换图片格式
convert xxx.jpg -auto-level -compress none xxx.tiff

## 生成灰度图
convert -colorspace Gray orig.jpg out.jpg

## 生成黑白图
convert -monochrome orig.jpg out.jpg

## 缩放
convert -scale 600% orig.jpg out.jpg

Tesseract 识别

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
MacBook-Pro:tessdata fang$ tesseract --help
Usage:
tesseract imagename|stdin outputbase|stdout [options...] [configfile...]

OCR options:
--tessdata-dir /path specify the location of tessdata path
--user-words /path/to/file specify the location of user words file
--user-patterns /path/to/file specify the location of user patterns file
-l lang[+lang] specify language(s) used for OCR
-c configvar=value set value for control parameter.
Multiple -c arguments are allowed.
-psm pagesegmode specify page segmentation mode.
These options must occur before any configfile.

pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.

Single options:
-v --version: version info
--list-langs: list available languages for tesseract engine. Can be used with --tessdata-dir.
--print-parameters: print tesseract parameters to the stdout.

常用参数

  • -l 选择语言
  • -c/configfile 可以设置白名单,提高识别率
  • -psm 选择模式,正确的模式有助于识别

Tesseract-OCR-iOS

https://github.com/gali8/Tesseract-OCR-iOS