2014-06-07 19:45:32 +00:00
|
|
|
|
|
|
|
Overview
|
|
|
|
========
|
2016-11-29 05:16:33 +00:00
|
|
|
OCR (Optical Character Recognition) is a technique used to
|
2016-11-29 03:53:09 +00:00
|
|
|
extract text from images. In the World of Subtitle, subtitle stored
|
|
|
|
in bitmap format are common and even necessary for converting subtitle
|
2016-12-02 08:31:53 +00:00
|
|
|
in bitmap format to subtitle in text format OCR is used.
|
2014-06-07 19:45:32 +00:00
|
|
|
|
2014-12-25 11:17:34 +00:00
|
|
|
Dependency
|
|
|
|
==========
|
2016-11-29 03:53:09 +00:00
|
|
|
Tesseract (OCR library by Google)
|
2016-12-02 08:31:53 +00:00
|
|
|
Leptonica (Image processing library)
|
2014-06-07 19:45:32 +00:00
|
|
|
|
2016-12-02 08:31:53 +00:00
|
|
|
How to compile CCExtractor on Linux with OCR
|
2014-06-07 19:45:32 +00:00
|
|
|
=============================================
|
|
|
|
|
|
|
|
Download and Install Leptonnica.
|
|
|
|
-------------------------------
|
|
|
|
This package is available, you need liblept-devel library.
|
|
|
|
|
|
|
|
If Leptonica isn't available for your distribution, or you want to use a newer version
|
|
|
|
than they offer, you can compile your own.
|
|
|
|
|
|
|
|
you can download lib leptonica from http://www.leptonica.com/download.html
|
|
|
|
|
|
|
|
Download and Install Tesseract.
|
|
|
|
-------------------------------
|
|
|
|
Tesseract is available directly from many Linux distributions. The package is generally
|
|
|
|
called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to
|
|
|
|
find it. Packages are also generally available for language training data (search the
|
|
|
|
repositories,) but if not you will need to download the appropriate training data,
|
|
|
|
unpack it, and copy the .traineddata file into the 'tessdata' directory, probably
|
|
|
|
/usr/share/tesseract-ocr/tessdata or /usr/share/tessdata.
|
|
|
|
|
|
|
|
If Tesseract isn't available for your distribution, or you want to use a newer version
|
|
|
|
than they offer, you can compile your own.
|
|
|
|
|
|
|
|
If you compile Tesseract then following command in its source code are enough
|
|
|
|
./autogen.sh
|
|
|
|
./configure
|
|
|
|
make
|
|
|
|
sudo make install
|
|
|
|
sudo ldconfig
|
|
|
|
|
|
|
|
Note:
|
2015-08-18 10:44:54 +00:00
|
|
|
1) CCExtractor is tested with Tesseract 3.04 version but it works with older versions.
|
2014-06-07 19:45:32 +00:00
|
|
|
|
2015-08-20 12:54:37 +00:00
|
|
|
you can download tesseract from https://github.com/tesseract-ocr/tesseract/archive/3.04.00.tar.gz
|
|
|
|
you can download tesseract training data from https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz
|
2014-06-07 19:45:32 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
2016-12-02 08:31:53 +00:00
|
|
|
Compile CCExtractor passing flags like following
|
2014-06-07 19:45:32 +00:00
|
|
|
-------------------------------------------------
|
|
|
|
make ENABLE_OCR=yes
|
|
|
|
|
2014-06-11 12:02:40 +00:00
|
|
|
|
2016-12-02 08:31:53 +00:00
|
|
|
How to compile CCExtractor on Windows with OCR
|
2014-12-25 11:17:34 +00:00
|
|
|
===============================================
|
2014-06-11 12:02:40 +00:00
|
|
|
|
2015-08-18 10:44:54 +00:00
|
|
|
Download prebuild library of leptonica and tesseract from following link
|
|
|
|
https://drive.google.com/file/d/0B2ou7ZfB-2nZOTRtc3hJMHBtUFk/view?usp=sharing
|
2014-06-11 12:02:40 +00:00
|
|
|
|
|
|
|
put the path of libs/include of leptonica and tesseract in library paths.
|
|
|
|
step 1) In visual studio 2013 right click <Project> and select property.
|
|
|
|
step 2) Select Configuration properties in left panel(column) of property.
|
|
|
|
step 3) Select VC++ Directory.
|
|
|
|
step 4) In the right pane, in the right-hand column of the VC++ Directory property,
|
|
|
|
open the drop-down menu and choose Edit.
|
|
|
|
Step 5) Add path of Directory where you have kept uncompressed library of leptonica
|
|
|
|
and tesseract.
|
|
|
|
|
|
|
|
|
|
|
|
Set preprocessor flag ENABLE_OCR=1
|
2016-12-02 08:31:53 +00:00
|
|
|
Step 1) In visual studio 2013 right click <Project> and select property.
|
|
|
|
Step 2) In the left panel, select Configuration Properties, C/C++, Preprocessor.
|
|
|
|
Step 3) In the right panel, in the right-hand column of the Preprocessor Definitions property, open the drop-down menu and choose Edit.
|
|
|
|
Step 4) In the Preprocessor Definitions dialog box, add ENABLE_OCR=1. Choose OK to save your changes.
|
2014-06-11 12:02:40 +00:00
|
|
|
|
|
|
|
Add library in linker
|
2016-12-02 08:31:53 +00:00
|
|
|
step 1) Open property of project
|
|
|
|
Step 2) Select Configuration properties
|
|
|
|
Step 3) Select Linker in left panel(column)
|
|
|
|
Step 4) Select Input
|
|
|
|
Step 5) Select Additional dependencies in right panel
|
|
|
|
Step 6) Add libtesseract304d.lib in new line
|
|
|
|
Step 7) Add liblept172.lib in new line
|
2014-06-11 12:02:40 +00:00
|
|
|
|
|
|
|
Download language data from following link
|
|
|
|
https://code.google.com/p/tesseract-ocr/downloads/list
|
2015-01-07 09:01:39 +00:00
|
|
|
after downloading the tesseract-ocr-3.02.eng.tar.gz extract the tar file and put
|
2016-12-02 08:31:53 +00:00
|
|
|
tessdata folder where you have kept CCExtractor executable
|
2014-06-11 12:02:40 +00:00
|
|
|
|
2015-08-20 12:54:37 +00:00
|
|
|
Copy the tesseract and leptonica dll from lib folder downloaded from above link to folder of executable or in system32.
|