ccextractor/docs/OCR.txt

95 lines
3.8 KiB
Plaintext
Raw Normal View History

2014-06-07 19:45:32 +00:00
Overview
========
2016-11-29 05:16:33 +00:00
OCR (Optical Character Recognition) is a technique used to
2016-11-29 03:53:09 +00:00
extract text from images. In the World of Subtitle, subtitle stored
in bitmap format are common and even necessary for converting subtitle
2016-12-02 08:31:53 +00:00
in bitmap format to subtitle in text format OCR is used.
2014-06-07 19:45:32 +00:00
2014-12-25 11:17:34 +00:00
Dependency
==========
2016-11-29 03:53:09 +00:00
Tesseract (OCR library by Google)
2016-12-02 08:31:53 +00:00
Leptonica (Image processing library)
2014-06-07 19:45:32 +00:00
2016-12-02 08:31:53 +00:00
How to compile CCExtractor on Linux with OCR
2014-06-07 19:45:32 +00:00
=============================================
Download and Install Leptonnica.
-------------------------------
This package is available, you need liblept-devel library.
If Leptonica isn't available for your distribution, or you want to use a newer version
than they offer, you can compile your own.
you can download lib leptonica from http://www.leptonica.com/download.html
Download and Install Tesseract.
-------------------------------
Tesseract is available directly from many Linux distributions. The package is generally
called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to
find it. Packages are also generally available for language training data (search the
repositories,) but if not you will need to download the appropriate training data,
unpack it, and copy the .traineddata file into the 'tessdata' directory, probably
/usr/share/tesseract-ocr/tessdata or /usr/share/tessdata.
If Tesseract isn't available for your distribution, or you want to use a newer version
than they offer, you can compile your own.
If you compile Tesseract then following command in its source code are enough
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
Note:
1) CCExtractor is tested with Tesseract 3.04 version but it works with older versions.
2014-06-07 19:45:32 +00:00
you can download tesseract from https://github.com/tesseract-ocr/tesseract/archive/3.04.00.tar.gz
you can download tesseract training data from https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz
2014-06-07 19:45:32 +00:00
2016-12-02 08:31:53 +00:00
Compile CCExtractor passing flags like following
2014-06-07 19:45:32 +00:00
-------------------------------------------------
make ENABLE_OCR=yes
2014-06-11 12:02:40 +00:00
2016-12-02 08:31:53 +00:00
How to compile CCExtractor on Windows with OCR
2014-12-25 11:17:34 +00:00
===============================================
2014-06-11 12:02:40 +00:00
Download prebuild library of leptonica and tesseract from following link
https://drive.google.com/file/d/0B2ou7ZfB-2nZOTRtc3hJMHBtUFk/view?usp=sharing
2014-06-11 12:02:40 +00:00
put the path of libs/include of leptonica and tesseract in library paths.
step 1) In visual studio 2013 right click <Project> and select property.
step 2) Select Configuration properties in left panel(column) of property.
step 3) Select VC++ Directory.
step 4) In the right pane, in the right-hand column of the VC++ Directory property,
open the drop-down menu and choose Edit.
Step 5) Add path of Directory where you have kept uncompressed library of leptonica
and tesseract.
Set preprocessor flag ENABLE_OCR=1
2016-12-02 08:31:53 +00:00
Step 1) In visual studio 2013 right click <Project> and select property.
Step 2) In the left panel, select Configuration Properties, C/C++, Preprocessor.
Step 3) In the right panel, in the right-hand column of the Preprocessor Definitions property, open the drop-down menu and choose Edit.
Step 4) In the Preprocessor Definitions dialog box, add ENABLE_OCR=1. Choose OK to save your changes.
2014-06-11 12:02:40 +00:00
Add library in linker
2016-12-02 08:31:53 +00:00
step 1) Open property of project
Step 2) Select Configuration properties
Step 3) Select Linker in left panel(column)
Step 4) Select Input
Step 5) Select Additional dependencies in right panel
Step 6) Add libtesseract304d.lib in new line
Step 7) Add liblept172.lib in new line
2014-06-11 12:02:40 +00:00
Download language data from following link
https://code.google.com/p/tesseract-ocr/downloads/list
2015-01-07 09:01:39 +00:00
after downloading the tesseract-ocr-3.02.eng.tar.gz extract the tar file and put
2016-12-02 08:31:53 +00:00
tessdata folder where you have kept CCExtractor executable
2014-06-11 12:02:40 +00:00
Copy the tesseract and leptonica dll from lib folder downloaded from above link to folder of executable or in system32.