NAME

Zack

DESCRIPTION

Zack is a nexus of text extraction utilities. We use PDF::OCR2, antiword, etc etc.. to convert various file types into text output.

Also, we cache the output to a database, associated via md5sum. Again, this is a very important concept to note; we associate text output with sum of the original content. If you run zack against an image, use ocr, and cache the output to that file's md5 sum. If the file does not change, we do not regenerate output. You can 'uncache' a file.

So if the file contebt does not change, we don't regenerate text. This is crucial in ocr procedures, which are timely.

This package is comprised of various scripts and perl modules.

Why cache text output

Two reasons.

expense

Some of the text extraction methods are timely, like ocr. Why run the same procedure twice on the same data?

information

If you run zack on a multitude of files, now you have a rich database of information for what files have what 'text' inside them. Without changing the files themselves. To make use of this, coordination must happen together with Alister, for example. This will be accomplished in another project.

CLI

Zack is primarily intended to be used via the command line.

zack

The main interface is the executable 'zack'.

Getting text out of files

As mentioned, zack aims to be a nexus of text extraction methods. You can use ocr to extract text from images and pdfs, antiword to get text ot of

Caching output

Whenever you use zack to extract text, the output is cached automatically to the database. To cache text output for a doc file:

zack ./file.doc

Uncaching a file

If for some reason a file's text output is cached and you don't want it so, you can ask zack to uncache.

You can remove a cache entry by:

zack -x MD5SUM

Where MD5SUM is a sum string like '87c1e9938cd60077f3cdfbda765d8695'. You can alternatively point to a file on disk, and zack resolves md5sum behind the scenes:

zack -x PATH

Where PATH is like './file.pdf',

The zack command line argument parsing is 'smart', you can mix up sums and paths at will.

zack -x ~/file1.tif 87c1e9938cd60077f3cdfbda765d8695 ~/file2.pdf

zack comand line options

Usage: zack [OPTION] [FILE|SUM].. Extract text from various file formats and mime types, caching output.

-h                help
-d                debug on
-q                quiet
-a path           abs config file, defaul is /etc/zack.conf
-x                delete on
-o                cache override on
-S                stats, also shows what types we can get text from.   
-c                check if cached
-J                files are sum named injectfiles, not regular files

zack usage examples

Get the text inside file to stdout

zack ./files.txt

Get whatever's cached for sum

zack 87c1e9938cd60077f3cdfbda765d8695

Just make sure files are cached

zack -q ./files*

Cache again, getting rid of previous cache zack -o -q ./files*

Asking if file or sum is cached

zack -c ./files*
zack -c 87c1e9938cd60077f3cdfbda765d8695
zack -c 87c1e9938cd60077f3cdfbda765d8695 ./file1.txt ./files2* 87c1e9938cd6a077f3cdfbda76528695

Remove cache entries by file path(s) or sum(s)

zack -x 87c1e9938cd60077f3cdfbda765d8695
zack -x ./files*
zack -x 87c1e9938cd60077f3cdfbda765d8695 ./files*

Injectfiles

An injectfile is a text file named after the sum of the source document. The content is the text content of the file.

If your source document is a pdf called

filename: ~/x.pdf 

The sum of that file is

sum: 87c1e9938cd60077f3cdfbda765d8695

And the content is

output: 'This is a sentence'

Create a file called

/tmp/87c1e9938cd60077f3cdfbda765d8695.txt

In which you place 'This is a sentence'.

That's a 'injectfile' in regards to zack.

Using an injectfile

The target must be a text file whose filename has a sum... Here we use the 'x' option flag to delete the file after successful insert.

zack -J -x 7099ba7e5103429c3c51c998aee05dec.txt 

If you have a directory with a million file entries..

zack -J -x /path/to/dir

This process checks if the file sum is already cached. If so, skips and leaves the file. If you want to overrite possible previous caches, use the -o flag.

zack -J -x -o /path/to/dir

CAVEAT

Injectfiles must have a md5sum as part of filename, 32 char a-f0-9. You must specify the -J flag to treat file arguments as injectfiles.

SEE ALSO

Zack::File2Text

AUTHOR

Leo Charre leocharre at cpan dot org

COPYRIGHT

Copyright (c) Leo Charre. All rights reserved.

LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms and conditions as Perl itself.

This means that you can, at your option, redistribute it and/or modify it under either the terms the GNU Public License (GPL) version 1 or later, or under the Perl Artistic License.

See http://dev.perl.org/licenses/

DISCLAIMER

THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

Use of this software in any way or in any form, source or binary, is not allowed in any country which prohibits disclaimers of any implied warranties of merchantability or fitness for a particular purpose or any disclaimers of a similar nature.

IN NO EVENT SHALL I BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION (INCLUDING, BUT NOT LIMITED TO, LOST PROFITS) EVEN IF I HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE