NAME

File::Find::Similars - Fast similar-files finder

SYNOPSIS

use File::Find::Similars;

File::Find::Similars->init(0, \@ARGV);
similarity_check_name();

DESCRIPTION

Extremely fast file similarity checker. Similar-sized and similar-named files are picked out as suspicious candidates of duplicated files.

It uses advanced soundex vector algorithm to determine the similarity between files. Generally it means that if there are n files, each having approximately m words in the file name, the degree of calculation is merely

O(n^2 * m)

which is over thousands times faster than any existing file fingerprinting technology.

ALGORITHM EXPLANATION

The self-test output will help you understand what the module do and what would you expect from the outcome.

$ make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-Iblib/lib" "-Iblib/arch" test.pl
1..4
# Running under perl version 5.010000 for linux
# Current time local: Wed Oct 29 11:35:06 2008
# Current time GMT:   Wed Oct 29 15:35:06 2008
# Using Test.pm version 1.25
# Testing File::Find::Similars version 1.23

== Testing 1, files under test/ subdir:

  9 test/(eBook) GNU - Python Standard Library 2001.pdf
  3 test/CardLayoutTest.java
  5 test/GNU - 2001 - Python Standard Library.pdf
  4 test/GNU - Python Standard Library (2001).rar
  9 test/LayoutTest.java
  3 test/PopupTest.java
  2 test/Python Standard Library.zip
  5 test/TestLayout.java
ok 1

Note:

- The fileSimilars.pl script will pick out similar files from them in next test.
- Let's assume that the number represent the file size in KB.

== Testing 2 result should be:

## =========
           3 'CardLayoutTest.java' 'test/'
           5 'TestLayout.java' 'test/'

## =========
           4 'GNU - Python Standard Library (2001).rar' 'test/'
           5 'GNU - 2001 - Python Standard Library.pdf' 'test/'
ok 2

Note:

- There are 2 groups of similar files picked out by the script.
  The second group makes more sense.
- The similar files are picked because their file names looks similar.
- However, the file size plays an important role as well.
- There are 2 files in the second similar files group.
- The file 'Python Standard Library.zip' is not considered to be similar to
  the group because its size is not similar to the group.

== Testing 3, if Python.zip is bigger, result should be:

## =========
           3 'CardLayoutTest.java' 'test/'
           5 'TestLayout.java' 'test/'

## =========
           4 'Python Standard Library.zip' 'test/'
           4 'GNU - Python Standard Library (2001).rar' 'test/'
           5 'GNU - 2001 - Python Standard Library.pdf' 'test/'
ok 3

Note:

- There are 3 files in the second similar files group.
- The file 'Python Standard Library.zip' is now in the 2nd similar files
  group because its size is now similar to the group.

== Testing 4, if Python.zip is even bigger, result should be:

## =========
           3 'CardLayoutTest.java' 'test/'
           5 'TestLayout.java' 'test/'

## =========
           4 'GNU - Python Standard Library (2001).rar'       'test/'
           5 'GNU - 2001 - Python Standard Library.pdf'       'test/'
           6 'Python Standard Library.zip'                    'test/'
           9 '(eBook) GNU - Python Standard Library 2001.pdf' 'test/'
ok 4

Note:

- There are 4 files in the second similar files group.
- The file 'Python Standard Library.zip' is still in the group.
- But this time, because it is also considered to be similar to the .pdf
  file (since their size are now similar, 6 vs 9), a 4th file the .pdf
  is now included in the 2nd group.
- If the size of file 'Python Standard Library.zip' is 12(KB), then the
  second similar files group will be split into two. Do you know why and
  which files each group will contain?

The File::Find::Similars package comes with a fully functional demo script fileSimilars.pl. Please refer to its help file for further explanations.

This package is highly customizable. Refer to hash variable %config and/or the 3 arrwash_ functions for customization hints.

BUGS

Please report any bugs or feature requests to bug-file-find-similars at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Find-Similars. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc File::Find::Similars

You can also look for information at:

RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=File-Find-Similars
AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/File-Find-Similars
CPAN Ratings

http://cpanratings.perl.org/d/File-Find-Similars
Search CPAN

http://search.cpan.org/dist/File-Find-Similars/

AUTHOR

SUN, Tong <suntong at cpan.org> http://xpt.sourceforge.net/

COPYRIGHT

This program is released under the BSD license.

TODO

To install File::Find::Similars, copy and paste the appropriate command in to your terminal.

cpanm

cpanm File::Find::Similars

CPAN shell

perl -MCPAN -e shell
install File::Find::Similars

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)