NAME

Text::PhraseDistance - A measure of the degree of proximity of 2 given phrases

SYNOPSIS

 use Text::PhraseDistance qw(pdistance);

 sub distance {

	#your own implementation of a distance between strings
	#
	#that needs 2 strings (2 arguments) and returns a number
 }

 # otherwise you can use Text::Levensthein or others, e.g.
 # use Text::Levenshtein qw(distance);

 my $phrase1="a yellow dog";
 my $phrase2="a dog yellow";

 my $set="abcdefghijklmnopqrstuvwxyz";

 print pdistance($phrase1,$phrase2,$set,\&distance);

DESCRIPTION

This module provides a way to compare two phrases and to give a measure of their proximity. In this context, a phrase is a groups of words formed by a set of characters, separated by elements from the complemetary of that set. E.g. if the set is composed by [abcdefghijklmnopqrstuvwxyz], a phrase is "hello, world!" where the words are "hello" and "world", with ", " and "!" parts of the complementary set.

This module does not provide a "classic" string distance (e.g. Levenshtein), i.e. a way to compare two strings as unique entities. Instead it uses a string distance to compare the words, one by one and it tries to "match" the ones that have a smaller distance. It also calculates a positional distance for every words belonging to the set and for the elements of the complementary set. So for example, for the two phrases:

"a yellow dog"
"a dog yellow"

Levenshtein says that are distance 8. Also for the phrases:

"a yellow dog"
"a good cat"

the Levenshtein distance is 8, but the first 2 phrases are much closer than the second.

With the phrase distance implemented in this module, using the Text::Levenshtein as the string distance, the phrases:

"a yellow dog"
"a good cat"

have distance 8, but the phrases:

"a yellow dog"
"a dog yellow"

have distance 2. This is because this module evaluates the string distance for the words that it is 0 (because there are 3 pairs of words with minimal string distance equal to 0) and the positional distance, that is 0 for the two "a"s plus 1 for "yellow" in the first phrase compared with "yellow" in the second (i.e. they are distant 1 position from each other), plus 1 for "dog" in the first phrase compared with "dog" in the second.

This 2 components of the phrase distance (i.e. the string distance and the positional distance) can have a different cost from the default (that is 1 for both) to give your own type of phrase distance (see below for the syntax).

By default, this module sums the phrase distance from the words from the set (i.e. formed by the defined set of characters) and the phrase distance calculated from the "words" belonging the complementary set. In order to change this behaviour, see below.

The phrase distance implemented in this module is very slow because it calculates the string distance n x m times, where n is the number of words in the first phrase and m is the number of words in the second one. Moreover, if there are a lot of minimums (i.e. pair of strings that have the smallest phrase distance in that moment), the algorithm has to do more iterations to find the best choice.

USAGE

You have to import the pdistance function to the current namespace:

use Text::PhraseDistance qw(pdistance);

then you have to declare your distance function:

 sub distance {

	#your own implementation of a distance between strings
	#
	#that needs 2 strings (2 arguments) and returns a number
 }

otherwise you can use Text::Levensthein or others, e.g.

use Text::Levenshtein qw(distance);

You need also the set of characters for the words, e.g.

my $set="abcdefghijklmnopqrstuvwxyz";

and then the two phrases, e.g.:

my $phrase1="a yellow dog";
my $phrase2="a dog yellow";

so you can call the phrase distance:

print pdistance($phrase1,$phrase2,$set,\&distance);

In order to define a custom distance subroutine, wrapping an existent one (e.g. WagnerFischer with a custom array cost) you can use a closure like this:

my $mydistance;
{
    my $array_ref = [0, 1, 2];
    $mydistance = sub { 
        distance( $array_ref, shift, shift );
    };
}

OPTIONAL PARAMETERS

 pdistance($phrase1,$phrase2,$set,\&distance,{-cost=>[1,0],-mode=>'set'});

 -mode	
 accepted values are: 	
	complementary	means that the distance is calculated only
			from the "words" from the complementary set
					
	both	default, the distance is calculated from both sets

	set	means that the distance is calculated only
		from the "words" from the given set

 -cost
 accepted value is an array with 2 elements: first is the cost for
 the string distance and the second is the cost for positional distance.
 Default array is [1,1] .

THANKS

Many thanks to Stefano L. Rodighiero <larsen at perlmonk.org> and to D. Frankowski for the suggestions.

AUTHOR

Copyright 2002 Dree Mistrut <dree@friul.it>

This package is free software and is provided "as is" without express or implied warranty. You can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Text::Levenshtein, Text::WagnerFischer