NAME
Encode::Mapper - Perl extension for intuitive, yet efficient construction of mappings for Encode
$Id: Mapper.pm,v 1.10 2003/08/04 12:07:40 smrz Exp $
SYNOPSIS
use Encode::Mapper; ############################################# Enjoy the ride ^^
## Types of rules for mapping the data and controlling the engine's configuration #####
@rules = (
'x', 'y', # single 'x' be 'y', unless greediness prefers ..
'xx', 'Y', # .. double 'x' be 'Y' or other rules
'uc(x)x', sub { 'sorry ;)' }, # if 'x' follows 'uc(x)', be sorry, else ..
'uc(x)', [ '', 'X' ], # .. alias this *engine-initial* string
'xuc(x)', [ '', 'xX' ], # likewise, alias for the 'x' prefix
'Xxx', [ sub { $i++; '' }, 'X' ], # count the still married 'x'
);
## Constructors of the engine, i.e. one Encode::Mapper instance #######################
$mapper = Encode::Mapper->compile( @rules ); # engine constructor
$mapper = Encode::Mapper->new( @rules ); # equivalent alias
## Elementary performance of the engine ###############################################
@source = ( 'x', 'xx', 'xxuc(x)', 'xxx', '', 'xx' ); # distribution of the data ..
$source = join '', @source; # .. is ignored in this sense
@result = ($mapper->process(@source), $mapper->recover()); # the mapping procedure
@result = ($mapper->process($source), $mapper->recover()); # completely equivalent
$result = join '', map { ref $_ eq 'CODE' ? $_->() : $_ } @result;
# maps 'xxxxxuc(x)xxxxx' into ( 'Y', 'Y', '', 'y', CODE(...), CODE(...), 'y' ), ..
# .. then converts it into 'YYyy', setting $i == 2
@follow = $mapper->compute(@source); # follow the engine's computation over @source
$dumper = $mapper->dumper(); # returns the engine as a Data::Dumper object
## Module's higher API implemented for convenience ####################################
$encoder = [ $mapper, Encode::Mapper->compile( ... ), ... ]; # reference to mappers
$result = Encode::Mapper->encode($source, $encoder, 'utf8'); # encode down to 'utf8'
$decoder = [ $mapper, Encode::Mapper->compile( ... ), ... ]; # reference to mappers
$result = Encode::Mapper->decode($source, $decoder, 'utf8'); # decode up from 'utf8'
ABSTRACT
Encode::Mapper serves for intuitive, yet efficient construction of mappings for Encode.
The module finds direct application in Encode::Arabic and Encode::Korean, providing an
object-oriented programming interface to convert data consistently, follow the engine's
computation, dump the engine using Data::Dumper etc.
DESCRIPTION
It looks like the author of the extension ... ;) prefered giving formal and terse examples to writing English. Please, see Encode::Arabic and Encode::Korean, where Encode::Mapper is used for solving complex real-world problems.
INTRO AND RULE TYPES
The module's core is an algoritm which, from the rules given by the user, builds a finite-state transducer, i.e. an engine performing greedy search in the input stream and producing output data and side effects relevant to the results of the search. Transducers may be linked one with another, thus forming multi-level devices suitable for nontrivial encoding/decoding tasks.
The rules declare which input sequences of bytes to search for, and what to do upon their occurence. If the left-hand side (LHS) of a rule is the longest left-most string out of those applicable on the input, the righ-hand side (RHS) of the rule is evaluated. The RHS defines the corresponding output string, and possibly controls the engine as if the extra text were prepended before the rest of the input:
$A => $X # $A .. literal string
# $X .. literal string or subroutine reference
$A => [$X, $Y] # $Y .. literal string for which 'length $Y < length $A'
The order of the rules does not matter, except when several rules with the same LHS are stated. In such a case, redefinition warning is issued before overriding the RHS.
LOW-LEVEL METHODS
- compile ($class, @list)
-
The constructor of an Encode::Mapper instance. The first argument is the name of the class, the rest is the list of rules ... LHS odd elements, RHS even elements. Redefinition for repeated LHS is enabled, and warned about.
The compilation algorithm, and the search algorithm itself, were inspired by Aho-Corasick and Boyer-Moore algorithms, and by the studies of finite automata with the restart operation. The engine is implemented in the classical sense, using hashes for the transition function for instance. We expect to improve this to Perl code evaluation, if the speed-up is significant.
It is to explore the way Perl's regular expressions would cope with the task, i.e. verify our initial doubts which prevented us from trying. Since Encode::Mapper's functionality is much richer than pure search, simulating it completely might be resource-expensive and non-elegant. Therefore, experiment reports are welcome.
- new ($class, @list)
-
Name alias to the
compileconstructor. - process ($obj, @list)
-
Process the input list with the engine. There is no resetting within the call of the method. Internally, the text in the list is
splitinto bytes, and there is just no need for the user tojoinhis/hers strings or lines of data. Note the unveiled properties of the Encode::Mapper class as well:sub process ($@) { # returns the list of search results performed by Mapper my $obj = shift @_; my (@returns, $phrase, $token, $q); use bytes; # ensures splitting into one-byte tokens $q = $obj->{'current'}; foreach $phrase (@_) { foreach $token (split //, $phrase) { until (defined $obj->{'tree'}[$q]->{$token}) { push @returns, @{$obj->{'bell'}[$q]}; $q = $obj->{'skip'}[$q]; } $q = $obj->{'tree'}[$q]->{$token}; } } $obj->{'current'} = $q; return @returns; } - recover ($obj, $r, $q)
-
Since the search algorithm is greedy and the engine does not know when the end of the data comes, there must be a method to tell. Normally,
recoveris called on the object without the other two optional parameters setting the initial and the final state, respectively.sub recover ($;$$) { # returns the 'in-progress' search result and resets Mapper my ($obj, $r, $q) = @_; my (@returns); $q = $obj->{'current'} unless defined $q; until ($q == 0) { push @returns, @{$obj->{'bell'}[$q]}; $q = $obj->{'skip'}[$q]; } $obj->{'current'} = defined $r ? $r : 0; return @returns; } - compute ($obj, @list)
-
Tracks down the computation over the list of data, resetting the engine before and after to its initial state. Developers might like this ;)
local $\ = "\n"; local $, = ":\t"; # just define the display foreach $result ($mapper->compute($source)) { # follow the computation print "Token", $result->[0]; print "Source", $result->[1]; print "Output", join " + ", @{$result->[2]}; print "Target", $result->[3]; print "Bell", join ", ", @{$result->[4]}; print "Skip", $result->[5]; } - dumper ($obj, $ref)
-
The individual instances of Encode::Mapper can be stored as revertible data structures. For minimalistic reasons, dumping needs to include explicit short-identifier references to the empty array and the empty hash of the engine. For details, see Data::Dumper.
sub dumper ($;$) { my ($obj, $ref) = @_; $ref = ['L', 'H', 'mapper'] unless defined $ref; require Data::Dumper; return Data::Dumper->new([$obj->{'null'}{'list'}, $obj->{'null'}{'hash'}, $obj], $ref); }
HIGH-LEVEL METHODS
In the Encode world, one can work with different encodings and is also provided a function for telling if the data are in Perl's internal utf8 format or not. In the Encode::Mapper business, one is encouraged to compile different mappers and stack them on top of each other, getting an easy-to-work-with filtering device.
In combination, this module offers the following encode and decode methods. In their prototypes, $encoder/$decoder represent merely a reference to an array of mappers, although mathematics might do more than that in future implementations ;)
Currently, the mappers involved are not reset with recover before the computation:
foreach $mapper (@{$_[2]}) { # either $encoder or $decoder
$text = join "", map {
UNIVERSAL::isa($_, 'CODE') ? $_->() : $_
} $mapper->process($text), $mapper->recover();
}
- encode ($class, $text, $encoder, $enc)
-
If
$encis defined, the$textis encoded into that encoding, using Encode. Then, the$encoder's engines are applied in series on the data. The returned text should have the utf8 flag off. - decode ($class, $text, $decoder, $enc)
-
The
$textis run through the sequence of engines in$decoder. If the result does not have the utf8 flag on, decoding from$encis further performed by Encode. If$encis not defined, utf8 is assumed.
EXPORT
This module does not export any symbols.
SEE ALSO
There are related theoretical studies which the implementation may have touched. You might be interested in Aho-Corasick and Boyer-Moore algorithms as well as in finite automata with the restart operation.
Encode, Encode::Arabic, Encode::Korean, Data::Dumper
AUTHOR
Otakar Smrz, http://ckl.mff.cuni.cz/smrz/
eval { 'E<lt>' . 'smrz' . "\x40" . (join '.', qw 'ckl mff cuni cz') . 'E<gt>' }
Perl is also designed to make the easy jobs not that easy ;)
COPYRIGHT AND LICENSE
Copyright 2003 by Otakar Smrz
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.