NAME

Lingua::JA::NormalizeText - Text Normalizer

SYNOPSIS

use Lingua::JA::NormalizeText;
use utf8;

my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
my $normalizer = Lingua::JA::NormalizeText->new(@options);

print $normalizer->normalize('鳥が㌧㌦でありんす♥');
# -> 鳥がトンドルです♥

sub dearinsu_to_desu
{
    my $text = shift;
    $text =~ s/でありんす/です/g;

    return $text;
}

# or

use Lingua::JA::NormalizeText qw/nfkc decode_entities/;
use utf8;

my $text = '㈱㋰㋫㋫♥';
print decode_entities( nfkc($text) );
# -> (株)ムフフ♥

DESCRIPTION

Lingua::JA::NormalizeText normalizes text.

METHODS

new(@options)

Creates a new Lingua::JA::NormalizeText instance.

The following options are available.

OPTION                 SAMPLE INPUT        OUTPUT FOR SAMPLE INPUT
---------------------  ------------------  -----------------------
lc                     DdD                 ddd
uc                     DdD                 DDD
nfkc                   ㌦                  ドル (length: 2)
nfkd                   ㌦                  ドル (length: 3)
nfc
nfd
decode_entities        ♥            ♥
strip_html             <em>あ</em>             あ    
alnum_z2h              ABC123        ABC123
alnum_h2z              ABC123              ABC123
space_z2h
space_h2z
katakana_z2h           ハァハァ            ハァハァ
katakana_h2z           スーハースーハー            スーハースーハー
katakana2hiragana      パンツ              ぱんつ
hiragana2katakana      ぱんつ              パンツ
unify_3dots            はぁ。。。          はぁ…
wave2tilde             〜                  ~
tilde2wave             ~                  〜
wavetilde2long         〜, ~              ー
wave2long              〜                  ー
tilde2long             ~                  ー
fullminus2long         −                   ー
dashes2long            —                   ー
drawing_lines2long     ─                   ー
unify_long_repeats     ヴァーーー          ヴァー
nl2space               (new line)          (space)
unify_long_spaces      (space)(space)      (space)
remove_head_space      (space)あ(space)あ  あ(space)あ
remove_tail_space      ああ(space)(space)  ああ
old2new_kana           ゐヰゑヱ            いイえエ
old2new_kanji          亞逸鬭              亜逸闘
tab2space              (tab)(tab)          (space)(space)
remove_controls        あ\x{0000}あ        ああ

The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.)

External functions are also addable. (See dearinsu_to_desu function of SYNOPSIS section.)

remove_controls

Note that this option does not remove the following chars:

CHARACTER TABULATION(tab)
LINE FEED(LF)
CARRIAGE RETURN(CR)

normalize($text)

normalizes $text.

AUTHOR

pawa <pawapawa@cpan.org>

SEE ALSO

新旧字体表: http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.