BERTにおけるテキストクレンジングを紹介[BERT]

5月 28, 20197月 27, 2020

汎用言語モデルBERTを使用する際に，テキストクレンジングを行う関数を見つけ，読んでみると勉強になったので記事にしてみた．

参考にしたのは，Google Researchの実装である．

まず，BERTのコード(tokenization.pyのFullTokenizerクラスのtokenize関数の中)で見つけたテキストクレンジングの関数を以下に貼る．

  def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)

処理の内容としては，不正な文字と空白文字を除去するシンプルな実装だが，処理の際，ordでASCIIコード*1に変換してから*2，ASCIIコードで条件分岐をさせている．また，６行目の条件分岐のcp==0はNull文字*3を表し，cp==0xfffdは�を表す*4．このcp==0xfffdは文字化けを判定する役目を果たしており，生テキストを扱う際には非常に有用である．さらに，もう１つの条件 _is_controlは，以下の関数によって定義される．

def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat in ("Cc", "Cf"):
return True
return False

_is_controlは，制御文字かどうかをチェックする関数である．まずは，"¥t"(タブ区切り)，"¥n"(改行)，"¥r"(復帰) *5 であるかどうかを判定．次にPythonのunicodedata.categoryによって，大文字か小文字か数字かなどのカテゴリを判定*6．もし，カテゴリが"Cc"(C0, C1 control codes)か"Cf"(format control character)なら制御文字フラグを立てる．

以上３つの条件で文字をチェックすることで，不正な文字を判定している．

さて，_clean_textに戻って，次の処理，_is_whitespaceをみてみよう．

def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False

_is_whitespaceは，空白文字かどうかを判定するための関数である．５行目の説明は先ほどしたため省略する．７行目について，ここでは，Pythonのunicodedata.categoryで"Zs"(Space character)にカテゴライズされた場合，空白文字フラグを立てる処理になっている．そして，_clean_textでは，_is_whitespaceで空白文字フラグが立つと，半角スペースに置き換わる処理になっているようだ．

以上で_clean_textの説明は終わりだ．一旦，ASCIIコードに変換してから条件分岐を行うところが非常に参考になった．

また，BERTのtokenization.pyでは，句読点を判定する関数も存在する．

def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "

#034;, and "`" are not in the Unicode

# Punctuation class but we treat them as punctuation anyways, for

# consistency.

if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or

(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):

return True

cat = unicodedata.category(char)

if cat.startswith("P"):

return True

return False

_is_punctuationは，句読点を判定する関数だが，８行目でordで変換したASCIIコードを不等号で条件分岐させている．ASCIIコードに変換する利点は，数値で範囲を指定できるところにあるのかもしれない．また，１２行目でカテゴリ"P"(Punctuation)系に分類される文字が現れた場合に句読点フラグを立てている．ASCIIコードとunicodedata.categoryと２つの観点で判定を行なっているようだが，これには何の意味があるのだろうか？もしかすると，ASCIIコードで引っかからない文字もしくはordで変換できない文字が存在するのかもしれない．
また，_is_chinese_charでは，中国語を判定する処理を行なっていた．
  def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
#   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
(cp >= 0x3400 and cp <= 0x4DBF) or  #
(cp >= 0x20000 and cp <= 0x2A6DF) or  #
(cp >= 0x2A700 and cp <= 0x2B73F) or  #
(cp >= 0x2B740 and cp <= 0x2B81F) or  #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or  #
(cp >= 0x2F800 and cp <= 0x2FA1F)):  #
return True
return False
_is_chinese_charは，中国語を判定する関数である．ここでも，_is_punctuationと同様，ASCIIコードの値域で中国語を定義していた．日本語も同様の判定ができそう*7．
Accentsを除去する関数や，テキストをUnicodeに変換する関数もあったが，疲れてしまったので解説は後日追加する予定．
 def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
以上でBERTにおけるクレンジング方法の紹介を終える．テキストのクレンジングの際には，やはりforループや，パターンマッチングを多用しまくるコードになってしまうんだな．

*1:ASCIIコード表 http://www3.nit.ac.jp/~tamura/ex2/ascii.html
*2:ord後にchrを実行すれば，元に戻せる https://python.civic-apps.com/char-ord/
*3:Null文字ってなんのためにあるの？ http://www.altima.jp/column/fpga_edison/null.html
*4:Replacement Character https://www.fileformat.info/info/unicode/char/fffd/index.htm
*5:LFとCR https://marusunrise2.blogspot.com/2014/06/lfcrcrlf.html
*6:General Category Value
http://www.unicode.org/reports/tr44/#General_Category_Values
*7:文字コード表 シフトJIS
http://charset.7jp.net/sjis.html




関連記事

Deep Learning

Posted by vastee

汎用言語モデルBERTのpre-trainingを試す[NLP][BERT]