SciSpaCyでテキストからStopWord, 句読点, 数字を除去する

自然言語で書かれたテキストの前処理として、“I, you, we, am, are” など、一般的に用いられるトークン (Stop word) や、数字や、句読点などを除去することがある。この前処理には、GensimやNLTKなどのライブラリを適用する方法が一般的であるが、Spacyを使うとよりエレガントに書くことが可能だ。

早速だが、コード全文を次に示す。コード内のtxtは、NLP界で有名な論文 “Attention is all you need” のアブストから引用したものである。また、今回はAI2が開発したscispacyというバイオの論文向けのモデルを使用する。

preprocess.py

import spacy

txt = "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."

nlp = spacy.load("en_core_sci_sm")

doc = nlp(txt)

# 句読点(Punctuation), 数字, 記号を除去とストップワードを除去を同時に実行
filtered_doc = [token.text for token in doc if not ((token.pos_ in ("PUNCT", "NUM", "SYM")) or (token.is_stop))]

print(filtered_doc)

出力は以下のリストとなる。出力をみると、句読点やストップワードなどが除去されていることが確認できる。

python preprocess.py
>> ['dominant', 'sequence', 'transduction', 'models', 'based', 'complex', 'recurrent', 'convolutional', 'neural', 'networks', 'encoder-decoder', 'configuration', 'best', 'performing', 'models', 'connect', 'encoder', 'decoder', 'attention', 'mechanism', 'propose', 'new', 'simple', 'network', 'architecture', 'Transformer', 'based', 'solely', 'attention', 'mechanisms', 'dispensing', 'recurrence', 'convolutions', 'entirely', 'Experiments', 'machine', 'translation', 'tasks', 'models', 'superior', 'quality', 'parallelizable', 'requiring', 'significantly', 'time', 'train', 'model', 'achieves', 'BLEU', 'WMT', 'English-to-German', 'translation', 'task', 'improving', 'existing', 'best', 'results', 'including', 'ensembles', 'BLEU', 'WMT', 'English-to-French', 'translation', 'task', 'model', 'establishes', 'new', 'single-model', 'state-of-the-art', 'BLEU', 'score', 'training', 'days', 'GPUs', 'small', 'fraction', 'training', 'costs', 'best', 'models', 'literature', 'Transformer', 'generalizes', 'tasks', 'applying', 'successfully', 'English', 'constituency', 'parsing', 'large', 'limited', 'training', 'data']

ちなみにストップワードを追加したい時には以下のように書くことができる。

preprocess_custom.py

import spacy

txt = "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."

nlp = spacy.load("en_core_sci_sm")

# 追加したいストップワードをリストで入力
customize_stop_words = ["dominant", "new", "data"]
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

doc = nlp(txt)

# 句読点(Punctuation), 数字, 記号を除去とストップワードを除去を同時に実行
filtered_doc = [token.text for token in doc if not ((token.pos_ in ("PUNCT", "NUM", "SYM")) or (token.is_stop))]

print(filtered_doc)

出力されるリスト

python preprocess_custom.py
>> ['sequence', 'transduction', 'models', 'based', 'complex', 'recurrent', 'convolutional', 'neural', 'networks', 'encoder-decoder', 'configuration', 'best', 'performing', 'models', 'connect', 'encoder', 'decoder', 'attention', 'mechanism', 'propose', 'simple', 'network', 'architecture', 'Transformer', 'based', 'solely', 'attention', 'mechanisms', 'dispensing', 'recurrence', 'convolutions', 'entirely', 'Experiments', 'machine', 'translation', 'tasks', 'models', 'superior', 'quality', 'parallelizable', 'requiring', 'significantly', 'time', 'train', 'model', 'achieves', 'BLEU', 'WMT', 'English-to-German', 'translation', 'task', 'improving', 'existing', 'best', 'results', 'including', 'ensembles', 'BLEU', 'WMT', 'English-to-French', 'translation', 'task', 'model', 'establishes', 'single-model', 'state-of-the-art', 'BLEU', 'score', 'training', 'days', 'GPUs', 'small', 'fraction', 'training', 'costs', 'best', 'models', 'literature', 'Transformer', 'generalizes', 'tasks', 'applying', 'successfully', 'English', 'constituency', 'parsing', 'large', 'limited', 'training']

本記事は以下のサイトを参考に執筆しました。

https://medium.com/@makcedward/nlp-pipeline-stop-words-part-5-d6770df8a936

https://stackoverflow.com/questions/45375488/how-to-filter-tokens-from-spacy-document

Pocket