ELECTRAのpre-trainingを自作データで行う

前準備

この記事は、一つ前の記事の続きとして書きます。

前提条件として、一つ前の記事に沿って、openwebtextをダウンロードし、build_openwebtext_pretraining_dataset.py を実行しておくことが必要です。

今回は、openwebtextから抽出したコーパス(テキストデータの集合)を用いて、ELECTRAのpre-trainingを行う手順を紹介します。

まずは、/home/ubuntu/disk/openwebtext/tmp/job_0 にあるテキストを、/home/ubuntu/disk/mycorpus にコピーします。

mkdir -p /home/ubuntu/data/mycorpus/txt
cp /home/ubuntu/script/electra/vocab.txt /home/ubuntu/data/mycorpus/
cp /home/ubuntu/data/openwebtext/tmp/job_0/*.txt /home/ubuntu/data/mycorpus/txt

コピー後、学習用データへの変換を行います。

python build_pretraining_dataset.py --corpus-dir /home/ubuntu/data/mycorpus/txt --vocab-file /home/ubuntu/data/mycorpus/vocab.txt --output-dir /home/ubuntu/data/mycorpus/pretrain_tfrecords

そして、pre-trainingを実行。

python run_pretraining.py --data-dir /home/ubuntu/data/mycorpus --model-name electra_small_owt

このようにELECTRAでは自分で用意したtxtデータであっても簡単に事前学習することができます。ちなみに今回入力したテキスト形式は以下のように、1行ずつ分かれているテキストです。

0261009-be09eb4e8359abe9837ca83b29048092.txt

Share this article:A man who was arrested at Los Angeles International Airport with more than 1,000 dried insects in his luggage — including 150 endangered butterflies — was expected to plead not guilty Monday to federal charges.

Alexander Bic, 25, is charged with violating the U.S. Endangered Species Act in connection with the alleged attempt to import Ornithoptera — or birdwing — butterflies into the United States.

The charge carries a possible federal prison sentence of up to 20 years upon conviction, according to Assistant U.S. Attorney Diana M. Kwok.

The vividly colored specimens were found by customs officers on April 7 as LAX, as Bic and his wife were returning from a trip to Japan, according to documents filed in Los Angeles federal court. Bic’s wife was not charged.

The dried and folded 5-inch butterflies from New Guinea were allegedly found among eight boxes of dead bugs discovered in Bic’s carry-on and checked baggage, the document states.

Kwok said Bic operates an Internet mail-order business in which he sells pinned and framed insect specimens to customers throughout the world.

The endangered birdwing species sells for upwards of $100, the prosecutor said.

“There are certainly enough collectors (of dried insects) to support an eBay business,” Kwok told City News Service.

—City News Service

Man charged with smuggling endangered butterflies at LAX was last modified: by

>> Want to read more stories like this? Get our Free Daily Newsletters Here!

Follow us:

テキストにsentence_splitやtokenizeなどの前処理が必要な場合は、以下のリンクが参考になるはずです。

https://github.com/stefan-it/turkish-bert/blob/master/CHEATSHEET.md#cased-model

ELECTRA,Python

Posted by vastee