How to create vocab.txt ? [BERT][ELECTRA][Tokenizers]
pip install tokenizers
from tokenizers import BertWordPieceTokenizer
from glob import glob
import json
txt_path = '/path/to/your/txts/*.txt'
txts = glob(txt_path)
tokenizer = BertWordPieceTokenizer(
clean_text=True,
handle_chinese_chars=False,
strip_accents=False,
lowercase=False,
)
trainer = tokenizer.train(
txts,
vocab_size=32000,
min_frequency=2,
show_progress=True,
# special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
limit_alphabet=1000,
wordpieces_prefix="##"
)
tokenizer.save("./vocab.json", pretty=True)
# vocab.json to vocab.txt
with open('./vocab.json') as f:
d = json.load(f)
vocab = d['model']['vocab']
vocab_txt = ''
for k, v in vocab.items():
vocab_txt += k
vocab_txt += '\n'
vocab_txt = vocab_txt[:-1]
with open('./vocab.txt', 'wt') as f:
f.write(vocab_txt)
References
https://github.com/huggingface/tokenizers/tree/master/bindings/python
https://github.com/stefan-it/turkish-bert/blob/master/CHEATSHEET.md#cased-model





ディスカッション
コメント一覧
まだ、コメントがありません