Deepstory – GPT-2 and transformers

Originally, I was using gpt-2-simple to train my model, because, it was really simple. However, almost all of my other modules use PyTorch and gpt-2-simple uses TensorFlow, I want to keep it consistent and use a PyTorch implementation of GPT-2. It reminds me of a repo that I discovered quite early and I realized it also supports GPT-2 and in PyTorch. Transformers from Huggingface is built with PyTorch and support lots of popular deep learning language models such as BERT, XLNet. But I won’t be using any of those. I just need it to generate sentences, not doing any classifier and stuff. And firstly, I had to convert the model trained in TensorFlow(like the very original released model) to PyTorch. Conveniently, they provided a script that automates it. Well, it seemed it should have been. But after the model has been converted, I was having some troubles.

transformers-cli convert --model_type gpt2 \
  --tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH \
  --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
  [--config OPENAI_GPT2_CONFIG] \
  [--finetuning_task_name OPENAI_GPT2_FINETUNED_TASK]

And I ran this.

transformers-cli convert --model_type gpt2 --tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH --pytorch_dump_output output --config gpt2/hparams.json

Firstly, the config mapping to hparams.json is important for converting models that are not 124M. And it seems that no one wants to mention the structure is different and I actually found out from a book that I have to rename ‘encoder.json’ to ‘vocab.json’ and ‘vocab.pbe’ to ‘merges.txt’.??? I mean, who would have known? It took me a few hours because there is just no simple documentations about this very simple and basic process. Still, I figured it out. Well, It works.

Then I conveniently write a class to make this looks more organized.

generate.py

# SIU KING WAI SM4701 Deepstory
from transformers import GPT2Tokenizer, GPT2LMHeadModel


class Generator:
    def __init__(self, model_name):
        self.model_name = model_name
        self.tokenizer = GPT2Tokenizer.from_pretrained(f'data/gpt2/{model_name}')
        self.model = GPT2LMHeadModel.from_pretrained(f'data/gpt2/{model_name}').to('cuda')

    def generate(self, text, max_length, top_p, top_k, temperature, do_sample):
        # encode input context
        input_ids = self.tokenizer.encode(text, return_tensors='pt').to('cuda') if text else None
        outputs = self.model.generate(input_ids=input_ids,
                                      max_length=max_length,
                                      top_p=top_p,
                                      top_k=top_k,
                                      temperature=temperature,
                                      do_sample=do_sample)
        return self.tokenizer.decode(outputs[0], clean_up_tokenization_spaces=True)

As you can tell, there’s not that much logic needed. First you convert the text into tokens in a tensor format using the tokenizer. Then you put this tensor in, by using the generate() method of the model, a tensor output is returned. Since it is returned in batches, and there’s only one batch generated, only the first dimension of the tensor is needed, thus outputs[0]. The tensor is then put back into the tokenizer to decode back the respective vocab.

About the performance. So… I’m not writing this in a python context manager. It is not loaded only when it is needed, but it is loaded in advance so that the subsequent generation would be faster. And the time needed most for this whole procedure is indeed the loading time. The generate is pretty fast. But one little problem, something went wrong when generating over 2000 characters of text. But for now, it suffices for a short clip.