Generate Jokes using GPT-2

Make sure GPU is enabled (edit->notebook settings->Hardware Accelerator GPU)

1 Data prepare

Clone and cd into repo

In [0]:
!git clone
In [0]:
cd gpt-2

Install requirements

In [0]:
!pip3 install -r requirements.txt
Download the model

In [0]:
!python3 117M
In [0]:

Clone joke dataset

In [0]:
cd /content/gpt-2
In [0]:
! git clone
Exract useful jokes and save to plain text

In [0]:
cd joke-dataset/
In [0]:
import json

with open("stupidstuff.json") as f:
  jokes = json.load(f)
  text = ''.join(x.get("body") for x in jokes)
  with open('stupidstuff.txt', 'w') as out_file:
with open("wocka.json") as f:
  jokes = json.load(f)
  text = ''.join(x.get("body") for x in jokes)
  with open('wocka.txt', 'w') as out_file:

def catjoke(x):
  title = x.get("title")
  body = x.get("body")
  return title + body

with open("reddit_jokes.json") as f:
  jokes = json.load(f)
  text = ''.join(catjoke(x) for x in jokes)
  with open('reddit_jokes.txt', 'w') as out_file:

2 Start training

Let's get our train on! In this case the file is reddit jokes scraped from /r/jokes. We are going to retrain GPT-2 117M model on this custom text dataset. Note that we can use small datasets but we have to be sure not to run the fine-tuning for too long or we will overfit badly.

The default training setting will save checkpoints every 1000 steps.

In [0]:
cd /content/gpt-2
In [0]:
!PYTHONPATH=src ./ --dataset /content/gpt-2/joke-dataset/reddit_jokes.txt
Saving checkpoint/run1/model-9

3 Generate jokes

Load trained model for use

In [0]:
!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/

Generate conditional samples from the model given a prompt - change top-k hyperparameter if desired (default is 40)

In [0]:
!python3 src/ --top_k 40

To check flag descriptions, use:

In [0]:
!python3 src/ -- --help

Generate unconditional samples from the model

In [0]:
!python3 src/ | tee /tmp/samples

To check flag descriptions, use:

In [0]:
!python3 src/ -- --help