Generate Jokes using GPT-2¶

Make sure GPU is enabled (edit->notebook settings->Hardware Accelerator GPU)

1 Data prepare¶

Clone and cd into repo

!git clone https://github.com/nshepperd/gpt-2.git

Cloning into 'gpt-2'...
remote: Enumerating objects: 212, done.
remote: Total 212 (delta 0), reused 0 (delta 0), pack-reused 212
Receiving objects: 100% (212/212), 4.37 MiB | 14.62 MiB/s, done.
Resolving deltas: 100% (112/112), done.

cd gpt-2

/content/gpt-2

Install requirements

!pip3 install -r requirements.txt

Requirement already satisfied: fire>=0.1.3 in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 1)) (0.1.3)
Requirement already satisfied: regex==2017.4.5 in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 2)) (2017.4.5)
Requirement already satisfied: requests==2.21.0 in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 3)) (2.21.0)
Requirement already satisfied: tqdm==4.31.1 in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 4)) (4.31.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from fire>=0.1.3->-r requirements.txt (line 1)) (1.11.0)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests==2.21.0->-r requirements.txt (line 3)) (2.6)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests==2.21.0->-r requirements.txt (line 3)) (2019.3.9)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests==2.21.0->-r requirements.txt (line 3)) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests==2.21.0->-r requirements.txt (line 3)) (1.22)

Download the model

!python3 download_model.py 117M

Fetching checkpoint: 1.00kit [00:00, 757kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 56.3Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 844kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:08, 61.7Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 5.60Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 53.7Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 54.3Mit/s]

!export PYTHONIOENCODING=UTF-8

Clone joke dataset

cd /content/gpt-2

/content/gpt-2

! git clone https://github.com/taivop/joke-dataset

Cloning into 'joke-dataset'...
remote: Enumerating objects: 37, done.
remote: Total 37 (delta 0), reused 0 (delta 0), pack-reused 37
Unpacking objects: 100% (37/37), done.

Exract useful jokes and save to plain text

cd joke-dataset/

/content/gpt-2/joke-dataset

import json

with open("stupidstuff.json") as f:
  jokes = json.load(f)
  #print(jokes)
  text = ''.join(x.get("body") for x in jokes)
  #print(text)
  with open('stupidstuff.txt', 'w') as out_file:
    out_file.write(text)
    
with open("wocka.json") as f:
  jokes = json.load(f)
  text = ''.join(x.get("body") for x in jokes)
  with open('wocka.txt', 'w') as out_file:
    out_file.write(text)    

def catjoke(x):
  title = x.get("title")
  body = x.get("body")
  return title + body

with open("reddit_jokes.json") as f:
  jokes = json.load(f)
  text = ''.join(catjoke(x) for x in jokes)
  with open('reddit_jokes.txt', 'w') as out_file:
    out_file.write(text)

2 Start training¶

Let's get our train on! In this case the file is reddit jokes scraped from /r/jokes. We are going to retrain GPT-2 117M model on this custom text dataset. Note that we can use small datasets but we have to be sure not to run the fine-tuning for too long or we will overfit badly.

The default training setting will save checkpoints every 1000 steps.

cd /content/gpt-2

/content/gpt-2

!PYTHONPATH=src ./train.py --dataset /content/gpt-2/joke-dataset/reddit_jokes.txt

2019-04-19 04:43:39.110980: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-04-19 04:43:39.111312: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x172a260 executing computations on platform Host. Devices:
2019-04-19 04:43:39.111350: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-19 04:43:39.273724: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-19 04:43:39.274248: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x1729e40 executing computations on platform CUDA. Devices:
2019-04-19 04:43:39.274289: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-04-19 04:43:39.274681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
totalMemory: 14.73GiB freeMemory: 14.60GiB
2019-04-19 04:43:39.274710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-19 04:43:39.733811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-19 04:43:39.733876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-04-19 04:43:39.733888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-04-19 04:43:39.734209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14115 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /content/gpt-2/src/sample.py:51: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /content/gpt-2/src/sample.py:53: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Loading checkpoint models/117M/model.ckpt
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Loading dataset...
100% 1/1 [01:14<00:00, 74.17s/it]
dataset has 12344266 tokens
Training...
2019-04-19 04:45:20.203553: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[1 | 8.00] loss=3.15 avg=3.15
[2 | 10.17] loss=3.19 avg=3.17
[3 | 12.35] loss=3.50 avg=3.28
[4 | 14.54] loss=3.40 avg=3.31
[5 | 16.73] loss=3.66 avg=3.38
[6 | 18.93] loss=3.42 avg=3.39
[7 | 21.12] loss=3.10 avg=3.35
[8 | 23.32] loss=3.37 avg=3.35
interrupted
Saving checkpoint/run1/model-9
^C

3 Generate jokes¶

Load trained model for use

!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/

Generate conditional samples from the model given a prompt - change top-k hyperparameter if desired (default is 40)

!python3 src/interactive_conditional_samples.py --top_k 40

To check flag descriptions, use:

!python3 src/interactive_conditional_samples.py -- --help

Generate unconditional samples from the model

!python3 src/generate_unconditional_samples.py | tee /tmp/samples

To check flag descriptions, use:

!python3 src/generate_unconditional_samples.py -- --help