Single day generative-ai based game in less than 200 lines of code
Recently, while flipping through the newsfeed, I came across a clickbait post “how a neural network draws sayings”
It was interesting to look at them and suddenly caught myself wanting to guess the proverb from the picture. After a little reflection on such a desire, I decided to embody it in the form of a simple game, similar in meaning to the wordalle “Guess which phrase is drawn” but in trivia style.
Since generative models are now very easy to use, I decided to create such a game in a couple of evenings.
Idea:
1. We take a list of films
2. We generate a picture using filmname as a prompt
3. Give a quiz with 4 variants
4. Game up to 10 guessed pictures in a row
To make it more interesting to play, the 3 wrong options should also fit the picture to some extent, and not be completely random (the idea is used to increase interest in the Imaginarium / Dixit board game)
The first model generates the image using filmname
Second model looks for the 4 film names that describe the generated image the best. I decided to use a list of the titles of the 1000 best films from IMDB
0. Creating folders into the project — place for storing the generated images and web page for the game (see GitHub)
1. Reading the 1000 best IMDB movies
import pandas as pd
movies=pd.read_csv('imdb (1000 movies) in june 2022.csv')
names=movies['movie name\r\n'][:]
2. Generating the images using the huggingface pretrained model. Despite the code looks simple, the generation has some complexity under the hood
from diffusers import StableDiffusionPipeline, StableDiffusionInpaintPipeline
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", use_auth_token=False).to("cuda")
answers=[0]*1000
images=[0]*1000
for i in range(1000):
if '{}.png'.format(i) not in os.listdir('1000movies/'):
selected=np.random.choice(names)
image= pipe([names[i]])['images'][0]
answers[i]=names[i]#selected
image.save('1000movies/{}.png'.format(i))
print(i)
3. In order to generate the hard and misleading variants, use the CLIP model.
from transformers import CLIPProcessor, CLIPModel
import os
import tqdm
import torch
import numpy as np
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
imagenames=natsorted(os.listdir('1000movies/'))
images=[Image.open('1000movies/'+imgname) for imgname in imagenames]
moviesfeat=[]
img_embed=[]
for name in tqdm.tqdm(names):
moviesfeat.append(model.get_text_features(**processor.tokenizer([name],return_tensors='pt')))
moviesfeat=torch.cat(moviesfeat,0)
for i in tqdm.tqdm(range (len(images))):
batch = processor(text=None, images=images[i:i+1], return_tensors='pt',padding=True)['pixel_values']
img_embed.append(model.get_image_features(pixel_values=batch))
img_embed=torch.cat(img_embed,0)
imgnorm=img_embed/img_embed.norm(dim=1).unsqueeze(1)
textnorm=(moviesfeat/moviesfeat.norm(dim=1).unsqueeze(1)).T
sim=imgnorm@textnorm
np.save('moivieSimilarity.npy',sim.detach().numpy())
Here is a simple explanation of the code above is the next:
CLIP model is an essential part of most of the visual-laguage models. Its main task is to take image and text and transform both of them into the embedding space. Any text and any image can be represented as a vector in 512 dimensional space. The CLIP model consistently does such transformation, so if the text precisely describes the image, the vectors of the image and the text are the same. So for each generated image and each movie label, we got the 512-dimensional vector.
The similarity between the 1000 movie names and 1000 generated labels can be measured as cosine similarity — that’s just a value that is equal to 1 if vectors are point to the same direction and -1 for opposite directions. The formula is just a scalar product of normalized vectors. If we measure the metrics on each pair-generated image-movie name, we can get the similarity matrix, which is interesting to observe (here is the top left corner of the such matrix that corresponds to top100 best films according to IMDb):
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,10))
ax.matshow(sim[:100,:100], cmap=plt.cm.Blues)
The important notice is that the main diagonal is highly prominent — so the generated image really corresponds to the filmname. Also, dark points out of the main diagonal correspond to movie names, which could be interpreted as a description of the image, so they are candidates for the “fake variants”.
It’s possible to draw the UI and game logic using python (see file diffgame_inference.py a GitHub -that is self-explanatory)
The problem with such a “game” is that it’s not possible to launch it for an inexperienced user or to launch it on a mobile device. Such a problem is typical for any python based development and in some case a lot of python prototypes need to be rewritten for better accessibility.
Since i got all of the necessary data (images, movie names and similarity matrix) and the logic of inference is quite simple, that’s possible to run it using JS and make a web-based app. Even more, the such app even do not need a backend because of simple logic and everything may be done on the client side.
So, at first, all data should be serialized as JSON objects. Matrix list of lists
import json
class NumpyArrayEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, np.ndarray):
return obj.tolist()
return JSONEncoder.default(self, obj)
with open('diffgamehtml/js/similaritytable.json','w') as f:
json.dump(numpyData,f,cls=NumpyArrayEncoder)
And filenames as list:
movies['movie name\r\n'].to_json('diffgamehtml/js/movieslist.json')
Since I have almost no experience in JS frameworks, the implementation may look quite ugly as it uses only vanilla js which is almost forgotten by me. Also, all project is written on vanilla html/css/js and placed into the diffgamehtml folder. All logic is less than 100 lines of code and is located at the gamelogics.js file. No need to explain it. After tests an idea of difficulty tweaks comes to my mind, so different difficulty uses the different size of matrix corner (from to top50 to top 1000 films)
Then deploy it on my server as a static html webpage https://aidle.org/diffusionguesser/