Skip to content

ViTMem: Vision Transformers for image Memorability

In this blog post we will describe the ViTMem model and python package for estimating image memorability.

Image memorability estimation is the task of estimating the probability that a human recognizes the repetition of an image after a single view.

Convolutional neural networks (CNN) have to date provided the state of the art in image memorability estimation. Our experiments show that vision transformers can provide more reliable estimates than CNNs.

We have trained a vision transformer model on a large image set scored on memorability and named the resulting model ViTMem. To allow other researchers to easily access this model, we created a python package which wraps the model in an easily accessible interface.

On our test set, ViTMem achieves a spearman rank correlation of 0.77 and a mean squared error of 0.006.

Using the model

If you are not familiar with python, you should familiarize yourself with it and make sure you have both python and pip installed, see for example the guides here and here.

The model can be installed from the python package index with the following command.

pip install vitmem

Image memorability can be estimated with the following code. In this example a file name is simply passed to the model to obtain an image memorability estimate. Note that the first time you use the ViTMem model, it will download a required model file from the internet (about 327 MB).

from vitmem import ViTMem
model = ViTMem()
memorability = model("image.jpg")
print(f"Estimated memorability: {memorability}")

The ViTMem model interface is flexible with regards to type of input. In the following example an Image object is passed to the model interface.

from PIL import Image
from vitmem import ViTMem
model = ViTMem()
image = Image.open("image.jpg")
memorability = model(image)
print(f"Estimated memorability: {memorability}")

The model can also accept a transformed image tensor.

from PIL import Image
from vitmem import transform
from vitmem import ViTMem
model = ViTMem()
image = Image.open("image.jpg")
tensor = transform(image)
memorability = model(tensor)
print(f"Estimated memorability: {memorability}")

The model package is open sourced on github.