How much a generated image could be variant from its real training dataset images?

Question

I'm a newbie to AI. After using GenAI for sometime, I was wondering how much can an AI generated image be different from its real trained dataset. To be more clear I can give an example, there is a site called thispoersondoesnotexit dot com and each refresh is a new face of a human being. Technically speaking, if we take one of the generated images and compare it to the images in the dataset, will we find in the training set a similar labeled image but slightly changed? Maybe skin color, wider eyebrows, etc?

score 3 · Accepted Answer · answered Feb 06 '25 at 08:42

In theory, each generated image from most popular probabilistic models such as VAEs and diffusion models should be a novel creation that doesn't directly replicate any single image from the training set. However, due to the nature of the training process, generated images can sometimes closely resemble specific images from the training data either as an entire image or as isolated objects with minor variations. This phenomenon, known as memorization which is problematic, occurs when a model inadvertently reproduces parts of its training data.

We know that image generators create wonderful original works, but do they sometimes replicate their training data? Recent work found that replication does occur. A common way to detect similarity between images is to produce embeddings of them and compute the dot product between embeddings. High dot product values indicate similar images. However, while this method detects large-scale similarities, it can fail to detect local ones. To detect a small area shared by two images, one strategy is to split apart their embeddings, compute the dot product between the pieces, and look for high values.

While models are designed to generate unique outputs, the influence of the training data's distribution means that complete abstraction beyond the learned patterns is still a developing area in AI research.

How much a generated image could be variant from its real training dataset images?

1 Answers1