Definition and Problem Overview
It turned out that 2023 was the year that AI-generated images took the world by storm — seemingly overnight, they flooded social media. Internet users were churning out AI-generated images ranging from realistic to amusing to completely absurd. However, while some enjoyed the entertainment of AI-generated viral memes, others were developing a growing uneasiness about how believable some of the images were becoming.
These AI images were being created using text-to-image (TTI) generation, in which an image is produced following a user’s written prompt. This is a relatively new offshoot of deep learning: the first model that could make a drawing on a written request — alignDRAW — saw the light in 2015. The model relies on a bidirectional Recurrent Neural Network (RNN) and variational autoencoders. Its debut was preceded by the introduction of the Generative Adversarial Network (GAN) by Goodfellow et al. in 2014 and Google’s Deep Dream based on a Convolutional Neural Network (CNN), which focused on image pattern analysis and algorithmic pareidolia.
TTI’s further development took a new turn in 2021 when OpenAI announced a new architecture dubbed CLIP or Contrastive Language–Image Pre-training. It was designed with the ‘zero-shot capability’ in mind, implying a single model can easily adapt to solving multiple tasks: computer vision, image classification, linking text prompts to image, while also being trained with a rather small dataset.
While TTI is heralded by some as a new era in digital art, game design, and media production, it has also stirred controversy online, mainly among digital artists. The Do Not AI campaign was created to spread awareness of the ethics of TTI, which uses digitally-available art as its training dataset to pull content from.
TTI has also raised larger questions as to the future of AI art in particular. Can an image be considered “art” if it is machine generated? Can TTI become so sophisticated that its replication of art styles will be indistinguishable from human artists? And, more importantly, what implications does this have for fraudsters using TTI?
To see the direction TTI is heading, it’s important to first address how it has developed since its inception.
The Evolution of Text-to-Image Generators
Starting from 2015, a multitude of architectures appeared – each one bringing a new perspective on how TTI functions when interacting with a natural human language.
AttnGAN, or the Attentional Generative Adversarial Network, was released in 2017. It relies on two primary components:
· Attentional generative network. This initiates a complex process in which such steps as sentence vector conversion to the conditioning vector, word-context vector calculation, image/word-context features combination, and others take place.
· Deep attentional multimodal similarity model. The model consists of two neural networks that allow mapping words of a written prompt and image sub-regions to their shared semantic space.
According to experiment results, AttnGAN managed to achieve a 4.36 inception score, a metric which judges the quality of AI images.
Diffusion models are influenced by non-equilibrium thermodynamics. Such a model employs a Markov chain (of length T) of diffusion steps to gradually add random noise to data, after which diffusion reversal takes place and data is reconstructed from the noise.
Stable Diffusion by StabilityAI premiered in 2022, and immediately enjoyed massive success online. The stable diffusion model strives to solve the issues related to resource-demanding function evaluations in pixel space. It suggests the usage of an explicit separation of the compressive from the generative learning phase with the help of a space-learning autoencoding model.
By adding Perceptual Image Compression and generative modeling of latent representations, barely noticeable high-frequency details can be removed. Thus, attention is paid only to crucial bits of image data, which also leads to reducing computational costs — a concept known as a Latent Diffusion Model (LDM).
The original Dall-E employs a maximized evidence lower bound (ELBO), which consists of two stages: training discrete variational autoencoders to compress RGB images into image tokens and concatenating BPE-encoded text tokens, while also training an autoregressive transformer to ensure joint distribution modeling over the text/image tokens. This helps achieve an effect similar to word generalization, which is observed in AI text generators.
Dall-E 2 is based on a diffusion model and employs direct CLIP embedding, classifier-free guidance, and other concepts that help it economize resources and have only 3.5 billion parameters – whereas Dall-E had 12 billion parameters.
In October 2023, Dall-E 3 premiered. The key feature of this version was that it was merged with ChatGPT to simplify the task of prompting. Normally, users would have to write detailed prompts to generate a desirable drawing. An LLM component makes it possible for the generator to understand brief user instructions.
Midjourney is a TTI service that was launched in 2022 by an independent lab. The authors have not disclosed the actual architecture of the model, but it is suggested that it’s based on v-diffusion, Progressive Distillation, which allows for faster sampling by halving the number of sampling steps for every request, among other similar techniques.
GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a TTI platform also developed by OpenAI. It is based on a modified diffusion model (based on ImageNet), which is enhanced with the text conditioning information, which is obtained via encoding a text into a sequence of K tokens, which are fed into a Transformer model.
The authors report that they scale the width of the model in use to 512 channels, garnering 2.3 billion parameters for the model’s visual component. An upsampling model is additionally trained with 1.5 billion parameters to achieve a higher resolution. A process similar to pre-training also takes place. However, about 20% of the text token sequences are substituted with empty sequences — this allows unconditional image generation.
Craiyon (DALL-E mini)
Craiyon, formerly known as Dall-E Mini, is a free and simplified version of the Dall-E TTI generators. Unlike premium versions, it has a limited scope of features, offering only three picture modes: Art, Drawing, and Photo. Users have observed that Craiyon delivers an output of lower quality and often struggles to picture human anatomy correctly, especially faces.
X-LXMERT is a cross-modality transformer with Image and Text inputs. X-LXMERT is enhanced with a picture generator, which can differentiate visual representations with the help of uniform masking and a vast scope of masking ratios. It can also perform accurate dataset alignment, and also employs the inception score to assess image quality, Fréchet Inception Distance (FID) to check authenticity, etc. The complete architecture of X-LXMERT allows it to generate both images and captions.
Comparison and Experiments
Three solutions — Stable Diffusion, Midjourney, Dall-E 2 — have been pitted against each other in a test to determine the best TTI generator that could produce the most realistic human face. Fréchet Inception Distance (FID) was used as a primary evaluation metric. Captions from the COCO dataset were used to formulate prompts.
Among these contenders, Midjourney showed the worst results, while Stable Diffusion scored the highest – yet it still produced images which were far less believable than photos of real faces, according to the FID evaluation.
A study suggests a TIFA metric could be useful for gauging how realistic an AI-generated image appears. TIFA (Text-to-Image Faithfulness evaluation with Question Answering) states that the faithfulness (realism) of a generated image can be measured with a VQA principle: Visual Question Answering. According to the concept, multiple-choice question-answer pairs should be employed. They include such values as Q — question, C — answer choices, and Ai ∈ Ci — the so-called “gold answer.” TIFA v1.0 benchmark is offered with premade question-answer pairs.
Using AI-Generated Images for Spoofing
Concerns are being raised about the potential misuse of artificially generated images for disseminating disinformation. For instance, a fabricated image of Trump being arrested could be verified against actual news reports, but many social media users were initially deceived by a widely circulated image of Pope Francis dressed in high fashion.
Recently, Midjourney decided to prohibit the use of the word "arrested" in prompts following a surge in the creation of false images of Trump. The company's founder announced via Discord that free trials of the software were suspended earlier this week due to "extraordinary demand and trial abuse," as reported by Gizmodo.
In the previous summer, Midjourney also decided to ban images of China's President Xi Jinping to "minimize drama." The company's founder explained on Discord, "Political satire in China is pretty not-okay and at some point would endanger people in China from using the service."
The issue of age-appropriate content is another ongoing challenge with generative AI, as is the concern of privacy, such as the creation of explicit content featuring real individuals.
On Midjourney, all content generated must adhere to a "PG-13" rating, despite the fact that Stable Diffusion was "trained" using explicit and adult content. The company's founder acknowledges that they are grappling with establishing appropriate guidelines.
The other examples of AI-generated image misuse could be as follows:
- Facial Recognition Systems: An attacker could generate a realistic image of a person's face based on a text description and use this image to spoof a facial recognition system. This could lead to unauthorized access to systems and services.
- E-commerce Platforms: An attacker could generate an image of a product or item based on a text description and use this image to spoof an e-commerce platform or an automated inventory management system. This could lead to financial losses and damage to the reputation of businesses.
- Malware Classification: In another related case, GANs have been used to generate images for use in image-based malware classification. While this is a legitimate use of the technology, it illustrates how GAN-generated images could potentially be used in more malicious ways, such as spoofing malware detection systems.
Detecting AI-Generated Images
At the moment, a number of synthesized media detectors have been announced or launched. Among them are Sensity, Deepware Scanner, Microsoft Video Authenticator, and others. However, malicious actors can use a cornucopia of techniques to cause a successful spoofing attack. Among them are, infusing adversarial perturbations, deleting traces that indicate usage of a neural network (e.g. GAN fingerprints), adding fake camera fingerprints, and so forth.
Image generation is not the only new frontier of spoofing threats. Text-based generators are also becoming increasingly more difficult to distinguish from human language. Read this next article for more information.
- ‘ChatGPT said I did not exist’: how artists and writers are fighting back against AI
- AlignDRAW model for generating images by learning an alignment between the input captions and generating canvas
- The ‘Martian face’ image is an example of pareidolia studied in image synthesis
- AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
- What are Diffusion Models?
- High-Resolution Image Synthesis with Latent Diffusion Models
- Midjourney image that won first prize in a digital art competition
- X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
- Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2
- Understanding Variational Autoencoders (VAEs)
- Welcome to Deep Dream Generator
- Stable Diffusion Online
- DALL·E now available without waitlist
- DALL·E 2 is an AI system that can create realistic images and art from a description in natural language
- ELBO — What & Why
- OpenAI debuts DALL-E for generating images from text
- Is Midjourney AI more-or-less the same architecture as DALL-E 2?
- Progressive Distillation for Fast Sampling of Diffusion Models
- GLIDE by OpenAI
- Diffusion Models Beat GANs on Image Synthesis
- Fréchet inception distance
- COCO dataset
- TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
- TIFA: Text-to-Image Faithfulness Evaluation with Question Answering
- Sensity. Deepfake Detection
- Deepware Scanner
- New Steps to Combat Disinformation