Generative AI in Video: Synthesis and Detection

A “Bear painting a bear” T2V sample generated by Make-a-Video

Introduction of Generative AI led to prompt-based media generators that could draw, compose music, or write sophisticated texts. Text-to-Video (T2V) is an instance of GenAI with the first documented video generator being Sync-DRAW (2016). A new wave of AI video tools inspired by Text-to-Image (TTI) synthesizers saw the light recently: for example, Make-a-Video by Meta was announced a year after TTI models became widespread.

Example of SAVE video editing based on textual prompts

However, generating a video clip based on a written prompt is much more challenging than a static image. It is noted that diffusion and non-diffusion T2V models require a tremendous amount of computational power and memory, as well as extensive high-quality datasets — it brings serious deployment and scaling issues.

Even though T2V is still in its infancy, it’s already considered a tangible threat, as anyone will be able to create a completely fabricated video without specific skills or knowledge that deepfake production typically requires.

Main Methods of Generating Video

At the moment, there’s a group of main methods employed in T2V creation:

Unguided generation

A rather rudimentary approach, which is regarded as the earliest method of video generation. It turns a single-frame image into a repetitive scene: moving ocean waves, etc. The technique is described as “spatially repetitive pattern with time-varying visualization”. It is dubbed unguided since there’s no textual prompt giving instructions.

Generative Adversarial Networks

Generative Adversarial Networks (GANs), first explored in 2014, have been applied to create highly realistic images. Essentially, a GAN consists of two main elements:

G — generator model that is tasked to transform the input noise into synthesized samples.
D — discriminator that classifies the received samples as true or fake.

Video-GANs have a specific architecture based on Recurrent Neural Network (RNN) and 2D convolutional networks. This is necessary due to the video data complexity that has a temporal dimension. Alternative approaches imply usage of 3D convolutional networks, two-stream video generation, and Coarse-to-Fine technique based on progressive architecture.

The first GAN architecture used for video generation

Variational Autoencoders

In simple terms, autoencoders strip the input data of the unnecessary noise through a sophisticated process of compressing and reconstructing. This allows the model to concentrate on valuable areas. This approach can be beneficial for video generation, as it can help economize computational resources.

The Auto-Encoding Variational Bayes is an efficient concept as it’s enhanced with a Stochastic Gradient Variational Bayes estimator that allows it to work with continuous latent variables and huge datasets. Another domain where video autoencoding is applicable is learning camera poses and disentangled representations.

Video autoencoder used for camera pose prediction

Diffusion Models

Latent Diffusion Models (LDMs) are known for their capability to generate high-quality images or audio without demanding too much computational power. This is possible due to compressed lower-dimensional latent space.

For video synthesis it is suggested to use a pre-trained image-generating model that is enhanced with the additional neural network layers responsible for consistent and temporally accurate frame alignment. (They also should be linked to the spatial layers.) When put together, these elements provide a video-aware temporal foundation or “backbone” as referred to by the authors.

A pre-trained Latent Diffusion Model enhanced with temporal layers

The Early Video Generators

Before the emergence of the up-to-date generative models, there was a group of T2V solutions based on other methods.

Sync-DRAW

Sync-DRAW presented in 2016 was the first and simplest T2V solution. It is based on a concept of learning data distribution with Recurrent Variational Auto-Encoders (R-VAE). This approach allows to temporally synchronize frames of a video, while also enabling a user to type in the prompts they want.

It also employs a recurrent visual attention mechanism for every frame present in a video clip, as this allows to explore and learn latent spatio-temporal representations. In turn, a generated video appears to be smooth in nature in terms of motion.

TGANs-C

TGANs-C is based on a GAN architecture, in which the generator and discriminator are fine-tuned with a minimax game mechanism — a decision-making algorithm for choosing the most appropriate next move. It also includes three pivotal elements:

Video-level matching-aware loss for aligning video with a prompt.
Frame-level matching-aware loss for boosting the visual realism.
Temporal coherence loss for utilizing temporal consistency between successive frames.

It was trained on datasets MSVD, TBMG., and SBMG.

GODIVA

GODIVA is an open-domain text-to-video pretrained model that can synthesize videos with the help of three-dimensional sparse attention mechanism. Architecturally, it consists of three components: Language Embedding, Vector Quantization Variational Autoencoder (VQ-VAE), and VQ-VAE Decoder. It was trained on a Howto100M dataset with 1.36 million video samples and is also capable of zero-shot generation.

NÜWA

NÜWA is a model based on a 3D transformer encoder-decoder architecture coupled with 3DNA — a 3D Nearby Attention tool that enhances local-wise sparse attention and also allows decreasing computational complexity of the process.

Generators Based on Diffusion Models

MagicVideo’s pipeline based on a diffusion model

Imagen Video

Imagen Video includes 7 diffusion sub-models specified in continuous time and enhanced with Video U-Net that together garner three-dimensional output, as well as text-conditional video generation, spatial super-resolution, and temporal super-resolution. It also features v-prediction parameterization that provides progressive distillation during the diffusion process.

Make-A-Video

Foundational Text-to-Image model trained on text-image pairs.
Attention and spatio-temporal layers responsible for temporal dimension.
Second group of spatio-temporal layers that provide frame generation at an intensive rate.

Pseudo-3D attention layers are an essential element of the model: they allow saving up computational power by using, among all else, matrix operators Flatten and Unflatten.

VQGAN for Long Video Generation

Vector Quantized Generative Adversarial Network (VQGAN) is mainly used for T2I generation. It combines an upgraded time-agnostic model for spatio-temporal compression to generate discrete tokens and a time-sensitive transformer that captures longer temporal dependence.

SAVE

Spectral-Shift-Aware Video Editing (SAVE) is a novel architecture, which merges a T2I model and a single T2V pair. It features such stages as Denoising Diffusion Implicit Models (DDIM) Sampling, regularized spectral shift, temporal modeling, and also privacy preservation.

DisCo

DisCo stands for Disentangled Control, which is used for dance generation. It pairs an LDM with ControlNet — a solution that controls diffusion models with extra conditions and allows keeping spatial consistency in balance.

Generative Disco

Generative Disco is a tool for producing music visualizers. It’s based on Stable Diffusion Videos and wavesurfer.js — an open-source library for designing interactive waveforms. It employs amplitude normalization to estimate musical energy to establish interpolation among successive images.

Deformable Motion Modulation (DMM) framework for repeating human poses

Video-to-Video Generators

Video-to-Video (V2V) generation is mainly centered around learning a mapping function of a source video and modelling temporal dynamics, so its content will be accurately repeated in the output video with a realistic quality. These generators include Few-shot, Image Animation, MM for Human Pose Transfer, vid2vid-zero, and others.

Detecting AI-Generated Videos

A constellation of methods is proposed to detect synthesized videos. Among them are convolutional long short-term memory (LSTM) structure for analyzing frame sequences, visual anomaly detection based on CNN models, photo response non-uniformity (PRNU) analysis, frame-based detection of generated street scenery with XceptionNet, and others.

However, since fully synthesized videos are quite different from mere facial deepfakes, the topic is yet to be discussed which tools will be the most effective when T2V generation reaches an ultra-realistic quality level.