Burger menu

Generative AI in Video: Synthesis and Detection

Current GenAI models allow highly realistic text-to-video synthesis, which should be considered by anti-spoofing and artificial intelligence experts

A “Bear painting a bear” T2V sample generated by Make-a-Video
A “Bear painting a bear” T2V sample generated by Make-a-Video

Introduction of Generative AI led to prompt-based media generators that could draw, compose music, or write sophisticated texts. Text-to-Video (T2V) is an instance of GenAI with the first documented video generator being Sync-DRAW (2016). A new wave of AI video tools inspired by Text-to-Image (TTI) synthesizers saw the light recently: for example, Make-a-Video by Meta was announced a year after TTI models became widespread.

Example of SAVE video editing based on textual prompts
Example of SAVE video editing based on textual prompts

However, generating a video clip based on a written prompt is much more challenging than a static image. It is noted that diffusion and non-diffusion T2V models require a tremendous amount of computational power and memory, as well as extensive high-quality datasets — it brings serious deployment and scaling issues.

Even though T2V is still in its infancy, it’s already considered a tangible threat, as anyone will be able to create a completely fabricated video without specific skills or knowledge that deepfake production typically requires.

A T2V framework utilized in SAVE
A T2V framework utilized in SAVE

Main Methods of Generating Video

At the moment, there’s a group of main methods employed in T2V creation:

Unguided generation

A rather rudimentary approach, which is regarded as the earliest method of video generation. It turns a single-frame image into a repetitive scene: moving ocean waves, etc. The technique is described as “spatially repetitive pattern with time-varying visualization”. It is dubbed unguided since there’s no textual prompt giving instructions.

Generative Adversarial Networks

Generative Adversarial Networks (GANs), first explored in 2014, have been applied to create highly realistic images. Essentially, a GAN consists of two main elements:

  • G — generator model that is tasked to transform the input noise into synthesized samples.
  • D — discriminator that classifies the received samples as true or fake.

Video-GANs have a specific architecture based on Recurrent Neural Network (RNN) and 2D convolutional networks. This is necessary due to the video data complexity that has a temporal dimension. Alternative approaches imply usage of 3D convolutional networks, two-stream video generation, and Coarse-to-Fine technique based on progressive architecture.

The first GAN architecture used for video generation
The first GAN architecture used for video generation

Variational Autoencoders

In simple terms, autoencoders strip the input data of the unnecessary noise through a sophisticated process of compressing and reconstructing. This allows the model to concentrate on valuable areas. This approach can be beneficial for video generation, as it can help economize computational resources. 

The Auto-Encoding Variational Bayes is an efficient concept as it’s enhanced with a Stochastic Gradient Variational Bayes estimator that allows it to work with continuous latent variables and huge datasets. Another domain where video autoencoding is applicable is learning camera poses and disentangled representations.

Video autoencoder used for camera pose prediction
Video autoencoder used for camera pose prediction

Diffusion Models

Latent Diffusion Models (LDMs) are known for their capability to generate high-quality images or audio without demanding too much computational power. This is possible due to compressed lower-dimensional latent space. 

For video synthesis it is suggested to use a pre-trained image-generating model that is enhanced with the additional neural network layers responsible for consistent and temporally accurate frame alignment. (They also should be linked to the spatial layers.)  When put together, these elements provide a video-aware temporal foundation or “backbone” as referred to by the authors.

A pre-trained Latent Diffusion Model enhanced with temporal layers
A pre-trained Latent Diffusion Model enhanced with temporal layers

The Early Video Generators

Before the emergence of the up-to-date generative models, there was a group of T2V solutions based on other methods.


Sync-DRAW presented in 2016 was the first and simplest T2V solution. It is based on a concept of learning data distribution with Recurrent Variational Auto-Encoders (R-VAE). This approach allows to temporally synchronize frames of a video, while also enabling a user to type in the prompts they want.

It also employs a recurrent visual attention mechanism for every frame present in a video clip, as this allows to explore and learn latent spatio-temporal representations. In turn, a generated video appears to be smooth in nature in terms of motion.

Sync-DRAW’s architecture
Sync-DRAW’s architecture


TGANs-C is based on a GAN architecture, in which the generator and discriminator are fine-tuned with a minimax game mechanism — a decision-making algorithm for choosing the most appropriate next move. It also includes three pivotal elements:

  • Video-level matching-aware loss for aligning video with a prompt.
  • Frame-level matching-aware loss for boosting the visual realism.
  • Temporal coherence loss for utilizing temporal consistency between successive frames.

It was trained on datasets MSVD, TBMG., and SBMG.


GODIVA is an open-domain text-to-video pretrained model that can synthesize videos with the help of three-dimensional sparse attention mechanism. Architecturally, it consists of three components: Language Embedding, Vector Quantization Variational Autoencoder (VQ-VAE), and VQ-VAE Decoder. It was trained on a Howto100M dataset with 1.36 million video samples and is also capable of zero-shot generation.

GODIVA’s framework
GODIVA’s framework


NÜWA is a model based on a 3D transformer encoder-decoder architecture coupled with 3DNA — a 3D Nearby Attention tool that enhances local-wise sparse attention and also allows decreasing computational complexity of the process.

Generators Based on Diffusion Models

MagicVideo’s pipeline based on a diffusion model
MagicVideo’s pipeline based on a diffusion model

Imagen Video

Imagen Video includes 7 diffusion sub-models specified in continuous time and enhanced with Video U-Net that together garner three-dimensional output, as well as text-conditional video generation, spatial super-resolution, and temporal super-resolution.  It also features v-prediction parameterization that provides progressive distillation during the diffusion process.


Designed by Meta, Make-A-Video includes three main components:

  • Foundational Text-to-Image model trained on text-image pairs.
  • Attention and spatio-temporal layers responsible for temporal dimension.
  • Second group of spatio-temporal layers that provide frame generation at an intensive rate.

Pseudo-3D attention layers are an essential element of the model: they allow saving up computational power by using, among all else, matrix operators Flatten and Unflatten.

VQGAN for Long Video Generation

Vector Quantized Generative Adversarial Network (VQGAN) is mainly used for T2I generation. It combines an upgraded time-agnostic model for spatio-temporal compression to generate discrete tokens and a time-sensitive transformer that captures longer temporal dependence. 


Spectral-Shift-Aware Video Editing (SAVE) is a novel architecture, which merges a T2I model and a single T2V pair. It features such stages as Denoising Diffusion Implicit Models (DDIM) Sampling, regularized spectral shift, temporal modeling, and also privacy preservation.

Zero-shot generation framework utilized in SAVE
Zero-shot generation framework utilized in SAVE


DisCo stands for Disentangled Control, which is used for dance generation. It pairs an LDM with ControlNet — a solution that controls diffusion models with extra conditions and allows keeping spatial consistency in balance.

Generative Disco

Generative Disco is a tool for producing music visualizers. It’s based on Stable Diffusion Videos and wavesurfer.js — an open-source library for designing interactive waveforms. It employs amplitude normalization to estimate musical energy to establish interpolation among successive images. 

Deformable Motion Modulation (DMM) framework for repeating human poses
Deformable Motion Modulation (DMM) framework for repeating human poses

Video-to-Video Generators

Video-to-Video (V2V) generation is mainly centered around learning a mapping function of a source video and modelling temporal dynamics, so its content will be accurately repeated in the output video with a realistic quality. These generators include Few-shot, Image Animation, MM for Human Pose Transfer, vid2vid-zero, and others.

Vid2Vid architecture
Vid2Vid architecture

Detecting AI-Generated Videos

A constellation of methods is proposed to detect synthesized videos. Among them are convolutional long short-term memory (LSTM) structure for analyzing frame sequences, visual anomaly detection based on CNN models, photo response non-uniformity  (PRNU) analysis, frame-based detection of generated street scenery with XceptionNet, and others.

However, since fully synthesized videos are quite different from mere facial deepfakes, the topic is yet to be discussed which tools will be the most effective when T2V generation reaches an ultra-realistic quality level.

A detection pipeline featuring XceptionNet
A detection pipeline featuring XceptionNet


  1. Meta announces Make-A-Video, which generates video from text
  2. SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-guided Video Editing
  3. Generating Videos with Scene Dynamics
  4. Video Autoencoder: self-supervised disentanglement of static 3D structure and motion
  5. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
  6. Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures
  7. GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
  8. MagicVideo: Efficient Video Generation With Latent Diffusion Models
  9. Bidirectionally Deformable Motion Modulation For Video-based Human Pose Transfer
  10. Detection of GAN-synthesized street videos
  11. Text-to-Video: The Task, Challenges and the Current State
  12. A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need?
  13. Generative Adversarial Networks
  14. Video Generative Adversarial Networks: A Review
  15. Auto-Encoding Variational Bayes
  16. To Create What You Tell: Generating Videos from Captions
  17. Mini-Max Algorithm in Artificial Intelligence
  18. MSVD (Microsoft Research Video Description Corpus)
  19. VQ-VAE
  20. What is HowTo100M?
  21. NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
  22. Imagen Video: High Definition Video Generation with Diffusion Models
  23. U-Net Based Multi-instance Video Object Segmentation
  24. Make-A-Video: Text-to-Video Generation without Text-Video Data
  25. Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
  26. Denoising Diffusion Implicit Models (DDIM) Sampling
  27. DisCo: Disentangled Control for Referring Human Dance Generation in Real World
  28. Generative Disco: Text-to-Video Generation for Music Visualization
  29. Wavesurfer.js is an open-source audio visualization library for creating interactive, customizable waveforms
  30. Video-to-Video Synthesis
  31. Few-shot Video-to-Video Synthesis
  32. First Order Motion Model for Image Animation
  33. Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
  34. A Survey on Deepfake Video Detection
  35. Photo response non-uniformity

Avatar Antispoofing


Editors at Antispoofing Wiki thoroughly review all featured materials before publishing to ensure accuracy and relevance.