Introduction of Generative AI led to prompt-based media generators that could draw, compose music, or write sophisticated texts. Text-to-Video (T2V) is an instance of GenAI with the first documented video generator being Sync-DRAW (2016). A new wave of AI video tools inspired by Text-to-Image (TTI) synthesizers saw the light recently: for example, Make-a-Video by Meta was announced a year after TTI models became widespread.
However, generating a video clip based on a written prompt is much more challenging than a static image. It is noted that diffusion and non-diffusion T2V models require a tremendous amount of computational power and memory, as well as extensive high-quality datasets — it brings serious deployment and scaling issues.
Even though T2V is still in its infancy, it’s already considered a tangible threat, as anyone will be able to create a completely fabricated video without specific skills or knowledge that deepfake production typically requires.
Main Methods of Generating Video
At the moment, there’s a group of main methods employed in T2V creation:
A rather rudimentary approach, which is regarded as the earliest method of video generation. It turns a single-frame image into a repetitive scene: moving ocean waves, etc. The technique is described as “spatially repetitive pattern with time-varying visualization”. It is dubbed unguided since there’s no textual prompt giving instructions.
Generative Adversarial Networks
Generative Adversarial Networks (GANs), first explored in 2014, have been applied to create highly realistic images. Essentially, a GAN consists of two main elements:
- G — generator model that is tasked to transform the input noise into synthesized samples.
- D — discriminator that classifies the received samples as true or fake.
Video-GANs have a specific architecture based on Recurrent Neural Network (RNN) and 2D convolutional networks. This is necessary due to the video data complexity that has a temporal dimension. Alternative approaches imply usage of 3D convolutional networks, two-stream video generation, and Coarse-to-Fine technique based on progressive architecture.
In simple terms, autoencoders strip the input data of the unnecessary noise through a sophisticated process of compressing and reconstructing. This allows the model to concentrate on valuable areas. This approach can be beneficial for video generation, as it can help economize computational resources.
The Auto-Encoding Variational Bayes is an efficient concept as it’s enhanced with a Stochastic Gradient Variational Bayes estimator that allows it to work with continuous latent variables and huge datasets. Another domain where video autoencoding is applicable is learning camera poses and disentangled representations.
Latent Diffusion Models (LDMs) are known for their capability to generate high-quality images or audio without demanding too much computational power. This is possible due to compressed lower-dimensional latent space.
For video synthesis it is suggested to use a pre-trained image-generating model that is enhanced with the additional neural network layers responsible for consistent and temporally accurate frame alignment. (They also should be linked to the spatial layers.) When put together, these elements provide a video-aware temporal foundation or “backbone” as referred to by the authors.
The Early Video Generators
Before the emergence of the up-to-date generative models, there was a group of T2V solutions based on other methods.
Sync-DRAW presented in 2016 was the first and simplest T2V solution. It is based on a concept of learning data distribution with Recurrent Variational Auto-Encoders (R-VAE). This approach allows to temporally synchronize frames of a video, while also enabling a user to type in the prompts they want.
It also employs a recurrent visual attention mechanism for every frame present in a video clip, as this allows to explore and learn latent spatio-temporal representations. In turn, a generated video appears to be smooth in nature in terms of motion.
TGANs-C is based on a GAN architecture, in which the generator and discriminator are fine-tuned with a minimax game mechanism — a decision-making algorithm for choosing the most appropriate next move. It also includes three pivotal elements:
- Video-level matching-aware loss for aligning video with a prompt.
- Frame-level matching-aware loss for boosting the visual realism.
- Temporal coherence loss for utilizing temporal consistency between successive frames.
It was trained on datasets MSVD, TBMG., and SBMG.
GODIVA is an open-domain text-to-video pretrained model that can synthesize videos with the help of three-dimensional sparse attention mechanism. Architecturally, it consists of three components: Language Embedding, Vector Quantization Variational Autoencoder (VQ-VAE), and VQ-VAE Decoder. It was trained on a Howto100M dataset with 1.36 million video samples and is also capable of zero-shot generation.
NÜWA is a model based on a 3D transformer encoder-decoder architecture coupled with 3DNA — a 3D Nearby Attention tool that enhances local-wise sparse attention and also allows decreasing computational complexity of the process.
Generators Based on Diffusion Models
Imagen Video includes 7 diffusion sub-models specified in continuous time and enhanced with Video U-Net that together garner three-dimensional output, as well as text-conditional video generation, spatial super-resolution, and temporal super-resolution. It also features v-prediction parameterization that provides progressive distillation during the diffusion process.
Designed by Meta, Make-A-Video includes three main components:
- Foundational Text-to-Image model trained on text-image pairs.
- Attention and spatio-temporal layers responsible for temporal dimension.
- Second group of spatio-temporal layers that provide frame generation at an intensive rate.
Pseudo-3D attention layers are an essential element of the model: they allow saving up computational power by using, among all else, matrix operators Flatten and Unflatten.
VQGAN for Long Video Generation
Vector Quantized Generative Adversarial Network (VQGAN) is mainly used for T2I generation. It combines an upgraded time-agnostic model for spatio-temporal compression to generate discrete tokens and a time-sensitive transformer that captures longer temporal dependence.
Spectral-Shift-Aware Video Editing (SAVE) is a novel architecture, which merges a T2I model and a single T2V pair. It features such stages as Denoising Diffusion Implicit Models (DDIM) Sampling, regularized spectral shift, temporal modeling, and also privacy preservation.
DisCo stands for Disentangled Control, which is used for dance generation. It pairs an LDM with ControlNet — a solution that controls diffusion models with extra conditions and allows keeping spatial consistency in balance.
Generative Disco is a tool for producing music visualizers. It’s based on Stable Diffusion Videos and wavesurfer.js — an open-source library for designing interactive waveforms. It employs amplitude normalization to estimate musical energy to establish interpolation among successive images.
Video-to-Video (V2V) generation is mainly centered around learning a mapping function of a source video and modelling temporal dynamics, so its content will be accurately repeated in the output video with a realistic quality. These generators include Few-shot, Image Animation, MM for Human Pose Transfer, vid2vid-zero, and others.
Detecting AI-Generated Videos
A constellation of methods is proposed to detect synthesized videos. Among them are convolutional long short-term memory (LSTM) structure for analyzing frame sequences, visual anomaly detection based on CNN models, photo response non-uniformity (PRNU) analysis, frame-based detection of generated street scenery with XceptionNet, and others.
However, since fully synthesized videos are quite different from mere facial deepfakes, the topic is yet to be discussed which tools will be the most effective when T2V generation reaches an ultra-realistic quality level.
- Meta announces Make-A-Video, which generates video from text
- SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-guided Video Editing
- Generating Videos with Scene Dynamics
- Video Autoencoder: self-supervised disentanglement of static 3D structure and motion
- Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
- Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures
- GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
- MagicVideo: Efficient Video Generation With Latent Diffusion Models
- Bidirectionally Deformable Motion Modulation For Video-based Human Pose Transfer
- Detection of GAN-synthesized street videos
- Text-to-Video: The Task, Challenges and the Current State
- A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need?
- Generative Adversarial Networks
- Video Generative Adversarial Networks: A Review
- Auto-Encoding Variational Bayes
- To Create What You Tell: Generating Videos from Captions
- Mini-Max Algorithm in Artificial Intelligence
- MSVD (Microsoft Research Video Description Corpus)
- What is HowTo100M?
- NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
- Imagen Video: High Definition Video Generation with Diffusion Models
- U-Net Based Multi-instance Video Object Segmentation
- Make-A-Video: Text-to-Video Generation without Text-Video Data
- Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
- Denoising Diffusion Implicit Models (DDIM) Sampling
- DisCo: Disentangled Control for Referring Human Dance Generation in Real World
- Generative Disco: Text-to-Video Generation for Music Visualization
- Wavesurfer.js is an open-source audio visualization library for creating interactive, customizable waveforms
- Video-to-Video Synthesis
- Few-shot Video-to-Video Synthesis
- First Order Motion Model for Image Animation
- Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
- A Survey on Deepfake Video Detection
- Photo response non-uniformity