Blog
AI Video & Image Generation: A Simple Guide to the Core Terms (T2I, T2V, I2V and More)
If you're getting into AI-generated images and video, the terminology can feel like alphabet soup.
Everyone throws around abbreviations like T2I or R2V, but rarely explains them in a simple way.
This guide breaks down the most common terms you'll see in AI image and video generation workflows,
including both standard industry labels and some less common but useful variants.
You write a text prompt, and the model generates a still image.
Example: "a cyberpunk city at night, rain, cinematic lighting" → image output
You provide a text prompt and the model generates a video sequence.
Example: "a spaceship flying through an asteroid field" → video output
Less standardized term sometimes used for AI systems that generate animated sequences instead of full video realism.
You provide an image and the model transforms it into another image while preserving structure or style.
Example: sketch → realistic render
You start with a single image and the model animates it into a video.
Example: portrait → talking/moving character video
Sometimes used for simpler animation pipelines where motion is limited (parallax, subtle movement, stylized effects).
You input an existing video and the model modifies it (style transfer, enhancement, or full transformation).
Example: real video → anime style video
Extracting or generating still frames from video, or transforming video content into keyframes or images.
AI analyzes a video and generates a textual description, captions, or summaries.
The model uses reference images (or multiple frames) to generate a consistent video.
This is often used for maintaining character identity or style consistency across frames.
Similar to R2V, but outputs a still image guided by reference material.
Often referred to as video extension or video continuation.
The model takes an existing video and generates what happens next.
This is sometimes labeled as:
• R2V (in extended form)
This is not a universally standardized acronym, but it is sometimes used in research contexts
to describe workflows where motion is derived from:
• frame sequences
In simpler terms: the model doesn't just generate video from text or images, but tries to infer motion dynamics from structured frame information.
A more general version of frame-conditioned video generation.
Often overlaps with I2V and R2V depending on implementation.
A hybrid approach where both a text prompt and a reference image guide video generation.
This is increasingly common in modern models because it improves consistency and control.
A less common but emerging concept where multiple input videos are blended or used as reference material
for generating a new output video.
The main problem is that these acronyms are not fully standardized.
Different companies use slightly different naming conventions for similar processes.
For example:
• One platform might call something I2V
Under the hood, the technology can be very similar — only the marketing label changes.
The simplest way to think about AI generation is:
• T = Text
Everything else is just a combination of these building blocks.
Once you understand this structure, most AI generation pipelines start to make a lot more sense.
1. Text-Based Generation
T2I – Text to Image
T2V – Text to Video
T2A (Text to Animation / Alternative naming)
2. Image-Based Generation
I2I – Image to Image
I2V – Image to Video
I2A – Image to Animation
3. Video-Based Workflows
V2V – Video to Video
V2I – Video to Image
V2T – Video to Text
4. Reference & Multi-Frame Conditioning
R2V – Reference to Video
R2I – Reference to Image
Video Extension / Video Continuation
• V2V extension
• temporal continuation
5. Advanced / Less Standard Terms
FLF2V / FFLF2V (Frame / Flow-based generation)
• optical flow estimation
• latent frame interpolation
F2V – Frame to Video
TI2V – Text + Image to Video
MV2V – Multi-Video to Video
6. Why These Terms Are Confusing
• Another calls it "image animation"
• A third calls it "video generation from reference"
Summary
• I = Image
• V = Video
• R = Reference / Conditioning material
• F = Frame-based input