← Blog

2026-04-15 10 min read

The Technical Stack Behind Modern AI Video Upscalers

A deep dive into GAN, diffusion, and proprietary architectures powering Topaz, Real-ESRGAN, FlashVSR, and the rest of the modern upscaling toolchain.

If you've ever tried to upscale a 480p video to 1080p, you've probably noticed something weird: the same input can produce dramatically different results depending on which tool you use. Faces come out plasticky in one, over-sharpened in another, and sometimes outright hallucinated in a third. These differences aren't UX — they're architectural. Every upscaler is a different combination of model family, training objective, and temporal strategy, and those choices leak into every frame of the output.

This post is a technical tour of the major AI video upscalers available today. We'll cover the underlying model architectures, the training tradeoffs, the artifacts each class is known for, and where each tool actually shines. By the end you should have a much clearer mental model of why your 144p grandma video looks plastic in one tool and crisp in another.

The Four Families of Video Upscalers

Before diving into specific tools, it helps to know the broad architecture families. Virtually every AI upscaler on the market today falls into one of these four:

Classical interpolation — bicubic, Lanczos, bilinear. No AI. Cheap, fast, produces smooth but blurry results. This is your FFmpeg scale filter.
CNN-based super-resolution — convolutional networks trained on paired low-res/high-res images. SRCNN, EDSR, VDSR fall in this category. Good detail recovery, but limited by the training distribution.
GAN-based super-resolution — a generator CNN paired with a discriminator. ESRGAN, Real-ESRGAN, SRGAN. Produces sharper-looking output by "hallucinating" plausible detail, but prone to texture artifacts, face distortion, and halo edges.
Diffusion-based super-resolution — the newest class. Uses iterative denoising from Gaussian noise, conditioned on the low-res input. StableSR, FlashVSR, RealBasicVSR-plus-diffusion variants. Produces perceptually natural output with fewer GAN-style artifacts, at the cost of much higher compute.

With that framing, let's walk through the tools.

Topaz Video AI

Family: Proprietary CNN + GAN
Cost: $299 one-time
Deployment: Local, GPU-accelerated (CUDA/Metal)

Topaz Labs' Video AI is the commercial heavyweight, and the reason is that they ship task-specific models rather than a single general-purpose one. Their model zoo includes:

Proteus — general-purpose upscaling with tunable sharpness/detail/noise
Artemis — specialized for denoising noisy/grainy footage
Theia — detail-focused upscaling, tends to over-sharpen
Iris — face-focused, trained specifically to avoid distortion on faces
Gaia HQ / Gaia CG — for animated/CG content

Under the hood these are fairly standard CNN architectures (variants of ResNet-style super-resolution backbones) with the key IP being the training datasets and fine-tuning recipes. Topaz owns massive proprietary corpora of paired high-res/low-res footage in multiple genres (film, animation, interview, archival), which is what lets their models generalize better than open-source equivalents.

Strengths: best-in-class face handling (thanks to Iris), excellent denoising, stable temporal consistency, Apple Silicon support.

Weaknesses: expensive, locked to your own hardware (a 4K upscale of a 10-minute clip can take several hours on a mid-range GPU), closed-source so you can't tune or retrain, UI has a learning curve.

Best for: film restoration, archival digitization, professional video work where quality per minute of input matters more than cost per minute of processing. See our Topaz Video AI alternative comparison for more detail on how Topaz stacks up against cloud alternatives.

Real-ESRGAN

Family: GAN-based super-resolution
Cost: Free, open source (BSD 3-Clause)
Deployment: Local (Python, CUDA) or via wrappers

Real-ESRGAN is the spiritual descendant of ESRGAN and probably the most widely-deployed open-source super-resolution model. The "Real" prefix refers to its training innovation: instead of training on clean downsamples (which don't match real-world degradation), Real-ESRGAN synthesizes complex degradation pipelines — blur, noise, JPEG compression, video compression — so the model learns to invert realistic corruption patterns.

Architecturally, it's a deep residual-in-residual dense block (RRDB) generator paired with a U-Net discriminator, trained with a weighted combination of L1 reconstruction loss, perceptual loss (VGG features), and adversarial loss.

Strengths: free, widely available, fast inference on modern GPUs, good at recovering texture detail in landscapes, textures, and text.

Weaknesses: the classic GAN failure modes — hallucinated features on faces (eyes and mouths often look wrong), repeating texture patterns on skin, halo artifacts around high-contrast edges, and temporal flicker when applied frame-by-frame to video. Real-ESRGAN is fundamentally an image model; applying it to video naively produces frame-to-frame inconsistency.

Best for: upscaling still images, or video where you don't care about temporal stability and can post-process to smooth artifacts.

Video2x

Family: Meta-tool — wraps other engines
Cost: Free, open source (GPL-3.0)
Deployment: Local, command-line or GUI

Video2x isn't a model — it's an orchestration layer. Under the hood it calls Waifu2x, Real-ESRGAN, SRMD, RealSR, or Anime4K depending on your configuration. Its real job is handling the video pipeline: frame extraction, frame-by-frame upscaling via a chosen backend, and reassembly (usually via FFmpeg).

The architecture choice is whatever backend you pick. If you choose Real-ESRGAN, you get GAN outputs. If you choose Waifu2x, you get a CNN trained on anime.

Strengths: flexibility (swap engines for different content types), free, active community, Windows-friendly GUI.

Weaknesses: inherits all artifacts of whichever backend you choose, no temporal consistency handling of its own, can be fiddly to configure, requires a local GPU.

Waifu2x

Family: CNN (pre-GAN era)
Cost: Free, open source
Deployment: Local or web

Waifu2x is the grandparent of the modern AI upscaler scene. It's a straightforward deep CNN (originally based on VDSR-style architectures) trained primarily on anime-style illustrations. It works by predicting the high-resolution residual from a bicubic-upsampled input.

Strengths: legendary status in anime/manga communities, fast, small model, still perfectly fine for its intended use case.

Weaknesses: terrible on real-world video. The training distribution was flat-color anime with hard edges; photographs and live-action footage fall completely outside that distribution, so you get oil-painting-style output with flattened skin textures.

Best for: anime stills and animated video only. Don't use it for live-action, and definitely not for old home videos.

FlashVSR (and Cloud Services Built On It)

Family: Diffusion-based video super-resolution
Cost: Model is open source; compute costs vary
Deployment: Requires significant GPU memory (24GB+ for HD output)

FlashVSR is one of the newer diffusion-based approaches to video super-resolution, and it represents a meaningful architectural shift from the GAN era. Instead of training a generator to directly map low-res to high-res, diffusion models learn to iteratively denoise a sample, conditioned on the low-res input. This changes the artifact profile completely:

GAN artifacts come from mode collapse and perceptual loss pushing the generator toward "plausible-looking" textures regardless of fidelity. Results: hallucinated faces, repeating patterns.
Diffusion artifacts come from incomplete denoising or incorrect conditioning. Results: slight blur in high-frequency detail, occasional temporal inconsistency between chunks. But faces and smooth surfaces tend to come out much more naturally.

The FlashVSR v1.1 architecture specifically uses block-sparse attention for efficiency (most diffusion video models are otherwise prohibitively expensive), temporal chunking with 60-frame windows and 16-frame overlap for consistency, and bfloat16 inference. It also has a quirk where it drops 4 frames from every output chunk, so input must be padded to 8k+1 frames to guarantee you get at least n output frames back.

Strengths: natural-looking faces without GAN hallucination, good temporal consistency, handles real-world degradation well.

Weaknesses: heavy compute (a 1-minute 1080p upscale takes tens of GPU-minutes), requires significant VRAM, diffusion models can produce slightly softer output than aggressive GANs.

Cloud deployments: running FlashVSR locally requires a 24GB+ GPU, CUDA toolkit, custom kernel compilation — non-trivial. ClearFrame wraps FlashVSR with a web UI and pay-per-minute compute, which is often cheaper than buying a GPU if your usage is occasional. You can try a free 1-second preview to see how FlashVSR responds to your specific footage before committing.

Best for: content with faces (portrait videos, interviews, family footage), anything where GAN hallucination is unacceptable. We have step-by-step guides for common use cases: fixing blurry video, enhancing old home videos, and upscaling 480p to 1080p.

Shutter Encoder

Family: FFmpeg wrapper with AI scaling filters
Cost: Free
Deployment: Local, cross-platform

Shutter Encoder is a Swiss-army-knife video tool that happens to include AI upscaling via integrated filters — it's using FFmpeg's various AI scaling options (including integrations with RealSR and ESRGAN models under the hood). Think of it like Video2x's more mainstream cousin.

Strengths: free, genuinely useful for many video tasks beyond upscaling (format conversion, subtitles, merging, etc.), no GPU required for non-AI features.

Weaknesses: the AI scaling is a feature, not the focus — quality is solid but not competitive with dedicated tools for demanding upscales.

Best for: general video workflow where upscaling is one of many operations you need.

VLC and Nvidia RTX Video Super Resolution

Family: Proprietary real-time super-resolution (not file-based)
Cost: Free if you have an RTX GPU
Deployment: Real-time, GPU-only

Nvidia RTX VSR is an interesting outlier. It's not a file-modification tool — it runs inside compatible browsers (Chrome, Edge) and players (VLC via its RTX edition) and upscales video in real time as you play it. The model runs on the RTX tensor cores and is optimized for 30-60fps realtime inference.

Architecturally it's Nvidia's proprietary neural network, likely a compact CNN tuned for speed over quality. They haven't published details but it's in the same family as DLSS (which upscales games in real time).

Strengths: zero workflow — just play your video and it looks better. No files, no processing queue, no waiting.

Weaknesses: requires an RTX 3000+ GPU, the output isn't saved (it's applied at playback time only), quality is good but not comparable to offline tools.

Best for: watching lower-resolution content on a big screen without pre-processing.

Comparison Matrix

Tool	Family	Cost	Face Quality	Temporal
Topaz Video AI	Proprietary CNN+GAN	$299	Excellent	Excellent
Real-ESRGAN	GAN	Free	Poor	None
Video2x	Wrapper	Free	Varies	None
Waifu2x	CNN	Free	Poor	None
FlashVSR / ClearFrame	Diffusion	Pay-per-minute	Excellent	Good
Shutter Encoder	FFmpeg+AI	Free	Moderate	Basic
Nvidia RTX VSR	Proprietary	Free w/ RTX	Moderate	Playback

Choosing the Right Tool

A rough decision tree:

Faces matter more than cost? Topaz Video AI (if you're buying) or ClearFrame (if you'd rather pay per minute and skip the GPU purchase).
Textures/landscapes, artifacts OK? Real-ESRGAN. Free and fast.
Anime? Waifu2x, or Real-ESRGAN with the anime model variant.
You have an RTX GPU and just want to watch stuff at higher quality? RTX VSR. No processing, just play.
You want to experiment? Video2x gives you the most backend options.
Precious family memories from a 144p camera phone? This is the hardest case. Spend the $299 on Topaz or use a cloud diffusion service. The free tools will produce exactly the "weird faces and black edges" that so many Reddit threads complain about — see our guide to enhancing old home videos.

The Bigger Picture

The upscaling landscape is bifurcating. On one end, GAN-based tools are getting faster and more accessible — Real-ESRGAN runs on a laptop GPU in a few minutes per video. On the other, diffusion models like FlashVSR are getting cheaper and more practical — what used to require a data center can now run on a single 24GB GPU, or via cloud APIs for a few cents per minute.

The right choice depends less on which tool has "the best AI" and more on what kind of content you're processing and how much you care about specific failure modes. Faces? Diffusion or Topaz Iris. Textures? Real-ESRGAN is fine. Anime? Waifu2x is still the answer after all these years.

None of these tools are magic. A 144p source has about 25,000 pixels per frame. A 1080p target has over 2 million. The AI is literally inventing ~98% of the image. The best you can hope for is "watchable and recognizable," not "HD remaster." Calibrate expectations accordingly, and pick the tool whose artifact profile you can live with.

Want to try diffusion-based upscaling?

ClearFrame runs FlashVSR in the cloud with a free 1-second preview on every video, so you can see the result before spending anything.

Try a free preview