How Scaling Method and Technique Impacts Quality and Throughput

The thing about FFmpeg is that there are almost always multiple ways to accomplish the same basic function. In this post, we look at four approaches to scaling to reveal how the scaling method and techniques used impact quality and throughput.

We found that if you’re scaling using the default -s function (-s 1280×720), you’re leaving a bit of quality on the table compared to other methods. How much depends upon the metric you prefer; about ten percent if you’re a VMAF (hand raised here) or SSIM fan, much less if you still bow to the PSNR gods. More importantly, if you’re chasing throughput via cascaded scaling with fast scaling algorithms (flags=fast_bilinear), you’re probably losing quality without a meaningful throughput increase.

That’s the TL/DR; here’s the backstory.

The Backstory

NETINT sells ASIC-based hardware transcoders. One key advantage over software-only/CPU-based encoding is throughput, so we perform lots of hardware vs. software benchmarking. Fairness dictates that we use the most efficient FFmpeg command string when deriving the command string for software-only encoding.

In addition, the NETINT T408 transcoder scales in software using the host CPU, so we are vested in techniques that increase throughput for T408 transcodes. In contrast, the NETINT Quadra scales and performs overlays in hardware and provides an AI engine, which is why it’s designated a Video Processing Unit (VPU) rather than a transcoder.

One proposed scaling technique for accelerating both software-only and T408 processing is cascading scaling, where you create a filter complex that starts at full resolution, scales to the next lower resolution, then uses the lower resolution to scale to the next lower resolution. Here’s an example.

filter_complex "[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=fast_bilinear,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=fast_bilinear,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=fast_bilinear,split=2[out720p][in720p];[in720p]scale=640:360:flags=fast_bilinear[out360p]"

So, rather than performing multiple scales from full resolution to the target (4K > 2K, 4K to 1080p, 4K > 720p, 4K to 360p), you’re performing multiple scales from lower resolution sources (4K > 2K > 1080p >720p > 360p). The theory was that this would reduce CPU cycles and improve throughput, particularly when coupled with a fast scaling algorithm. Even assuming a performance increase (which turned out to be a bad assumption), the obvious concern is quality; how much does quality degrade because the lower-resolution transcodes are working from a lower-resolution source?

In contrast, if you’ve read this far,  you know that the typical scaling technique used by most beginning FFmpeg producers is the -s command (-s 1280×720). For all rungs below 4K, FFmpeg scales the source footage down to the target resolution using the bicubic scaling algorithm,

So, we had two proposed methods which I expanded to four, as follows.

  • Default (-s 1280×720)
  • Cascade using fast bilinear
  • Cascade using Lanczos
  • Video filter using Lanczos (-vf scale=1280×720 -sws_flags lanczos)

I tested the following encoding ladder using the HEVC codec.

  • 4K @ 12 Mbps
  • 2K @ 7 Mbps
  • 1080p @ 3.5 Mbps
  • 1080p @ 1.8 Mbps
  • 720p @ 1 Mbps
  • 360p @ 500 kbps

I encoded two 3-minute 4Kp30 files, excerpts from the Netflix Meridian and Harmonic Football test clips using the x265 codec and ultrafast preset. You can see full command strings at the end of the article. I measured throughput in frames per second and measured the 2K to 360p rung quality with VMAF, PSNR, and SSIM, compiling the results into BD-Rate comparisons in Excel.

I tested on a Dell Precision 7820 tower driven by two 2.9 GH Intel Xeon Gold (6226R) CPUs running Windows 10 Pro for Workstations with 64 GB of RAM. I tested with FFmpeg 5.0, a version downloaded from www.gyan.dev on December 15, 2022.

Performance

TABLE 1. FPS BY SCALING METHOD

Table 1 shows that cascading delivered negligible performance benefits with the two test files and the selected encoding parameters. I asked the engineer who suggested the cascading scaling approach why we saw no throughput increase. Here’s a brief exchange. 

Engineer: It’s not going to make any performance difference in your example anyways but it does reduce the scaling load

       Me: Why wouldn’t it make a performance difference if it reduces the scaling load?

Engineer: Because, as your example has shown, the x265 encoding load dominates. It would make a very small difference

       Me: Ah, so the slowest, most CPU-intensive process controls overall performance.

Engineer: Yes, when you compare 1000+1 with 1000+10 there is not too much difference.

What this means, of course, is that these results may vary by the codec. If you’re encoding with H.264, which is much faster, cascading scaling might increase throughput. If you’re encoding with AV1 or VVC, almost certainly not.

Given that the T408 transcoder is multiple times faster than real-time, I’m now wondering if cascaded scaling might increase throughput when producing with the T408. You probably wouldn’t attempt this approach if quality suffered, but what if cascaded scaling improved quality? Sound far-fetched? Read on.

Scaling Method and Technique Impact on Quality and Throughput

Quality Results

Table 2 shows the combined VMAF results for the two clips. Read this by choosing a row and moving from column to column. As you would suspect, green is good, and red is bad. So, for the Default row, that technique produces the same quality as Cascade – Fast Bilinear with a bitrate reduction of 18.55%. However, you’d have to boost the bitrate by 12.89% and 11.24%, respectively, to produce the same quality as Cascade – Lanczos and  Video Filter – Lanczos.

Scaling blog - Table 2 BD-Rate comparisons for the four techniques using the VMAF metric
Table 2. BD-Rate comparisons for the four techniques using the VMAF metric.

From a quality perspective, the Cascade approach combined with the fast bilinear algorithm was the clear loser, particularly compared to either method using the Lanczos algorithm. Even if there was a substantial performance increase, which there wasn’t, it’s hard to see a relevant use case for this algorithm.

The most interesting takeaway was that cascading scaling with the Lanczos algorithm produced the best results, slightly higher than using a video filter with Lanczos. The same pattern emerged for PSNR, where Cascade – Lanc was green in all three columns, indicating the highest-quality approach. 

Scaling blog - Table 3 BD-Rate comparisons for the four techniques using the PSNR metric
Table 3. BD-Rate comparisons for the four techniques using the PSNR metric.

Ditto for SSIM.

Scaling blog - Table 4 BD-Rate comparisons for the four techniques using the SSIM metric
Table 4. BD-Rate comparisons for the four techniques using the SSIM metric.

The cascading approach delivering better quality than the video filter was an anomaly. Not surprisingly, the engineer noted:

Engineer: It is odd that cascading with Lanczos has better quality than direct scaling. I’m not sure why that would be.

       Me: Makes absolutely no sense. Is anything funky in the two command strings?

Engineer: Nothing obvious but I can look some more.

Later analysis yielded no epiphanies. Perhaps they can come from a reader.

The Net Net

First, the normal caveats; your mileage may vary by codec and content. My takeaways are:

  • Try cascading scaling with Lanczos with the T408,
  • For software encodes, never use -s again.
  • Use cascade or the simpler video filter approach. 
  • With most software-based encoders, faster scaling methods may not deliver performance increases but could degrade quality.

Further, as we all know, there are several, if not dozens, additional approaches to scaling; if you have meaningful results that prove one is substantially better, please share them with me via THIS email.

Finally, taking a macro view, it’s worth remembering that a $12,000 + workstation could only produce 25 fps when producing a live 4K ladder to HEVC using x265’s ultrafast preset. Sure, there are faster software encoders available. Still, hardware encoding is the best answer for affordable live 4K transcoding from both an OPEX and CAPEX perspective.

Command Strings:

Default:

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_12M_default.mp4 ^

-s 2560×1440 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_7M_default.mp4  ^

-s 1920×1080 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_3_5M_default.mp4 ^

-s 1920×1080 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_1_8M_default.mp4 ^

-s 1280×720  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_1M_default.mp4 ^

-s 640×360  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_500K_default.mp4

Cascade – Fast Bilinear

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=fast_bilinear,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=fast_bilinear,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=fast_bilinear,split=2[out720p][in720p];[in720p]scale=640:360:flags=fast_bilinear[out360p]” ^

-map [out4k] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_cascade_12M_fast_bi.mp4 ^

-map [out1440p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_cascade_7M_fast_bi.mp4  ^

-map [out1080p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_3_5M_fast_bi.mp4 ^

-map [out1080p2] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_1_8M_fast_bi.mp4 ^

-map [out720p]  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_8_bit_cascade_1M_fast_bi.mp4 ^

-map [out360p]  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_8_bit_cascade_500K_fast_bi.mp4

Cascade – Lanczos

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=lanczos,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=lanczos,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=lanczos,split=2[out720p][in720p];[in720p]scale=640:360:flags=lanczos[out360p]” ^

-map [out4k] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_cascade_12M_lanc.mp4 ^

-map [out1440p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_cascade_7M_lanc.mp4  ^

-map [out1080p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_3_5M_lanc.mp4 ^

-map [out1080p2] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_1_8M_lanc.mp4 ^

-map [out720p]  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_8_bit_cascade_1M_lanc.mp4 ^

-map [out360p]  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_cascade_500K_lanc.mp4

Video Filter – Lanczos

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_12M_filter_lanc.mp4 ^

-vf scale=2560×1440 -sws_flags lanczos -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_7M_filter_lanc.mp4  ^

-vf scale=1920×1080 -sws_flags lanczos  -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_3_5M_filter_lanc.mp4 ^

-vf scale=1920×1080 -sws_flags lanczos  -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_1_8M_filter_lanc.mp4 ^

-vf scale=1280×720 -sws_flags lanczos -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_1M_filter_lanc.mp4 ^

-vf scale=640×360 -sws_flags lanczos  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_500K_filter_lanc.mp4

Related Article

Argos - an ASIC chip dedicated to YouTube video encoding was given a codename after a monster from Greek mythology called 'Argos'.
NETINT Blog

Argos dispels common myths about encoding ASICs

  Even in 2023, many high-volume streaming producers continue to rely on software-based transcoding, despite the clear CAPEX, OPEX, and environmental benefits of ASIC-based transcoding.