Which AWS CPU is Best for FFmpeg – AMD, Graviton, or Intel?

Which AWS CPU is Best for FFmpeg - AMD, Graviton, or Intel?

If you encode with FFmpeg on AWS, you probably know that you have three CPU options: AMD, Graviton, and Intel. Which delivers the most bang for the buck?

For those in a hurry, it’s Graviton for x264 and AMD for x265, often by a significant margin. But the devil is always in the details, and if you want to learn how we tested and how big a difference your CPU selection makes, you can follow the narrative or hopscotch through the fancy charts below. We conclude with a look at the optimal core count for those encoding with AMD CPUs.

Testing the AWS CPUs

Let me start by saying that this was my first foray into CPU testing on AWS, and while it appears straightforward, some unconsidered complexity may have skewed the results. If you see any errors or other factors worth considering, please drop me a note at jan.ozer@netint.com.

Second, your source clip and command string may produce different results than those shown below. If you’re spending big to encode with FFmpeg on AWS, don’t consider my results the final word; instead, consider them as evidence that your CPU choice really does matter and as motivation to perform your own tests. 

Those caveats aside, let’s dig into the testing.

Codecs/Configurations/Command Strings

I tested three test cases.

  • 8-bit 1080p30 with x264
  • 8-bit 1080p30 with x265
  • 10-bit 4K60p with x265

I present the command strings at the bottom of this article. Note that I used the veryslow preset for x264, slower for x265 at 1080p30, and slow for the 4K60 HEVC encodes. Why such demanding presets? Because based upon a total cost of distribution (encoding and bandwidth), the optimal economic decision when view counts will exceed 10,000 views is to use a high-quality preset.

Based upon a total distribution cost (encoding and bandwidth), the optimal economic decision when view counts exceed 10,000 views is to use a high-quality preset.

Remember, presets don’t determine quality; your quality expectations do. Most compressionists target a VMAF score of between 93-95 VMAF points for the top rung of their encoding ladders. Using the veryslow preset, you might achieve that at, say, 3 Mbps. Using ultrafast, you might need a bit rate of as much as 5 Mbps to achieve the same quality. Ultrafast might cut your encoding time/cost by 90%, but you only pay that once, while you pay bandwidth costs for each video view. Even at a cost per GB of $0.02, it takes less than 10,000 views for the veryslow preset to break even based on lower bandwidth costs.

Instances and Pricing

I tested using the 8-core instances and on-demand pricing shown in Table 1. I tested all systems running Ubuntu version 22.04. Note that the cost delta between Intel and AMD is ten percent, a number I’ll refer to below.

Table 1:  Instances and on-demand pricing tested.

Encoding Procedure

As you’ll see in the charts below, I started encoding a single FFmpeg instance and kept adding simultaneous encodes until the cost per stream began to increase, indicating that spinning up another instance was more cost effective than adding additional encodes to the same system.

FFmpeg Versions

Here’s where things get a bit complicated. My premise was that I would produce the optimal results using FFmpeg versions compiled specifically for each CPU tested. I downloaded builds for Graviton, AMD, and Intel from https://johnvansickle.com/ffmpeg/ and happily contributed via PayPal. However, I was also in touch with MulticoreWare, who requested that I test with an advanced version of their x265 codec that was optimized for Graviton.

Figure 1. I tested with CPU-specific versions of FFmpeg 6.0 from https://johnvansickle.com/ffmpeg/.

Before testing, I compared the performance of the stock version of FFmpeg (Version 4.4) with the CPU-specific versions from Vansickle on the AMD and Intel platforms and for x264 on Graviton. In all cases, the Vansickle version produced the same or better throughput with identical quality.

Note that in other tests on different AMD instances with core counts ranging from 2 – 32, the Vansickle version was not always the best performer. So, if you try the Vansickle versions or your own CPU-specific compiled versions, you should verify that it outperforms the native version in all relevant use cases.

Note that the MulticoreWare version of FFmpeg performed much better on the Graviton system than the generic version of 4.4 or the Vansickle version, though still far behind Intel and particularly AMD. As you’ll see clearly below, if you’re running x265 on a Graviton system using high quality presets, you’re missing a great opportunity to shave your costs.

For the record, I tried upgrading the stock version of FFmpeg on the Ubuntu system to version 6.0 but ran into multiple issues that ultimately corrupted the system and forced me to start back at ground zero. Unfortunately, Ubuntu operation and maintenance are not a core-strengths of mine, but since I ran all tests using Version 6.0, whether supplied by Vansickle or MulticoreWare, the results should be representative.

Table 2 shows the different versions of FFmpeg that I ran on the three systems for the three test cases.

Table 2. The FFmpeg versions deployed on the three systems for the three test cases.

Results

Here are the results for the three test cases.

1080p x264

Figure 2 shows the cost per hour to produce a 1080p30 stream using FFmpeg and the x264 codec. One of the more interesting testing results was that the combination of FFmpeg and Ubuntu handled multiple instances of FFmpeg with minimal overhead, particularly on the Graviton CPU. You see this with the cost per hour for Graviton remaining consistent through twelve instances, while it increased slightly for Intel after 10 instances and AMD after 12.

In all cases, you see the cost per instance drop significantly when moving from single to multiple simultaneous encodes. If you’re performing a single 1080p x264 encode on an 8-core system, you’re probably wasting money.

On the other hand, once each CPU hits the lowest cost per hour, it’s time to consider adding another instance. The cost per stream will remain the same, but your encoding speed will double. So, if you’re encoding on a Graviton system, your encoding time will double if you perform twelve simultaneous encodes as opposed to six, but your cost per hour will be almost exactly the same. If you spin up another 8-core system and encode six simultaneous encodes on the two systems, your cost will be almost identical, but your throughput will double.

Figure 2. Cost per hour to produce a single 1080p stream using the x264 codec and FFmpeg. Graviton is clearly the most cost-effective.

1080p x265

What a difference a codec makes. Where Graviton was the clear leader for x264, it’s the clear laggard for x265. Again, I produced the Graviton results shown in Figure 3 using a version of FFmpeg supplied by x265 developer MulticoreWare; the results would have been much worse with either the Vansickle version or the stock version. As you may know, Graviton is an Arm-based CPU that uses a different instruction set than Intel or AMD CPUs. While the x264 codec was Arm-friendly, the x265 codec was decidedly the reverse, at least using the high-quality presets that I used in my tests.

Interestingly, for both Intel and AMD, we realized the lowest cost per stream at relatively low simultaneous stream counts, two for Intel and two and three for AMD. If your testing confirms this, you should consider adding instances once you achieve this threshold rather than adding additional encodes to existing instances.

Figure 3. Cost per hour to produce a single 1080p stream using the x265 codec and FFmpeg.

Comparing the lowest cost Intel ($6.60) to the lowest cost AMD ($5.49), shows a cost delta of about 17%. As shown in Table 1, 10% of this relates to pricing, leaving about a 7% performance delta.

For the record, note that an Amazon engineer ran similar tests here and found that Graviton was faster for both x264 and x265. Note, however, that the author used the ultrafast preset, while I used higher quality presets for the stated reasons. Have a look and draw your own conclusions.

4K60 x265

In 4K60p testing, the Graviton was clearly overwhelmed from both a cost and performance aspect, unable to complete even three simultaneous encodes. The overall cost delta between Intel and AMD narrowed slightly, dropping to 13.7% overall, with 10% relating to pricing. The actual throughput delta between the two in these tests is 3.7%.

Figure 4. Cost per hour to produce a single 4K60p stream using the x265 codec and FFmpeg.

This 4K60 test stressed memory usage much more so than the 1080p tests, limiting successful simultaneous transcodes to two for Graviton and four for AMD and Intel. Interestingly, in these tests, AMD produced the lowest cost per stream while running a single encode, and Intel did so at 2. With these challenging encodes; you may want to spin up new machines after only one or two encodes rather than attempting more simultaneous encodes. Or, perhaps, try a machine with more cores. Hold that thought until the last section.

For reference, Table 3 summarizes the lowest cost per hour for the three test cases.

Table 3. Cost per hour for the three test cases on the three tested CPUs.

Which leads us to the last section.

What’s the Optimal Number of Cores for FFmpeg?

AWS offers multiple core counts in all three CPU flavors: what’s the optimal core count? To evaluate this, I ran tests on multiple AMD CPUs for all three test cases and present the results below.

Let’s talk about expectations first. AWS charges linearly for the machine cores, so an 8-core system costs twice as much as a 4-core system and a quarter of a 32-core system. Given the results presented above, where FFmpeg/Ubuntu proved highly efficient when processing multiple instances, I expected a similar cost per hour for all CPUs. The results were close.

With x264, 2-core and 8-core systems were slightly more affordable than 16-core, though a 32-core system finally caught up at 32 simultaneous transcodes. If you’re going to run a 32-core system for 1080p30/x264 encodes, you need to be running quite a few simultaneous encodes to achieve the optimal cost per stream.

Figure 5. x264 encoding cost for the CPU core counts shown.

With x265 encoding at 1080p, the results were closer to what I expected, though again, the 2-core and 8-core systems were slightly more affordable. Unlike x264, the 32-core system became slightly more expensive as the number of simultaneous encodes increased, making eight simultaneous streams the most affordable.

Figure 6. x265 encoding cost for 1080p30 encodes and the CPU core counts shown.

When encoding 4K videos, the phrase “go big or go home” comes to mind. Here, 32-cores delivered the lowest cost, though only by a fraction, and only at four simultaneous encodes. After that, the cost per hour increases slightly through eight encodes and then starts a more serious climb.

Figure 7. x265 encoding cost for 4K60 encodes and the CPU core counts shown.

As you can see, all these results are highly codec and source material specific. The most important takeaway from this article should not be that Graviton is best for x264 and AMD best for x265. It should be that real differences exist between the performance of the CPUs, and these differences may translate to significant cost differentials. If you’re spending even a few thousand dollars a month on AWS for FFmpeg encoding, it makes sense to run tests like these to identify the most cost-effective CPU and core-count.

Test Strings

1080p30 x264:

ffmpeg -y -i Orchestra.mp4 -c:v libx264 -profile:v high  -preset veryslow -g 60 -keyint_min 60 -sc_threshold 0  -b:v 4200k -pass 1  -f mp4 /dev/null

ffmpeg -y -i Orchestra.mp4 -c:v libx264  -preset veryslow -g 60 -keyint_min 60 -sc_threshold 0  -b:v 4200k -maxrate 8400k -bufsize 8400k -pass 2  orchestra_x264_output.mp4

1080p30 x265:

ffmpeg  -y -i Football_short.mp4 -c:v libx265 -preset slower -x265-params keyint=60:min-keyint=60:scenecut=0:bitrate=3500:pass=1  -f mp4 /dev/null

ffmpeg  -y -i Football_short.mp4 -c:v libx265 -preset slower -x265-params keyint=60:min-keyint=60:scenecut=0:bitrate=3500:vbv-maxrate=7000:vbv-bufsize=7000:pass=2  Football_x265_HD_output.mp4

4K60 x265:

ffmpeg -y -i Football_4K60.mp4 -c:v libx265 -preset slow -x265-params keyint=120:min-keyint=120:scenecut=0:bitrate=12500K:pass=1  -f mp4 /dev/null

ffmpeg -y -i Football_4K60.mp4 -c:v libx265 -preset slow -x265-params keyint=120:min-keyint=120:scenecut=0:bitrate=12500K:vbv-maxrate=25000K:vbv-bufsize=25000K:pass=2  Football_4K_output.mp4 

Play Video about Which AWS CPU is Best for FFmpeg - AMD, Graviton, or Intel?
HARD QUESTIONS ON HOT TOPICS: AMD, Graviton, and Intel
– three CPU options to encode with FFmpeg on AWS
 
Watch the full conversation on YouTube: https://youtu.be/BOZZuiemMAU

How Scaling Method and Technique Impacts Quality and Throughput

How Scaling Method and Technique Impacts Quality and Throughput

The thing about FFmpeg is that there are almost always multiple ways to accomplish the same basic function. In this post, we look at four approaches to scaling to reveal how the scaling method and techniques used impact quality and throughput.

We found that if you’re scaling using the default -s function (-s 1280×720), you’re leaving a bit of quality on the table compared to other methods. How much depends upon the metric you prefer; about ten percent if you’re a VMAF (hand raised here) or SSIM fan, much less if you still bow to the PSNR gods. More importantly, if you’re chasing throughput via cascaded scaling with fast scaling algorithms (flags=fast_bilinear), you’re probably losing quality without a meaningful throughput increase.

That’s the TL/DR; here’s the backstory.

The Backstory

NETINT sells ASIC-based hardware transcoders. One key advantage over software-only/CPU-based encoding is throughput, so we perform lots of hardware vs. software benchmarking. Fairness dictates that we use the most efficient FFmpeg command string when deriving the command string for software-only encoding.

In addition, the NETINT T408 transcoder scales in software using the host CPU, so we are vested in techniques that increase throughput for T408 transcodes. In contrast, the NETINT Quadra scales and performs overlays in hardware and provides an AI engine, which is why it’s designated a Video Processing Unit (VPU) rather than a transcoder.

One proposed scaling technique for accelerating both software-only and T408 processing is cascading scaling, where you create a filter complex that starts at full resolution, scales to the next lower resolution, then uses the lower resolution to scale to the next lower resolution. Here’s an example.

filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=fast_bilinear,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=fast_bilinear,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=fast_bilinear,split=2[out720p][in720p];[in720p]scale=640:360:flags=fast_bilinear[out360p]”

So, rather than performing multiple scales from full resolution to the target (4K > 2K, 4K to 1080p, 4K > 720p, 4K to 360p), you’re performing multiple scales from lower resolution sources (4K > 2K > 1080p >720p > 360p). The theory was that this would reduce CPU cycles and improve throughput, particularly when coupled with a fast scaling algorithm. Even assuming a performance increase (which turned out to be a bad assumption), the obvious concern is quality; how much does quality degrade because the lower-resolution transcodes are working from a lower-resolution source?

In contrast, if you’ve read this far,  you know that the typical scaling technique used by most beginning FFmpeg producers is the -s command (-s 1280×720). For all rungs below 4K, FFmpeg scales the source footage down to the target resolution using the bicubic scaling algorithm,

So, we had two proposed methods which I expanded to four, as follows.

  • Default (-s 1280×720)
  • Cascade using fast bilinear
  • Cascade using Lanczos
  • Video filter using Lanczos (-vf scale=1280×720 -sws_flags lanczos)

I tested the following encoding ladder using the HEVC codec.

  • 4K @ 12 Mbps
  • 2K @ 7 Mbps
  • 1080p @ 3.5 Mbps
  • 1080p @ 1.8 Mbps
  • 720p @ 1 Mbps
  • 360p @ 500 kbps

I encoded two 3-minute 4Kp30 files, excerpts from the Netflix Meridian and Harmonic Football test clips using the x265 codec and ultrafast preset. You can see full command strings at the end of the article. I measured throughput in frames per second and measured the 2K to 360p rung quality with VMAF, PSNR, and SSIM, compiling the results into BD-Rate comparisons in Excel.

I tested on a Dell Precision 7820 tower driven by two 2.9 GH Intel Xeon Gold (6226R) CPUs running Windows 10 Pro for Workstations with 64 GB of RAM. I tested with FFmpeg 5.0, a version downloaded from www.gyan.dev on December 15, 2022.

Performance

How Scaling Method and Technique Impacts Quality and Throughput - table 1
TABLE 1. FPS BY SCALING METHOD

Table 1 shows that cascading delivered negligible performance benefits with the two test files and the selected encoding parameters. I asked the engineer who suggested the cascading scaling approach why we saw no throughput increase. Here’s a brief exchange. 

Engineer: It’s not going to make any performance difference in your example anyways but it does reduce the scaling load

       Me: Why wouldn’t it make a performance difference if it reduces the scaling load?

Engineer: Because, as your example has shown, the x265 encoding load dominates. It would make a very small difference

       Me: Ah, so the slowest, most CPU-intensive process controls overall performance.

Engineer: Yes, when you compare 1000+1 with 1000+10 there is not too much difference.

What this means, of course, is that these results may vary by the codec. If you’re encoding with H.264, which is much faster, cascading scaling might increase throughput. If you’re encoding with AV1 or VVC, almost certainly not.

Given that the T408 transcoder is multiple times faster than real-time, I’m now wondering if cascaded scaling might increase throughput when producing with the T408. You probably wouldn’t attempt this approach if quality suffered, but what if cascaded scaling improved quality? Sound far-fetched? Read on.

Quality Results

Table 2 shows the combined VMAF results for the two clips. Read this by choosing a row and moving from column to column. As you would suspect, green is good, and red is bad. So, for the Default row, that technique produces the same quality as Cascade – Fast Bilinear with a bitrate reduction of 18.55%. However, you’d have to boost the bitrate by 12.89% and 11.24%, respectively, to produce the same quality as Cascade – Lanczos and  Video Filter – Lanczos.

How Scaling Method and Technique Impacts Quality and Throughput - table 2
Table 2. BD-Rate comparisons for the four techniques using the VMAF metric.

From a quality perspective, the Cascade approach combined with the fast bilinear algorithm was the clear loser, particularly compared to either method using the Lanczos algorithm. Even if there was a substantial performance increase, which there wasn’t, it’s hard to see a relevant use case for this algorithm.

The most interesting takeaway was that cascading scaling with the Lanczos algorithm produced the best results, slightly higher than using a video filter with Lanczos. The same pattern emerged for PSNR, where Cascade – Lanc was green in all three columns, indicating the highest-quality approach. 

How Scaling Method and Technique Impacts Quality and Throughput - table 3
Table 3. BD-Rate comparisons for the four techniques using the PSNR metric.

Ditto for SSIM.

How Scaling Method and Technique Impacts Quality and Throughput - table 4
Table 4. BD-Rate comparisons for the four techniques using the SSIM metric.

The cascading approach delivering better quality than the video filter was an anomaly. Not surprisingly, the engineer noted:

Engineer: It is odd that cascading with Lanczos has better quality than direct scaling. I’m not sure why that would be.

       Me: Makes absolutely no sense. Is anything funky in the two command strings?

Engineer: Nothing obvious but I can look some more.

Later analysis yielded no epiphanies. Perhaps they can come from a reader.

The Net Net

First, the normal caveats; your mileage may vary by codec and content. My takeaways are:

  • Try cascading scaling with Lanczos with the T408,
  • For software encodes, never use -s again.
  • Use cascade or the simpler video filter approach. 
  • With most software-based encoders, faster scaling methods may not deliver performance increases but could degrade quality.

Further, as we all know, there are several, if not dozens, additional approaches to scaling; if you have meaningful results that prove one is substantially better, please share them with me via THIS email.

Finally, taking a macro view, it’s worth remembering that a $12,000 + workstation could only produce 25 fps when producing a live 4K ladder to HEVC using x265’s ultrafast preset. Sure, there are faster software encoders available. Still, hardware encoding is the best answer for affordable live 4K transcoding from both an OPEX and CAPEX perspective.

Command Strings:

Default:

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_12M_default.mp4 ^

-s 2560×1440 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_7M_default.mp4  ^

-s 1920×1080 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_3_5M_default.mp4 ^

-s 1920×1080 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_1_8M_default.mp4 ^

-s 1280×720  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_1M_default.mp4 ^

-s 640×360  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_500K_default.mp4

Cascade – Fast Bilinear

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=fast_bilinear,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=fast_bilinear,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=fast_bilinear,split=2[out720p][in720p];[in720p]scale=640:360:flags=fast_bilinear[out360p]” ^

-map [out4k] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_cascade_12M_fast_bi.mp4 ^

-map [out1440p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_cascade_7M_fast_bi.mp4  ^

-map [out1080p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_3_5M_fast_bi.mp4 ^

-map [out1080p2] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_1_8M_fast_bi.mp4 ^

-map [out720p]  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_8_bit_cascade_1M_fast_bi.mp4 ^

-map [out360p]  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_8_bit_cascade_500K_fast_bi.mp4

Cascade – Lanczos

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=lanczos,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=lanczos,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=lanczos,split=2[out720p][in720p];[in720p]scale=640:360:flags=lanczos[out360p]” ^

-map [out4k] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_cascade_12M_lanc.mp4 ^

-map [out1440p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_cascade_7M_lanc.mp4  ^

-map [out1080p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_3_5M_lanc.mp4 ^

-map [out1080p2] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_1_8M_lanc.mp4 ^

-map [out720p]  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_8_bit_cascade_1M_lanc.mp4 ^

-map [out360p]  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_cascade_500K_lanc.mp4

Video Filter – Lanczos

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_12M_filter_lanc.mp4 ^

-vf scale=2560×1440 -sws_flags lanczos -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_7M_filter_lanc.mp4  ^

-vf scale=1920×1080 -sws_flags lanczos  -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_3_5M_filter_lanc.mp4 ^

-vf scale=1920×1080 -sws_flags lanczos  -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_1_8M_filter_lanc.mp4 ^

-vf scale=1280×720 -sws_flags lanczos -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_1M_filter_lanc.mp4 ^

-vf scale=640×360 -sws_flags lanczos  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_500K_filter_lanc.mp4