How Scaling Method and Technique Impacts Quality and Throughput

How Scaling Method and Technique Impacts Quality and Throughput

The thing about FFmpeg is that there are almost always multiple ways to accomplish the same basic function. In this post, we look at four approaches to scaling to reveal how the scaling method and techniques used impact quality and throughput.

We found that if you’re scaling using the default -s function (-s 1280×720), you’re leaving a bit of quality on the table compared to other methods. How much depends upon the metric you prefer; about ten percent if you’re a VMAF (hand raised here) or SSIM fan, much less if you still bow to the PSNR gods. More importantly, if you’re chasing throughput via cascaded scaling with fast scaling algorithms (flags=fast_bilinear), you’re probably losing quality without a meaningful throughput increase.

That’s the TL/DR; here’s the backstory.

The Backstory

NETINT sells ASIC-based hardware transcoders. One key advantage over software-only/CPU-based encoding is throughput, so we perform lots of hardware vs. software benchmarking. Fairness dictates that we use the most efficient FFmpeg command string when deriving the command string for software-only encoding.

In addition, the NETINT T408 transcoder scales in software using the host CPU, so we are vested in techniques that increase throughput for T408 transcodes. In contrast, the NETINT Quadra scales and performs overlays in hardware and provides an AI engine, which is why it’s designated a Video Processing Unit (VPU) rather than a transcoder.

One proposed scaling technique for accelerating both software-only and T408 processing is cascading scaling, where you create a filter complex that starts at full resolution, scales to the next lower resolution, then uses the lower resolution to scale to the next lower resolution. Here’s an example.

filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=fast_bilinear,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=fast_bilinear,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=fast_bilinear,split=2[out720p][in720p];[in720p]scale=640:360:flags=fast_bilinear[out360p]”

So, rather than performing multiple scales from full resolution to the target (4K > 2K, 4K to 1080p, 4K > 720p, 4K to 360p), you’re performing multiple scales from lower resolution sources (4K > 2K > 1080p >720p > 360p). The theory was that this would reduce CPU cycles and improve throughput, particularly when coupled with a fast scaling algorithm. Even assuming a performance increase (which turned out to be a bad assumption), the obvious concern is quality; how much does quality degrade because the lower-resolution transcodes are working from a lower-resolution source?

In contrast, if you’ve read this far,  you know that the typical scaling technique used by most beginning FFmpeg producers is the -s command (-s 1280×720). For all rungs below 4K, FFmpeg scales the source footage down to the target resolution using the bicubic scaling algorithm,

So, we had two proposed methods which I expanded to four, as follows.

  • Default (-s 1280×720)
  • Cascade using fast bilinear
  • Cascade using Lanczos
  • Video filter using Lanczos (-vf scale=1280×720 -sws_flags lanczos)

I tested the following encoding ladder using the HEVC codec.

  • 4K @ 12 Mbps
  • 2K @ 7 Mbps
  • 1080p @ 3.5 Mbps
  • 1080p @ 1.8 Mbps
  • 720p @ 1 Mbps
  • 360p @ 500 kbps

I encoded two 3-minute 4Kp30 files, excerpts from the Netflix Meridian and Harmonic Football test clips using the x265 codec and ultrafast preset. You can see full command strings at the end of the article. I measured throughput in frames per second and measured the 2K to 360p rung quality with VMAF, PSNR, and SSIM, compiling the results into BD-Rate comparisons in Excel.

I tested on a Dell Precision 7820 tower driven by two 2.9 GH Intel Xeon Gold (6226R) CPUs running Windows 10 Pro for Workstations with 64 GB of RAM. I tested with FFmpeg 5.0, a version downloaded from www.gyan.dev on December 15, 2022.

Performance

How Scaling Method and Technique Impacts Quality and Throughput - table 1
TABLE 1. FPS BY SCALING METHOD

Table 1 shows that cascading delivered negligible performance benefits with the two test files and the selected encoding parameters. I asked the engineer who suggested the cascading scaling approach why we saw no throughput increase. Here’s a brief exchange. 

Engineer: It’s not going to make any performance difference in your example anyways but it does reduce the scaling load

       Me: Why wouldn’t it make a performance difference if it reduces the scaling load?

Engineer: Because, as your example has shown, the x265 encoding load dominates. It would make a very small difference

       Me: Ah, so the slowest, most CPU-intensive process controls overall performance.

Engineer: Yes, when you compare 1000+1 with 1000+10 there is not too much difference.

What this means, of course, is that these results may vary by the codec. If you’re encoding with H.264, which is much faster, cascading scaling might increase throughput. If you’re encoding with AV1 or VVC, almost certainly not.

Given that the T408 transcoder is multiple times faster than real-time, I’m now wondering if cascaded scaling might increase throughput when producing with the T408. You probably wouldn’t attempt this approach if quality suffered, but what if cascaded scaling improved quality? Sound far-fetched? Read on.

Quality Results

Table 2 shows the combined VMAF results for the two clips. Read this by choosing a row and moving from column to column. As you would suspect, green is good, and red is bad. So, for the Default row, that technique produces the same quality as Cascade – Fast Bilinear with a bitrate reduction of 18.55%. However, you’d have to boost the bitrate by 12.89% and 11.24%, respectively, to produce the same quality as Cascade – Lanczos and  Video Filter – Lanczos.

How Scaling Method and Technique Impacts Quality and Throughput - table 2
Table 2. BD-Rate comparisons for the four techniques using the VMAF metric.

From a quality perspective, the Cascade approach combined with the fast bilinear algorithm was the clear loser, particularly compared to either method using the Lanczos algorithm. Even if there was a substantial performance increase, which there wasn’t, it’s hard to see a relevant use case for this algorithm.

The most interesting takeaway was that cascading scaling with the Lanczos algorithm produced the best results, slightly higher than using a video filter with Lanczos. The same pattern emerged for PSNR, where Cascade – Lanc was green in all three columns, indicating the highest-quality approach. 

How Scaling Method and Technique Impacts Quality and Throughput - table 3
Table 3. BD-Rate comparisons for the four techniques using the PSNR metric.

Ditto for SSIM.

How Scaling Method and Technique Impacts Quality and Throughput - table 4
Table 4. BD-Rate comparisons for the four techniques using the SSIM metric.

The cascading approach delivering better quality than the video filter was an anomaly. Not surprisingly, the engineer noted:

Engineer: It is odd that cascading with Lanczos has better quality than direct scaling. I’m not sure why that would be.

       Me: Makes absolutely no sense. Is anything funky in the two command strings?

Engineer: Nothing obvious but I can look some more.

Later analysis yielded no epiphanies. Perhaps they can come from a reader.

The Net Net

First, the normal caveats; your mileage may vary by codec and content. My takeaways are:

  • Try cascading scaling with Lanczos with the T408,
  • For software encodes, never use -s again.
  • Use cascade or the simpler video filter approach. 
  • With most software-based encoders, faster scaling methods may not deliver performance increases but could degrade quality.

Further, as we all know, there are several, if not dozens, additional approaches to scaling; if you have meaningful results that prove one is substantially better, please share them with me via THIS email.

Finally, taking a macro view, it’s worth remembering that a $12,000 + workstation could only produce 25 fps when producing a live 4K ladder to HEVC using x265’s ultrafast preset. Sure, there are faster software encoders available. Still, hardware encoding is the best answer for affordable live 4K transcoding from both an OPEX and CAPEX perspective.

Command Strings:

Default:

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_12M_default.mp4 ^

-s 2560×1440 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_7M_default.mp4  ^

-s 1920×1080 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_3_5M_default.mp4 ^

-s 1920×1080 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_1_8M_default.mp4 ^

-s 1280×720  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_1M_default.mp4 ^

-s 640×360  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_500K_default.mp4

Cascade – Fast Bilinear

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=fast_bilinear,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=fast_bilinear,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=fast_bilinear,split=2[out720p][in720p];[in720p]scale=640:360:flags=fast_bilinear[out360p]” ^

-map [out4k] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_cascade_12M_fast_bi.mp4 ^

-map [out1440p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_cascade_7M_fast_bi.mp4  ^

-map [out1080p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_3_5M_fast_bi.mp4 ^

-map [out1080p2] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_1_8M_fast_bi.mp4 ^

-map [out720p]  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_8_bit_cascade_1M_fast_bi.mp4 ^

-map [out360p]  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_8_bit_cascade_500K_fast_bi.mp4

Cascade – Lanczos

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=lanczos,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=lanczos,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=lanczos,split=2[out720p][in720p];[in720p]scale=640:360:flags=lanczos[out360p]” ^

-map [out4k] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_cascade_12M_lanc.mp4 ^

-map [out1440p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_cascade_7M_lanc.mp4  ^

-map [out1080p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_3_5M_lanc.mp4 ^

-map [out1080p2] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_1_8M_lanc.mp4 ^

-map [out720p]  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_8_bit_cascade_1M_lanc.mp4 ^

-map [out360p]  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_cascade_500K_lanc.mp4

Video Filter – Lanczos

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_12M_filter_lanc.mp4 ^

-vf scale=2560×1440 -sws_flags lanczos -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_7M_filter_lanc.mp4  ^

-vf scale=1920×1080 -sws_flags lanczos  -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_3_5M_filter_lanc.mp4 ^

-vf scale=1920×1080 -sws_flags lanczos  -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_1_8M_filter_lanc.mp4 ^

-vf scale=1280×720 -sws_flags lanczos -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_1M_filter_lanc.mp4 ^

-vf scale=640×360 -sws_flags lanczos  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_500K_filter_lanc.mp4

Is power consumption your company’s priority?

Is power consumption your company's priority?

Power consumption is a priority for NETINT customers and a passion for NETINT engineers and technicians. Matthew Ariho, a system engineer in SoC Engineering at NETINT, recently answered some questions about:

  • How to test power consumption
  • Which computer components draw the most power
  • Why using older computers is bad for your power bills, and
  • The best way for video-centric data centers to reduce power consumption.

What are the different ways to test power consumption (and cost)?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

There are software and hardware-based solutions to this problem. I use one of each as a means of confirming any results.

One software tool is the IPMItool linux package which provides a simple command-line interface to IPMI-enabled devices through a Linux kernel driver. This tool polls the instantaneous, average and peak and minimum instantaneous power draw of the over a sampling period.

Is power consumption your company's priority?

On the hardware side of things, you can use different forms of multimeters, like the Kill-A-Watt meter and a 208VAC power bar are examples of such devices available in our lab.

What are their pros and cons (and accuracy)?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

The IPMItool is great because it provides a lot of information. It is fairly simple to set up and use. There is a question of reliability because it is software based, it depends on readings whose source I’m not familiar with.

The multimeters (like the Kill-A-Watt meter), while also simple to use, do not have any logging capabilities which makes measurements like average or steady state power draw difficult to measure. Both methods have a resolution of 1W which is not ideal but more than sufficient for our use cases.

What activities to you run when you test power consumption?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

We run multi-instances that mimic streaming workloads but only to the point that each of those instances is performing up to par with our standards (for example, 30 fps).

What’s the range of power consumption you’ve seen?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

I’ve seen reports of power consumption of up to 450 watts, but personally never tested a unit that drew that much. Typically, without any load on the T408 devices, the power consumption hovers around 150W, which increases to 210 to 220W during peak periods.

What’s the difference between Power Supply rating and actual power consumption (and are they related)?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

Power supplies take in 120VAC or 208VAC and convert to various DC voltages (12V, 5V, 3.3V) to power different devices in a computer. This conversion process inherently has several inefficiencies. The extent of these inefficiencies depends on the make of the power supply and the quality of components used.

Power supplies are offered with an efficiency rating that certify how efficiently a power supply will function at different loads. Power consumption measured at the wall will always be less than power supplied within a computer.

What are the hidden sources of excessive power that most people don’t know about?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

The operating system of a computer can consume a lot of power performing background tasks though this has become less of a problem with more efficient CPUs on the market. Other sources of excessive power are bloatware that are usually unnecessary programs that run in the background.

What distinguishes a power-hungry computer from an efficient one – what should the reader look for?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

The power supply rating is something to watch. Small variations in the power supply rating make significant differences in efficiency. The difference between a PSU rated at 80 PLUS and a PSU rated at 80 PLUS Bronze is about 2% to 5% depending on the load. This number only grows with better rated PSUs.

Other factors including the components of the computer. Recently, newer devices (CPUs, GPUs and motherboards) have been made with beyond significant generational improvements in efficiency. A top-of-the-line computer from 3 years ago simply cannot compete with some mid-range computers in terms of both power efficiency or performance. So, while sourcing older but cheaper components in the past may have been a good decision, nowadays, its not as clear cut.

Which components draw the most power?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

CPUs and GPUs. Even consumer CPUs can draw over 200W sustained. GPUs on the lower end consume around 150W and now more recently over 400W.

How does the number of cores in a computer impact power usage?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

I’m really not an expert on server components and it is hard to say without having examples. There are too many options to provide a conclusion on a proper trend. There are AMD 64 core server CPUs that pull about 250 to 270 W and 12 to 38 core Intel server CPUs that do about the same. Ultimately architectural advantages/features determine performance and efficiencies when comparing CPUs across manufacturer or even CPUs from the same manufacturer.

You can't manage what you don't measure.

One famous quote attributed to Peter Drucker is that you can’t manage what you don’t measure. As power consumption becomes increasingly important, it’s incumbent upon all of us to both measure and manage it.

Insights from the Bitmovin Video Developer Report

Insights from the Bitmovin Video Developer Report

The Bitmovin Video Developer Report, now in its 6th edition, is one of the most far-reaching and useful documents available to streaming professionals (now with no registration required). It’s a report that I happily download each December and generally refer to frequently during the next twelve months.

Like the proverbial elephant, what you find important in the report depends upon your interests. I typically zero in on video codec usage, encoding practices, and the most important problems and opportunities facing streaming developers. As discussed below, this year’s edition has some surprises, like the fact that more respondents are currently working with H.266/VVC than AV1.

Beyond this, the report also tracks details on development frameworks, content distribution, monetization practices, DRM, video analytics, and many other topics. This makes it extraordinarily valuable to anyone needing a finger on the pulse of streaming industry practices.

Let’s start with some details about how Bitmovin compiles the data and then jump to what I found most interesting.

Gathering the Data

Bitmovin collected the data between June and September 2022. A total of 424 respondents from over 80 countries answered the survey. Geographically, EMEA led the charge with 43%, followed by North America (34%), APAC (14%), and Latin America (8%). Regarding job function, 34% of respondents were manager/CEO/VP level, 23% developer/engineer, 14% technical manager, 10% product manager, 9% architect/consultant, 7% in R&D, and 3% in sales and marketing.

A quarter of respondents worked in OTT streaming services, 21% in online video platforms, 15% for broadcasters, 12% for integrators, 7% for publishers, 6% for telcos, 5% for social media sites, with 10% other. In terms of company size, 35% worked in companies with 300+ employees, 17% 101-300, 19% 51 – 100, and 29% 1 – 50. In other words, a very useful cross-section of geography, industry, job function, and company size.

To be clear, the results are not actual data from Bitmovin’s cloud encoding facility, which would be useful in its own right. Rather, the respondents answered questions about their current practices and future plans in each of the listed topics.

Current and Planned Codec Usage

Figure 1 shows current and planned codec usage for live encoding, with current usage in blue and planned usage in red. The numbers exceed 100% (of course) because most respondents use multiple codecs.

It’s always a surprise to see H.264 at less than 100%, but there’s 78% clear as day. Even given the breadth of industries that responded to the survey, it’s tough to imagine any publisher not supporting H.264.

Insights from the Bitmovin Video Developer Report - 1
Figure 1. Answers to the question, “Which streaming formats are you using in production for distribution and which ones are you planning to introduce within the next year?”

HEVC was next at 40%, with AV1 in fifth at 18%, bracketed by VP8 (19%) and VP9 (17%), presumably more for WebRTC than OTT. These are the codecs most likely to be used to actually publish video in 2022. Other codecs presumably implemented by infrascture providers were H.266/VVC a suprising third at 19%, with LCEVC and EVC both at 16%.

Looking ahead, HEVC looks to be most likely to succeed in 2023 with 43% of respondents planning to implement, with AV1 next at 34%, H.264/AVC at 33%, and VVC at 20%. Given that CanIUse lists AV1 support at 73% while VVC isn’t even listed, you’d have to assume that actual AV1 deployments in the near term will dwarf H.266/VVC, but you can’t ignore the interest this standard based codec is receiving from the industry. VOD encoding tracks these results fairly closely for both current and planned usage.

Video Quality Related Findings

Quality is a constant concern for video professionals and quality-related data appeared in several questions. In terms of challenges faced by respondents, “finding the root case of quality issues” ranked fifth with 23%, while “quality of experience” ranked ninth, with 19%.

Interestingly, in response to the question, “For which of the following video use cases do you expect to use machine learning (ML) or artificial intelligence (AI) to improve the video experience for your viewers,” 33% cited “video quality optimization,” which ranked third, while 30% cited “quality of experience (QoE),” which ranked fourth.

With so many respondents looking for futuristic means to improve quality, it was ironic that so many ignored content-aware encoding (CAE), a proven method of improving both quality and quality of experience. Specifically, only 33% percent of respondents were currently using CAQ, with 35% planning to implement CAE within the next 12 months. If you’re not in either of these camps, consider yourself scolded.

Live Encoding Practices

Lastly, I focused on live encoding practices, finding that 53% of respondents used commercial encoders, which presumably include both hardware and software. In comparison, 34% encode via open source, which is all software. What’s interesting is how poorly this group dovetails with both the most significant challenge faced by respondents and the largest opportunity for innovation perceived by respondents.

Figure 2. Answers to the question, “Where do you encode video?”

Specifically, controlling cost was the most significant challenge in the report, selected by 33% of respondents. On a cost per stream basis, considering both CAPEX and OPEX, software-encoding is by far more expensive than encoding with hardware, particularly ASICs.

The most significant opportunity for innovation reported by respondents was live streaming at scale, again at 33%. In this regard, the same lack of throughput that makes CPU-driven open-source encoding the most expensive solution makes it the least scalable. Simply stated, publishers currently encoding with CPU-driven open-source codecs can help address both their biggest challenge and their most significant opportunity by switching to ASIC-based transcoding.

Insights from the Bitmovin Video Developer Report - 3
Figure 3. Responses to the question, “Where do you see the most opportunity for innovation in your service?

Curious? Download our white paper, How to Slash CAPEX, OPEX, and Carbon Emissions Using the NETINT T408 Video Transcoder here. Or, compute how long it will take to recoup your investment in ASIC-based encoding through reduced power costs via calculators available here.

And don’t forget to download the Bitmovin Video Developer Report, here.

How NETINT enables ASIC upgradeability with Software

ASIC upgradeability with Software - NETINT technologies

ASICs provide a tremendous energy efficiency, and yet suffer from being fixed function with limited programmability. This was a core engineering challenge that we addressed in the development of the Codensity ASIC family with upgradeable firmware that can be used for a variety of purposes, including adding new features and improving coding performance, and functionality.

To explore these capabilities, we spoke with two members of the NETINT development team, Neil Gunn, who is NETINT’s Video Firmware Tech Lead, and Savio Lam, a firmware engineer. In this short discussion, they describe how firmware allows Codensity video transcoders and VPU’s to evolve and improve long after leaving the foundry. 

This conversation focuses mainly on our Codensity G4 ASIC, however the capability to upgrade firmware applies to all of our ASIC platforms including the Codensity G5.

What do you do with NETINT?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

I am a firmware architect and also develop the firmware and to a lesser extent, the host side software (libxcoder and FFmpeg) for NETINT transcoding ASICs. I started at NETINT in 2018 working on T408 (Codensity G4 based) firmware development. Then, I moved to Quadra (Codensity G5 based) as a software architect and firmware/software developer. I continue to support T408 in the background.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

I am a firmware engineer working on our video transcoding products.

What did you do on the T408?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

I implemented a number of video features in the firmware such as 10-bit transcoding, close captions, HDR10, HDR10+, HLG10, Dolby Vision, HRD, Region of Interest, encoder parameter change, etc. I also worked on bug fixes and customer issues.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

I worked on the system design and integration. I mainly developed code that controls how video data comes in and out of our transcoder in the most efficient and reliable way.

What is firmware in an ASIC?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

The firmware is software that runs on embedded CPUs within the ASIC. The firmware provides a high-level interface to the low-level encoding and decoding hardware. The firmware does a lot of the high-level bitstream processing, such as creating VPS, SPS, and PPS headers, and SEI processing, leaving the ASIC hardware to do the low-level number crunching. Functions that consume a lot of processing and are likely not to change are implemented in hardware.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

To add to what Neil has already described, the firmware in our T408 ASIC manages several significant functions. For example, it comprises code responsible for the NVMe protocol, which allows us to efficiently receive and return up to 8GB/s of video input and output data. To properly consume and process the video data, the firmware sets up and schedules tasks to the appropriate hardware blocks.

Our firmware is also the brain that oversees the bigger picture part of the rate control. In this role, it’s part of a feedback loop that inputs subpicture data from low-level hardware blocks and uses that data to make better decisions that improve picture quality.

To sum up, the firmware is the brain that controls all the hardware blocks in the ASIC and gives instructions to each of them to perform their tasks as efficiently as possible.

How is firmware different from the gates burned into the chip?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

Firmware, like all software, can be changed, unlike actual gates in a chip. It’s called firmware because it’s a little harder to change than software. Firmware is stored in Flash memory which can be reprogrammed through an upgrade process. A T408 firmware release typically consists of new host-side software and firmware that must be version-matched for proper operation. Software provided to our customers with the release simplify the upgrade for one or more T408s in a system.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

There is logic in our T408 ASIC, which could have been designed as part of the hardware for better performance. However, that would significantly limit us from adding and improving the certain product features to suit different customer needs. We believe we have found the right balance on deciding what should be implemented in the firmware or hardware.

What functions can you adjust and/or improve within firmware?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

Things like the codec headers, seis, and rate control, to a certain extent, can be adjusted and/or improved within the firmware. Some lower-level rate control features are fixed in the hardware. Lower-level parts of the encoding standard are fixed in the hardware as these require a lot of processing and are unlikely to change.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

As Neil said, we are quite flexible when it comes to adding or improving support for different video metadata. And as we both explained earlier, since the firmware is also part of the brain that operates the picture rate control for encoding, we can continue to improve quality to a certain degree post-ASIC development.

Do you have any examples of significant improvements with the T408?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

We significantly reduced codec delay on both the encoder and decoder. Our low delay mode removes all frame buffering and encodes and decodes a single frame at a time. Our encoder uses a low delay GOP and sets flags in the bitstream appropriately so that another decoder knows that it doesn’t need to add any delay while decoding.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

Based on different customers’ feedback, we have made several improvements (or fixes) in the past to our rate control through firmware fixes which improved or resolved some of the video quality-related problems they have encountered.

When you hear people say ASICs are obsolete the day they come out of the foundry, what’s your response?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

It’s not true. It is true that the hardware is fixed in an ASIC. Still, the functions implemented in the hardware are typically the lower-level parts of a video codec standard that do not change over time and so the hardware does not need to be updated. The higher levels parts of the video codecs are in firmware and driver software and can still be changed. For example, the T408 encoder hardware is designed for H.264 and H.265. We cannot add new codecs to the T408, but we can add new features to the existing codecs.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

There is a fine balance between what needs to be implemented in hardware for performance and what needs to be implemented in the firmware for flexibility (programmability). We think we struck the perfect balance with the Codensity G4 which is what makes it a great ASIC.

This conversation focuses mainly on our Codensity G4 ASIC, however the capability to upgrade firmware applies to all of our ASIC platforms including the Codensity G5.

Computing Payback Period on T408s

Computing Payback Period on T408s

One of the most power-hungry processes performed in data centers is software-based live transcoding, which can be performed much more efficiently with ASIC-based transcoders. With power costs soaring and carbon emissions an ever-increasing concern, data centers that perform high-volume live transcoding should strongly consider switching to ASIC-based transcoders like the NETINT T408. Computing the Payback Period is easy with this calculator.

To assist in this transition, NETINT recently published two online calculators that measure the cost savings and payback period for replacing software-based transcoders with T408s. This article describes how to use these calculators and shows that data centers can recover their investment in T408 transcoders in just a few months, even less if you can repurpose servers previously used for encoding for other uses. Most of the data shown are from a white paper that you can access here.

About the T408

Briefly, NETINT designs, develops, and sells ASIC-powered transcoders like the T408, which is a video transcoder in a U.2 form factor containing a single ASIC. Operating in x86 and ARM-based servers, T408 transcoders output H.264 or HEVC at up to 4Kp60 or 4x 1080p60 streams per T408 module and draw only 7 watts.

Simply stated, a single T408 can produce roughly the same output as a 32-core workstation encoding in software but drawing anywhere from 250 – 500 watts of power. You can install up to 24 T408s in a single workstation, which essentially replaces 20 – 24 standalone encoding workstations, slashing power costs and the associated carbon emissions.

In a nutshell, these savings are why large video publishers like YouTube and Meta are switching to ASICs. By deploying NETINT’s T408s, you can achieve the same benefits without the associated R&D and manufacturing costs. The new calculators will help you quantify the savings.

Determining the Required Number of T408s

The first calculator, available here, computes the number of T408s required for your production. There are two steps; first, enter the rungs of your encoding ladder into the table as shown. If you don’t know the details of your ladder, you can click the Insert Sample HD or 4K Ladder buttons to insert sample ladders.

After entering your ladder information, insert the number of encoding ladders that you need to produce simultaneously, which in the table is 100. Then press the Compute button (not shown in the Figure but obvious on the calculator).

Calculator 1: Computing the number of required T408 transcoders.

This yields a total of 41 T408s. For perspective, the calculator should be very accurate for streams that don’t require scaling, like 1080p inputs output to 1080p. However, while the T408 decodes and transcodes in hardware, it relies on the host CPU for scaling. If you’re processing full encoding ladders, as we are in this example, throughput will be impacted by the power of the host CPU.

As designed, the calculator assumes that your T408 server is driven by a 32-core host CPU. On an 8-16 core system, expect perhaps 5 – 10% lower throughput. On a 64-core system, throughput could increase by 15 – 20%. Accordingly, please consider the output from this calculator as a good rough estimate accurate to about plus or minus 20%.

To compute the payback period, click the Compute Payback Period shown in Figure 1. To restart the calculation, refresh your browser.

Computing Payback Period

Computing the payback period requires significantly more information, which is summarized in the following graphic.

Calculator 2: Information needed to compute the payback period.

Step by step

  1. Choose your currency in the drop-down list.

  2. Enter your current cost per KW. The $0.25/KW is the approximate UK cost as of March 2022 from this source, which you can also access by clicking the information button to the right of this field. This information button also contains a link to US power costs here.

  3. Enter the number of encoders currently transcoding your live streams. In the referenced white paper, 34 was the number of required servers needed to produce 100 H.264 encoding ladders.

  4. Enter the power consumption per encoding server. The 289 watts shown were the actual power consumption measured for the referenced white paper. If you don’t know your power consumption, click the Info button for some suggested values.

  5. Enter the number of encoding servers that can be repurposed. The T408s will dramatically improve encoding density; for example, in the white paper, it took 34 servers transcoding with software to produce the same streams as five servers with ten T408s each. Since you won’t need as many encoding servers, you can shift them to other applications, which has an immediate economic benefit. If you won’t be able to repurpose any existing servers for some reason, enter 0 here.

  6. Enter the current cost of the encoding servers that can be repurposed. This number will be used to compute the economic benefit of repurposing servers for other functions rather than buying new servers for those functions. You should use the current replacement cost for these servers rather than the original price.

  7. Enter the number of T408s required. If you start with the first calculator, this number will be auto-filled.

  8. Enter your cost for the T408s. $400 is the retail price of the T408 in low quantities. To request pricing for higher volumes, please check with a NETINT sales representative. You can arrange a meeting HERE. 

  9. Enter the power consumption for each T408. The T408 draws 7 watts of power which should be auto-filled.

  10. Enter the number of computers needed to host the T408s. You can deploy up to ten T408s in a 1RU server and up to 24 T408s in a 2RU server. We assumed that you would deploy using the first option (10 T408s in a single 1RU) and auto-filled this entry with that calculation. If the actual number is different, enter the number of computers you anticipate buying for the T408s.

  11. Enter the price for computers purchased to run T408s (USD). If you need to purchase new computers to house the T408, enter the cost here. Note that since the T408 decodes incoming H.264 and HEVC streams and transcodes on-board to those formats, most use cases work fine on workstations with 8-16 cores, though you’ll need a U.2 expansion chassis to house the T408s. Check this link for more information about choosing a server to house the T408s. We assumed $3,000 because that was the cost for the server used in the white paper.

    If you’re repurposing existing hardware, enter the current cost, similar to number 6.

 

  1. Enter power consumption for the servers (watts/hour). As mentioned, you won’t need a very powerful computer to run the T408s, and CPU utilization and power consumption should be modest because the T408s are doing most of the work. This number is the base power consumption of the computer itself; the power utilized by the T408s will be added separately.

When you’ve entered all the data, press the Calculate button.

Interpreting the Results

The calculator computes the payback period under three assumptions:

  • Simple: Payback Period on T408 Purchases
  • Simple: Payback Period on T408 + New Computers
  • Comprehensive: Consider all costs
Figure 3. Simple payback on T408 purchases.

This result divides the cost of the T408 purchases by the monthly savings and shows a payback period of around 11 months. That said, if five servers with T408s essentially replaced 34 servers, unless you’re discarding the 29 servers, the third result is probably a more accurate reflection of the actual economic impact.

Figure 4. Simple: Payback Period on T408 + New Computers

This result includes the cost of the servers necessary to run the T408s, which extends the payback period to about 20.5 months. Again, however, if you’re able to allocate existing encoding servers into other roles, the third calculation is a more accurate reflection.

Figure 5. Comprehensive: consider all costs.

This result incorporates all economic factors. In this case, the value of the repurposed computers ($145,000) exceeds the costs of the T408s and the computers necessary to house them ($103,600), so you’re ahead the day you make the switch.

However you run the numbers, data centers driving high-volume live transcoding operations will find that ASIC-based transcoders will pay for themselves in a matter of months. If power costs keep rising, the payback period will obviously shrink even further.

2022-Opportunities and Challenges for the Streaming Video Industry

2022-Opportunities and Challenges for the Streaming Video Industry

As 2022 comes to a close, for those in the streaming video industry, it will be remembered as a turbulent year marked by new opportunities, including the emergence of new video platforms and services.

2022 started off with Meta’s futuristic vision of the internet known as the Metaverse. The Metaverse can be described as a combination of virtual reality, augmented reality, and video where users interact within a digital universe. The Metaverse continues to evolve with the trend of unique individual, one-to-one video streaming experiences in contrast to one-to-many video streaming services which are commonplace today. 

Recent surveys have shown that two-thirds of consumers are planning to cut back on streaming subscriptions due to rising costs and diminishing discretionary income. With consumers becoming more value-conscious and price-sensitive, Netflix and other platforms have introduced new innovative subscriber models. Netflix’s subscription offering, in addition to SVOD (Subscription Video on Demand), now includes an Ad-based tier, AVOD (Advertising Video on Demand).  

Netflix shows the way

This new ad-based tier targets the most price sensitive customers and it is projected that AVOD growth will lead SVOD by 3x in 2023. Netflix can potentially earn over $4B in advertising revenue, making them the second largest ad support platform only after YouTube. This year also saw Netflix making big moves into mobile cloud gaming with the purchase of its 6th gaming studio. Adding gaming to their product portfolio serves at least two purposes: it expands the number of platforms that can access their game titles and serves as another service to maintain their existing users.

These new services and platforms are a small sample of the continued growth in new streaming video services where business opportunities abound for video platforms willing to innovate and take risks.

Stop data center expansion

The new streaming video industry landscape requires platforms to provide innovative new services to highly cost sensitive customers in a regulatory environment that discourages data center expansion. To prosper in 2023 and beyond, video platforms must address key issues to prosper and add services and subscribers.

  • Controlling data center sprawl – new services and extra capacity can no longer be contingent on the creation of new and larger data centers.
  • Controlling OPEX and CAPEX – in the current global economic climate, costs need to be controlled to keep prices under control and drive subscriber growth. In addition, in today’s economic uncertainty, access to financing and capital to fund data expansion cannot be assumed.
  • Energy consumption and environmental impact are intrinsically linked, and both must be reduced. Governments are now enacting environmental regulations and platforms that do not adopt green policies do so at their own peril.

Application Specific Integrated Circuit

For a vision of what needs to be done to address these issues, one only needs to glimpse into the recent past at YouTube’s Argos VCU (Video Coding Unit). Argos is YouTube’s in-house designed ASIC (Application Specific Integrated Circuit) encoder that, among other objectives, enabled YouTube to reduce their encoding costs, server footprint, and power consumption. YouTube is encoding over 500 hours (about 3 weeks) of content per minute.

To stay ahead of this workload, Google designed their own ASIC, which enabled them to eliminate millions of Intel CPUs. Obviously, not everyone has their own in-house ASIC development team, but whether you are a hyperscale platform, commercial, institutional, or government video platform, the NETINT Codensity ASIC-powered video processing units are available.

To enable faster adoption, NETINT partnered with Supermicro, the global leader in green server solutions. The NETINT Video Transcoding Server is based on a 1RU Supermicro server powered with 10 NETINT T408 ASIC-based video transcoder modules. The NETINT Video Transcoding Server, with its ASIC encoding engine, enables a 20x reduction in operational costs compared to CPU/software-based encoding. The massive savings in operational costs offset the CAPEX associated with upgrading to the NETINT video transcoding server.

Supermicro and T408 Server Bundle

In addition to the extraordinary cost savings, the advantages of ASIC encoding include enabling a reduction in the server footprint by a factor of 25x or more, which has a corresponding reduction in power consumption and, as a bonus, is also accompanied by a 25x reduction in carbon emissions. This enables video platforms to expand encoding capacity without increasing their server or carbon footprints, avoiding potential regulatory setbacks.

In need of environmentally friendly technologies

2022 has seen the emergence of many new opportunities with the launch of new innovative video services and platforms. To ensure the business success of these services, in the light of global economic uncertainty and geopolitical unrest, video platforms must rethink how these services are deployed and embrace new cost-efficient, environmentally friendly technologies.

Introduction to AI Processing on Quadra

Intro to AI Processing on Quadra - NETINT technologies

The intersection of video processing and artificial intelligence (AI) delivers exciting new functionality, from real-time quality enhancement for video publishers to object detection and optical character recognition for security applications. One key feature in NETINT’s Quadra Video Processing Units are two onboard Neural Processing Units (NPUs). Combined with Quadra’s integrated decoding, scaling, and transcoding hardware, this creates an integrated AI and video processing architecture that requires minimal interaction from the host CPU. As you’ll learn in this post, this architecture makes Quadra the ideal platform for executing video-related AI applications.

This post introduces the reader to what AI is, how it works, and how you deploy AI applications on NETINT Quadra. Along the way, we’ll explore one Quadra-supported AI application, Region of Interest (ROI) encoding.

About AI

Let’s start by defining some terms and concepts. Artificial intelligence refers to a program that can sense, reason,  act, and adapt. One AI subset that’s a bit easier to grasp is called machine learning, which refers to algorithms whose performance improves as they are exposed to more data over time.

Machine learning involves the five steps shown in the figure below. Let’s assume we’re building an application that can identify dogs in a video stream. The first step is to prepare your data. You might start with 100 pictures of dogs and then extract features, or represent them mathematically, that identify them as dogs: four legs, whiskers, two ears, two eyes, and a tail. So far, so good.

AI Processing on Quadra - figure 1
Figure 1. The high-level AI workflow (from Escon Info Systems)

To train the model, you apply your dog-finding algorithm to a picture database of 1,000 animals, only to find that rats, cats, possums, and small ponies are also identified as dogs. As you evaluate and further train the model, you extract new features from all the other animals that disqualify them from being a dog, along with more dog-like features that help identify true canines. This is the ”machine learning” that improves the algorithm.

As you train and evaluate your model, at some point it achieves the desired accuracy rate and it’s ready to deploy.

The NETINT AI Tool Chain

Then it’s time to run the model. Here, you export the model for deployment on an AI-capable hardware platform like the NETINT Quadra. What makes Quadra ideal for video-related AI applications is the power of the Neural Processing Units (NPU) and the proximity of the video to the NPUs. That is, since the video is entirely processed in Quadra, there are no transfers to a CPU or GPU, which minimizes latency and enables faster performance. More on this is below.

Figure 2 shows the NETINT AI Toolchain workflow for creating and running models on Quadra. On the left are third-party tools for creating and training AI-related models. Once these models are complete, you use the free NETINT AI Toolkit to input the models and translate, export, and run them on the Quadra NPUs – you’ll see an example of how that’s done in a moment. On the NPUs, they perform the functions for which they were created and trained, like identifying dogs in a video stream.

AI Processing on Quadra - figure 2
Figure 2. The NETINT AI Tool Chain.

Quadra Region of Interest (ROI) Filter

Let’s look at a real-world example. One AI function supplied with Quadra is an ROI filter, which analyzes the input video to detect faces and generate Region of Interest (ROI) data to improve the encoding quality of the faces. Specifically, when the AI Engine identifies a face, it draws a box around the face and sends the box’s coordinates to the encoder, with encoding instructions specific to the box.

Technically, Quadra identifies the face using what’s called a YOLOv4 object detection model. YOLO stands for You Only Look Once, which is a technology that requires only a single pass of the image (or one look) for object detection. By way of background, YOLO is a highly regarded family of “deep learning object detection models. The original versions of YOLO are implemented using the DARKNET framework, which you see as an input to the NETINT AI Toolkit in Figure 2.

Deep learning is different from the traditional machine learning described above in that it uses large datasets to create the model, rather than human intervention. To create the model deployed in the ROI filter, we trained the YOLOv4 model in DARKNET using hundreds of thousands of publicly available image data with labels (where the labels are bounding boxes on people’s faces). This produced a highly accurate model with minimum manual input, which is faster and cheaper than traditional machine learning. Obviously, where relevant training data is available, deep learning is a better alternative than traditional machine learning.

Using the ROI Function

Most users will access the ROI function via FFmpeg, where it’s presented as a video filter with the filter-specific command string shown below. To execute the function, you call the filter (ni_quadra_roi), enter the name and location of the model (yolov4_head.nb), and a QP value to adjust the quality within each box (qpoffset=-0.6). Negative values increase video quality, while positive values decrease it so that the command string would increase the quality of the faces by approximately 60% over other regions in the video.  

-vf ‘ni_quadra_roi=nb=./yolov4_head.nb:qpoffset=-0.6’

Obviously, this video is highly compressed; in a surveillance video, the ROI filter could preserve facial quality for face detection; in a gambling or similar video compressed at a higher bitrate, it could ensure that the players’ or performers’ faces look their best.

Figure 3. The region of interest filter at work; original on LEFT, ROI filter on the RIGHT

In terms of performance, a single Quadra unit can process about 200 frames per second or at least six 30fps streams. This would allow a single Quadra to detect faces and transcode streams from six security cameras or six player inputs in an interactive gambling application, along with other transcoding tasks performed without region of interest detection.

Figure 4 shows the processing workflow within the Quadra VPU. Here we see the face detection operating within Quadra’s NPUs, with the location and processing instructions passing directly from the NPU to the encoder. As mentioned, since all instructions are processed on Quadra, there are no memory transfers outside the unit, reducing latency to a minimum and improving overall throughput and performance. This architecture represents the ideal execution environment for any video-related AI application.

Figure 4. Quadra’s on-board AI and encoding processing.

NETINT offers several other AI functions, including background removal and replacement, with others like optical character recognition, video enhancement, camera video quality detection, and voice-to-text on the long-term drawing board. Of course, via the NETINT Tool Chain, Quadra should be able to run most models created in any machine learning platform.

Here in late 2022, we’re only touching the surface of how AI can enhance video, whether by improving visual quality, extracting data, or any number of as-yet unimagined applications. Looking ahead, the NETINT AI Tool Chain should ensure that any AI model that you build will run on Quadra. Once deployed, Quadra’s integrated video processing/AI architecture should ensure highly efficient and extremely low-latency operation for that model.

Vindral CDN Against Dinosaurs’ Agreement

Vindral's CDN Against Dinosaurs' Agreement.jpg

One thing is the bill that you're getting, the other thing is the bill we're leaving to our children...”

WATCH FULL CONVERSATION HERE: https://youtu.be/tNPFpXPVpxI

We’re going to talk about Vindral – but first, tell us a little bit about RealSprint?

RealSprint, we’re a Swedish company based in Northern Sweden, which is kind of a great place to be running a tech company. When you’re in a University Town, and any time after September, it gets dark outside for most parts of the day, which means  people generally try to find things to do inside. So, it’s a good place to have a tech business because you’ll have people spending a lot of time in front of their screens, creating things. RealSprint is a heavily culture-focused team, with the majority located in Northern Sweden and a few based in Stockholm and in the U.S.

The company started around 10 years ago as a really small team that did not have the end game figured out yet.  All they knew was that they wanted to do something around video, broadcasting, and streaming. From there it’s grown, and today we’re 30 people.

At a high level, what is Vindral?

Vindral is actually a product family. There is a live CDN, as you mentioned, and there’s also a video compositing software. As for the live CDN, it’s been around five or six years that it’s been running 24/7.

The product was born because we got questions from our clients about latency and quality. ‘Why do I have to choose if I want low latency or if I want high quality’. There are solutions on both ends of that spectrum, but when we got introduced to the problem, there weren’t really any good ones. We started looking into real-time technologies, like webRTC, in its current state and quickly found that it’s not really suitable if you want high quality. It’s amazing in terms of latency. But the client’s reality requires more. You can’t go all in on only one aspect of a solution. You need something that’s balanced.

Draw us a block diagram. So, you’ve got your encoder, you’ve got your CDN, you’ve got software…

We can take a typical client in entertainment or gaming. So, they have their content, and they want to broadcast that to a global audience. What they generally do is they ingest one signal to our endpoint, which is the most standard way of using our CDN. And there are several ways of ingesting multiple transfer protocols.

The first thing that happens on our end is we create the ABR ladder. We transcode all the qualities that are needed since network conditions vary  between  markets. Even in places that are well connected, the home Wi-Fi alone can be so bad at times, with a lot of jitter and latency.

After the ABR ladder is created, the next box fans out to the places in the world where there are potential viewers. And from there, we also have edge software as one part of this. Lastly, the signal is received by the player instanced on the device.

That’s basically it.

You’ve got an encoder in the middle of things creating the encoding ladder. Then you’ve got the CDN distributing. What about the software that you’ve contributed? How does that work? Do I log into some kind of portal and then administrate through there?

Exactly. Take a typical client in gaming, for example.They’re running 50 or 100 channels. And they want to see what’s going on in their operations, understand how much data is flowing through the system, and things like that. There is a portal where they can log in, see their usage, and see all of the channel information that they would need. It’s a very important part, of course, of any mature system that the client understands what’s going on.

Encoding is particularly important for us to solve because we have loads of channels running 24/7. So, that’s different. If you’re running a CDN, and your typical client is broadcasting for 20 minutes a month, then, of course, the encoding load is much lower. In our case, yes, we do have those types (minimal usage), but many of our clients are heavy users, and they own a lot of content rights. Therefore, the encoding part is several hundreds of terabytes ingested. Only one quality for each stream monthly on the ingest side.

You’re encoding ABR. Which codecs are you supporting? And which endpoints are you supporting?

So, codec-wise, everybody does H264, of course. That’s the standard when it comes to live streaming with low latency. We have recently added AV1, as well, which was something we announced as a world first. We weren’t the world’s first with AV1, but we were the world’s first with AV1 at what many would call real-time. We call it low latency.

We chose to add it because there’s a market pointing to AV1.

Which devices are you targeting? Is it TV? Smart TV? Mobile? The whole gamut?

I would say the whole gamut. That list of devices is steadily growing. I’m trying to think of any devices that we don’t support. Essentially, as long as it’s using the internet, we deliver to it. Any desktop or mobile browser, including IOS as well.

iOS is, basically, the hardest one. If you’re delivering to iOS browsers that are all running iOS Safari. We’re getting the same performance on iOS Safari. And then Apple TV, Google Chromecast, Samsung, LG TVs, and Android TVs. There’s a plethora of different devices that our clients require us to support.

4K? 1080p? HDR? SDR?

Yes, we support all of them. One of the very important things for us is to prove that you can get quality on low latency.

Take a typical client. They’re broadcasting sports and their viewers are used to watching this on their television, maybe a 77-inch or 85-inch TV. You don’t want that user to get a 720p stream. This is where the configurable latency really comes into play, allowing the client to pick a second of latency or 800 milliseconds, with 4K to be maintained on that latency. That is one of the use cases where we shine.

There’s also a huge market for lower qualities as well, where that’s important.

So, you mentioned ABR ladders, and yes, there are markets where you get 600 kilobits per second on the last mile. You need a solution for that as well.

Your system is the delivery side, the encoding side. Which types of encoders did you consider when you chose the encoder to fit into Vindral?

There are actually two steps to consider depending on whether we’re doing it on-prem or off, like a cloud solution. The client often has their own encoders. Many of our clients use Elemental or something similar just to push the material to us. But on the transcoding, where we generate the ladder, unless we’re passing all qualities through (which is also a possibility), there are, of course, different ways and different directions to go for different scenarios. For example, if you take an Intel CPU-based and you use software to encode. That is a viable option in some scenarios, but not in all.

There’s an Nvidia GPU, for example, which you could use in some scenarios since there are many factors coming into play when making that decision.

The highest priority of all is something that our business generally does badly –maintaining business viability. You want to make sure that any client that is using the system can pay and make their business work. Now, if we have channels that are running 24/7, as we do, and if it’s in a region where it’s not impossible to allocate bare metal or collocation space, then that is a fantastic option in many ways.

CPU-based, GPU-based, and ASICs are all different and make up the three different ones that we’ve looked into.

So, how do you differentiate? You talked about software being a good option in some instances. When is it not a good option?

No option is good or bad in a sense, but if you compare them, both the GPU and the ASIC outperform the software encoding when it comes to heavier use.

The software option is useful when you need to spin it up, spin it down, and you need to move things. You need it to be flexible which is, usually, in the lower revenue parts of the markets.

When it comes to the big broadcaster and the large rights holders, the use case is heavier with many channels, and large usage over time, then the GPU and especially the ASIC make a lot of sense.

You’re talking there about density. What is the quality picture?
A lot of people think software quality is going to be better than ASIC and GPUs. How do they compare?

It might be in some instances. We’ve found that the quality when using ASICs is fantastic. It’s all depending on what you want to do. Because we need to understand we’re talking about low latency here. We don’t have the option of passing encoding or anything like that. Everything needs to work in real time. Our requirement on encoding is that it takes a frame to encode, and that’s all the time that you get.

You mentioned density, but there are a lot of other things coming into play, quality being one.

If you’re looking at ASICs, you’re comparing that to GPUs. In some scenarios we’ve had for the past two years, the decision could have been based on the availability factor – there’s a chip shortage. What can I get my hands on? In some cases,  we’ve had a client banging on the door, and they want to go live right away.

Going back to the density part. That is a huge game changer because the ASIC is unmatched in terms of the number of streams per rack unit. If you just measure that KPI, and you’re willing to do the job of building your CDN in co-location spaces, which not everybody is, then that’s it. You have to ask yourself, though, who’s going to manage this? You don’t want to bloat when you’re managing this type of solution. If you have thousands of channels running, then cost is one thing when it comes to not having to take up a lot of rack space, but also, you don’t want it to bloat too much.

How formal of analysis did you make in choosing between the two hardware alternatives? Did you bring it down to cost per stream and power per stream?
Did you do any of that math? How did you make that decision between those two options?

Well, in a way, yes. But, on that particular metric, we need to look at the two options and say well, this is at a tenth of the cost. So I’m not going to give you the number, because I know it’s so much smaller.

We’re well aware of what costs are involved, but the cost per stream depends on profiles, etc. Just comparing them. We’ve, naturally, looked at things like started encoding streams, especially in AV1. We look at what the actual performance is, how much load there is, and what’s happening on the cards, and how much you can put on them before they start giving in… But then… there’s such a big difference…

Take, for example, a GPU. A great piece of hardware. But it’s also kind of like buying a car for the sound system. Because the GPU… If I’m buying an NVIDIA GPU to encode video, then I might not even be using the actual rendering capabilities. That is the biggest job that the GPU is typically built for. So, that’s one of the comparisons to make, of course.

Take, for example, a GPU. A great piece of hardware. But it's also kind of like buying a car for the sound system.”

What about the power side? How important is power consumption to either you yourself or your customers?

If you look at the energy crisis and how things are evolving I’d say it is very, very important. The typical offer you’ll be getting from the data center is: we’re going to charge you 2x the electrical bill. And that’s never been something that’s been charged because they don’t even bother. Only now, we’re seeing the first invoices coming in where the electrical bill is part of it. In Germany, the energy price peaked in August at 0.7 Euros per kilowatt hour.

Frankfurt, Germany, is one of the major exchanges that is extremely important. If you want performance streaming, you need to have something in Frankfurt.  There’s another part of it as well, which is, of course, the environmental aspect of it. One thing is the bill that you’re getting. The other thing is the bill we’re leaving to our children.

It’s kind of contradictory because many of our clients  make travel unnecessary. We have a Norwegian company that we’re working with that is doing remote inspections of ships. They were the first company in the world to do that. Instead of flying in an inspector, the ship owner, and two divers to the location, there’s only one operator of an underwater drone that is on the location. Everybody else is just connected. That’s obviously a good thing for the environment. But what are we doing?

Why did you decide to lead with AV1?

That’s a really good question. There are several reasons why we decided to lead with AV1. It is very compelling as soon as you can do it in real time. We had to wait for somebody to make it viable, which we found with the NETINT’s ASIC.

Viable acts at high quality and with latency and reliability that we could use and also, of course, with throughput. We don’t have to buy too much hardware to get it working.

We’re seeing markers that our clients are going to want AV1. And there are several reasons why that is the case. One of which is, of course, it’s license free. If you’re a content owner, especially if you’re a content owner with a large crowd with many subscribers to your content, that’s a game-changer. Because the cost of licensing a codec can grow to become a significant part of your business expenses.

Look at what’s happening with fast, free, ad-supported television. There you’re trying to get even more viewers. And you have lower margins so what you’re doing is creating eyeball minutes. And then, if you have codec and license costs, that’s a bit of an issue. It’s better if it’s free.

Is this what you’re hearing from your customers? Or is this what you’re assuming they’re thinking about?

That’s what we’re hearing from our customers, and that’s why we started implementing it.

For us, there’s also the bandwidth-to-quality aspect, which is great. I believe that it will explode in 2023. For example, if you look at what happened one month ago, Google made hardware decoding mandatory for Android 14 devices. That’s both phones and tablets. It opens so many possibilities.

We were not expecting to get business on it yet, but we are, and I’m happy about that. There are already clients reaching out because of the licensing aspect, as some of them are transmitting petabytes a month. If you can bring down the bandwidth while retaining the quality, that’s a good deal.

You mentioned before that your systems allow the user to dial in the latency and the quality. Could you explain how that works?

It’s important to make a difference between the user and the broadcaster. Our client is the broadcaster that owns the content, and they can pick the latency.

Vindral’s live CDN doesn’t work on a ‘fetch your file’ basis. The way it works is we’re going to push the file to you, and you’re going to play it out. And this is how much you’re going to buffer. Once you have that setup, and, of course, a lot of sync algorithms and things like that at work, then the stream is not allowed to drift.

A typical use case is where you have tick live auctions, for example. The typical setup for live auctions is 1080P, and you want below one second of latency because people are bidding. There are also people bidding in the actual auction house, so there’s the fairness aspect of it as well.

What we typically see is they configure maybe a 700-millisecond buffer, and it makes it possible. Even that small of a buffer makes such a huge difference. What we see in our metrics is that, basically, 99% of the viewers are getting the highest quality stream across all markets. That’s a huge deal.

How much does the quality drop off? What’s the lowest latency do you support and how much does the quality drop off at that latency as compared to one or two seconds.

I would say that the lowest that we would maybe recommend somebody to use our system for is 500 milliseconds. That would be about 250 milliseconds slower than a webRTC-based real-time solution. And why do I say that? It’s because other than that, I see no reason to use our approach. If you don’t want a buffer, you may as well use something else.

Actually, we don’t have that many clients trying that out, because most of them 500 milliseconds is the lowest somebody’s sets. And they’ve been like ‘this is so quick we don’t need anything more’. And it retains 4K at that latency.

How does the pitch work against webRTC?
If I’m a potential customer of yours and you come in and talk about your system and compared to webRTC, what are the pros and cons of each? It’s an interesting technological decision. I know that webRTC is going to be potentially lower latency, but it might only be one stream, may not come with captioning, it’s not gonna be the ABR It’s interesting to hear what technology was, how do you differentiate.

Let’s look from the perspective of when you should be using which. If you need to have a two-way voice conversation, you should use webRTC. There are actually studies that have been made proving that if you bring the latency up above 200 milliseconds, the conversation starts feeling awkward. If you have half a second, it is possible, but it’s not good. So, if that’s an ultimate requirement, then webRTC all day long.

Both technologies are actually very similar. The main difference I would point out is that we have added this buffer that the platform owner can set. So, the player’s instance is at that buffer level. WebRTC currently does not support that. And even if it did, we might even Implement that as an option. And it might go that way at some point. Today it’s not.

On the topic of differences, then. If 700 or 600 milliseconds of latency is good for you and quality is still important, then you should be using a buffer and using our solution. When you’re considering different vendors, the feature set, and what you’re actually getting in the package, there are huge differences. For some vendors, on their lower-tier products, ABR is not included. Things like that. Where the obvious thing is – you should be using ABR. Definitely.

You talked about the shortest. What’s the longest latency you see people dialing in?

We’ve actually had one use case in Hong Kong where they chose to set the latency at 3.7 seconds. That was because the television broadcast was at 3.7 seconds.

That’s the other thing. We talk a lot about latency. Latency is a hot topic, but honestly, many of our clients value synchronization even above latency. Not all clients, but some of them.

If you have a game show where you want to react to the chat and have some sort of interactivity… Maybe you have 1.5 seconds. That’s not a big issue if it’s at 1.5 seconds of latency. You will, naturally, get a little bit more stability since you’re increasing the buffer. Some of our clients have chosen to do that.

But around 3.5… That’s actually the only client we’ve had that has done that. But I think there could be more in the future. Especially in sports. If you have the satellite broadcast… It is at seven seconds of latency. We can match it to the hundreds of hundreds of milliseconds.

Latency is a hot topic, but honestly, many of our clients value synchronization even above latency.”

And the advantage of higher latency is going to be stream stability and quality.
Do you know what’s the quality difference is going to be?

Definitely. However, as soon as you’re above even one second, the returns are diminishing. It’s not like it unlocks this whole universe of opportunities. On extreme markets, it might, but I would think that if you’re going above two seconds, you’ve kind of done. There is no need to go higher. At least our clients have not found that need. The markets are basically from East Asia to South America and South Africa because we’ve expanded our CDN into those parts.

You’ve spoken a couple of times about where you install your equipment, and you’re talking about co-locating and things like that. What’s your typical server look like. How many encoders are you putting in it? And what type of density are you expecting from that?

In general, it would be something like one server can do 10 times as many streams if you’re using the ASIC. Then if you’re using GPUs, like Nvidia, for example, it’s probably just the one. I’m not stating any numbers, because my IT guys are going to tell me that I was wrong.

What is the cost of low latency? If I decide to go to the smallest setting, what is that going to cost me? I guess there’s going to be a quality answer, and there’s going to be a stability answer… Is there a hard economic answer?

My hope is that there shouldn’t be a cost difference, depending on regions. The way we’ve chosen to operate is about the design paradigm of the product that you’ve created. We have competitors that are going with one partner. They’ve picked cloud vendor X, and they’re running everything in their cloud. And then what they can do is limited to the deal with that cloud vendor.

For example, we had an AV1 request from Greece. Huge egress for an internet TV channel that I was blown away by, and they mentioned their pricing. They wanted to save costs by cutting their traffic by using av1. What we did with that request is we went out to our partners and vendors and asked them – can you help us match this, and we did. From a business perspective, it might, in some cases, cost more. But there is also a perception that plagues the low latency business of high cost and that is because many of these companies have not considered their power consumption – their form factors.

Actually, being willing to take a CAPEX investment instead of just running in the cloud and paying as you go. Many of those things that we’ve chosen to put the time into so that there will not be that big a difference.

Take, for example, Tata Communications, one of our biggest partners, and their pricing. They’re running our software stack in their environments to run their VDM, and it’s on a cost parity. So that’s something that should always be the aim. Then, I’m not going to say it’s always going to be like that, but that’s just a short version when you’re talking about the business implications.

We’re often getting requests where the potential client has this notion that it’s going to be a very high cost. Then they find that this makes sense, and we can build a business.

Are you seeing companies moving away from the cloud towards creating their own co-located servers with encoders and producing that way, as opposed to paying cents per minute to different cloud providers?

I would say I’m seeing the opposite. We’re doing both, just to be clear. I think the way to go is to do a hybrid.

For some clients, they’re going to be broadcasting 20 minutes a month. Cloud is awesome for that. You spin it up when you need it, and you kill it when it’s done. But that’s not always going to cut it. But if you’re asking me what motion I’m seeing in the market? There are more and more of these companies that are deploying across one cloud. And that’s where it resides. There are also types of offerings that you can instance yourself in third-party clouds, which is also an option. But again, it’s the design choice that it’s a cloud service that uses underlying cloud functions. It’s a shame that it’s not more of both. It creates an opportunity for us, though.

What are the big trends that you’re chasing for 2023 and beyond? What are you seeing? What forces are going to impact your business? The new features you’re going to be picking up? What are the big technology directions you’re seeing?

I mean, for us on our roadmap, we have been working hard on our partner strategy, and we’ve been seeing a higher demand for white-label solutions, which is what we’re working on with some partners.

We’ve done a few of those installs, and that’s where we are putting a lot of effort into it because we’re running our own CDN. But we can also enable others to do it, even as a managed service. You have these telcos that have maybe an edge or less offering since before, and they’re sitting on tons of equipment and fiber. So that’s one thing.

If we’re making predictions, there are two things worth a mention. I would expect the sports betting markets, especially in the US, to explode. That’s something we are definitely keeping our eyes on.

Maybe live shopping becomes a thing outside of China. Many of the big players, the big retailers, and even financial companies, are working on their own offerings and live shopping.

Vindral's CDN Against Dinosaurs' Agreement.jpg

The dinosaurs’ agreement?

Have I told you about the dinosaurs’ agreement? It’s comparable to a gentleman’s agreement. This might be provocative to some. And I get that it’s complicated in many cases.

There is, among some of the bigger players and also among independent consultants that have different stakes, a sort of mutual agreement to keep asking the question – do we really need low latency? Or do we really need synchronization?

As long as the bigger brands are not creating the experience that the audience is waiting for them to create, nobody's going to have to move.”

And while a valid question it’s also kind of a self-fulfilling prophecy. Because as long as the bigger brands are not creating the experience that the audience is waiting for them to create, nobody’s going to have to move. So that is what I’m calling the dinosaurs here. They’re holding on to the thing that they’ve always been doing. And they’re optimizing that, but not moving on to the next generation. And the problem they’re going to be facing, hopefully, is that when it reaches critical mass, the viewers are going to start expecting it, and that’s when things might start changing.

There are many workflow considerations, of course. There are tech legacy considerations. There are cost considerations and different aspects when it comes to scaling. However, saying that you don’t need low latency is a bit of an excuse.

One thing is the bill that you're getting, the other thing is the bill we're leaving to our children..”

Meta AV1 Delivery Presentation: Six Key Takeaways

Meta AV1 Delivery Presentation: Six Key Takeaways

One of the most gracious things that large companies like Meta and Netflix do is to share their knowledge with others in the community. On November 3, Meta hosted Video @Scale Fall 2022 which featured multiple speakers from Meta and other companies. If you’re unfamiliar with the event, here’s the description, “Designed for engineers that develop or manage large-scale video systems serving millions of people.”

Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

One talk drew my attention; Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta. Watch above or use this link:  https://bit.ly/Lei_AV1 

For perspective, where Netflix has focused AV1 distribution on Smart TVs, Meta’s focus is mobile. Briefly, the company started delivering “AV1-encoded FB/IG Reels videos to selected iPhone and Android devices” in 2022. Lei’s talk included encoding, decoding, and some observations about the bandwidth savings, improved MOS scores, and increased viewing time that AV1 delivered.

Here are my top 6 takeaways from Lei’s excellent presentation.

1. Meta Finds that AV1 is 30% More Efficient than HEVC/VP9

As you’ll learn later in this article, Meta relies upon software playback on iOS and Android platforms. Since both platforms support HEVC decoding, iOS in hardware (since 2017) and Android mostly in hardware but also in software, it’s reasonable to ask why Meta didn’t just use HEVC?

The answer is that in Meta’s own tests, they found that AV1 was 30% more efficient than both VP9 and HEVC, about 21% lower than the 38% higher efficiency that I found in this study by Streaming Media. Lei didn’t discuss HEVC in his presentation, but you’d have to guess that Meta chose AV1 over HEVC because the superior quality AV1 was able to deliver outweighed the potential impact of software-playback on mobile device battery life.

SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

2. Meta Encodes with SVT-AV1 For Video On Demand (VOD)

The chart shown below tracks the encoding time and quality levels of the open-source codecs shown on the upper right, which includes libaom-av1 (AV1 codec), libvpx (VP9), x265 (HEVC), x264, (AVC), vvenc (VVC), and SVT-AV1 (AV1).

Here’s how Lei interpreted this data. “From this graph, we see that SVT-AV1 maintains a consistent performance across a wide range of complexity levels. No matter for an encoding efficiency or compute efficiency point of view, SVT-AV1 always achieves the most optimal results among open-source encoders.” Again, these results track my own findings, at least as it relates to SVT-AV1 as compared to Libaom.

Interestingly, the chart only tracks software encoders, not hardware, which present a completely different quality/encoding time curve. You’ll see why this is important at the end of this post.

Meta about AV1-3
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

3. Meta Creates Their Encoding Ladder Using the Convex Hull

There are many forms of per-title encoding. Some, like YouTube, are based on machine learning, while others’, like Netflix, are based on multiple encodes to find the convex hull. Since Meta’s encoding task is much closer to YouTube than Netflix (high volume UGC), you might assume that Meta uses AI as well.

However, Meta actually uses the convex hull, a brute force technique that involves encoding at multiple resolutions and multiple bitrates to find the combination that comprises the convex hull for that video. In the example shown below, Meta encoded at seven resolutions and five CRF levels, a total of 35 encodes. To compute the convex hull, Meta plots the 35 data points and then draws a line connecting the points on the upper left boundary. The points on the convex hull are the optimal encoding configuration for that video.

As Lei points out, “the complexity of this process is quite high.” To reduce the complexity, Meta uses techniques like computing the convex hull with high-speed presets, and then encoding the selected resolution and CRF points using higher-quality presets for final delivery. Lei noted that though there are more encodes using this hybrid approach, as the optimal configurations are encoded twice, overall encoding time is reduced. 

Just to state the obvious, this approach only works for video on demand, not live. Even with the fastest hardware encoders, you can’t produce 35 iterations to identify the optimal five. This indicates that Meta uses a different schema for live transcoding, which Lei doesn’t address.

Meta about AV1-4
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

4. Meta Uses the Convex Hull Computed for AVC for VP9 and AV1

Like most large publishers, Meta encodes using multiple codecs like H.264, VP9, and AV1 to deliver to different devices. One surprising revelation was that Meta uses the convex hull computed for H.264 to guide the convex hull implementations for the VP9 and AV1 encodes.

Lei didn’t explain how this works – as you can see in the figure below, the resolutions and bitrates for the three codecs are obviously different, and that’s what you would expect. So, there must be some kind of interpolation of the convex hull information from one codec to another. But you see that VP9 delivers a 48% bitrate savings over the top H.264 ladder rung, while AV1 delivers 65%.

Meta about AV1-5
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

5. Apple and Android Phones Present Completely Different Challenges

Again, no surprise. There are many fewer Apple devices, and all are premium high-performance models. In contrast, there’s a much greater range of Android devices, from low-cost/low-performance options to models that rival Apple in cost and performance.

Lei shared that Facebook tests Android devices to determine eligibility for AV1 videos. As you can see in the slide below, Meta delivers much different quality to iOS and Android devices.

It was clear from Lei’s talk that delivering AV1 to Apple phones was relatively simple compared to sending AV1 video to Android phones. This is actually the reverse of what you might expect, as iOS doesn’t support AV1 natively while Android does. Though you can deliver video via an app to iOS devices, as Meta does, Safari doesn’t support it. And even though Android does support AV1 playback natively, you’ll have to implement some type of testing protocol—like Meta—to ensure smooth playback until AV1 hardware support becomes pervasive, which probably won’t be until 2024 or beyond.

Meta about AV1-6
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

6. AV1 has Delivered in Several Key Metrics

Integrating a new codec into your encoding and delivery pipeline isn’t trivial. So, the big question is, was AV1 worth it? The slide below displays three graphs. Sorry that the quality in the original slide is suboptimal, but here’s the net/net.

The graph on the top left shows the week-over-week playback MOS on all videos played on an iPhone. It shows about a 0.6 MOS point improvement. Since MOS (Mean Opinion Score) is usually computed on a scale from 1-5, .6 is a significant number. The second graph, on the upper right is the bitrate of all videos delivered, and it shows about a 12% bitrate reduction.

The bottom chart presents the average iPhone watch time for the different codecs used in Facebook Reels and shows that AV1 watch time went up to about 70% within the first week after rollout. This doesn’t seem to mean that AV1 increased watch time; rather, it seems to show that a significant number of devices were able to play AV1, which is how AV1 delivered the MOS improvement and bitrate reductions shown in the top two charts.

Meta about AV1-7
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

Lei’s talk was about 18 minutes long, and there’s a lot more useful data and observations than I’ve presented here. Again, here’s the link – https://bit.ly/Lei_AV1. If you’re considering deploying AV1 for VOD encoding in your organization, you’ll find the encoding-related portions of Lei’s talk illuminating.

ASICs are able to deliver video quality on par with SW encoders with significantly improved power efficiency. Because of the rapid commoditization of video processing, rising energy costs, and pollution concerns, Video Processing ASICS are inevitable.”

What about live? Lei didn’t address it, but you can take some guidance from the fact that Meta recently announced their own Video Processing ASIC. After the announcement, David Ronca, Director, Video Encoding at Meta, commented that “ASICs are able to deliver video quality on par with SW encoders with significantly improved power efficiency. Because of the rapid commoditization of video processing, rising energy costs, and pollution concerns, Video Processing ASICS are inevitable.”

At NETINT, we’ve been shipping transcoders based upon custom encoding ASICs since 2019 and have real market validations of Ronca’s comments. While software encoding may be appropriate for VOD, ASIC-based transcoders are superior, if not essential, for live transcoding.

Back on Lei’s talk, whether you’re distributing VOD or live AV1 streams, Lei’s descriptions of the challenges of AV1 delivery to mobile will be instructive to all.

NETINT Quadra vs. NVIDIA T4 – Benchmarking Hardware Encoding Performance

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer

This article is the second in a series about benchmarking hardware encoding performance. In the first article, available here, I delineated a procedure for testing hardware encoders. Specifically, I recommended this three-step procedure:

  1. Identify the most critical quality and throughput-related options for the encoder.
  2. Test across a range of configurations from high quality/low throughput to low quality/high throughput to identify the operating point that delivers the optimum blend of quality and throughput for your application.
  3. Compute quality, cost per stream, and watts per stream at the operating point to compare against other technologies.

After laying out this procedure, I applied it to the NETINT Quadra Video Processing Unit (VPU) to find the optimum operating point and the associated quality, cost per stream, and watts per stream. In this article, we perform the same analysis on the NVIDIA T4 GPU-based encoder.

About The NVIDIA T4

The NVIDIA T4 is powered by NVIDIA Turing Tensor Cores and draws 70 watts in operation. Pricing varies by the reseller, with $2,299 around the median price, which puts it slightly higher than the $1,500 quoted for the NETINT Quadra T1  VPU in the previous article.

In creating the command line for the NVIDIA encodes, I checked multiple NVIDIA documents, including a document entitled Video Benchmark Assumptions, this blog post entitled Turing H.264 Video Encoding Speed and Quality, and a document entitled Using FFmpeg with NVIDIA GPU Hardware acceleration that requires a login. I readily admit that I am not an expert on NVIDIA encoding, but the point of this exercise is not absolute quality as much as the range of quality and throughput that all hardware enables. You should check these documents yourself and create your own version of the optimized command string.

While there are many configuration options that impact quality and throughput, we focused our attention on two, lookahead and presets. As discussed in the previous article, the lookahead buffer allows the encoder to look at frames ahead of the frame being encoded, so it knows what is coming and can make more intelligent decisions. This improves encoding quality, particularly at and around scene changes, and it can improve bitrate efficiency. But lookahead adds latency equal to the lookahead duration, and it can decrease throughput.

Note that while the NVIDIA documentation recommends a lookahead buffer of twenty frames, I use 15 in my tests because, at 20, the hardware decoder kept crashing. I tested a 20-frame lookahead using software decoding, and the quality differential between 15 and 20 was inconsequential, so this shouldn’t impact the comparative results.

I also tested using various NVIDIA presets, which like all encoding presets, trade off quality vs. throughput. To measure quality, I computed the VMAF harmonic mean and low-frame scores, the latter a measure of transient quality. For throughput, I tested the number of simultaneous 1080p30 files the hardware could process at 30 fps. I divided the stream count into price and watts/hour to determine cost/stream and watts/stream.

As you can see in Table 1, I tested with a lookahead value of 15 for selected presets 1-9, and then with a 0 lookahead for preset 9. Line two shows the closest x264 equivalent score for perspective.

In terms of operating point for comparing to  Quadra, I choose the lookahead 15/preset 4 configuration, which yielded twice the throughput of preset 2 with only a minor reduction in VMAF Harmonic mean. We will consider low-frame scores in the final comparisons.

In general, the presets worked as they should, with higher quality and lower throughput at the left end, and the reverse at the right end, though LA15/P4 performance was an anomaly since it produced lower quality and higher throughput than LA15/P6. In addition, dropping the lookahead buffer did not produce the performance increase that we saw with Quadra, though it also did not produce a significant quality decrease.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 1
Table 1. H.264 options and results.

Table 2 shows the T4’s HEVC results. Though quality was again near the medium x265 preset with several combinations, throughput was very modest at 3 or 4 streams at that quality level. For HEVC, LA15/P4 stands out as the optimal configuration, with four times or better throughput than other combinations with higher-quality output.

In terms of expected preset behavior, LA15/P4 was again quite the anomaly, producing the highest throughput in the test suite with slightly lower quality than LA15/P6, which should deliver lower quality. Again, switching from LA 15 to LA 0 produced neither the expected spike in throughput nor a drop in quality, as we saw with the Quadra for both HEVC and H.264.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 2
Table 2. HEVC options and results.

Quadra vs. T4

Now that we have identified the operating points for Quadra and the T4, let us compare quality, throughput, CAPEX, and OPEX. You see the data for H.264 in Table 3.

Here, the stream count was the same, so Quadra’s advantage in cost per stream and watts per stream related to its lower cost and more efficient operation. At their respective operating points, the Quadra’s VMAF harmonic mean quality was slightly higher, with a more significant advantage in the low-frame score, a predictor of transient quality problems.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 3
Table 3. Comparing Quadra and T4 at H.264 operating points.

Table 4 shows the same comparison for HEVC. Here, Quadra output 75% more streams than the T4, which increases the cost per stream and watts per stream advantages. VMAF harmonic means scores were again very similar, though the T4’s low frame score was substantially lower.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 4
Table 4. Comparing Quadra and T4 at HEVC operating points. 

Figure 5 illustrates the low-frames and low-frame differential between the two files. It is the result plot from the Moscow State University Video Quality Measurement Tool (VQMT), which displays the VMAF score, frame-by-frame, over the entire duration of the two video files analyzed, with Quadra in red and the T4 in green. The top window shows the VMAF comparison for the entire two files, while the bottom window is a close-up of the highlighted region of the top window, right around the most significant downward spike at frame 1590.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Picture 1
Figure 5. The downward green spikes represent the low-frame scores in the T4 encode.

As you can see in the bottom window in Figure 5, the low-frame region extends for 2-3 frames, which might be borderline noticeable by a discerning viewer. Figure 6 shows a close-up of the lowest quality frame, Quadra on the left, T4 on the right, and the dramatic difference in VMAF score, 87.95 to 57, is certainly warranted. Not surprisingly, PSNR and SSIM measurements confirmed these low frames.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Picture 2
Figure 6. Quality comparisons, NETINT Quadra on the left, T4 on the right.

It is useful to track low frames because if they extend beyond 2-3 frames, they become noticeable to viewers and can degrade viewer quality of experience. Mathematically, in a two-minute test file, the impact of even 10 – 15 terrible frames on the overall score is negligible. That is why it is always useful to visualize the metric scores with a tool like VQMT, rather than simply relying on a single score.

Summing Up

Overall, you should consider the procedure discussed in this and the previous article as the most important takeaway from these two articles. I am not an expert in encoding with NVIDIA hardware, and the results from a single or even a limited number of files can be idiosyncratic.

Do your own research, test your own files, and draw your own conclusions. As stated in the previous article, do not be impressed by quality scores without knowing the throughput, and expect that impressive throughput numbers may be accompanied by a significant drop in quality.

Whenever you test any hardware encoder, identify the most important quality/throughput configuration options, test over the relevant range, and choose the operating point that delivers the best combination of quality and throughput. This will give the best chance to achieve a meaningful apples vs. apples comparison between different hardware encoders that incorporates quality, cost per stream, and watts per stream.