AV1 Capped CRF Encoding with Quadra VPU

Jan Ozer-AV1 Capped CRF-featured image-B

We’ve previously reported results for capped CRF encoding for H.264 and HEVC using NETINT Quadra video processing units (VPU). This post will detail AV1 performance, including both 1080p and 4K data.

For those with limited time, here’s what you need to know: Capped CRF delivers higher quality video during hard-to-encode regions than CBR, similar quality during all other scenes, and improved quality of experience at the same cost or lower than CBR. NETINT VPUs are the first hardware video encoders to adopt Capped CRF across the three most popular codecs in use today, AV1, HEVC, and H.264.

You can read a quick description of capped CRF here and get a deep dive with H.264 and HEVC performance results here

CAPPED CRF OVERVIEW

Briefly, capped CRF is a smart bitrate control technique that combines the benefits of CRF encoding with a bitrate cap. Unlike variable bitrate encoding (VBR) and constant bitrate encoding (CBR), which target specific bitrates, capped CRF targets a specific quality level, which is controlled by the CRF value. You also set a bitrate cap, which is applied if the encoder can’t meet the quality level below the bitrate cap.

On easy-to-encode videos, the CRF value sets the quality level, which it can usually achieve below the bitrate cap. In these cases, capped CRF typically delivers bitrate savings over CBR-encoded footage while delivering similar quality. For harder-to-encode footage, the bitrate cap usually controls, and capped CRF delivers close to the same quality and bitrate as CBR.

The value proposition is clear: lower bitrates and good quality during easy scenes, and similar to CBR in bitrate and quality for harder scenes. I’m not addressing VBR because NETINT’s focus is live streaming, where CBR usage dominates. If you’re analyzing capped CRF for VOD, you would compare against 2-pass VBR as well as potentially CBR.

One last detail. CRF values have an inverse relationship to quality and bitrate; the higher the CRF value, the lower the quality and bitrate. In general, video engineers select a CRF value that delivers their target quality level. For premium content, you might target an average VMAF score of 95. For user-generated content or training videos, you might target 93 or even lower. As you’ll see, the lower the quality score, the greater the bandwidth savings.

1080p RESULTS

We show 1080p results in Table 1, which is divided between easy-to-encode and hard-to-encode content. We encoded the CBR clips to 4.5 Mbps and applied the same cap for capped CRF encoding.

Jan Ozer-AV1 Capped CRF-1
Table 1. 1080p results using Quadra VPU and capped CRF encoding.

You see that in CBR mode, Quadra VPUs do not reach the target rate as accurately as when using capped CRF mode. This won’t degrade viewer quality of experience since the VMAF scores exceed 95, so this missing on the low side saves excess bandwidth with no visual quality detriment.

In this comparison, bitrate savings is minimized, particularly at CRF 19 and 21, as the capped CRF clips in the hard-to-encode content have a higher bitrate than the CBR counterparts (4,419 and 4,092 to 3,889). Not surprisingly, CRF 19 and 21 deliver little bandwidth savings and a slighly higher quality than CBR.

At CRF 23, things get interesting, with an overall bandwidth savings of 16.1% with a negligible quality delta from CBR. With a VMAF score of around 95, CRF 23 might be the target for engineers delivering premium content. Engineers targeting slightly lower quality can choose CRF 27 and achieve a bitrate savings of 43%, and an efficient 2.4 Mbps bit rate for hard-to-encode footage. At CRF 27, Quadra VPUs encoded the hard-to-encode Football clip at 3,999 kbps with an impressive VMAF score of 93.39.

Note that as with H.264 and HEVC, AV1 capped CRF does reduce throughput. Specifically, a single Quadra VPU installed in a 32-core workstation outputs 23 simultaneous CBR streams using CBR encoding. This dropped to eighteen for capped CRF, a reduction of 22%.

4K RESULTS

Many engineers encoding with AV1 are delivering UHD content, so we ran similar tests with the Quadra and 4K30 8-bit content with a CBR target and bitrate cap of 16 Mbps. Using four clips, including a 4K version of the high-motion Football clip to much less dynamic content like Netflix’s Meridian clip and Blender Foundation’s Sintel.

Table 2. 4K results for the Quadra VPU and capped CRF encoding.

In CBR mode, the Quadra VPU hit the bitrate target much more accurately at 4K than 1080p, so even at CRF 19, the VPU delivered a 13% bitrate savings with a VMAF score of 96.23. Again, CRF 23 delivered a VMAF score of very close to 95, with 45% savings over CBR. Impressively, at CRF 23, Quadra delivered an overall VMAF score of 94.87 for these 4K clips at 7.78 Mbps, and that’s with the Football clip weighing in at 14.3 Mbps.

Of course, these savings directly relate to the cap and CBR target. It’s certainly fair to argue that 16 Mbps is excessive for 4K AV1-encoded content, though Apple recommends 16.8 for 8-bit 4K content with HEVC here.

The point is, when you encode with CBR, you’re limiting quality to control bandwidth costs. With capped CRF, you can set the cap higher than your CBR target, knowing that all content contains easy-to-encode regions that will balance out the impact of the higher cap and deliver similar or lower bandwidth costs. With these comparative settings, capped CRF delivers higher quality video during hard-to-encode regions than CBR, similar quality during all other scenes, and improved quality of experience at the same cost or lower than CBR.

DENSER / LEANER / GREENER : Symposium on Building Your Own Streaming Cloud

Vindral’s CDN Against Dinosaurs’ Agreement

One thing is the bill that you're getting, the other thing is the bill we're leaving to our children...”

We’re going to talk about Vindral – but first, tell us a little bit about RealSprint?

RealSprint, we’re a Swedish company based in Northern Sweden, which is kind of a great place to be running a tech company. When you’re in a University Town, and any time after September, it gets dark outside for most parts of the day, which means  people generally try to find things to do inside. So, it’s a good place to have a tech business because you’ll have people spending a lot of time in front of their screens, creating things. RealSprint is a heavily culture-focused team, with the majority located in Northern Sweden and a few based in Stockholm and in the U.S.

The company started around 10 years ago as a really small team that did not have the end game figured out yet.  All they knew was that they wanted to do something around video, broadcasting, and streaming. From there it’s grown, and today we’re 30 people.

At a high level, what is Vindral?

Vindral is actually a product family. There is a live CDN, as you mentioned, and there’s also a video compositing software. As for the live CDN, it’s been around five or six years that it’s been running 24/7.

The product was born because we got questions from our clients about latency and quality. ‘Why do I have to choose if I want low latency or if I want high quality’. There are solutions on both ends of that spectrum, but when we got introduced to the problem, there weren’t really any good ones. We started looking into real-time technologies, like webRTC, in its current state and quickly found that it’s not really suitable if you want high quality. It’s amazing in terms of latency. But the client’s reality requires more. You can’t go all in on only one aspect of a solution. You need something that’s balanced.

Draw us a block diagram. So, you’ve got your encoder, you’ve got your CDN, you’ve got software…

We can take a typical client in entertainment or gaming. So, they have their content, and they want to broadcast that to a global audience. What they generally do is they ingest one signal to our endpoint, which is the most standard way of using our CDN. And there are several ways of ingesting multiple transfer protocols.

The first thing that happens on our end is we create the ABR ladder. We transcode all the qualities that are needed since network conditions vary  between  markets. Even in places that are well connected, the home Wi-Fi alone can be so bad at times, with a lot of jitter and latency.

After the ABR ladder is created, the next box fans out to the places in the world where there are potential viewers. And from there, we also have edge software as one part of this. Lastly, the signal is received by the player instanced on the device.

That’s basically it.

You’ve got an encoder in the middle of things creating the encoding ladder. Then you’ve got the CDN distributing. What about the software that you’ve contributed? How does that work? Do I log into some kind of portal and then administrate through there?

Exactly. Take a typical client in gaming, for example.They’re running 50 or 100 channels. And they want to see what’s going on in their operations, understand how much data is flowing through the system, and things like that. There is a portal where they can log in, see their usage, and see all of the channel information that they would need. It’s a very important part, of course, of any mature system that the client understands what’s going on.

Encoding is particularly important for us to solve because we have loads of channels running 24/7. So, that’s different. If you’re running a CDN, and your typical client is broadcasting for 20 minutes a month, then, of course, the encoding load is much lower. In our case, yes, we do have those types (minimal usage), but many of our clients are heavy users, and they own a lot of content rights. Therefore, the encoding part is several hundreds of terabytes ingested. Only one quality for each stream monthly on the ingest side.

You’re encoding ABR. Which codecs are you supporting? And which endpoints are you supporting?

So, codec-wise, everybody does H264, of course. That’s the standard when it comes to live streaming with low latency. We have recently added AV1, as well, which was something we announced as a world first. We weren’t the world’s first with AV1, but we were the world’s first with AV1 at what many would call real-time. We call it low latency.

We chose to add it because there’s a market pointing to AV1.

Which devices are you targeting? Is it TV? Smart TV? Mobile? The whole gamut?

I would say the whole gamut. That list of devices is steadily growing. I’m trying to think of any devices that we don’t support. Essentially, as long as it’s using the internet, we deliver to it. Any desktop or mobile browser, including IOS as well.

iOS is, basically, the hardest one. If you’re delivering to iOS browsers that are all running iOS Safari. We’re getting the same performance on iOS Safari. And then Apple TV, Google Chromecast, Samsung, LG TVs, and Android TVs. There’s a plethora of different devices that our clients require us to support.

4K? 1080p? HDR? SDR?

Yes, we support all of them. One of the very important things for us is to prove that you can get quality on low latency.

Take a typical client. They’re broadcasting sports and their viewers are used to watching this on their television, maybe a 77-inch or 85-inch TV. You don’t want that user to get a 720p stream. This is where the configurable latency really comes into play, allowing the client to pick a second of latency or 800 milliseconds, with 4K to be maintained on that latency. That is one of the use cases where we shine.

There’s also a huge market for lower qualities as well, where that’s important.

So, you mentioned ABR ladders, and yes, there are markets where you get 600 kilobits per second on the last mile. You need a solution for that as well.

Your system is the delivery side, the encoding side. Which types of encoders did you consider when you chose the encoder to fit into Vindral?

There are actually two steps to consider depending on whether we’re doing it on-prem or off, like a cloud solution. The client often has their own encoders. Many of our clients use Elemental or something similar just to push the material to us. But on the transcoding, where we generate the ladder, unless we’re passing all qualities through (which is also a possibility), there are, of course, different ways and different directions to go for different scenarios. For example, if you take an Intel CPU-based and you use software to encode. That is a viable option in some scenarios, but not in all.

There’s an Nvidia GPU, for example, which you could use in some scenarios since there are many factors coming into play when making that decision.

The highest priority of all is something that our business generally does badly –maintaining business viability. You want to make sure that any client that is using the system can pay and make their business work. Now, if we have channels that are running 24/7, as we do, and if it’s in a region where it’s not impossible to allocate bare metal or collocation space, then that is a fantastic option in many ways.

CPU-based, GPU-based, and ASICs are all different and make up the three different ones that we’ve looked into.

So, how do you differentiate? You talked about software being a good option in some instances. When is it not a good option?

No option is good or bad in a sense, but if you compare them, both the GPU and the ASIC outperform the software encoding when it comes to heavier use.

The software option is useful when you need to spin it up, spin it down, and you need to move things. You need it to be flexible which is, usually, in the lower revenue parts of the markets.

When it comes to the big broadcaster and the large rights holders, the use case is heavier with many channels, and large usage over time, then the GPU and especially the ASIC make a lot of sense.

You’re talking there about density. What is the quality picture?
A lot of people think software quality is going to be better than ASIC and GPUs. How do they compare?

It might be in some instances. We’ve found that the quality when using ASICs is fantastic. It’s all depending on what you want to do. Because we need to understand we’re talking about low latency here. We don’t have the option of passing encoding or anything like that. Everything needs to work in real time. Our requirement on encoding is that it takes a frame to encode, and that’s all the time that you get.

You mentioned density, but there are a lot of other things coming into play, quality being one.

If you’re looking at ASICs, you’re comparing that to GPUs. In some scenarios we’ve had for the past two years, the decision could have been based on the availability factor – there’s a chip shortage. What can I get my hands on? In some cases,  we’ve had a client banging on the door, and they want to go live right away.

Going back to the density part. That is a huge game changer because the ASIC is unmatched in terms of the number of streams per rack unit. If you just measure that KPI, and you’re willing to do the job of building your CDN in co-location spaces, which not everybody is, then that’s it. You have to ask yourself, though, who’s going to manage this? You don’t want to bloat when you’re managing this type of solution. If you have thousands of channels running, then cost is one thing when it comes to not having to take up a lot of rack space, but also, you don’t want it to bloat too much.

How formal of analysis did you make in choosing between the two hardware alternatives? Did you bring it down to cost per stream and power per stream?
Did you do any of that math? How did you make that decision between those two options?

Well, in a way, yes. But, on that particular metric, we need to look at the two options and say well, this is at a tenth of the cost. So I’m not going to give you the number, because I know it’s so much smaller.

We’re well aware of what costs are involved, but the cost per stream depends on profiles, etc. Just comparing them. We’ve, naturally, looked at things like started encoding streams, especially in AV1. We look at what the actual performance is, how much load there is, and what’s happening on the cards, and how much you can put on them before they start giving in… But then… there’s such a big difference…

Take, for example, a GPU. A great piece of hardware. But it’s also kind of like buying a car for the sound system. Because the GPU… If I’m buying an NVIDIA GPU to encode video, then I might not even be using the actual rendering capabilities. That is the biggest job that the GPU is typically built for. So, that’s one of the comparisons to make, of course.

Take, for example, a GPU. A great piece of hardware. But it's also kind of like buying a car for the sound system.”

What about the power side? How important is power consumption to either you yourself or your customers?

If you look at the energy crisis and how things are evolving I’d say it is very, very important. The typical offer you’ll be getting from the data center is: we’re going to charge you 2x the electrical bill. And that’s never been something that’s been charged because they don’t even bother. Only now, we’re seeing the first invoices coming in where the electrical bill is part of it. In Germany, the energy price peaked in August at 0.7 Euros per kilowatt hour.

Frankfurt, Germany, is one of the major exchanges that is extremely important. If you want performance streaming, you need to have something in Frankfurt.  There’s another part of it as well, which is, of course, the environmental aspect of it. One thing is the bill that you’re getting. The other thing is the bill we’re leaving to our children.

It’s kind of contradictory because many of our clients  make travel unnecessary. We have a Norwegian company that we’re working with that is doing remote inspections of ships. They were the first company in the world to do that. Instead of flying in an inspector, the ship owner, and two divers to the location, there’s only one operator of an underwater drone that is on the location. Everybody else is just connected. That’s obviously a good thing for the environment. But what are we doing?

Why did you decide to lead with AV1?

That’s a really good question. There are several reasons why we decided to lead with AV1. It is very compelling as soon as you can do it in real time. We had to wait for somebody to make it viable, which we found with the NETINT’s ASIC.

Viable acts at high quality and with latency and reliability that we could use and also, of course, with throughput. We don’t have to buy too much hardware to get it working.

We’re seeing markers that our clients are going to want AV1. And there are several reasons why that is the case. One of which is, of course, it’s license free. If you’re a content owner, especially if you’re a content owner with a large crowd with many subscribers to your content, that’s a game-changer. Because the cost of licensing a codec can grow to become a significant part of your business expenses.

Look at what’s happening with fast, free, ad-supported television. There you’re trying to get even more viewers. And you have lower margins so what you’re doing is creating eyeball minutes. And then, if you have codec and license costs, that’s a bit of an issue. It’s better if it’s free.

Is this what you’re hearing from your customers? Or is this what you’re assuming they’re thinking about?

That’s what we’re hearing from our customers, and that’s why we started implementing it.

For us, there’s also the bandwidth-to-quality aspect, which is great. I believe that it will explode in 2023. For example, if you look at what happened one month ago, Google made hardware decoding mandatory for Android 14 devices. That’s both phones and tablets. It opens so many possibilities.

We were not expecting to get business on it yet, but we are, and I’m happy about that. There are already clients reaching out because of the licensing aspect, as some of them are transmitting petabytes a month. If you can bring down the bandwidth while retaining the quality, that’s a good deal.

You mentioned before that your systems allow the user to dial in the latency and the quality. Could you explain how that works?

It’s important to make a difference between the user and the broadcaster. Our client is the broadcaster that owns the content, and they can pick the latency.

Vindral’s live CDN doesn’t work on a ‘fetch your file’ basis. The way it works is we’re going to push the file to you, and you’re going to play it out. And this is how much you’re going to buffer. Once you have that setup, and, of course, a lot of sync algorithms and things like that at work, then the stream is not allowed to drift.

A typical use case is where you have tick live auctions, for example. The typical setup for live auctions is 1080P, and you want below one second of latency because people are bidding. There are also people bidding in the actual auction house, so there’s the fairness aspect of it as well.

What we typically see is they configure maybe a 700-millisecond buffer, and it makes it possible. Even that small of a buffer makes such a huge difference. What we see in our metrics is that, basically, 99% of the viewers are getting the highest quality stream across all markets. That’s a huge deal.

How much does the quality drop off? What’s the lowest latency do you support and how much does the quality drop off at that latency as compared to one or two seconds.

I would say that the lowest that we would maybe recommend somebody to use our system for is 500 milliseconds. That would be about 250 milliseconds slower than a webRTC-based real-time solution. And why do I say that? It’s because other than that, I see no reason to use our approach. If you don’t want a buffer, you may as well use something else.

Actually, we don’t have that many clients trying that out, because most of them 500 milliseconds is the lowest somebody’s sets. And they’ve been like ‘this is so quick we don’t need anything more’. And it retains 4K at that latency.

How does the pitch work against webRTC?
If I’m a potential customer of yours and you come in and talk about your system and compared to webRTC, what are the pros and cons of each? It’s an interesting technological decision. I know that webRTC is going to be potentially lower latency, but it might only be one stream, may not come with captioning, it’s not gonna be the ABR It’s interesting to hear what technology was, how do you differentiate.

Let’s look from the perspective of when you should be using which. If you need to have a two-way voice conversation, you should use webRTC. There are actually studies that have been made proving that if you bring the latency up above 200 milliseconds, the conversation starts feeling awkward. If you have half a second, it is possible, but it’s not good. So, if that’s an ultimate requirement, then webRTC all day long.

Both technologies are actually very similar. The main difference I would point out is that we have added this buffer that the platform owner can set. So, the player’s instance is at that buffer level. WebRTC currently does not support that. And even if it did, we might even Implement that as an option. And it might go that way at some point. Today it’s not.

On the topic of differences, then. If 700 or 600 milliseconds of latency is good for you and quality is still important, then you should be using a buffer and using our solution. When you’re considering different vendors, the feature set, and what you’re actually getting in the package, there are huge differences. For some vendors, on their lower-tier products, ABR is not included. Things like that. Where the obvious thing is – you should be using ABR. Definitely.

You talked about the shortest. What’s the longest latency you see people dialing in?

We’ve actually had one use case in Hong Kong where they chose to set the latency at 3.7 seconds. That was because the television broadcast was at 3.7 seconds.

That’s the other thing. We talk a lot about latency. Latency is a hot topic, but honestly, many of our clients value synchronization even above latency. Not all clients, but some of them.

If you have a game show where you want to react to the chat and have some sort of interactivity… Maybe you have 1.5 seconds. That’s not a big issue if it’s at 1.5 seconds of latency. You will, naturally, get a little bit more stability since you’re increasing the buffer. Some of our clients have chosen to do that.

But around 3.5… That’s actually the only client we’ve had that has done that. But I think there could be more in the future. Especially in sports. If you have the satellite broadcast… It is at seven seconds of latency. We can match it to the hundreds of hundreds of milliseconds.

Latency is a hot topic, but honestly, many of our clients value synchronization even above latency.”

And the advantage of higher latency is going to be stream stability and quality.
Do you know what’s the quality difference is going to be?

Definitely. However, as soon as you’re above even one second, the returns are diminishing. It’s not like it unlocks this whole universe of opportunities. On extreme markets, it might, but I would think that if you’re going above two seconds, you’ve kind of done. There is no need to go higher. At least our clients have not found that need. The markets are basically from East Asia to South America and South Africa because we’ve expanded our CDN into those parts.

You’ve spoken a couple of times about where you install your equipment, and you’re talking about co-locating and things like that. What’s your typical server look like. How many encoders are you putting in it? And what type of density are you expecting from that?

In general, it would be something like one server can do 10 times as many streams if you’re using the ASIC. Then if you’re using GPUs, like Nvidia, for example, it’s probably just the one. I’m not stating any numbers, because my IT guys are going to tell me that I was wrong.

What is the cost of low latency? If I decide to go to the smallest setting, what is that going to cost me? I guess there’s going to be a quality answer, and there’s going to be a stability answer… Is there a hard economic answer?

My hope is that there shouldn’t be a cost difference, depending on regions. The way we’ve chosen to operate is about the design paradigm of the product that you’ve created. We have competitors that are going with one partner. They’ve picked cloud vendor X, and they’re running everything in their cloud. And then what they can do is limited to the deal with that cloud vendor.

For example, we had an AV1 request from Greece. Huge egress for an internet TV channel that I was blown away by, and they mentioned their pricing. They wanted to save costs by cutting their traffic by using av1. What we did with that request is we went out to our partners and vendors and asked them – can you help us match this, and we did. From a business perspective, it might, in some cases, cost more. But there is also a perception that plagues the low latency business of high cost and that is because many of these companies have not considered their power consumption – their form factors.

Actually, being willing to take a CAPEX investment instead of just running in the cloud and paying as you go. Many of those things that we’ve chosen to put the time into so that there will not be that big a difference.

Take, for example, Tata Communications, one of our biggest partners, and their pricing. They’re running our software stack in their environments to run their VDM, and it’s on a cost parity. So that’s something that should always be the aim. Then, I’m not going to say it’s always going to be like that, but that’s just a short version when you’re talking about the business implications.

We’re often getting requests where the potential client has this notion that it’s going to be a very high cost. Then they find that this makes sense, and we can build a business.

Are you seeing companies moving away from the cloud towards creating their own co-located servers with encoders and producing that way, as opposed to paying cents per minute to different cloud providers?

I would say I’m seeing the opposite. We’re doing both, just to be clear. I think the way to go is to do a hybrid.

For some clients, they’re going to be broadcasting 20 minutes a month. Cloud is awesome for that. You spin it up when you need it, and you kill it when it’s done. But that’s not always going to cut it. But if you’re asking me what motion I’m seeing in the market? There are more and more of these companies that are deploying across one cloud. And that’s where it resides. There are also types of offerings that you can instance yourself in third-party clouds, which is also an option. But again, it’s the design choice that it’s a cloud service that uses underlying cloud functions. It’s a shame that it’s not more of both. It creates an opportunity for us, though.

What are the big trends that you’re chasing for 2023 and beyond? What are you seeing? What forces are going to impact your business? The new features you’re going to be picking up? What are the big technology directions you’re seeing?

I mean, for us on our roadmap, we have been working hard on our partner strategy, and we’ve been seeing a higher demand for white-label solutions, which is what we’re working on with some partners.

We’ve done a few of those installs, and that’s where we are putting a lot of effort into it because we’re running our own CDN. But we can also enable others to do it, even as a managed service. You have these telcos that have maybe an edge or less offering since before, and they’re sitting on tons of equipment and fiber. So that’s one thing.

If we’re making predictions, there are two things worth a mention. I would expect the sports betting markets, especially in the US, to explode. That’s something we are definitely keeping our eyes on.

Maybe live shopping becomes a thing outside of China. Many of the big players, the big retailers, and even financial companies, are working on their own offerings and live shopping.

Vindral CDN v dinosaur

The dinosaurs’ agreement?

Have I told you about the dinosaurs’ agreement? It’s comparable to a gentleman’s agreement. This might be provocative to some. And I get that it’s complicated in many cases.

There is, among some of the bigger players and also among independent consultants that have different stakes, a sort of mutual agreement to keep asking the question – do we really need low latency? Or do we really need synchronization?

As long as the bigger brands are not creating the experience that the audience is waiting for them to create, nobody's going to have to move.”

And while a valid question it’s also kind of a self-fulfilling prophecy. Because as long as the bigger brands are not creating the experience that the audience is waiting for them to create, nobody’s going to have to move. So that is what I’m calling the dinosaurs here. They’re holding on to the thing that they’ve always been doing. And they’re optimizing that, but not moving on to the next generation. And the problem they’re going to be facing, hopefully, is that when it reaches critical mass, the viewers are going to start expecting it, and that’s when things might start changing.

There are many workflow considerations, of course. There are tech legacy considerations. There are cost considerations and different aspects when it comes to scaling. However, saying that you don’t need low latency is a bit of an excuse.

One thing is the bill that you're getting, the other thing is the bill we're leaving to our children..”

Meta AV1 Delivery Presentation: Six Key Takeaways

One of the most gracious things that large companies like Meta and Netflix do is to share their knowledge with others in the community. On November 3, Meta hosted Video @Scale Fall 2022 which featured multiple speakers from Meta and other companies. If you’re unfamiliar with the event, here’s the description, “Designed for engineers that develop or manage large-scale video systems serving millions of people.”

Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

One talk drew my attention; Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta. Watch above or use this link:  https://bit.ly/Lei_AV1 

For perspective, where Netflix has focused AV1 distribution on Smart TVs, Meta’s focus is mobile. Briefly, the company started delivering “AV1-encoded FB/IG Reels videos to selected iPhone and Android devices” in 2022. Lei’s talk included encoding, decoding, and some observations about the bandwidth savings, improved MOS scores, and increased viewing time that AV1 delivered.

Here are my top 6 takeaways from Lei’s excellent presentation.

1. Meta Finds that AV1 is 30% More Efficient than HEVC/VP9

As you’ll learn later in this article, Meta relies upon software playback on iOS and Android platforms. Since both platforms support HEVC decoding, iOS in hardware (since 2017) and Android mostly in hardware but also in software, it’s reasonable to ask why Meta didn’t just use HEVC?

The answer is that in Meta’s own tests, they found that AV1 was 30% more efficient than both VP9 and HEVC, about 21% lower than the 38% higher efficiency that I found in this study by Streaming Media. Lei didn’t discuss HEVC in his presentation, but you’d have to guess that Meta chose AV1 over HEVC because the superior quality AV1 was able to deliver outweighed the potential impact of software-playback on mobile device battery life.

Meta about AV1-2
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

2. Meta Encodes with SVT-AV1 For Video On Demand (VOD)

The chart shown below tracks the encoding time and quality levels of the open-source codecs shown on the upper right, which includes libaom-av1 (AV1 codec), libvpx (VP9), x265 (HEVC), x264, (AVC), vvenc (VVC), and SVT-AV1 (AV1).

Here’s how Lei interpreted this data. “From this graph, we see that SVT-AV1 maintains a consistent performance across a wide range of complexity levels. No matter for an encoding efficiency or compute efficiency point of view, SVT-AV1 always achieves the most optimal results among open-source encoders.” Again, these results track my own findings, at least as it relates to SVT-AV1 as compared to Libaom.

Interestingly, the chart only tracks software encoders, not hardware, which present a completely different quality/encoding time curve. You’ll see why this is important at the end of this post.

Meta about AV1-3
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

3. Meta Creates Their Encoding Ladder Using the Convex Hull

There are many forms of per-title encoding. Some, like YouTube, are based on machine learning, while others’, like Netflix, are based on multiple encodes to find the convex hull. Since Meta’s encoding task is much closer to YouTube than Netflix (high volume UGC), you might assume that Meta uses AI as well.

However, Meta actually uses the convex hull, a brute force technique that involves encoding at multiple resolutions and multiple bitrates to find the combination that comprises the convex hull for that video. In the example shown below, Meta encoded at seven resolutions and five CRF levels, a total of 35 encodes. To compute the convex hull, Meta plots the 35 data points and then draws a line connecting the points on the upper left boundary. The points on the convex hull are the optimal encoding configuration for that video.

As Lei points out, “the complexity of this process is quite high.” To reduce the complexity, Meta uses techniques like computing the convex hull with high-speed presets, and then encoding the selected resolution and CRF points using higher-quality presets for final delivery. Lei noted that though there are more encodes using this hybrid approach, as the optimal configurations are encoded twice, overall encoding time is reduced. 

Just to state the obvious, this approach only works for video on demand, not live. Even with the fastest hardware encoders, you can’t produce 35 iterations to identify the optimal five. This indicates that Meta uses a different schema for live transcoding, which Lei doesn’t address.

Meta about AV1-4
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

4. Meta Uses the Convex Hull Computed for AVC for VP9 and AV1

Like most large publishers, Meta encodes using multiple codecs like H.264, VP9, and AV1 to deliver to different devices. One surprising revelation was that Meta uses the convex hull computed for H.264 to guide the convex hull implementations for the VP9 and AV1 encodes.

Lei didn’t explain how this works – as you can see in the figure below, the resolutions and bitrates for the three codecs are obviously different, and that’s what you would expect. So, there must be some kind of interpolation of the convex hull information from one codec to another. But you see that VP9 delivers a 48% bitrate savings over the top H.264 ladder rung, while AV1 delivers 65%.

Meta about AV1-5
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

5. Apple and Android Phones Present Completely Different Challenges

Again, no surprise. There are many fewer Apple devices, and all are premium high-performance models. In contrast, there’s a much greater range of Android devices, from low-cost/low-performance options to models that rival Apple in cost and performance.

Lei shared that Facebook tests Android devices to determine eligibility for AV1 videos. As you can see in the slide below, Meta delivers much different quality to iOS and Android devices.

It was clear from Lei’s talk that delivering AV1 to Apple phones was relatively simple compared to sending AV1 video to Android phones. This is actually the reverse of what you might expect, as iOS doesn’t support AV1 natively while Android does. Though you can deliver video via an app to iOS devices, as Meta does, Safari doesn’t support it. And even though Android does support AV1 playback natively, you’ll have to implement some type of testing protocol—like Meta—to ensure smooth playback until AV1 hardware support becomes pervasive, which probably won’t be until 2024 or beyond.

Meta about AV1-6
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

6. AV1 has Delivered in Several Key Metrics

Integrating a new codec into your encoding and delivery pipeline isn’t trivial. So, the big question is, was AV1 worth it? The slide below displays three graphs. Sorry that the quality in the original slide is suboptimal, but here’s the net/net.

The graph on the top left shows the week-over-week playback MOS on all videos played on an iPhone. It shows about a 0.6 MOS point improvement. Since MOS (Mean Opinion Score) is usually computed on a scale from 1-5, .6 is a significant number. The second graph, on the upper right is the bitrate of all videos delivered, and it shows about a 12% bitrate reduction.

The bottom chart presents the average iPhone watch time for the different codecs used in Facebook Reels and shows that AV1 watch time went up to about 70% within the first week after rollout. This doesn’t seem to mean that AV1 increased watch time; rather, it seems to show that a significant number of devices were able to play AV1, which is how AV1 delivered the MOS improvement and bitrate reductions shown in the top two charts.

Meta about AV1-7
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

Lei’s talk was about 18 minutes long, and there’s a lot more useful data and observations than I’ve presented here. Again, here’s the link – https://bit.ly/Lei_AV1. If you’re considering deploying AV1 for VOD encoding in your organization, you’ll find the encoding-related portions of Lei’s talk illuminating.

ASICs are able to deliver video quality on par with SW encoders with significantly improved power efficiency. Because of the rapid commoditization of video processing, rising energy costs, and pollution concerns, Video Processing ASICS are inevitable.”

What about live? Lei didn’t address it, but you can take some guidance from the fact that Meta recently announced their own Video Processing ASIC. After the announcement, David Ronca, Director, Video Encoding at Meta, commented that “ASICs are able to deliver video quality on par with SW encoders with significantly improved power efficiency. Because of the rapid commoditization of video processing, rising energy costs, and pollution concerns, Video Processing ASICS are inevitable.”

At NETINT, we’ve been shipping transcoders based upon custom encoding ASICs since 2019 and have real market validations of Ronca’s comments. While software encoding may be appropriate for VOD, ASIC based transcoders are superior, if not essential, for live transcoding.

Back on Lei’s talk, whether you’re distributing VOD or live AV1 streams, Lei’s descriptions of the challenges of AV1 delivery to mobile will be instructive to all.

Meta AV1 Delivery Presentation: Six Key Takeaways

One of the most gracious things that large companies like Meta and Netflix do is to share their knowledge with others in the community. On November 3, Meta hosted Video @Scale Fall 2022 which featured multiple speakers from Meta and other companies. If you’re unfamiliar with the event, here’s the description, “Designed for engineers that develop or manage large-scale video systems serving millions of people.”

Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

One talk drew my attention; Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta. Watch above or use this link:  https://bit.ly/Lei_AV1 

For perspective, where Netflix has focused AV1 distribution on Smart TVs, Meta’s focus is mobile. Briefly, the company started delivering “AV1-encoded FB/IG Reels videos to selected iPhone and Android devices” in 2022. Lei’s talk included encoding, decoding, and some observations about the bandwidth savings, improved MOS scores, and increased viewing time that AV1 delivered.

Here are my top 6 takeaways from Lei’s excellent presentation.

1. Meta Finds that AV1 is 30% More Efficient than HEVC/VP9

As you’ll learn later in this article, Meta relies upon software playback on iOS and Android platforms. Since both platforms support HEVC decoding, iOS in hardware (since 2017) and Android mostly in hardware but also in software, it’s reasonable to ask why Meta didn’t just use HEVC?

The answer is that in Meta’s own tests, they found that AV1 was 30% more efficient than both VP9 and HEVC, about 21% lower than the 38% higher efficiency that I found in this study by Streaming Media. Lei didn’t discuss HEVC in his presentation, but you’d have to guess that Meta chose AV1 over HEVC because the superior quality AV1 was able to deliver outweighed the potential impact of software-playback on mobile device battery life.

Meta about AV1-2
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

2. Meta Encodes with SVT-AV1 For Video On Demand (VOD)

The chart shown below tracks the encoding time and quality levels of the open-source codecs shown on the upper right, which includes libaom-av1 (AV1 codec), libvpx (VP9), x265 (HEVC), x264, (AVC), vvenc (VVC), and SVT-AV1 (AV1).

Here’s how Lei interpreted this data. “From this graph, we see that SVT-AV1 maintains a consistent performance across a wide range of complexity levels. No matter for an encoding efficiency or compute efficiency point of view, SVT-AV1 always achieves the most optimal results among open-source encoders.” Again, these results track my own findings, at least as it relates to SVT-AV1 as compared to Libaom.

Interestingly, the chart only tracks software encoders, not hardware, which present a completely different quality/encoding time curve. You’ll see why this is important at the end of this post.

Meta about AV1-3
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

3. Meta Creates Their Encoding Ladder Using the Convex Hull

There are many forms of per-title encoding. Some, like YouTube, are based on machine learning, while others’, like Netflix, are based on multiple encodes to find the convex hull. Since Meta’s encoding task is much closer to YouTube than Netflix (high volume UGC), you might assume that Meta uses AI as well.

However, Meta actually uses the convex hull, a brute force technique that involves encoding at multiple resolutions and multiple bitrates to find the combination that comprises the convex hull for that video. In the example shown below, Meta encoded at seven resolutions and five CRF levels, a total of 35 encodes. To compute the convex hull, Meta plots the 35 data points and then draws a line connecting the points on the upper left boundary. The points on the convex hull are the optimal encoding configuration for that video.

As Lei points out, “the complexity of this process is quite high.” To reduce the complexity, Meta uses techniques like computing the convex hull with high-speed presets, and then encoding the selected resolution and CRF points using higher-quality presets for final delivery. Lei noted that though there are more encodes using this hybrid approach, as the optimal configurations are encoded twice, overall encoding time is reduced. 

Just to state the obvious, this approach only works for video on demand, not live. Even with the fastest hardware encoders, you can’t produce 35 iterations to identify the optimal five. This indicates that Meta uses a different schema for live transcoding, which Lei doesn’t address.

Meta about AV1-4
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

4. Meta Uses the Convex Hull Computed for AVC for VP9 and AV1

Like most large publishers, Meta encodes using multiple codecs like H.264, VP9, and AV1 to deliver to different devices. One surprising revelation was that Meta uses the convex hull computed for H.264 to guide the convex hull implementations for the VP9 and AV1 encodes.

Lei didn’t explain how this works – as you can see in the figure below, the resolutions and bitrates for the three codecs are obviously different, and that’s what you would expect. So, there must be some kind of interpolation of the convex hull information from one codec to another. But you see that VP9 delivers a 48% bitrate savings over the top H.264 ladder rung, while AV1 delivers 65%.

Meta about AV1-5
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

5. Apple and Android Phones Present Completely Different Challenges

Again, no surprise. There are many fewer Apple devices, and all are premium high-performance models. In contrast, there’s a much greater range of Android devices, from low-cost/low-performance options to models that rival Apple in cost and performance.

Lei shared that Facebook tests Android devices to determine eligibility for AV1 videos. As you can see in the slide below, Meta delivers much different quality to iOS and Android devices.

It was clear from Lei’s talk that delivering AV1 to Apple phones was relatively simple compared to sending AV1 video to Android phones. This is actually the reverse of what you might expect, as iOS doesn’t support AV1 natively while Android does. Though you can deliver video via an app to iOS devices, as Meta does, Safari doesn’t support it. And even though Android does support AV1 playback natively, you’ll have to implement some type of testing protocol—like Meta—to ensure smooth playback until AV1 hardware support becomes pervasive, which probably won’t be until 2024 or beyond.

Meta about AV1-6
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

6. AV1 has Delivered in Several Key Metrics

Integrating a new codec into your encoding and delivery pipeline isn’t trivial. So, the big question is, was AV1 worth it? The slide below displays three graphs. Sorry that the quality in the original slide is suboptimal, but here’s the net/net.

The graph on the top left shows the week-over-week playback MOS on all videos played on an iPhone. It shows about a 0.6 MOS point improvement. Since MOS (Mean Opinion Score) is usually computed on a scale from 1-5, .6 is a significant number. The second graph, on the upper right is the bitrate of all videos delivered, and it shows about a 12% bitrate reduction.

The bottom chart presents the average iPhone watch time for the different codecs used in Facebook Reels and shows that AV1 watch time went up to about 70% within the first week after rollout. This doesn’t seem to mean that AV1 increased watch time; rather, it seems to show that a significant number of devices were able to play AV1, which is how AV1 delivered the MOS improvement and bitrate reductions shown in the top two charts.

Meta about AV1-7
SLIDE FROM Meta’s Ryan Lei speaking on Scaling AV1 End-To-End Delivery at Meta.

Lei’s talk was about 18 minutes long, and there’s a lot more useful data and observations than I’ve presented here. Again, here’s the link – https://bit.ly/Lei_AV1. If you’re considering deploying AV1 for VOD encoding in your organization, you’ll find the encoding-related portions of Lei’s talk illuminating.

ASICs are able to deliver video quality on par with SW encoders with significantly improved power efficiency. Because of the rapid commoditization of video processing, rising energy costs, and pollution concerns, Video Processing ASICS are inevitable.”

What about live? Lei didn’t address it, but you can take some guidance from the fact that Meta recently announced their own Video Processing ASIC. After the announcement, David Ronca, Director, Video Encoding at Meta, commented that “ASICs are able to deliver video quality on par with SW encoders with significantly improved power efficiency. Because of the rapid commoditization of video processing, rising energy costs, and pollution concerns, Video Processing ASICS are inevitable.”

At NETINT, we’ve been shipping transcoders based upon custom encoding ASICs since 2019 and have real market validations of Ronca’s comments. While software encoding may be appropriate for VOD, ASIC based transcoders are superior, if not essential, for live transcoding.

Back on Lei’s talk, whether you’re distributing VOD or live AV1 streams, Lei’s descriptions of the challenges of AV1 delivery to mobile will be instructive to all.