AV1 Capped CRF Encoding with Quadra VPU

Jan Ozer-AV1 Capped CRF-featured image-B

We’ve previously reported results for capped CRF encoding for H.264 and HEVC using NETINT Quadra video processing units (VPU). This post will detail AV1 performance, including both 1080p and 4K data.

For those with limited time, here’s what you need to know: Capped CRF delivers higher quality video during hard-to-encode regions than CBR, similar quality during all other scenes, and improved quality of experience at the same cost or lower than CBR. NETINT VPUs are the first hardware video encoders to adopt Capped CRF across the three most popular codecs in use today, AV1, HEVC, and H.264.

You can read a quick description of capped CRF here and get a deep dive with H.264 and HEVC performance results here

CAPPED CRF OVERVIEW

Briefly, capped CRF is a smart bitrate control technique that combines the benefits of CRF encoding with a bitrate cap. Unlike variable bitrate encoding (VBR) and constant bitrate encoding (CBR), which target specific bitrates, capped CRF targets a specific quality level, which is controlled by the CRF value. You also set a bitrate cap, which is applied if the encoder can’t meet the quality level below the bitrate cap.

On easy-to-encode videos, the CRF value sets the quality level, which it can usually achieve below the bitrate cap. In these cases, capped CRF typically delivers bitrate savings over CBR-encoded footage while delivering similar quality. For harder-to-encode footage, the bitrate cap usually controls, and capped CRF delivers close to the same quality and bitrate as CBR.

The value proposition is clear: lower bitrates and good quality during easy scenes, and similar to CBR in bitrate and quality for harder scenes. I’m not addressing VBR because NETINT’s focus is live streaming, where CBR usage dominates. If you’re analyzing capped CRF for VOD, you would compare against 2-pass VBR as well as potentially CBR.

One last detail. CRF values have an inverse relationship to quality and bitrate; the higher the CRF value, the lower the quality and bitrate. In general, video engineers select a CRF value that delivers their target quality level. For premium content, you might target an average VMAF score of 95. For user-generated content or training videos, you might target 93 or even lower. As you’ll see, the lower the quality score, the greater the bandwidth savings.

1080p RESULTS

We show 1080p results in Table 1, which is divided between easy-to-encode and hard-to-encode content. We encoded the CBR clips to 4.5 Mbps and applied the same cap for capped CRF encoding.

Jan Ozer-AV1 Capped CRF-1
Table 1. 1080p results using Quadra VPU and capped CRF encoding.

You see that in CBR mode, Quadra VPUs do not reach the target rate as accurately as when using capped CRF mode. This won’t degrade viewer quality of experience since the VMAF scores exceed 95, so this missing on the low side saves excess bandwidth with no visual quality detriment.

In this comparison, bitrate savings is minimized, particularly at CRF 19 and 21, as the capped CRF clips in the hard-to-encode content have a higher bitrate than the CBR counterparts (4,419 and 4,092 to 3,889). Not surprisingly, CRF 19 and 21 deliver little bandwidth savings and a slighly higher quality than CBR.

At CRF 23, things get interesting, with an overall bandwidth savings of 16.1% with a negligible quality delta from CBR. With a VMAF score of around 95, CRF 23 might be the target for engineers delivering premium content. Engineers targeting slightly lower quality can choose CRF 27 and achieve a bitrate savings of 43%, and an efficient 2.4 Mbps bit rate for hard-to-encode footage. At CRF 27, Quadra VPUs encoded the hard-to-encode Football clip at 3,999 kbps with an impressive VMAF score of 93.39.

Note that as with H.264 and HEVC, AV1 capped CRF does reduce throughput. Specifically, a single Quadra VPU installed in a 32-core workstation outputs 23 simultaneous CBR streams using CBR encoding. This dropped to eighteen for capped CRF, a reduction of 22%.

4K RESULTS

Many engineers encoding with AV1 are delivering UHD content, so we ran similar tests with the Quadra and 4K30 8-bit content with a CBR target and bitrate cap of 16 Mbps. Using four clips, including a 4K version of the high-motion Football clip to much less dynamic content like Netflix’s Meridian clip and Blender Foundation’s Sintel.

Table 2. 4K results for the Quadra VPU and capped CRF encoding.

In CBR mode, the Quadra VPU hit the bitrate target much more accurately at 4K than 1080p, so even at CRF 19, the VPU delivered a 13% bitrate savings with a VMAF score of 96.23. Again, CRF 23 delivered a VMAF score of very close to 95, with 45% savings over CBR. Impressively, at CRF 23, Quadra delivered an overall VMAF score of 94.87 for these 4K clips at 7.78 Mbps, and that’s with the Football clip weighing in at 14.3 Mbps.

Of course, these savings directly relate to the cap and CBR target. It’s certainly fair to argue that 16 Mbps is excessive for 4K AV1-encoded content, though Apple recommends 16.8 for 8-bit 4K content with HEVC here.

The point is, when you encode with CBR, you’re limiting quality to control bandwidth costs. With capped CRF, you can set the cap higher than your CBR target, knowing that all content contains easy-to-encode regions that will balance out the impact of the higher cap and deliver similar or lower bandwidth costs. With these comparative settings, capped CRF delivers higher quality video during hard-to-encode regions than CBR, similar quality during all other scenes, and improved quality of experience at the same cost or lower than CBR.

DENSER / LEANER / GREENER : Symposium on Building Your Own Streaming Cloud

Norsk and NETINT: Elevating Live Streaming Efficiency

Norsk and NETINT
With the growing demand for high-quality viewing experiences and the heightened attention on cost efficiency and environmental impact,  hardware acceleration plays an ever-more-crucial role in live streaming. Here at NETINT, we want users to take full advantage of our transcoding hardware, so we’re pleased to announce that id3as NORSK now offers exceptionally efficient support for NETINT’s T408 and Quadra video processing unit (VPU) modules.

Here at NETINT, we want users to take full advantage of our transcoding hardware, so we’re pleased to announce that id3as NORSK now offers exceptionally efficient support for NETINT’s T408 and Quadra video processing unit (VPU) modules.

Using NETINT VPU’s, users can leverage the Norsk low-code live streaming SDK to achieve higher throughput and greater efficiency compared to running software on CPUs in on-prem or cloud configurations. Combined with Norsk’s proven high-availability track record, this makes it easy to deliver exceptional services with maximum reliability and performance at a never-before-available OPEX. 

NORSK AND NETINT

Norsk also takes advantage of Quadra’s hardware acceleration and onboard scaling to achieve complex compositions like picture-in-picture and resizing directly on the card. Even better, Norsk’s built-in ability to “do the right thing” also means that it knows when it can take advantage of hardware acceleration and when it can’t.  

 

For example, if you’re running Norsk on the T408, decoding will take place on the card, but Norsk will automatically utilize the host CPU for functions like picture-in-picture and resizing that the T408 doesn’t natively support, before returning the enriched media to the card for encoding (Scaling and resizing functions are native to Quadra VPUs so are performed onboard without the host CPU). 

 

“As founding members of Greening of Streaming, we’re keenly aware of the pressing need to focus on energy efficiency at every point of the video stack,” says Norsk CEO Adrian Roe. “By utilizing the Quadra and T408 VPU modules, users can reduce energy usage while achieving maximum performance even on compute-intensive tasks. With Norsk seamlessly running on NETINT hardware, live streaming services can consume as little energy as possible while delivering a fantastic experience to their customers.” 

“By utilizing the Quadra and T408 VPU modules, users can reduce energy usage while achieving maximum performance even on compute-intensive tasks. With Norsk seamlessly running on NETINT hardware, live streaming services can consume as little energy as possible while delivering a fantastic experience to their customers.” 

– Norsk CEO Adrian Roe. 

“Id3as has proven expertise in helping its customers produce polished, high-volume, compelling productions, and as a product, Norsk makes that expertise widely accessible,” commented Alex Liu, NETINT founder and COO. “With Norsk’s deep integration with our T408 and Quadra products, this partnership makes NETINT’s proven ASIC-based technology available to any video engineer seeking to create high-quality productions at scale.” 

“With Norsk’s deep integration with our T408 and Quadra products, this partnership makes NETINT’s proven ASIC-based technology available to any video engineer seeking to create high-quality productions at scale.”  

– Alex Liu, NETINT founder and COO.

Both Norsk and NETINT will be at IBC in Amsterdam, September 15-18. Click to request a meeting with Norsk, or NETINT, and/or visit NETINT at booth 5.A86

ON-DEMAND: Adrian Roe - Make Live Easy with NORSK SDK

Unveiling the Quadra Server: The Epitome of Power and Scalability

The Quadra Server review by Jan Ozer from NETINT Technologies

Streaming engineers face constant pressure to produce more streams at a lower cost per stream and reduced power consumption. However, those considering new transcoding technologies need a solution that integrates with their existing workflows while delivering the quality and flexibility of software with the cost efficiency of ASIC-based hardware.

If this sounds like you, the US $21,000 NETINT Quadra Video Server could be the ideal solution. Combining the Supermicro 1114S-WN10RT AMD EPYC 7543P-powered server hosting ten NETINT Quadra T1U Video Processing Units (VPUs), is a power house. The server outputs H.264, HEVC, and AV1 streams at normal or low latency, and you can control operation via FFmpeg, GStreamer, or a low-level API. This makes the server a drop-in replacement for a traditional FFmpeg-based software or GPU-based encoding stack.

As you’ll see below, the 1RU form factor server can output up to 20 8Kp30 streams, 80 4Kp30 streams, up to 320 1080p30 streams, and 640 720p30 streams for live and interactive video streaming applications. For ABR production, the server can output over 120 encoding ladders in H.264, HEVC, and AV1 formats. This unparalleled density enables all video engineers to greatly expand capacity while shrinking the number of required servers and the associated power bills.

I’ll start this review with a technical description of the server and transcoding hardware. Then we’ll review some performance results for one-to-one streaming and H.264, HEVC, and AV1 ladder generation and finish with a look at the server’s AI-based features and output.

The Quadra Server - Quadra video processing server powered by 10 Quadras, ASIC-based VPUs from NETINT
Figure 1. The Quadra Video Server powered by the Codensity G5 ASIC.

Hardware Specs - The Quadra Server

The NETINT Quadra Video Server uses the Supermicro 1114S-WN10RT server platform with a 32-core AMD EPYC 7543P CPU running Ubuntu 20.04.05 LTS. The server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots that house the Quadra T1U VPUs. NETINT also offers the server with two other CPUs, the 64-core AMD EPYC 7713P processor ($24,000) for more demanding applications, and the economical 8-core AMD EPYC 7232P processor ($19,000) for pure transcoding applications that may not require a 32-core CPU.

Supermicro* is a leading server and storage vendor that designs, develops, and manufactures primarily in the United States. Supermicro* adheres to high-quality standards, with a quality management system certified to the ISO 9001:2015 and ISO 13485:2016 standards, and an environmental management system certified to the ISO 14001:2015 standard. Supermicro is also a leader in green computing and reducing data center footprints (see the white paper Green Computing: Top Ten Best Practices for a Green Data Center). As you’ll see below, this focus has resulted in an extremely power-efficient server to house the NETINT Quadra VPUs.

*We are the leading server and storage vendor that designs, develops, and manufactures the majority of our development in the United States – at our headquarters in San Jose, Calif. Our Quality Management System is certified to ISO 9001:2015 and ISO 13485:2016 standards and our Environmental Management System is certified to ISO 14001:2015 standard. In addition to that, the Supermicro Information Security Managemen

SOURCE: https://www.supermicro.com/en/about

Hardware Specs – Quadra VPUs

The Quadra T1U VPUs are powered by the NETINT Codensity G5 ASIC and packaged in a U.2 form factor that plugs into the NVMe slots in the server and communicate via the ultra-high bandwidth PCIe 4.0 bus. Quadra VPUs can decode H.264, HEVC, and VP9 inputs and encode into the H.264, HEVC, and AV1 standards.

Beyond transcoding, Quadra VPUs house 2D processing engines that can crop, pad, and scale video, and perform video overlay, YUV and RGB conversion, reducing the load on the host CPU to increase overall throughput. These engines can perform xStack operations in hardware, making the server ideal for conferencing and security applications that combine multiple feeds into a multi-pane output mosaic window.

Each Quadra T1U in the server includes a 15 TOPS Deep Neural Network Inference Engine that can support models trained with all major deep learning frameworks, including Caffe, TensorFlow, TensorFlow Lite, Keras, Darknet, PyTorch, and ONNX. NETINT supplies several reference models, including a facial detection model that uses region of interest encoding to improve facial quality on security and other highly compressed streams. Another model provides background removal for conferencing applications.

Operational Overview

We tested the server with FFmpeg and GStreamer. Operationally, both GStreamer and FFmpeg communicate with the libavcodec layer that functions between the Quadra NVME interface and the FFmpeg/GStreamer software layers. This allows existing FFmpeg and GStreamer-based transcoding applications to control server operation with minimal changes.

Figure 2 - The Quadra Server - software architecture for controlling the Quadra Server
Figure 2. The software architecture for controlling the server.

To allocate jobs to the ten Quadra T1U VPUs, the Quadra device driver software includes a resource management module that tracks Quadra capacity and usage load to present inventory and status on available resources and enable resource distribution. There are several modes of operation, including auto, which automatically distributes the work among the available VPUs.

Alternatively, you can manually assign decoding and encoding tasks to different Quadra VPUs in the command line or application and even control which streams are decoded by the host CPU or a Quadra. With these and similar controls, you can most efficiently balance the overall transcoding load between the Quadra and host CPU and maximize throughput. We used auto distribution for all tests.

We tested running FFmpeg v 5.2.3 and GStreamer version 1.18 (with FFmpeg v 4.3.1), and with Quadra release 3.2.0. As you’ll see, we weren’t able to complete all tests in all modes with both software programs, so we presented the results we were able to complete.

In all tests, we configured the Quadra VPUs for maximum throughput as opposed to maximum quality. You can read about the configuration options and their impact on output quality and performance in Benchmarking Hardware Transcoder Performance. While quality will relate to each video and encoding configuration, the configuration used should produce quality at least equal to the veryfast x264 and x265 presets, with quality up to the slow presets available in configurations that optimize quality over throughput.

We tested multiple facets of system performance. The first series of tests involved a single stream in and single stream out, either at the same resolution as the incoming stream or scaled down and output at a lower resolution. Many applications use this mode of operation, including gaming, gambling, and auctions.

The second use case is ABR distribution, where a single input stream is transcoded to a full encoding ladder. Here we supplemented the results with software-only transcodes for comparison purposes. To assess AI-related throughput, we tested region-of-interest transcoding and background removal.

In most modes, we tested normal and low-latency performance. To simulate live streaming and minimize file I/O as a drag on system performance, we retrieved the source file from a RAM drive on the server and delivered the encoded file to RAM.

Same-Resolution Transcoding

Table 1 shows transcoding results for 8K, 4K, 1080p, and 720p in latency tolerant and low-delay modes. The number represents the amount of full frame rate outputs produced by the system at each configuration.

These results are most relevant for interactive gambling and similar applications that input a single stream, transcode the stream at full resolution, and stream it out. You see that 8K streaming is not available in the AV1 format and that H.264 and HEVC are not available in low latency mode with either program. Interestingly, FFmpeg outperformed GStreamer at this resolution while the reverse was true at 1080p.

4K and 720p results were consistent for all input and output codecs and for normal and low delay modes. All output numbers are impressive, but the 640 720p streams for AV1, H.264, or HEVC is remarkable density for a 1RU rack server.

At 1080p there are minor output differences between normal and low-delay mode and the different codecs, though the codec-related differences aren’t that substantial. Interestingly, HEVC throughput is slightly higher than H.264, with AV1 about 16% behind HEVC.

Jan Ozer - the Quadra server review-table-1
Table 1. Same resolution transcoding results.

Table 2 shows a collection of maximum data points (worst case) from the transcoding results presented in Table 1. As you can see, both Max CPU and power consumption track upwards with the number of streams produced. Max latency (decode plus encode) in normal latency mode tracks downward with the stream resolution, becoming quite modest at 720p. Max latency (decode plus encode) in low-delay mode for both decoding and encoding starts and stays under 30.9 milliseconds, which is less than a single frame.

Jan Ozer - the Quadra server review-table-2
Table 2. Maximum CPU, power consumption, and latency data for pure transcoding.

As between FFmpeg and GStreamer, the latter proved more CPU and power efficient than the former, in both normal and low-delay modes. For example, in all tests, GStreamer’s CPU utilization was less than half of FFmpeg, through the power consumption delta was generally under 20%.

At 8K and 4K resolutions, the latency reported was about even between the two programs, but at the lower resolutions in low-delay mode, GStreamer’s latency was often half that of FFmpeg. You can see an example of these two observations in Table 3, reporting 720p HEVC input and output as HEVC. Though the throughput was identical, GStreamer used much less energy and produced much lower latency. As you’ll see in the next section, this dynamic stayed true in transcoding with scaling tests, making GStreamer the superior app for applications involving same-resolution transcoding and transcoding with scaling. 

Figure 3. GStreamer was much more CPU and power efficient
and delivered substantially lower latency than FFmpeg
in these same resolution transcode tests.

Transcoding and Scaling

Table 3 shows transcoding while scaling results, first 8K input to 4K output, then 4K to 1080p, and lastly 1080p to 720p. If you compare Table 3 with Table 1, you’ll see that performance tracks the input resolution, not output, which makes sense because decoding is a separate operation that obviously involves its own hardware limits.

Jan Ozer - the Quadra server review-table-4
Table 3. Transcoding while scaling results.

As the Quadra VPUs perform scaling on-board, there was no drop in throughput with the scaling related tests; rather, there was a slight increase in 8K > 4K and 4K > 1080p outputs over the same resolution transcoding reported in Table 1. In terms of throughput, the results were consistent between the codecs and software programs.

Table 4 shows the max CPU and power usage for all the transcodes in Table 3, which increased somewhat from the low-quantity high-resolution transcodes to the high-quantity low-resolution transcodes but was well within the performance envelope for this 32-core server.

The Max latency for all normal encodes was relatively consistent between five and six frames. With low delay engaged, 8K > 4K latency didn’t drop that significantly, though you’d assume that 8K to 4K transcodes are uncommon. Latency dropped to below a single frame in the two lower resolution transcodes.

Jan Ozer - the Quadra server review-table-5
Table 4. Maximum CPU, power consumption, and latency data for transcoding while scaling.

As between FFmpeg and GStreamer we saw the same dynamic as with full resolution transcodes; in most tests, GStreamer consumed significantly less power and produced sharply lower latency. You can see an example of this in Table 5, reporting the results of 1080p incoming HEVC output to AV1 at 720p. 

Figure 4. GStreamer was much more CPU and power efficient
and delivered much lower latency than FFmpeg in this scale then transcode tests.

Encoding Ladder Testing

Table 5 shows the results of full ladder testing with CPU, latency, and power consumption embedded in the output instances. Note that we tested a five-rung ladder for H.264, and four-rung ladders for HEVC and AV1. We didn’t test 4K H.264 output because few services would deploy this configuration. Also, we didn’t test with GSteamer because NETINT’s current GStreamer implementation can’t use Quadra’s internal scalers when producing more than a single file, an issue that the NETINT engineering team will resolve soon. Also, as you can see, low-delay mode wasn’t available for 4K testing. 

This fine print behind us, as with the single file testing, throughput was impressive. The ability to deliver up to 140 HEVC 4-rung ladders from a single 1RU rack, in either normal or low-latency mode, is remarkable.

Jan Ozer - the Quadra server review-table-7
Table 5: Encoding ladder throughput. 

For comparison purposes, we produced the equivalent encoding ladders on the same server using software-only encoding with FFmpeg and the x264, x265, and SVT-AV1 codecs. To match the throughput settings used for Quadra, we used the ultrafast preset for x264 and x265, and preset eleven for SVT-AV1. You see the results in Table 6. 

Note that these numbers over-represent software-based output since no engineer would produce a live stream with CPU utilization over 60 – 65%, since a sudden spike in CPU usage would crash all the streams. Not only is CPU utilization much lower for the Quadra-driven encodes, minimizing the risk of exceeding CPU capacity, Quadra-based transcoding is much more determinist than CPU-based transcoding, so CPU requirements don’t typically change in midstream.

All that said, Quadra proved much more efficient than software-based encoding for all codecs, particularly HEVC and AV1. In Table 5, the Multiple column shows the number of servers required to produce the same output as the Quadra server, plus the power consumed by all these servers. For H.264, you would need six servers instead of a single Quadra server to produce the 120 instances, and power costs would be nearly six times higher. That’s running each server at 98.3% CPU utilization. Running at a more reasonable 60% utilization would translate to ten servers and 4,287 watts per hour.

Jan Ozer - the Quadra server review-table-8
Table 6. Ladders, CPU utilization, and power consumed for CPU-only transcoding.

Even without factoring in the 60% CPU-utilization limits, the comparison reaches untenable levels with HEVC and AV1. As the data shows, CPU-based transcoding simply can’t keep up with these more complex codecs, while the ASIC-driven Quadra remains relatively consistent. 

AI-Related Functions

The next two tables benchmark AI-related functions, first region of interest encoding, then background removal. Briefly, region of interest encoding uses AI to search for faces in a stream and then increases the bits assigned to those faces to increase quality. This is useful in surveillance videos or any low-bitrate video environment where facial quality is important. 

We tested 1080p AVC input and output with FFmpeg only, and the system delivered sixty outputs in both normal and low-delay modes, with very modest CPU utilization and power consumption. For more on Quadra’s AI-related functions, and for an example of the region of interest filter, see an Introduction to AI Processing on Quadra.

Jan Ozer - the Quadra server review-table-9
Table 7. Throughput for Region of Interest transcoding via Artificial Intelligence.

Table 8 shows 1080p input/output using the AVC codec with background removal, which is useful in conferencing and other applications to composite participants in a virtual environment (see Figure 2). This task involves considerably more CPU but delivers slightly greater throughput.

Jan Ozer - the Quadra server review-table-10
Table 8. Throughput for background removal and transcoding via Artificial Intelligence.

As you can read about in the Introduction to AI Processing on Quadra, Quadra comes with these and other AI-based applications and can deploy AI-based models developed in most machine learning programs. Over time, AI-based operations will become increasingly integral to video transcoding functions, and the Quadra Video Server provides a future-proof platform for that integration.

Figure 3 -The Quadra Server - Compositing participants in a virtual environment with background removal
Figure 4. Throughput for background removal and transcoding via Artificial Intelligence.

Conclusion

While there’s a compelling case for ASIC-based transcoding solely for H.264 production, these tests show that as applications migrate to more complex codecs like HEVC and AV1, CPU-based transcoding is untenable economically and for the environment. Beyond pure transcoding functionality, if there’s anything that the ChatGPT-era has proven, it’s that AI-based transcoding-related functions will become mainstream much sooner than anyone might have thought. With highly efficient ASIC-based transcoding hardware and AI engines, the Quadra Video Server checks all the boxes for a server to strongly consider for all high-volume live streaming applications. 

What Can a VPU Do for You?

For Cloud-Gaming, a VPU can deliver 200 simultaneous 720p30 game sessions from a single 2RU server.

When you encode using a Video Processing Unit (VPU) rather than the built-in GPU encoder, you will decrease your cost per concurrent user (CCU) by 90%, enabling profitability at a much lower subscription price. How is this technically feasible? Two technology enablers make this possible. First, extraordinarily capable encoding hardware, known as a VPU (video processing unit), dedicated to the task of high-quality video encoding and processing. And second, peer-to-peer direct memory access (DMA) that enables video frames to be delivered at the speed of memory compared to the much slower NVMe buss between the GPU and VPU. Let’s discuss these in reverse order.

Peer-to-Peer Direct Memory Access (DMA)

Within a cloud gaming architecture, the primary role of the GPU is to render frames from the game engine output. These frames are then encoded into a standard codec that is easily decoded on a wide cross section of devices. Generally this is H.264 or HEVC, though AV1 is becoming of interest to those with a broader Android user based. Encoding on the GPU is efficient from a data transfer standpoint because the rendering and encoding occurs on the same silicon die; there’s no transfer of the rendered YUV frame to a separate transcoder over the slower PCIe or NVMe busses. However, since encoding requires substantial GPU resources, this dramatically reduces the overall throughput of the system. Interestingly, it’s the encoder that is often at full capacity and, thus the bottleneck, not the rendering engine. Modern GPU’s are built for general-purpose graphical operations, thus, more real estate is devoted to this compared to video encoding.

By installing a dedicated video encoder in the system and using traditional data transfer techniques, the host CPU can easily manage the transfer of the YUV frame from the GPU to the transcoder but as the number of concurrent game sessions increase the probability of dropped frames or corrupted data makes this technique not usable.

NETINT, working with AMD enabled peer-to-peer direct memory access (DMA) to overcome this situation. DMA is a technology that enables devices within a system to exchange data in memory by allowing the GPU to send frames directly to the VPU whereby removing the situation of the buss becoming clogged as the concurrent session count increases above 48 720p streams.

What can a VPU do for you?

The Benefits of Peer-to-Peer DMA

Peer-to-peer DMA delivers multiple benefits. First, by eliminating the need for CPU involvement in data transfers, peer-to-peer DMA significantly reduces latency, which translates to a more responsive and immersive gaming experience for end-users. NETINT VPUs feature latencies as low as 8ms in fully loaded and sustained operation.

In addition, peer-to-peer DMA relieves the CPU of the burden of managing inter-device data transfers. This frees up valuable CPU cycles, allowing the CPU to focus on other critical tasks, such as game logic and physics calculations, optimizing overall system performance and producing a smoother gaming experience.

By leveraging peer-to-peer communications, data can be transferred at greater speeds and efficiency than CPU-managed transfers. This improves productivity and scalability for cloud gaming production workflows.

These factors combine to produce higher throughput without the need for additional costly resources. This cost-effectiveness translates to improved return on investment (ROI) and a major competitive advantage.

Extraordinarily Capable VPUs

Peer-to-peer DMA has no value if the encoding hardware used is not equally capable. With NETINT VPUs, that isn’t the case here.

The reference system that produces 200 720p30 cloud gaming sessions is built on the Supermicro AS-2015CS-TNR server platform with a single GPU and two Quadra T2A VPUs. This server supports AV1, HEVC, and H.264 video game streaming at up to 8K and 60fps, though as may be predicted, the simultaneous stream counts will be reduced as you increase framerate or resolution.

Quadra T2A is the most capable of the Quadra VPU line, the world’s first dedicated hardware to support AV1. With its embedded AI and 2D engines, the Quadra T2A can support AI-enhanced video encoding, region of interest, and content-adaptive encoding. Quadra T2A coupled with a P2P DMA enabled GPU, allows cloud gaming providers to achieve unprecedented high throughput with ultra-low latency.

Quadra T2A is an AIC (HH HL) form-factor video processing unit with two Codensity G5 ASICs that operates in x86 or Arm-based servers requiring just 40 watts at maximum load. It enables cloud gaming platforms to transition from software or GPU-only based encoding with up to a 40x reduction in the total cost of ownership.

What Can A VPU Do For You?

What Can A VPU Do For You?

It make’s Cloud Gaming profitable, finally.

Peer-to-peer DMA is a game-changing technology that reduces latency and increases system throughput. When paired with an extraordinarily capable VPU like the NETINT Quadra T2A, now you can deliver an immersive gaming experience at a CCU that cannot be matched by any competing architecture.

Video Transcoder vs. Video Processing Unit (VPU)

When choosing a product for live stream processing, half the battle is knowing what to search for. Do you want a live transcoder, a video processing unit (VPU), a video coding unit (VCU), Scalable Video Processor (SVP) or something else? If you’re not quite sure what these terms mean and how they relate, this short article will educate you in four minutes or less.  

In the Beginning, There Were Transcoders

Simply stated, a transcoder is any technology, software or hardware, that can input a compressed stream (decode) and output a compressed stream (encode). FFmpeg is a transcoder, and for video-on-demand applications, it works fine in most low-volume applications.

Video Transcoder versus Video Processing Unit aka VPU - board 1

For live applications, particularly high-volume live interactive applications (think Twitch), you’ll probably need a hardware transcoder to achieve the necessary cost per stream (CAPEX), operating cost per stream, and density.

For example, the NETINT Video Transcoding Server, a single 1RU server with ten NETINT T408 Video Transcoders, can deliver up to 80 H.264/HEVC 1080p30 streams while drawing under 250 watts. Performed in software using only the CPU, this same output could take up to ten separate 1RU servers, each drawing well over 250 watts.

Netint Codensity, ASIC-based T408 Video Transcoder
The NETINT T408 Video Transcoder.

Speaking of the T408, if Websters defined a transcoder (it doesn’t), it might have a picture of the T408 as the perfect example of a transcoder. Based on custom transcoding ASICs, the T408 is inexpensive ($400), capable (4K @ 60 FPS or 4x 1080p60 streams), flexible (H.264 and HEVC), and exceptionally efficient (only 7 watts).

What doesn’t the T408 do? Well, that leads us to the difference between a transcoder and a VPU.

The difference between a transcoder and a Video Processing Unit (VPU)

First, the T408 doesn’t scale video. If you’re building a full encoding ladder from a high-resolution source, all the scaling for the lower rungs is performed by the host CPU. In addition, the T408 doesn’t perform overlay in hardware. So, if you insert a logo or other bug over your videos, again, the CPU does the heavy lifting.

Finally, the T408 was launched in 2019, the first ASIC-based transcoder to ship in quite a long time. So, it’s not surprising that it doesn’t incorporate any artificial intelligence processing capabilities.

What is a Video Processing Unit (VPU)?

What’s a Video Processing Unit? A hardware device that does all that extra stuff, scaling, overlay, and AI. You see this in the transcoding pipeline shown below, which is for the NETINT Quadra.

Video Transcoder versus Video Processing Unit aka VPU - diagram 1

When it came to labeling the Quadra, you see the problem; It does much more than a video transcoder. Not only does it outperform the T408 by a factor of four, it adds AV1 output and all the additional hardware functionality. It’s much more than a simple video transcoder, it’s a video processing unit (VPU).

Video Transcoder versus Video Processing Unit aka VPU - board 2

As much as we’d like to lay claim to the acronym, it actually existed before we applied it to the Quadra. It’s not surprising. It follows the terminology for CPU (central processing unit) and GPU (graphical processing unit). And, if Websters defined VPU (it doesn’t). Oh, you get the point. Here’s the required Quadra glamour shot.

Netint Codensity, ASIC-based Quadra T1A Video Processing Unit
The NETINT Quadra Video Processing Unit.

VCUs and M(SVP)

While NETINT was busy developing ASIC-based transcoders and VPUs for the mass market, large video publishers like YouTube and Meta produced their own ASICs to achieve similar benefits (and produce more acronyms). In 2021, when Google shipped their own ASIC-based transcoder called Argos, they labeled it a Video Coding Unit, or VCU.

Like the T408 and Quadra, the benefits of this ASIC-based technology are profound; as reported by CNET, “Argos handles video 20 to 33 times more efficiently than conventional servers when you factor in the cost to design and build the chip, employ it in Google’s data centers, and pay YouTube’s colossal electricity and network usage bills.” Interestingly, despite YouTube’s heavy usage of the AV1 codec, Argos encodes only H.264 and VP9, not AV1.

In May 2023, Meta released their own ASIC, which, like Argos, outputs H.264 and VP9, but not AV1. Called the Meta Scalable Video Processor (MSVP), the unit delivered impressive results, including “a throughput gain of ~9x for H.264 when compared against libx264 SW encoding…[and] a throughput gain of ~50x when compared with libVPX speed 2 preset.” Meta also noted that the unit drew only 10 watts of power, which is skimpy but also about 43% higher than the T408.

Of course, neither Google or Meta sells their ASIC to third parties, so if want the CAPEX and OPEX efficiencies that ASIC-based VPUs deliver, you’ll have to buy from NETINT.

Of course, neither Google or Meta sells their ASIC to third parties, so if want the CAPEX and OPEX efficiencies that ASIC-based VPUs deliver, you’ll have to buy from NETINT. The bottom line is that whether you call it a transcoder, VPU, VCU, or MSVP, you’ll get the highest throughput and lowest power consumption if it’s powered by an ASIC.

Play Video about Hard Questions on Hot Topics w Jan Ozer and Anita Flejter- ASIC-based Video Transcoder versus Video Processing Unit (VPU)
HARD QUESTIONS ON HOT TOPICS:
ASIC-based Video Transcoder versus Video Processing Unit (VPU)
Watch the full conversation on YouTube: https://youtu.be/iO7ApppgJAg

World’s First AV1 Live Streaming CDN powered by VPUs

AV1 live streaming CDN

RealSprint’s vision for Vindral, its live-streaming CDN, is to deliver the quality of HLS and the latency of WebRTC. Early trials revealed that CPU-only transcoding lacked scalability, and GPUs used excessive power and proved challenging to configure.

Implementing NETINT’s ASIC-based Quadra delivered the required quality and latency in a low-power, simple-to-configure package with H.264, HEVC, and AV1 output. As a result, Quadra became a “preferred component” of the Vindral setup.

Implementing NETINT’s ASIC-based Quadra delivered the required quality and latency in a low-power, simple-to-configure package with H.264, HEVC, and AV1 output. As a result, Quadra became a “preferred component” of the Vindral setup.

The RealSprint Story

RealSprint is a tech company founded in 2013 and based in Umeå, Sweden. Since its inception, RealSprint has delivered industry-defining solutions that drive real business value. It’s flagship solution, Vindral live CDN, combines ultra-low latency streaming with 4K support, sync, and absolute stability. The latest addition, Composer, streamlines the setup for live video compositing, effects, and encoding.

In explaining RealSprint’s goals to Streaming Media Magazine, RealSprint CEO Daniel Alinder stated that part of the company’s goal is “to disrupt, spur innovation, and ensure high-end streaming experiences.” This focus, and RealSprint’s painstaking execution, has brought customers like Sotheby’s, Hong Kong Jockey Club, and IcelandAir into RealSprint’s client roster.

RealSprint is a tech company founded in 2013 and based in Umeå, Sweden. Since its inception, RealSprint has delivered industry-defining solutions that drive real business value. It’s flagship solution, Vindral live CDN, combines ultra-low latency streaming with 4K support, sync, and absolute stability.

Figure 1. Check out this Vindral demo at https://demo.vindral.com/?4k
Figure 1. Check out this Vindral demo at https://demo.vindral.com/?4k

Finding the Ideal Transcoder for Vindral

The Vindral live CDN is transforming the landscape for live streaming, offering high-quality streaming at low latency and synchronized playout. As a result, Vindral is highly optimized for verticals such as live sports, iGaming, live auctions, and entertainment markets with a desired latency of around one second and where stability is imperative, even at high video quality.

Alinder explains, “It is, of course, possible to configure for 0.5-second latency as well, but none of our clients has chosen to go that low. More common focus areas are image quality and synchronized playout. A game show with host-crowd interaction does not require real-time latency. Keeping all viewers in sync, around 1 second, while maintaining full-HD quality is a common request that we see.”

Elaborating on Alinder’s comments, Niclas Åström, founder and Chief Product Officer at RealSprint, adds, “we call it the Sweet Spot. Vindral is built to put clients in charge of their own sweet spot in terms of buffer and quality. While we are highly impressed by technologies such as WebRTC, we aim to pave the way for a new mainstream in which latency is only one of the parameters.”

Expanding upon Vindral’s target use cases, Alinder details, “A typical use case is live auctions. The usual setup for live auctions is 1080P, and you want below one second of latency because people are bidding online. There are also people bidding in the actual auction house, so there’s the fairness aspect of it as well.”

“Clients typically configure around a 700-millisecond buffer, and even that small of a buffer makes such a huge difference in quality and reliability. What we see in our metrics is that, basically, 99% of the viewers watch the highest quality stream across all markets. That’s a huge deal.”

Play Video about Hard Questions on Hot Topics with Jan Ozer and Anita Flejter from NETINT - NETINT Technologies about World’s first AV1 live streaming CDN powered by NETINT's Quadra VPU
HARD QUESTIONS ON HOT TOPICS:
World’s first AV1 live streaming CDN powered by NETINT’s Quadra VPU
Watch on YouTube: https://youtu.be/Qhe6wuJoOX0

Exploring Transcoder Options

To provide this flexible latency, Vindral depends upon a transcoder to produce the streams with minimal latency, and a vendor-agnostic hybrid content delivery network (CDN) to deliver the streams. To explain, the transcoder inputs the incoming stream from the live source and produces multiple outputs to deliver to viewers watching on different devices and connections.

Choosing the transcoder is obviously a critical decision for Vindral and RealSprint. When exploring its transcoder options, RealSprint considered multiple criteria, including cost per stream, power, output quality, format support, latency, and density.

According to CTO Per Mafrost, “We started using only CPUs but quickly concluded that we needed better scalability. We moved on to using GPUs, but the hardware setups got a bit more troublesome and more energy-demanding. A year back, we got in touch with NETINT to test their ASICs and were pleased with our findings.”

Netint Codensity, ASIC-based Quadra T2A Video Processing Unit
Figure 2. The NETINT Quadra T2 VPU.

“We’ve found that the quality when using ASICs is fantastic.”

RealSprint CEO Daniel Alinder

Quadra Fills the Gap

Specifically, Vindral implemented NETINT’s Quadra Video Processing Unit (VPU), which is driven by the Codensity G5 ASIC, which stands for Application Specific Integrated Circuit, in terms of transcoding, Quadra inputs H.264, HEVC, and VP9 video and outputs H.264, HEVC, and AV1, all at sub-frame latencies, which translate to under 0.03 seconds for a 30-fps input stream. Quadra is called a VPU rather than a transcoder because, in addition to audio and video transcoding, it also offers onboard scaling, overlay and houses two Deep Neural Network engines capable of 18 Trillion Operations per Second (TOPS).

According to Alinder, Quadra delivers both top quality and the necessary low latency. “We’ve found that the quality when using ASICs is fantastic. It’s all depending on what you want to do. Because we need to understand we’re talking about low latency here. Everything needs to work in real time. Our requirement on encoding is that it takes a frame to encode, and that’s all the time that you get.”

Quadra’s AV1 output was another key consideration. As Alinder explained, “we’re seeing markers that our clients are going to want AV1. And there are several reasons why that is the case. One of which is, of course, it’s license free. If you’re a content owner, especially if you’re a content owner with a large crowd with many subscribers to your content, that’s a game-changer. Because the cost of licensing a codec can grow to become a significant part of your business expenses.”

“That is a huge game changer because ASICs are unmatched in terms of the number of streams per rack unit.”

RealSprint CEO Daniel Alinder

Density and Power Consumption

Density refers to the number of streams a device or server can output. Because ASICs are purpose-built for video transcoding, they’re extremely efficient transcoders that provide maximum density but also very low power consumption. Speaking to Quadra’s density, Alinder commented, “That is a huge game changer because ASICs are unmatched in terms of the number of streams per rack unit.”

Of course, power consumption is also critical, particularly in Europe. As Alinder detailed, “If you look at the energy crisis and how things are evolving, I’d say [power consumption] is very, very important. The typical offer you’ll be getting from the data center is: we’re going to charge you 2x the electrical bill. In Germany, the energy price peaked in August 2022 at 0.7 Euros per kilowatt hour.”

To be clear, in some instances, Vindral can reduce power consumption and other carbon emissions by making travel unnecessary. As Alinder explained, “We have a Norwegian company that we’re working with that is doing remote inspections of ships. They were the first company in the world to do that. Instead of flying in an inspector, the ship owner, and two divers to the location, there’s only one operator of an underwater drone that is on the location. Everybody else is just connected. That’s obviously a good thing for the environment.”

“Another seldom mentioned topic set NETINT ASICs apart from CPUs and many GPUs: linear load. Specifically, it was relatively easy to create a solution where we could feel safe when calculating the load and expected capacity for transcoder nodes. The density, cost/stream, and quality are bonuses.”

RealSprint CTO Per Mafrost

Linear Load

One final characteristic set Quadra apart, was a predictable “linear load” pattern. As described by CTO Mafrost, “in choosing between different alternatives, the usual suspects such as cost, power, quality, and density were our main criteria. But another seldom mentioned topic set NETINT ASICs apart from CPUs and many GPUs: linear load. Specifically, it was relatively easy to create a solution where we could feel safe when calculating the load and expected capacity for transcoder nodes. The density, cost/stream, and quality are bonuses.”

RealSprint began deploying NETINT Quadra VPUs in 2022. As Mafrost concluded, “Since then, ASICs have started to be a preferred component of our setup.”

NETINT Quadra has become a “preferred component” of Vindral
Figure 3. NETINT Quadra has become a “preferred component” of Vindral.

The NETINT View

NETINT Technologies is an innovator of ASIC-based video processing solutions for low-latency video transcoding. Users of NETINT solutions realize a 10X increase in encoding density and a 20X reduction in carbon emissions compared to CPU-based software encoding solutions. NETINT makes it seamless to move from software to hardware-based video encoding so that hyper-scale services and platforms can unlock the full potential in their computing infrastructure.

Regarding Vindral’s use of Quadra, NETINT’s COO Alex Liu commented, “Live streaming video platforms demand more efficient and cost-effective video encoding solutions due to the emergence of new interactive video applications which can only be met with ASIC hardware encoding. Vindral, the industry’s first 4K AV1 streaming platform and powered with NETINT’s Quadra T2 real-time, low-latency 4K AV1 encoder, is a game changer. We are really excited about the amazing video experiences that Vindral users will bring to their customers as a result of this breakthrough in latency and quality,”

RealSprint began deploying NETINT Quadra VPUs in 2022. As Mafrost concluded, “Since then, ASICs have started to be a preferred component of our setup.”

Figure 4. Streaming Media Magazine discussing Vindral with RealSprint CEO Daniel Alinder. https://youtu.be/xJ2Zfo2r7SM

The Industry Takes Notice

The potent combination of Vindral and Quadra has the industry taking notice. For example, in this Streaming Media interview, respected contributing editor Tim Siglin interviewed Alinder about Vindral, summarizing “the fact that [Quadra] is an ASIC that does more transcodes at a lower power consumption means that it gives you a better viability.” 

The Industry Takes Notice

NETINT was the first company to ship AV1-based ASIC transcoders and has shipped tens of thousands of transcoders and VPUs, producing over 200 billion streams in 2022. In fact, NETINT has shipped more ASIC-based transcoders than any other supplier to the cloud gaming, broadcast, and similar live-streaming markets.

Validating NETINT’s approach, in 2021, Google launched their own encoding ASIC-based transcoder, called ARGOS, as did Meta in 2022. Both products are exclusively used internally by the respective companies.

The best way to leverage the benefits of encoding ASICs is to contact NETINT.