AV1 Capped CRF Encoding with Quadra VPU

We’ve previously reported results for capped CRF encoding for H.264 and HEVC using NETINT Quadra video processing units (VPU). This post will detail AV1 performance, including both 1080p and 4K data.

For those with limited time, here’s what you need to know: Capped CRF delivers higher quality video during hard-to-encode regions than CBR, similar quality during all other scenes, and improved quality of experience at the same cost or lower than CBR. NETINT VPUs are the first hardware video encoders to adopt Capped CRF across the three most popular codecs in use today, AV1, HEVC, and H.264.

You can read a quick description of capped CRF here and get a deep dive with H.264 and HEVC performance results here

CAPPED CRF OVERVIEW

Briefly, capped CRF is a smart bitrate control technique that combines the benefits of CRF encoding with a bitrate cap. Unlike variable bitrate encoding (VBR) and constant bitrate encoding (CBR), which target specific bitrates, capped CRF targets a specific quality level, which is controlled by the CRF value. You also set a bitrate cap, which is applied if the encoder can’t meet the quality level below the bitrate cap.

On easy-to-encode videos, the CRF value sets the quality level, which it can usually achieve below the bitrate cap. In these cases, capped CRF typically delivers bitrate savings over CBR-encoded footage while delivering similar quality. For harder-to-encode footage, the bitrate cap usually controls, and capped CRF delivers close to the same quality and bitrate as CBR.

The value proposition is clear: lower bitrates and good quality during easy scenes, and similar to CBR in bitrate and quality for harder scenes. I’m not addressing VBR because NETINT’s focus is live streaming, where CBR usage dominates. If you’re analyzing capped CRF for VOD, you would compare against 2-pass VBR as well as potentially CBR.

One last detail. CRF values have an inverse relationship to quality and bitrate; the higher the CRF value, the lower the quality and bitrate. In general, video engineers select a CRF value that delivers their target quality level. For premium content, you might target an average VMAF score of 95. For user-generated content or training videos, you might target 93 or even lower. As you’ll see, the lower the quality score, the greater the bandwidth savings.

1080p RESULTS

We show 1080p results in Table 1, which is divided between easy-to-encode and hard-to-encode content. We encoded the CBR clips to 4.5 Mbps and applied the same cap for capped CRF encoding.

Jan Ozer-AV1 Capped CRF-1
Table 1. 1080p results using Quadra VPU and capped CRF encoding.

You see that in CBR mode, Quadra VPUs do not reach the target rate as accurately as when using capped CRF mode. This won’t degrade viewer quality of experience since the VMAF scores exceed 95, so this missing on the low side saves excess bandwidth with no visual quality detriment.

In this comparison, bitrate savings is minimized, particularly at CRF 19 and 21, as the capped CRF clips in the hard-to-encode content have a higher bitrate than the CBR counterparts (4,419 and 4,092 to 3,889). Not surprisingly, CRF 19 and 21 deliver little bandwidth savings and a slighly higher quality than CBR.

At CRF 23, things get interesting, with an overall bandwidth savings of 16.1% with a negligible quality delta from CBR. With a VMAF score of around 95, CRF 23 might be the target for engineers delivering premium content. Engineers targeting slightly lower quality can choose CRF 27 and achieve a bitrate savings of 43%, and an efficient 2.4 Mbps bit rate for hard-to-encode footage. At CRF 27, Quadra VPUs encoded the hard-to-encode Football clip at 3,999 kbps with an impressive VMAF score of 93.39.

Note that as with H.264 and HEVC, AV1 capped CRF does reduce throughput. Specifically, a single Quadra VPU installed in a 32-core workstation outputs 23 simultaneous CBR streams using CBR encoding. This dropped to eighteen for capped CRF, a reduction of 22%.

4K RESULTS

Many engineers encoding with AV1 are delivering UHD content, so we ran similar tests with the Quadra and 4K30 8-bit content with a CBR target and bitrate cap of 16 Mbps. Using four clips, including a 4K version of the high-motion Football clip to much less dynamic content like Netflix’s Meridian clip and Blender Foundation’s Sintel.

Table 2. 4K results for the Quadra VPU and capped CRF encoding.

In CBR mode, the Quadra VPU hit the bitrate target much more accurately at 4K than 1080p, so even at CRF 19, the VPU delivered a 13% bitrate savings with a VMAF score of 96.23. Again, CRF 23 delivered a VMAF score of very close to 95, with 45% savings over CBR. Impressively, at CRF 23, Quadra delivered an overall VMAF score of 94.87 for these 4K clips at 7.78 Mbps, and that’s with the Football clip weighing in at 14.3 Mbps.

Of course, these savings directly relate to the cap and CBR target. It’s certainly fair to argue that 16 Mbps is excessive for 4K AV1-encoded content, though Apple recommends 16.8 for 8-bit 4K content with HEVC here.

The point is, when you encode with CBR, you’re limiting quality to control bandwidth costs. With capped CRF, you can set the cap higher than your CBR target, knowing that all content contains easy-to-encode regions that will balance out the impact of the higher cap and deliver similar or lower bandwidth costs. With these comparative settings, capped CRF delivers higher quality video during hard-to-encode regions than CBR, similar quality during all other scenes, and improved quality of experience at the same cost or lower than CBR.

DENSER / LEANER / GREENER : Symposium on Building Your Own Streaming Cloud

Norsk and NETINT: Elevating Live Streaming Efficiency

With the growing demand for high-quality viewing experiences and the heightened attention on cost efficiency and environmental impact,  hardware acceleration plays an ever-more-crucial role in live streaming.

Here at NETINT, we want users to take full advantage of our transcoding hardware, so we’re pleased to announce that id3as NORSK now offers exceptionally efficient support for NETINT’s T408 and Quadra video processing unit (VPU) modules.

Here at NETINT, we want users to take full advantage of our transcoding hardware, so we’re pleased to announce that id3as NORSK now offers exceptionally efficient support for NETINT’s T408 and Quadra video processing unit (VPU) modules.

Using NETINT VPU’s, users can leverage the Norsk low-code live streaming SDK to achieve higher throughput and greater efficiency compared to running software on CPUs in on-prem or cloud configurations. Combined with Norsk’s proven high-availability track record, this makes it easy to deliver exceptional services with maximum reliability and performance at a never-before-available OPEX. 

Norsk and NETINT.

Norsk also takes advantage of Quadra’s hardware acceleration and onboard scaling to achieve complex compositions like picture-in-picture and resizing directly on the card. Even better, Norsk’s built-in ability to “do the right thing” also means that it knows when it can take advantage of hardware acceleration and when it can’t.  

 

For example, if you’re running Norsk on the T408, decoding will take place on the card, but Norsk will automatically utilize the host CPU for functions like picture-in-picture and resizing that the T408 doesn’t natively support, before returning the enriched media to the card for encoding (Scaling and resizing functions are native to Quadra VPUs so are performed onboard without the host CPU). 

 

“As founding members of Greening of Streaming, we’re keenly aware of the pressing need to focus on energy efficiency at every point of the video stack,” says Norsk CEO Adrian Roe. “By utilizing the Quadra and T408 VPU modules, users can reduce energy usage while achieving maximum performance even on compute-intensive tasks. With Norsk seamlessly running on NETINT hardware, live streaming services can consume as little energy as possible while delivering a fantastic experience to their customers.” 

“By utilizing the Quadra and T408 VPU modules, users can reduce energy usage while achieving maximum performance even on compute-intensive tasks. With Norsk seamlessly running on NETINT hardware, live streaming services can consume as little energy as possible while delivering a fantastic experience to their customers.” 

– Norsk CEO Adrian Roe. 

“Id3as has proven expertise in helping its customers produce polished, high-volume, compelling productions, and as a product, Norsk makes that expertise widely accessible,” commented Alex Liu, NETINT founder and COO. “With Norsk’s deep integration with our T408 and Quadra products, this partnership makes NETINT’s proven ASIC-based technology available to any video engineer seeking to create high-quality productions at scale.” 

“With Norsk’s deep integration with our T408 and Quadra products, this partnership makes NETINT’s proven ASIC-based technology available to any video engineer seeking to create high-quality productions at scale.”  

– Alex Liu, NETINT founder and COO.

Both Norsk and NETINT will be at IBC in Amsterdam, September 15-18. Click to request a meeting with Norsk, or NETINT, and/or visit NETINT at booth 5.A86

ON-DEMAND: Adrian Roe - Make Live Easy with NORSK SDK

NETINT Buyer’s Guide. Choosing the Right VPU & Server for Your Workflow.

This guide is designed to help you choose the optimum NETINT Video Processing Unit (VPU) for your encoding workflow.

As an overview, note that all NETINT hardware products (VPUs and transcoders) run the same basic software controlled via FFmpeg and GStreamer patches or an SDK. This includes load balancing of all encoding resources in a server. In addition, both generations are similar in terms of latency and HDR support.

Question 1. Which ASIC Architecture: Codensity G4 (Logan) or Codensity G5 (Quadra)?

Tables 1 and 2 show the similarities and differences between Codensity G4 ASIC-powered products (T408 and T432) and Codensity G5-based products (Quadra T1U, T1A, T2A). Both architectures are available in either the U.2 or AIC form factor, the latter all half-height half-length (HHHL) configurations.

From a codec perspective, the main difference is that G5-based products support AV1 encoding and VP9 decoding. In terms of throughput, G5-based products offer four times the throughput but cost roughly three times more than G4, making the cost per output stream similar but with greater stream densities per host server. G5 power consumption is roughly 3x higher per ASIC than G4, but the throughput is 4x, making power consumption per stream lower.

Choosing the Right VPU & Server - Table 1. Codec support, throughput, and power consumption.
Table 1. Codec support, throughput, and power consumption.

Table 2 covers other hardware features. From an encoding perspective, G5-based products enable tuning of quality and throughput to match your applications, while quality and throughput are fixed for G4-based products. The G5’s quality ceiling is higher than the G4, at the cost of throughput, and the quality floor is lower, with an option for higher throughput.

G5-based products are much more capable hardware-wise, performing scaling, overlay, and audio compression and offer AI processing of 15 TOPS for T1U and 18 TOPS for T1A (36 TOPS T2A). In contrast, G4-based products scale, overlay, and encode audio via the host CPU and offer no AI processing. You can read about Quadra’s AI capability here.

Peer-to-peer DMA is a feature that allows G5-based products to communicate directly with some specific GPUs, which is particularly useful in cloud gaming. This is only available on G5-based products. Learn about peer-to-peer DMA here.

Note that G4 and G5-based devices can co-exist on the same server, so you can add G5 devices to a server with G4 devices already installed and vice versa.

Choosing the Right VPU & Server - Table 2 Advanced hardware functionality.
Table 2. Advanced hardware functionality.

Observations:

  • Codensity G4 and G5-based VPUs offer similar cost-per-stream, with Quadra slightly more efficient on a watts-per-stream basis. Both products transcode to H.264 and HEVC formats (G5 encodes to AV1 and decodes VP9).

  • Choose G4-based products for:
    • The absolute lowest overall cost
    • Compatibility with existing G4-based encoding stacks
    • Interactive same resolution-in/out productions (minimum scaling and overlay)

  • Choose G5-based products for:
    • AV1 output
    • AI integration
    • Applications that need quality and throughput tuning
    • Applications that involve scaling and overlay
    • Maximum throughput from a single server
    • Cloud gaming

Question 2: Which G4-based Product?

This section discusses your G4-based options shown in Figure 1, with the U.2-based T408 in the background and AIC-form factor T432 in the foreground. These products are designated as Transcoders since this is their primary hardware-based function.

Choosing the Right VPU & Server - Figure 1. The NETINT T408 in the back, T432 in the front.
Figure 1. The NETINT T408 in the back, T432 in the front.

Table 3 identifies the key differences between NETINT’s two G4-based VPUs, the T408, which includes a single G4 ASIC in a U.2 form factor, and the T432, which includes four G4 ASICS in an AIC half-height half-length configuration.

Choosing the Right VPU & Server - Table 3. NETINT’s two G4-based products.
Table 3. NETINT’s two G4-based products.

Observations:

  • The U.2-based T408 offers the best available density for installing units into a 1RU server.
  • The AIC-based T432 is the best option for computers without U.2 connections and for maximum server chassis density.

Question 3: Which G5-based Product?

Figure 2 identifies the three Quadra G5-based products, with the U.2-based T1U in the back, the AIC-based T1A in the middle, and the AIC-based T2A in the front. These products are designated Video Processing Units, or VPUs, because their hardware functionality extends far beyond simple transcoding.

Choosing the Right VPU & Server - Figure 2. The Quadra T1U in the back, T1A in the middle, and T2A in front.
Figure 2. The Quadra T1U in the back, T1A in the middle, and T2A in front.

Table 3 identifies the key differences between NETINT’s three G5-based VPUs:

  • The T1U includes a single G5 ASIC in a U.2 form factor.
  • The T1A includes a single G5 ASIC in an AIC half-height half-length configuration.
  • The T2A includes two G5 ASICs in an AIC half-height half-length configuration.
Choosing the Right VPU & Server - Table 4. NETINT’s two G4-based products.
Table 4. NETINT’s two G4-based products.

Observations:

  • The U.2-based Quadra T1U offers the best density for installing in a 1RU server.
  • The Quadra T2A offers the best density for AIC-based installation and is ideal for cloud gaming servers that need peer-to-peer DMA communication with GPUs.
  • The AIC-based Quadra T1A is the most affordable AIC option for installs that don’t need maximum density.

Question 4: VPU or Server?

NETINT offers two video servers that use the same Supermicro 1114S-WN10RT server chassis; the Logan Video Server contains ten T408 U.2 VPUs, while the Quadra Video Server contains ten Quadra T1U VPUs. Servers offer a turnkey option for fast and simple deployment.

An advantage of buying a NETINT Video Server is all components, including CPU, RAM, hard drive, OS, and software versions, have been extensively tested for compatibility, stability, and performance, making them the easiest and fastest way to transition from software to hardware encoding.

As for the choice between servers, your answer to question 1 should guide your selection.

If you have any questions about any products, please contact us here.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

Unveiling the Quadra Server: The Epitome of Power and Scalability

The Quadra Server review by Jan Ozer from NETINT Technologies

Streaming engineers face constant pressure to produce more streams at a lower cost per stream and reduced power consumption. However, those considering new transcoding technologies need a solution that integrates with their existing workflows while delivering the quality and flexibility of software with the cost efficiency of ASIC-based hardware.

If this sounds like you, the US $21,000 NETINT Quadra Video Server could be the ideal solution. Combining the Supermicro 1114S-WN10RT AMD EPYC 7543P-powered server hosting ten NETINT Quadra T1U Video Processing Units (VPUs), is a power house. The Quadra server outputs H.264, HEVC, and AV1 streams at normal or low latency, and you can control operation via FFmpeg, GStreamer, or a low-level API. This makes the server a drop-in replacement for a traditional FFmpeg-based software or GPU-based encoding stack.

As you’ll see below, the 1RU form factor server can output up to 20 8Kp30 streams, 80 4Kp30 streams, up to 320 1080p30 streams, and 640 720p30 streams for live and interactive video streaming applications. For ABR production, the server can output over 120 encoding ladders in H.264, HEVC, and AV1 formats. This unparalleled density enables all video engineers to greatly expand capacity while shrinking the number of required servers and the associated power bills.

I’ll start this review with a technical description of the server and transcoding hardware. Then we’ll review some performance results for one-to-one streaming and H.264, HEVC, and AV1 ladder generation and finish with a look at the Quadra server’s AI-based features and output.

The Quadra Server - Quadra video processing server powered by 10 Quadras, ASIC-based VPUs from NETINT
Figure 1. The Quadra Video Server powered by the Codensity G5 ASIC.

Hardware Specs - The Quadra Server

The NETINT Quadra Video Server uses the Supermicro 1114S-WN10RT server platform with a 32-core AMD EPYC 7543P CPU running Ubuntu 20.04.05 LTS. The Quadra server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots that house the Quadra T1U VPUs. NETINT also offers the Quadra server with two other CPUs, the 64-core AMD EPYC 7713P processor ($24,000) for more demanding applications and the economical 8-core AMD EPYC 7232P processor ($19,000) for pure transcoding applications that may not require a 32-core CPU.

Supermicro* is a leading server and storage vendor that designs, develops, and manufactures primarily in the United States. Supermicro* adheres to high-quality standards, with a quality management system certified to the ISO 9001:2015 and ISO 13485:2016 standards, and an environmental management system certified to the ISO 14001:2015 standard. Supermicro is also a leader in green computing and reducing data center footprints (see the white paper Green Computing: Top Ten Best Practices for a Green Data Center). As you’ll see below, this focus has resulted in an extremely power-efficient server to house the NETINT Quadra VPUs.

*We are the leading server and storage vendor that designs, develops, and manufactures the majority of our development in the United States – at our headquarters in San Jose, Calif. Our Quality Management System is certified to ISO 9001:2015 and ISO 13485:2016 standards and our Environmental Management System is certified to ISO 14001:2015 standard. In addition to that, the Supermicro Information Security Managemen

SOURCE: https://www.supermicro.com/en/about

Hardware Specs – Quadra VPUs

The Quadra T1U VPUs are powered by the NETINT Codensity G5 ASIC and packaged in a U.2 form factor that plugs into the NVMe slots in the server and communicate via the ultra-high bandwidth PCIe 4.0 bus. Quadra VPUs can decode H.264, HEVC, and VP9 inputs and encode into the H.264, HEVC, and AV1 standards.

Beyond transcoding, Quadra VPUs house 2D processing engines that can crop, pad, and scale video, and perform video overlay, YUV and RGB conversion, reducing the load on the host CPU to increase overall throughput. These engines can perform xStack operations in hardware, making the Quadra server ideal for conferencing and security applications that combine multiple feeds into a multi-pane output mosaic window.

Each Quadra T1U in the Quadra server includes a 15 TOPS Deep Neural Network Inference Engine that can support models trained with all major deep learning frameworks, including Caffe, TensorFlow, TensorFlow Lite, Keras, Darknet, PyTorch, and ONNX. NETINT supplies several reference models, including a facial detection model that uses region of interest encoding to improve facial quality on security and other highly compressed streams. Another model provides background removal for conferencing applications.

Operational Overview

We tested the Quadra server with FFmpeg and GStreamer. Operationally, both GStreamer and FFmpeg communicate with the libavcodec layer that functions between the Quadra NVME interface and the FFmpeg/GStreamer software layers. This allows existing FFmpeg and GStreamer-based transcoding applications to control server operation with minimal changes.

Figure 2 - The Quadra Server - software architecture for controlling the Quadra Server
Figure 2. The software architecture for controlling the server.

To allocate jobs to the ten Quadra T1U VPUs, the Quadra device driver software includes a resource management module that tracks Quadra capacity and usage load to present inventory and status on available resources and enable resource distribution. There are several modes of operation, including auto, which automatically distributes the work among the available VPUs.

Alternatively, you can manually assign decoding and encoding tasks to different Quadra VPUs in the command line or application and even control which streams are decoded by the host CPU or a Quadra. With these and similar controls, you can most efficiently balance the overall transcoding load between the Quadra and host CPU and maximize throughput. We used auto distribution for all tests.

We tested running FFmpeg v 5.2.3 and GStreamer version 1.18 (with FFmpeg v 4.3.1), and with Quadra release 3.2.0. As you’ll see, we weren’t able to complete all tests in all modes with both software programs, so we presented the results we were able to complete.

In all tests, we configured the Quadra VPUs for maximum throughput as opposed to maximum quality. You can read about the configuration options and their impact on output quality and performance in Benchmarking Hardware Transcoder Performance. While quality will relate to each video and encoding configuration, the configuration used should produce quality at least equal to the veryfast x264 and x265 presets, with quality up to the slow presets available in configurations that optimize quality over throughput.

We tested multiple facets of system performance. The first series of tests involved a single stream in and single stream out, either at the same resolution as the incoming stream or scaled down and output at a lower resolution. Many applications use this mode of operation, including gaming, gambling, and auctions.

The second use case is ABR distribution, where a single input stream is transcoded to a full encoding ladder. Here we supplemented the results with software-only transcodes for comparison purposes. To assess AI-related throughput, we tested region-of-interest transcoding and background removal.

In most modes, we tested normal and low-latency performance. To simulate live streaming and minimize file I/O as a drag on system performance, we retrieved the source file from a RAM drive on the Quadra server and delivered the encoded file to RAM.

Same-Resolution Transcoding

Table 1 shows transcoding results for 8K, 4K, 1080p, and 720p in latency tolerant and low-delay modes. The number represents the amount of full frame rate outputs produced by the system at each configuration.

These results are most relevant for interactive gambling and similar applications that input a single stream, transcode the stream at full resolution, and stream it out. You see that 8K streaming is not available in the AV1 format and that H.264 and HEVC are not available in low latency mode with either program. Interestingly, FFmpeg outperformed GStreamer at this resolution while the reverse was true at 1080p.

4K and 720p results were consistent for all input and output codecs and for normal and low delay modes. All output numbers are impressive, but the 640 720p streams for AV1, H.264, or HEVC is remarkable density for a 1RU rack server.

At 1080p there are minor output differences between normal and low-delay mode and the different codecs, though the codec-related differences aren’t that substantial. Interestingly, HEVC throughput is slightly higher than H.264, with AV1 about 16% behind HEVC.

Jan Ozer - the Quadra server review-table-1
Table 1. Same resolution transcoding results.

Table 2 shows a collection of maximum data points (worst case) from the transcoding results presented in Table 1. As you can see, both Max CPU and power consumption track upwards with the number of streams produced. Max latency (decode plus encode) in normal latency mode tracks downward with the stream resolution, becoming quite modest at 720p. Max latency (decode plus encode) in low-delay mode for both decoding and encoding starts and stays under 30.9 milliseconds, which is less than a single frame.

Jan Ozer - the Quadra server review-table-2
Table 2. Maximum CPU, power consumption, and latency data for pure transcoding.

As between FFmpeg and GStreamer, the latter proved more CPU and power efficient than the former, in both normal and low-delay modes. For example, in all tests, GStreamer’s CPU utilization was less than half of FFmpeg, through the power consumption delta was generally under 20%.

At 8K and 4K resolutions, the latency reported was about even between the two programs, but at the lower resolutions in low-delay mode, GStreamer’s latency was often half that of FFmpeg. You can see an example of these two observations in Table 3, reporting 720p HEVC input and output as HEVC. Though the throughput was identical, GStreamer used much less energy and produced much lower latency. As you’ll see in the next section, this dynamic stayed true in transcoding with scaling tests, making GStreamer the superior app for applications involving same-resolution transcoding and transcoding with scaling. 

Quadra Server - Table 3. GStreamer was much more CPU and power efficient and delivered substantially lower latency than FFmpeg in these same resolution transcode tests.
Table 3. GStreamer was much more CPU and power-efficient
and delivered substantially lower latency than FFmpeg
in these same resolution transcode tests.

Transcoding and Scaling

Table 4 shows transcoding while scaling results, first 8K input to 4K output, then 4K to 1080p, and lastly 1080p to 720p. If you compare Table 3 with Table 1, you’ll see that performance tracks the input resolution, not output, which makes sense because decoding is a separate operation that obviously involves its own hardware limits.

Jan Ozer - the Quadra server review-table-4
Table 4. Transcoding while scaling results.

As the Quadra VPUs perform scaling on-board, there was no drop in throughput with the scaling related tests; rather, there was a slight increase in 8K > 4K and 4K > 1080p outputs over the same resolution transcoding reported in Table 1. In terms of throughput, the results were consistent between the codecs and software programs.

Table 5 shows the max CPU and power usage for all the transcodes in Table 3, which increased somewhat from the low-quantity high-resolution transcodes to the high-quantity low-resolution transcodes but was well within the performance envelope for this 32-core server.

The Max latency for all normal encodes was relatively consistent between five and six frames. With low delay engaged, 8K > 4K latency didn’t drop that significantly, though you’d assume that 8K to 4K transcodes are uncommon. Latency dropped to below a single frame in the two lower resolution transcodes.

Jan Ozer - the Quadra server review-table-5
Table 5. Maximum CPU, power consumption, and latency data for transcoding while scaling.

As between FFmpeg and GStreamer we saw the same dynamic as with full resolution transcodes; in most tests, GStreamer consumed significantly less power and produced sharply lower latency. You can see an example of this in Table 6, reporting the results of 1080p incoming HEVC output to AV1 at 720p. 

Table 6. GStreamer was much more CPU and power-efficient
and delivered much lower latency than FFmpeg in this scale then transcode tests.

Encoding Ladder Testing

Table 7 shows the results of full ladder testing with CPU, latency, and power consumption embedded in the output instances. Note that we tested a five-rung ladder for H.264, and four-rung ladders for HEVC and AV1. We didn’t test 4K H.264 output because few services would deploy this configuration. Also, we didn’t test with GSteamer because NETINT’s current GStreamer implementation can’t use Quadra’s internal scalers when producing more than a single file, an issue that the NETINT engineering team will resolve soon. Also, as you can see, low-delay mode wasn’t available for 4K testing. 

This fine print behind us, as with the single file testing, throughput was impressive. The ability to deliver up to 140 HEVC 4-rung ladders from a single 1RU rack, in either normal or low-latency mode, is remarkable.

Jan Ozer - the Quadra server review-table-7
Table 7: Encoding ladder throughput. 

For comparison purposes, we produced the equivalent encoding ladders on the same server using software-only encoding with FFmpeg and the x264, x265, and SVT-AV1 codecs. To match the throughput settings used for Quadra, we used the ultrafast preset for x264 and x265, and preset eleven for SVT-AV1. You see the results in Table 8

Note that these numbers over-represent software-based output since no engineer would produce a live stream with CPU utilization over 60 – 65%, since a sudden spike in CPU usage would crash all the streams. Not only is CPU utilization much lower for the Quadra-driven encodes, minimizing the risk of exceeding CPU capacity, Quadra-based transcoding is much more determinist than CPU-based transcoding, so CPU requirements don’t typically change in midstream.

All that said, Quadra proved much more efficient than software-based encoding for all codecs, particularly HEVC and AV1. In Table 7, the Multiple column shows the number of servers required to produce the same output as the Quadra server, plus the power consumed by all these servers. For H.264, you would need six servers instead of a single Quadra server to produce the 120 instances, and power costs would be nearly six times higher. That’s running each Quadra server at 98.3% CPU utilization. Running at a more reasonable 60% utilization would translate to ten servers and 4,287 watts per hour.

Jan Ozer - the Quadra server review-table-8
Table 8. Ladders, CPU utilization, and power consumed for CPU-only transcoding.

Even without factoring in the 60% CPU-utilization limits, the comparison reaches untenable levels with HEVC and AV1. As the data shows, CPU-based transcoding simply can’t keep up with these more complex codecs, while the ASIC-driven Quadra remains relatively consistent. 

AI-Related Functions

The next two tables benchmark AI-related functions, first region of interest encoding, then background removal. Briefly, region of interest encoding uses AI to search for faces in a stream and then increases the bits assigned to those faces to increase quality. This is useful in surveillance videos or any low-bitrate video environment where facial quality is important. 

We tested 1080p AVC input and output with FFmpeg only, and the system delivered sixty outputs in both normal and low-delay modes, with very modest CPU utilization and power consumption. For more on Quadra’s AI-related functions, and for an example of the region of interest filter, see an Introduction to AI Processing on Quadra.

Jan Ozer - the Quadra server review-table-9
Table 9. Throughput for Region of Interest transcoding via Artificial Intelligence.

Table 10 shows 1080p input/output using the AVC codec with background removal, which is useful in conferencing and other applications to composite participants in a virtual environment (see Figure 2). This task involves considerably more CPU but delivers slightly greater throughput.

Jan Ozer - the Quadra server review-table-10
Table 10. Throughput for background removal and transcoding via Artificial Intelligence.

As you can read about in the Introduction to AI Processing on Quadra, Quadra comes with these and other AI-based applications and can deploy AI-based models developed in most machine learning programs. Over time, AI-based operations will become increasingly integral to video transcoding functions, and the Quadra Video Server provides a future-proof platform for that integration.

Figure 3 -The Quadra Server - Compositing participants in a virtual environment with background removal
Figure 3. Compositing participants in a virtual environment with background removal

Conclusion

While there’s a compelling case for ASIC-based transcoding solely for H.264 production, these tests show that as applications migrate to more complex codecs like HEVC and AV1, CPU-based transcoding is untenable economically and for the environment. Beyond pure transcoding functionality, if there’s anything that the ChatGPT-era has proven, it’s that AI-based transcoding-related functions will become mainstream much sooner than anyone might have thought. With highly efficient ASIC-based transcoding hardware and AI engines, the Quadra Video Server checks all the boxes for a server to strongly consider for all high-volume live streaming applications. 

What Can a VPU Do for You?

What Can a VPU Do for You? - NETINT Technologies

For Cloud-Gaming, a VPU can deliver 200 simultaneous 720p30 game sessions from a single 2RU server.

When you encode using a Video Processing Unit (VPU) rather than the built-in GPU encoder, you will decrease your cost per concurrent user (CCU) by 90%, enabling profitability at a much lower subscription price. How is this technically feasible? Two technology enablers make this possible. First, extraordinarily capable encoding hardware, known as a VPU (video processing unit), dedicated to the task of high-quality video encoding and processing. And second, peer-to-peer direct memory access (DMA) that enables video frames to be delivered at the speed of memory compared to the much slower NVMe buss between the GPU and VPU. Let’s discuss these in reverse order.

Peer-to-Peer Direct Memory Access (DMA)

Within a cloud gaming architecture, the primary role of the GPU is to render frames from the game engine output. These frames are then encoded into a standard codec that is easily decoded on a wide cross section of devices. Generally this is H.264 or HEVC, though AV1 is becoming of interest to those with a broader Android user based. Encoding on the GPU is efficient from a data transfer standpoint because the rendering and encoding occurs on the same silicon die; there’s no transfer of the rendered YUV frame to a separate transcoder over the slower PCIe or NVMe busses. However, since encoding requires substantial GPU resources, this dramatically reduces the overall throughput of the system. Interestingly, it’s the encoder that is often at full capacity and, thus the bottleneck, not the rendering engine. Modern GPU’s are built for general-purpose graphical operations, thus, more real estate is devoted to this compared to video encoding.

By installing a dedicated video encoder in the system and using traditional data transfer techniques, the host CPU can easily manage the transfer of the YUV frame from the GPU to the transcoder but as the number of concurrent game sessions increase the probability of dropped frames or corrupted data makes this technique not usable.

NETINT, working with AMD enabled peer-to-peer direct memory access (DMA) to overcome this situation. DMA is a technology that enables devices within a system to exchange data in memory by allowing the GPU to send frames directly to the VPU whereby removing the situation of the buss becoming clogged as the concurrent session count increases above 48 720p streams.

What can a VPU do for you?

The Benefits of Peer-to-Peer DMA

Peer-to-peer DMA delivers multiple benefits. First, by eliminating the need for CPU involvement in data transfers, peer-to-peer DMA significantly reduces latency, which translates to a more responsive and immersive gaming experience for end-users. NETINT VPUs feature latencies as low as 8ms in fully loaded and sustained operation.

In addition, peer-to-peer DMA relieves the CPU of the burden of managing inter-device data transfers. This frees up valuable CPU cycles, allowing the CPU to focus on other critical tasks, such as game logic and physics calculations, optimizing overall system performance and producing a smoother gaming experience.

By leveraging peer-to-peer communications, data can be transferred at greater speeds and efficiency than CPU-managed transfers. This improves productivity and scalability for cloud gaming production workflows.

These factors combine to produce higher throughput without the need for additional costly resources. This cost-effectiveness translates to improved return on investment (ROI) and a major competitive advantage.

Extraordinarily Capable VPUs

Peer-to-peer DMA has no value if the encoding hardware used is not equally capable. With NETINT VPUs, that isn’t the case here.

The reference system that produces 200 720p30 cloud gaming sessions is built on the Supermicro AS-2015CS-TNR server platform with a single GPU and two Quadra T2A VPUs. This server supports AV1, HEVC, and H.264 video game streaming at up to 8K and 60fps, though as may be predicted, the simultaneous stream counts will be reduced as you increase framerate or resolution.

Quadra T2A is the most capable of the Quadra VPU line, the world’s first dedicated hardware to support AV1. With its embedded AI and 2D engines, the Quadra T2A can support AI-enhanced video encoding, region of interest, and content-adaptive encoding. Quadra T2A coupled with a P2P DMA enabled GPU, allows cloud gaming providers to achieve unprecedented high throughput with ultra-low latency.

Quadra T2A is an AIC (HH HL) form-factor video processing unit with two Codensity G5 ASICs that operates in x86 or Arm-based servers requiring just 40 watts at maximum load. It enables cloud gaming platforms to transition from software or GPU-only based encoding with up to a 40x reduction in the total cost of ownership.

What Can A VPU Do For You?

What Can A VPU Do For You?

It makes Cloud Gaming profitable, finally.

Peer-to-peer DMA is a game-changing technology that reduces latency and increases system throughput. When paired with an extraordinarily capable VPU like the NETINT Quadra T2A, now you can deliver an immersive gaming experience at a CCU that cannot be matched by any competing architecture.

Video Transcoder vs. Video Processing Unit (VPU)

When choosing a product for live stream processing, half the battle is knowing what to search for. Do you want a live transcoder, a video processing unit (VPU), a video coding unit (VCU), Scalable Video Processor (SVP) or something else? If you’re not quite sure what these terms mean and how they relate, this short article will educate you in four minutes or less.  

In the Beginning, There Were Transcoders

Simply stated, a transcoder is any technology, software or hardware, that can input a compressed stream (decode) and output a compressed stream (encode). FFmpeg is a transcoder, and for video-on-demand applications, it works fine in most low-volume applications.

For live applications, particularly high-volume live interactive applications (think Twitch), you’ll probably need a hardware transcoder to achieve the necessary cost per stream (CAPEX), operating cost per stream, and density.

For example, the NETINT Video Transcoding Server, a single 1RU server with ten NETINT T408 Video Transcoders, can deliver up to 80 H.264/HEVC 1080p30 streams while drawing under 250 watts. Performed in software using only the CPU, this same output could take up to ten separate 1RU servers, each drawing well over 250 watts.

Netint Codensity, ASIC-based T408 Video Transcoder
The NETINT T408 Video Transcoder.

Speaking of the T408, if Websters defined a transcoder (it doesn’t), it might have a picture of the T408 as the perfect example of a transcoder. Based on custom transcoding ASICs, the T408 is inexpensive ($400), capable (4K @ 60 FPS or 4x 1080p60 streams), flexible (H.264 and HEVC), and exceptionally efficient (only 7 watts).

What doesn’t the T408 do? Well, that leads us to the difference between a transcoder and a VPU.

The difference between a transcoder and a Video Processing Unit (VPU)

First, the T408 doesn’t scale video. If you’re building a full encoding ladder from a high-resolution source, all the scaling for the lower rungs is performed by the host CPU. In addition, the T408 doesn’t perform overlay in hardware. So, if you insert a logo or other bug over your videos, again, the CPU does the heavy lifting.

Finally, the T408 was launched in 2019, the first ASIC-based transcoder to ship in quite a long time. So, it’s not surprising that it doesn’t incorporate any artificial intelligence processing capabilities.

What is a Video Processing Unit (VPU)?

What’s a Video Processing Unit? A hardware device that does all that extra stuff, scaling, overlay, and AI. You see this in the transcoding pipeline shown below, which is for the NETINT Quadra.

When it came to labeling the Quadra, you see the problem; It does much more than a video transcoder. Not only does it outperform the T408 by a factor of four, it adds AV1 output and all the additional hardware functionality. It’s much more than a simple video transcoder, it’s a video processing unit (VPU).

As much as we’d like to lay claim to the acronym, it actually existed before we applied it to the Quadra. It’s not surprising. It follows the terminology for CPU (central processing unit) and GPU (graphical processing unit). And, if Websters defined VPU (it doesn’t). Oh, you get the point. Here’s the required Quadra glamour shot.

Netint Codensity, ASIC-based Quadra T1A Video Processing Unit
The NETINT Quadra Video Processing Unit.

VCUs and M(SVP)

While NETINT was busy developing ASIC-based transcoders and VPUs for the mass market, large video publishers like YouTube and Meta produced their own ASICs to achieve similar benefits (and produce more acronyms). In 2021, when Google shipped their own ASIC-based transcoder called Argos, they labeled it a Video Coding Unit, or VCU.

Like the T408 and Quadra, the benefits of this ASIC-based technology are profound; as reported by CNET, “Argos handles video 20 to 33 times more efficiently than conventional servers when you factor in the cost to design and build the chip, employ it in Google’s data centers, and pay YouTube’s colossal electricity and network usage bills.” Interestingly, despite YouTube’s heavy usage of the AV1 codec, Argos encodes only H.264 and VP9, not AV1.

In May 2023, Meta released their own ASIC, which, like Argos, outputs H.264 and VP9, but not AV1. Called the Meta Scalable Video Processor (MSVP), the unit delivered impressive results, including “a throughput gain of ~9x for H.264 when compared against libx264 SW encoding…[and] a throughput gain of ~50x when compared with libVPX speed 2 preset.” Meta also noted that the unit drew only 10 watts of power, which is skimpy but also about 43% higher than the T408.

Of course, neither Google or Meta sells their ASIC to third parties, so if want the CAPEX and OPEX efficiencies that ASIC-based VPUs deliver, you’ll have to buy from NETINT.

Of course, neither Google or Meta sells their ASIC to third parties, so if want the CAPEX and OPEX efficiencies that ASIC-based VPUs deliver, you’ll have to buy from NETINT. The bottom line is that whether you call it a transcoder, VPU, VCU, or MSVP, you’ll get the highest throughput and lowest power consumption if it’s powered by an ASIC.

Play Video about HARD QUESTIONS ON HOT TOPICS: ASIC-based Video Transcoder versus Video Processing Unit (VPU)
HARD QUESTIONS ON HOT TOPICS:
ASIC-based Video Transcoder versus Video Processing Unit (VPU)
Watch the full conversation on YouTube: https://youtu.be/iO7ApppgJAg