Even in 2023, many high-volume streaming producers continue to rely on software-based transcoding, despite the clear CAPEX, OPEX, and environmental benefits of ASIC-based transcoding. Part of the inertia relates to outdated concerns about the shortcomings of ASICs, including sub-par quality and lack of flexibility to add features or codec enhancements.
As a parent, I long ago concluded that there were no words that could come out of my mouth that would change my daughter’s views on certain topics. As a marketer, I feel some of that same dynamic, that no words can come out of my keyboard that would shake the negative beliefs about ASICs from staunch software-encoding supporters.
So, don’t take our word that these beliefs are outdated; consider the results from the world’s largest video producer, YouTube. The following slides and observations are from a Google presentation by Aki Kuusela and Clint Smullen on the Argos ASIC-based transcoder at Hot Chips 33 back in August 2021. The slides are available here, and the video here.
In the presentation, the speakers discussed why YouTube developed its own ASIC and the performance and power efficiency achieved during the first 16 months of deployment. Their comments go a long way toward dispelling the myths identified above and make for interesting reading.
Advanced Codecs Means Encoding Time Has Grown by 8,000% Since H.264
In discussing why Google created its own encoder, Kuusela explained that video was getting harder to compress, not only from a codec perspective but from a resolution and frame rate perspective. Here’s Kuusela (all quotes grabbed from the YouTube video and lightly edited for readability).
“In order to sustain the higher resolutions and frame rate requirements of video, we have to develop better video compression algorithms with improved compression efficiency. However, this efficiency comes with greatly increased complexity. For example, if we compare the vp9 from 2013 to the decade older H.264, the time to encode videos in software has grown to 10x. The more recent AV1 format from 2018 is already 200 times more time-consuming than the h.264 standard.
If we further compound this effect with the increase in resolution and frame rate for top-quality video, we can see that the time to encode a video from 2003 to 2018 has grown eight thousand-fold. It is very obvious that the CPU performance improvement has not kept up with this massive complexity growth, and to keep our video services running smoothly, we had to consider warehouse scale acceleration. We also knew things would not get any better with the next generation of compression.”
Figure 1. Google moved to hardware
to address skyrocketing encoding times.
Reviewing Figure 1, it should be noted that though few engineers use VP9 as extensively as YouTube, if you swap HEVC for VP9, the complexity difference between H.264 is the same. Beyond the higher resolutions and frame rates engineers must support to remain competitive, the need for hardware becomes even more apparent when you consider the demands of live production.
“Near Parity” with Software Encoding Quality
One consistent concern about ASICs has been quality, which admittedly lagged in early hardware generations. However, Google’s comparison shows that properly designed hardware can deliver near-parity to software-based transcoding.
Kuusela doesn’t spend a lot of time on the slide shown in Figure 2, merely stating that “we also wanted to be able to optimize the compression efficiency of the video encoder based on the real-time requirements and time available for each encoder and to have full access to all quality control algorithms such as bitrate allocation and group of picture selection. So, we could get near parity to software-based encoding quality with our no-compromises implementation.”
Figure 2. Argos delivers “near-parity”
with software encoders.
NETINT’s data more than supports this claim. For example, Table 1 compares the NETINT Quadra VPU with various x265 presets. Depending upon the test configuration, Quadra delivers quality on par with the x265 medium preset. When you consider that software-based live production often necessitates using the veryfast or ultrafast preset to achieve marginal throughput, Quadra’s quality far exceeds that of software-based transcoding.
Table 1. Quadra HEVC quality compared to x265
in high-quality latency tolerant configuration.
ASIC Performance Can Improve After Deployment
Another concern about ASIC-based transcoders is the inability to upgrade, and accelerated obsolescence. Proper ASIC design allows ASICs to balance encoding tasks between hardware, firmware, and control software to ensure continued upgradeability.
Figure 3 shows how the bitrate of VP9 and H.264 continued to improve compared to software in the months after the product launch, even without changes to the firmware or kernel driver. The second Google presenter, Clint Smullen attributed this to a hybrid hardware/software design, commenting that “Using a software approach was critical both to supporting the quality and feature development in the video core as well as allowing customer teams to iteratively improve quality and performance.”
Figure 3. Argos continued to improve after deployment
without changes to firmware or the kernel driver.
The NETINT Codensity G4 ASIC included in the T408 and the NETINT Codensity G5 ASIC that powers our Quadra family of VPUs, both use a hybrid design that distributes critical functions between the ASIC, driver software, and firmware.
We optimize ASIC design to maximize functional longevity as explained here on the role of firmware in ASIC implementations, “The functions implemented in the hardware are typically the lower-level parts of a video codec standard that do not change over time, so the hardware does not need to be updated. The higher levels parts of the video codecs are in firmware and driver software and can still be changed.”
As Google’s experience and NETINT’s data show, well-designed ASICs can continue improving in quality and functionality long after deployment.
90% Reduction in Power Consumption
Few engineers question the throughput and power efficiency of ASICs, and Google’s data bears this out. Commenting on Figure 4, Smullen stated, “For H.264 transcoding a single VCU matches the speed of the baseline system while using about one-tenth of the system level power. For VP9, a single 20 VCU machine replaces multiple racks of CPU-only systems.”
Figure 4. Throughput and comparative efficiency
of Argos vs software-only transcoding.
NETINT ASICs deliver similar results. For example, a single T408 transcoder (H.264 and HEVC) delivers roughly the same throughput as a 16-core computer encoding with software and draws only about 7 watts compared to 250+ for the computer. NETINT Quadra draws 20 watts and delivers roughly 4x the performance of the T408 for H.264, HEVC, and AV1. In one implementation, a single 1RU rack of ten Quadras can deliver 320 1080p streams or 200 720p cloud gaming sessions, which like Argos, replaces multiple racks of CPUs.