Evaluating Hardware Transcoder Performance

If you’ve ever benchmarked software codecs, you know the quality/throughput tradeoff; simply stated, the higher the quality, the lower the throughput. In contrast, for many first-generation hardware encoders, throughput was prioritized, but the quality was fixed; you got what you got.

Most next-gen hardware encoders offer presets or other switches to optimize quality at a cost to throughput that can be even more striking than with software. In comparing specifications for encoders, remember the quality/throughput tradeoff. And when you see quality stats, think, “hmm, at what throughput?” Or, if you see throughput stats, ask, “at what quality?”

Whenever you test a hardware encoder, you should start by identifying the configuration options that most impact quality and throughput and then test across a range of configurations to get a sense of the performance/quality tradeoff. If you plug in pricing and power consumption figures, you can also easily compute the cost per stream and watts per stream. This is the CAPEX and OPEX side of the equation.

Then you can choose the “operating point” that delivers the optimum blend of quality and throughput for your applications. When comparing multiple encoders, you should perform the same analysis for each to enable a complete apples-to-apples comparison.

Recently, I benchmarked the performance of NETINT’s Quadra Video Processing Unit (VPU) against the NVIDIA T4. In this post, I’ll review the testing and the Quadra results to give you a feel for the hardware evaluation process. In a future post, I’ll review the NVIDIA findings and compare the two.

Benchmarking Quadra

Briefly, Quadra is NETINT’s newest ASIC-based transcoder, called a VPU, because it has onboard decoding, scaling, encoding, and overlay, plus an 18 TOPS AI engine. The VPU can create encoded bitstreams in H.264, HEVC, and AV1.

Quadra has two major configuration options that impact quality, lookahead buffer, and rate-distortion optimization.

Briefly, the lookahead buffer allows the encoder to look at frames ahead of the frame being encoded, so it knows what’s coming and can make more intelligent decisions. This improves encoding quality, particularly at/around scene changes, and it can improve bitrate efficiency. But, lookahead adds latency equal to the lookahead duration, and it can decrease throughput.

Table 1 shows the impact of a 40-frame lookahead buffer when encoding to the H.264 format. The top-line harmonic mean VMAF score is 2.3 points lower, which is borderline significant. But the low-frame differential of almost 16 points could predict transient problems that might be apparent to some viewers. But in addition to injecting 1.3 seconds of latency into the process, you see that the lookahead cuts the throughput by 33%, from 36 1080p streams to 24.

Evaluating Hardware Transcoder Performance
Table 1. Quality and performance impact
of a 40-frame lookahead on Quadra H.264 encoding.

Rate distortion optimization (RDO) functions like most presets and adjusts several parameters that impact both quality and throughout, with higher values increasing quality and reducing throughput. With H.264 output, Quadra offers one level of RDO, while with HEVC, there are three levels, 1, 2, and 3.

Table 2 shows the range of H.264 options tested during the recent benchmarking. LA is lookahead, and I tested three values, 40, 20, and 0. I also tested with RDO on and off. To provide some perspective of quality, the x264 Quality Equivalent shows x264 quality encoded using the same parameters using the presets shown.

At the highest quality setting, Quadra’s output quality slightly exceeded that of x264 using the slow preset, and the unit produced 16 1080p streams. You see that dropping the lookahead from 40 to 20 with RDO disabled had little impact on quality or throughput but cut latency by 0.66 sec, making that choice easy for latency-sensitive events.

Table 2. H.264 options and results.

At the lowest possible quality setting, Quadra’s quality dropped to slightly better than veryfast quality, which is often the x264 preset used for live applications to ensure at least nominal throughput with CPU-only transcoding.

At this quality level, the VPU outputs 36 1080p streams. By adding the cost per stream and watts per stream data, you will get a true feel for the comparative CAPEX and OPEX costs produced by all settings combinations.

Table 3 shows the same data for HEVC transcoding using the same lookahead options and RDO at 1, 2, 3, and 0 (disabled). At the highest quality levels, the output quality nearly matched the x265 encoder using the slow setting but only produced four streams. At the other end of the spectrum, output quality nearly matched x265 using the very fast preset, but the Quadra produced 40 1080p 30 streams, four more than using the H.264 format.

Table 3. HEVC options and results.

There are several new hardware encoders coming, and their launches will be accompanied by aggressive claims about quality and throughput. My recommendation is not to assume that the same settings were used for both. In short, you better do your own testing. Trust but verify comes to mind.

When you perform your own testing, remember the methodology explained above:

  1. Identify the most critical quality-related options for your specific application. All producers have different priorities, whether it’s bitrate efficiency, absolute quality, ultra-low latency, density, power consumption, or cost per stream. You need to know what your critical constraints are in order to arrive at the best solution analysis.

  2. Test across a range of configurations from high quality/low throughput to low quality/high throughput. Increasingly, even for quality-driven use cases, imperceptible quality tradeoffs might be necessary to meet an operational cost or energy efficiency target. You should choose the operating point that delivers the optimum blend for your application.

  3. Compute quality, cost per stream, and watts per stream at the operating point to compare against other technologies. Remember to factor in the CAPEX of the additional servers required to run a software encoding service. Our customers report a reduction in the number of machines needed by 90% or more and this can translate to tens of millions or even hundreds of millions of dollars in savings or reclaimed CPUs that can be used in other parts of the operation.

In the next post, we’ll share quality results from the NVIDIA T4 GPU and compare them to Quadra.

Related Article

NETINT Blog

Evaluating Hardware Transcoder Performance

If you’ve ever benchmarked software codecs, you know the quality/throughput tradeoff; simply stated, the higher the quality, the lower the throughput. In contrast, for many