Computing Payback Period on T408s

Computing Payback Period on T408s

One of the most power-hungry processes performed in data centers is software-based live transcoding, which can be performed much more efficiently with ASIC-based transcoders. With power costs soaring and carbon emissions an ever-increasing concern, data centers that perform high-volume live transcoding should strongly consider switching to ASIC-based transcoders like the NETINT T408. Computing the Payback Period is easy with this calculator.

To assist in this transition, NETINT recently published two online calculators that measure the cost savings and payback period for replacing software-based transcoders with T408s. This article describes how to use these calculators and shows that data centers can recover their investment in T408 transcoders in just a few months, even less if you can repurpose servers previously used for encoding for other uses. Most of the data shown are from a white paper that you can access here.

About the T408

Briefly, NETINT designs, develops, and sells ASIC-powered transcoders like the T408, which is a video transcoder in a U.2 form factor containing a single ASIC. Operating in x86 and ARM-based servers, T408 transcoders output H.264 or HEVC at up to 4Kp60 or 4x 1080p60 streams per T408 module and draw only 7 watts.

Simply stated, a single T408 can produce roughly the same output as a 32-core workstation encoding in software but drawing anywhere from 250 – 500 watts of power. You can install up to 24 T408s in a single workstation, which essentially replaces 20 – 24 standalone encoding workstations, slashing power costs and the associated carbon emissions.

In a nutshell, these savings are why large video publishers like YouTube and Meta are switching to ASICs. By deploying NETINT’s T408s, you can achieve the same benefits without the associated R&D and manufacturing costs. The new calculators will help you quantify the savings.

Determining the Required Number of T408s

The first calculator, available here, computes the number of T408s required for your production. There are two steps; first, enter the rungs of your encoding ladder into the table as shown. If you don’t know the details of your ladder, you can click the Insert Sample HD or 4K Ladder buttons to insert sample ladders.

After entering your ladder information, insert the number of encoding ladders that you need to produce simultaneously, which in the table is 100. Then press the Compute button (not shown in the Figure but obvious on the calculator).

Calculator 1: Computing the number of required T408 transcoders.

This yields a total of 41 T408s. For perspective, the calculator should be very accurate for streams that don’t require scaling, like 1080p inputs output to 1080p. However, while the T408 decodes and transcodes in hardware, it relies on the host CPU for scaling. If you’re processing full encoding ladders, as we are in this example, throughput will be impacted by the power of the host CPU.

As designed, the calculator assumes that your T408 server is driven by a 32-core host CPU. On an 8-16 core system, expect perhaps 5 – 10% lower throughput. On a 64-core system, throughput could increase by 15 – 20%. Accordingly, please consider the output from this calculator as a good rough estimate accurate to about plus or minus 20%.

To compute the payback period, click the Compute Payback Period shown in Figure 1. To restart the calculation, refresh your browser.

Computing Payback Period

Computing the payback period requires significantly more information, which is summarized in the following graphic.

Calculator 2: Information needed to compute the payback period.

Step by step

  1. Choose your currency in the drop-down list.

  2. Enter your current cost per KW. The $0.25/KW is the approximate UK cost as of March 2022 from this source, which you can also access by clicking the information button to the right of this field. This information button also contains a link to US power costs here.

  3. Enter the number of encoders currently transcoding your live streams. In the referenced white paper, 34 was the number of required servers needed to produce 100 H.264 encoding ladders.

  4. Enter the power consumption per encoding server. The 289 watts shown were the actual power consumption measured for the referenced white paper. If you don’t know your power consumption, click the Info button for some suggested values.

  5. Enter the number of encoding servers that can be repurposed. The T408s will dramatically improve encoding density; for example, in the white paper, it took 34 servers transcoding with software to produce the same streams as five servers with ten T408s each. Since you won’t need as many encoding servers, you can shift them to other applications, which has an immediate economic benefit. If you won’t be able to repurpose any existing servers for some reason, enter 0 here.

  6. Enter the current cost of the encoding servers that can be repurposed. This number will be used to compute the economic benefit of repurposing servers for other functions rather than buying new servers for those functions. You should use the current replacement cost for these servers rather than the original price.

  7. Enter the number of T408s required. If you start with the first calculator, this number will be auto-filled.

  8. Enter your cost for the T408s. $400 is the retail price of the T408 in low quantities. To request pricing for higher volumes, please check with a NETINT sales representative. You can arrange a meeting HERE. 

  9. Enter the power consumption for each T408. The T408 draws 7 watts of power which should be auto-filled.

  10. Enter the number of computers needed to host the T408s. You can deploy up to ten T408s in a 1RU server and up to 24 T408s in a 2RU server. We assumed that you would deploy using the first option (10 T408s in a single 1RU) and auto-filled this entry with that calculation. If the actual number is different, enter the number of computers you anticipate buying for the T408s.

  11. Enter the price for computers purchased to run T408s (USD). If you need to purchase new computers to house the T408, enter the cost here. Note that since the T408 decodes incoming H.264 and HEVC streams and transcodes on-board to those formats, most use cases work fine on workstations with 8-16 cores, though you’ll need a U.2 expansion chassis to house the T408s. Check this link for more information about choosing a server to house the T408s. We assumed $3,000 because that was the cost for the server used in the white paper.

    If you’re repurposing existing hardware, enter the current cost, similar to number 6.

 

  1. Enter power consumption for the servers (watts/hour). As mentioned, you won’t need a very powerful computer to run the T408s, and CPU utilization and power consumption should be modest because the T408s are doing most of the work. This number is the base power consumption of the computer itself; the power utilized by the T408s will be added separately.

When you’ve entered all the data, press the Calculate button.

Interpreting the Results

The calculator computes the payback period under three assumptions:

  • Simple: Payback Period on T408 Purchases
  • Simple: Payback Period on T408 + New Computers
  • Comprehensive: Consider all costs
Figure 3. Simple payback on T408 purchases.

This result divides the cost of the T408 purchases by the monthly savings and shows a payback period of around 11 months. That said, if five servers with T408s essentially replaced 34 servers, unless you’re discarding the 29 servers, the third result is probably a more accurate reflection of the actual economic impact.

Figure 4. Simple: Payback Period on T408 + New Computers

This result includes the cost of the servers necessary to run the T408s, which extends the payback period to about 20.5 months. Again, however, if you’re able to allocate existing encoding servers into other roles, the third calculation is a more accurate reflection.

Figure 5. Comprehensive: consider all costs.

This result incorporates all economic factors. In this case, the value of the repurposed computers ($145,000) exceeds the costs of the T408s and the computers necessary to house them ($103,600), so you’re ahead the day you make the switch.

However you run the numbers, data centers driving high-volume live transcoding operations will find that ASIC-based transcoders will pay for themselves in a matter of months. If power costs keep rising, the payback period will obviously shrink even further.

2022-Opportunities and Challenges for the Streaming Video Industry

2022-Opportunities and Challenges for the Streaming Video Industry

As 2022 comes to a close, for those in the streaming video industry, it will be remembered as a turbulent year marked by new opportunities, including the emergence of new video platforms and services.

2022 started off with Meta’s futuristic vision of the internet known as the Metaverse. The Metaverse can be described as a combination of virtual reality, augmented reality, and video where users interact within a digital universe. The Metaverse continues to evolve with the trend of unique individual, one-to-one video streaming experiences in contrast to one-to-many video streaming services which are commonplace today. 

Recent surveys have shown that two-thirds of consumers are planning to cut back on streaming subscriptions due to rising costs and diminishing discretionary income. With consumers becoming more value-conscious and price-sensitive, Netflix and other platforms have introduced new innovative subscriber models. Netflix’s subscription offering, in addition to SVOD (Subscription Video on Demand), now includes an Ad-based tier, AVOD (Advertising Video on Demand).  

Netflix shows the way

This new ad-based tier targets the most price sensitive customers and it is projected that AVOD growth will lead SVOD by 3x in 2023. Netflix can potentially earn over $4B in advertising revenue, making them the second largest ad support platform only after YouTube. This year also saw Netflix making big moves into mobile cloud gaming with the purchase of its 6th gaming studio. Adding gaming to their product portfolio serves at least two purposes: it expands the number of platforms that can access their game titles and serves as another service to maintain their existing users.

These new services and platforms are a small sample of the continued growth in new streaming video services where business opportunities abound for video platforms willing to innovate and take risks.

Stop data center expansion

The new streaming video industry landscape requires platforms to provide innovative new services to highly cost sensitive customers in a regulatory environment that discourages data center expansion. To prosper in 2023 and beyond, video platforms must address key issues to prosper and add services and subscribers.

  • Controlling data center sprawl – new services and extra capacity can no longer be contingent on the creation of new and larger data centers.
  • Controlling OPEX and CAPEX – in the current global economic climate, costs need to be controlled to keep prices under control and drive subscriber growth. In addition, in today’s economic uncertainty, access to financing and capital to fund data expansion cannot be assumed.
  • Energy consumption and environmental impact are intrinsically linked, and both must be reduced. Governments are now enacting environmental regulations and platforms that do not adopt green policies do so at their own peril.

Application Specific Integrated Circuit

For a vision of what needs to be done to address these issues, one only needs to glimpse into the recent past at YouTube’s Argos VCU (Video Coding Unit). Argos is YouTube’s in-house designed ASIC (Application Specific Integrated Circuit) encoder that, among other objectives, enabled YouTube to reduce their encoding costs, server footprint, and power consumption. YouTube is encoding over 500 hours (about 3 weeks) of content per minute.

To stay ahead of this workload, Google designed their own ASIC, which enabled them to eliminate millions of Intel CPUs. Obviously, not everyone has their own in-house ASIC development team, but whether you are a hyperscale platform, commercial, institutional, or government video platform, the NETINT Codensity ASIC-powered video processing units are available.

To enable faster adoption, NETINT partnered with Supermicro, the global leader in green server solutions. The NETINT Video Transcoding Server is based on a 1RU Supermicro server powered with 10 NETINT T408 ASIC-based video transcoder modules. The NETINT Video Transcoding Server, with its ASIC encoding engine, enables a 20x reduction in operational costs compared to CPU/software-based encoding. The massive savings in operational costs offset the CAPEX associated with upgrading to the NETINT video transcoding server.

Supermicro and T408 Server Bundle

In addition to the extraordinary cost savings, the advantages of ASIC encoding include enabling a reduction in the server footprint by a factor of 25x or more, which has a corresponding reduction in power consumption and, as a bonus, is also accompanied by a 25x reduction in carbon emissions. This enables video platforms to expand encoding capacity without increasing their server or carbon footprints, avoiding potential regulatory setbacks.

In need of environmentally friendly technologies

2022 has seen the emergence of many new opportunities with the launch of new innovative video services and platforms. To ensure the business success of these services, in the light of global economic uncertainty and geopolitical unrest, video platforms must rethink how these services are deployed and embrace new cost-efficient, environmentally friendly technologies.

Introduction to AI Processing on Quadra

Intro to AI Processing on Quadra - NETINT technologies

The intersection of video processing and artificial intelligence (AI) delivers exciting new functionality, from real-time quality enhancement for video publishers to object detection and optical character recognition for security applications. One key feature in NETINT’s Quadra Video Processing Units are two onboard Neural Processing Units (NPUs). Combined with Quadra’s integrated decoding, scaling, and transcoding hardware, this creates an integrated AI and video processing architecture that requires minimal interaction from the host CPU. As you’ll learn in this post, this architecture makes Quadra the ideal platform for executing video-related AI applications.

This post introduces the reader to what AI is, how it works, and how you deploy AI applications on NETINT Quadra. Along the way, we’ll explore one Quadra-supported AI application, Region of Interest (ROI) encoding.

About AI

Let’s start by defining some terms and concepts. Artificial intelligence refers to a program that can sense, reason,  act, and adapt. One AI subset that’s a bit easier to grasp is called machine learning, which refers to algorithms whose performance improves as they are exposed to more data over time.

Machine learning involves the five steps shown in the figure below. Let’s assume we’re building an application that can identify dogs in a video stream. The first step is to prepare your data. You might start with 100 pictures of dogs and then extract features, or represent them mathematically, that identify them as dogs: four legs, whiskers, two ears, two eyes, and a tail. So far, so good.

AI Processing on Quadra - figure 1
Figure 1. The high-level AI workflow (from Escon Info Systems)

To train the model, you apply your dog-finding algorithm to a picture database of 1,000 animals, only to find that rats, cats, possums, and small ponies are also identified as dogs. As you evaluate and further train the model, you extract new features from all the other animals that disqualify them from being a dog, along with more dog-like features that help identify true canines. This is the ”machine learning” that improves the algorithm.

As you train and evaluate your model, at some point it achieves the desired accuracy rate and it’s ready to deploy.

The NETINT AI Tool Chain

Then it’s time to run the model. Here, you export the model for deployment on an AI-capable hardware platform like the NETINT Quadra. What makes Quadra ideal for video-related AI applications is the power of the Neural Processing Units (NPU) and the proximity of the video to the NPUs. That is, since the video is entirely processed in Quadra, there are no transfers to a CPU or GPU, which minimizes latency and enables faster performance. More on this is below.

Figure 2 shows the NETINT AI Toolchain workflow for creating and running models on Quadra. On the left are third-party tools for creating and training AI-related models. Once these models are complete, you use the free NETINT AI Toolkit to input the models and translate, export, and run them on the Quadra NPUs – you’ll see an example of how that’s done in a moment. On the NPUs, they perform the functions for which they were created and trained, like identifying dogs in a video stream.

AI Processing on Quadra - figure 2
Figure 2. The NETINT AI Tool Chain.

Quadra Region of Interest (ROI) Filter

Let’s look at a real-world example. One AI function supplied with Quadra is an ROI filter, which analyzes the input video to detect faces and generate Region of Interest (ROI) data to improve the encoding quality of the faces. Specifically, when the AI Engine identifies a face, it draws a box around the face and sends the box’s coordinates to the encoder, with encoding instructions specific to the box.

Technically, Quadra identifies the face using what’s called a YOLOv4 object detection model. YOLO stands for You Only Look Once, which is a technology that requires only a single pass of the image (or one look) for object detection. By way of background, YOLO is a highly regarded family of “deep learning object detection models. The original versions of YOLO are implemented using the DARKNET framework, which you see as an input to the NETINT AI Toolkit in Figure 2.

Deep learning is different from the traditional machine learning described above in that it uses large datasets to create the model, rather than human intervention. To create the model deployed in the ROI filter, we trained the YOLOv4 model in DARKNET using hundreds of thousands of publicly available image data with labels (where the labels are bounding boxes on people’s faces). This produced a highly accurate model with minimum manual input, which is faster and cheaper than traditional machine learning. Obviously, where relevant training data is available, deep learning is a better alternative than traditional machine learning.

Using the ROI Function

Most users will access the ROI function via FFmpeg, where it’s presented as a video filter with the filter-specific command string shown below. To execute the function, you call the filter (ni_quadra_roi), enter the name and location of the model (yolov4_head.nb), and a QP value to adjust the quality within each box (qpoffset=-0.6). Negative values increase video quality, while positive values decrease it so that the command string would increase the quality of the faces by approximately 60% over other regions in the video.  

-vf ‘ni_quadra_roi=nb=./yolov4_head.nb:qpoffset=-0.6’

Obviously, this video is highly compressed; in a surveillance video, the ROI filter could preserve facial quality for face detection; in a gambling or similar video compressed at a higher bitrate, it could ensure that the players’ or performers’ faces look their best.

Figure 3. The region of interest filter at work; original on LEFT, ROI filter on the RIGHT

In terms of performance, a single Quadra unit can process about 200 frames per second or at least six 30fps streams. This would allow a single Quadra to detect faces and transcode streams from six security cameras or six player inputs in an interactive gambling application, along with other transcoding tasks performed without region of interest detection.

Figure 4 shows the processing workflow within the Quadra VPU. Here we see the face detection operating within Quadra’s NPUs, with the location and processing instructions passing directly from the NPU to the encoder. As mentioned, since all instructions are processed on Quadra, there are no memory transfers outside the unit, reducing latency to a minimum and improving overall throughput and performance. This architecture represents the ideal execution environment for any video-related AI application.

Figure 4. Quadra’s on-board AI and encoding processing.

NETINT offers several other AI functions, including background removal and replacement, with others like optical character recognition, video enhancement, camera video quality detection, and voice-to-text on the long-term drawing board. Of course, via the NETINT Tool Chain, Quadra should be able to run most models created in any machine learning platform.

Here in late 2022, we’re only touching the surface of how AI can enhance video, whether by improving visual quality, extracting data, or any number of as-yet unimagined applications. Looking ahead, the NETINT AI Tool Chain should ensure that any AI model that you build will run on Quadra. Once deployed, Quadra’s integrated video processing/AI architecture should ensure highly efficient and extremely low-latency operation for that model.

NETINT Quadra vs. NVIDIA T4 – Benchmarking Hardware Encoding Performance

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer

This article is the second in a series about benchmarking hardware encoding performance. In the first article, available here, I delineated a procedure for testing hardware encoders. Specifically, I recommended this three-step procedure:

  1. Identify the most critical quality and throughput-related options for the encoder.
  2. Test across a range of configurations from high quality/low throughput to low quality/high throughput to identify the operating point that delivers the optimum blend of quality and throughput for your application.
  3. Compute quality, cost per stream, and watts per stream at the operating point to compare against other technologies.

After laying out this procedure, I applied it to the NETINT Quadra Video Processing Unit (VPU) to find the optimum operating point and the associated quality, cost per stream, and watts per stream. In this article, we perform the same analysis on the NVIDIA T4 GPU-based encoder.

About The NVIDIA T4

The NVIDIA T4 is powered by NVIDIA Turing Tensor Cores and draws 70 watts in operation. Pricing varies by the reseller, with $2,299 around the median price, which puts it slightly higher than the $1,500 quoted for the NETINT Quadra T1  VPU in the previous article.

In creating the command line for the NVIDIA encodes, I checked multiple NVIDIA documents, including a document entitled Video Benchmark Assumptions, this blog post entitled Turing H.264 Video Encoding Speed and Quality, and a document entitled Using FFmpeg with NVIDIA GPU Hardware acceleration that requires a login. I readily admit that I am not an expert on NVIDIA encoding, but the point of this exercise is not absolute quality as much as the range of quality and throughput that all hardware enables. You should check these documents yourself and create your own version of the optimized command string.

While there are many configuration options that impact quality and throughput, we focused our attention on two, lookahead and presets. As discussed in the previous article, the lookahead buffer allows the encoder to look at frames ahead of the frame being encoded, so it knows what is coming and can make more intelligent decisions. This improves encoding quality, particularly at and around scene changes, and it can improve bitrate efficiency. But lookahead adds latency equal to the lookahead duration, and it can decrease throughput.

Note that while the NVIDIA documentation recommends a lookahead buffer of twenty frames, I use 15 in my tests because, at 20, the hardware decoder kept crashing. I tested a 20-frame lookahead using software decoding, and the quality differential between 15 and 20 was inconsequential, so this shouldn’t impact the comparative results.

I also tested using various NVIDIA presets, which like all encoding presets, trade off quality vs. throughput. To measure quality, I computed the VMAF harmonic mean and low-frame scores, the latter a measure of transient quality. For throughput, I tested the number of simultaneous 1080p30 files the hardware could process at 30 fps. I divided the stream count into price and watts/hour to determine cost/stream and watts/stream.

As you can see in Table 1, I tested with a lookahead value of 15 for selected presets 1-9, and then with a 0 lookahead for preset 9. Line two shows the closest x264 equivalent score for perspective.

In terms of operating point for comparing to  Quadra, I choose the lookahead 15/preset 4 configuration, which yielded twice the throughput of preset 2 with only a minor reduction in VMAF Harmonic mean. We will consider low-frame scores in the final comparisons.

In general, the presets worked as they should, with higher quality and lower throughput at the left end, and the reverse at the right end, though LA15/P4 performance was an anomaly since it produced lower quality and higher throughput than LA15/P6. In addition, dropping the lookahead buffer did not produce the performance increase that we saw with Quadra, though it also did not produce a significant quality decrease.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 1
Table 1. H.264 options and results.

Table 2 shows the T4’s HEVC results. Though quality was again near the medium x265 preset with several combinations, throughput was very modest at 3 or 4 streams at that quality level. For HEVC, LA15/P4 stands out as the optimal configuration, with four times or better throughput than other combinations with higher-quality output.

In terms of expected preset behavior, LA15/P4 was again quite the anomaly, producing the highest throughput in the test suite with slightly lower quality than LA15/P6, which should deliver lower quality. Again, switching from LA 15 to LA 0 produced neither the expected spike in throughput nor a drop in quality, as we saw with the Quadra for both HEVC and H.264.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 2
Table 2. HEVC options and results.

Quadra vs. T4

Now that we have identified the operating points for Quadra and the T4, let us compare quality, throughput, CAPEX, and OPEX. You see the data for H.264 in Table 3.

Here, the stream count was the same, so Quadra’s advantage in cost per stream and watts per stream related to its lower cost and more efficient operation. At their respective operating points, the Quadra’s VMAF harmonic mean quality was slightly higher, with a more significant advantage in the low-frame score, a predictor of transient quality problems.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 3
Table 3. Comparing Quadra and T4 at H.264 operating points.

Table 4 shows the same comparison for HEVC. Here, Quadra output 75% more streams than the T4, which increases the cost per stream and watts per stream advantages. VMAF harmonic means scores were again very similar, though the T4’s low frame score was substantially lower.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 4
Table 4. Comparing Quadra and T4 at HEVC operating points. 

Figure 5 illustrates the low-frames and low-frame differential between the two files. It is the result plot from the Moscow State University Video Quality Measurement Tool (VQMT), which displays the VMAF score, frame-by-frame, over the entire duration of the two video files analyzed, with Quadra in red and the T4 in green. The top window shows the VMAF comparison for the entire two files, while the bottom window is a close-up of the highlighted region of the top window, right around the most significant downward spike at frame 1590.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Picture 1
Figure 5. The downward green spikes represent the low-frame scores in the T4 encode.

As you can see in the bottom window in Figure 5, the low-frame region extends for 2-3 frames, which might be borderline noticeable by a discerning viewer. Figure 6 shows a close-up of the lowest quality frame, Quadra on the left, T4 on the right, and the dramatic difference in VMAF score, 87.95 to 57, is certainly warranted. Not surprisingly, PSNR and SSIM measurements confirmed these low frames.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Picture 2
Figure 6. Quality comparisons, NETINT Quadra on the left, T4 on the right.

It is useful to track low frames because if they extend beyond 2-3 frames, they become noticeable to viewers and can degrade viewer quality of experience. Mathematically, in a two-minute test file, the impact of even 10 – 15 terrible frames on the overall score is negligible. That is why it is always useful to visualize the metric scores with a tool like VQMT, rather than simply relying on a single score.

Summing Up

Overall, you should consider the procedure discussed in this and the previous article as the most important takeaway from these two articles. I am not an expert in encoding with NVIDIA hardware, and the results from a single or even a limited number of files can be idiosyncratic.

Do your own research, test your own files, and draw your own conclusions. As stated in the previous article, do not be impressed by quality scores without knowing the throughput, and expect that impressive throughput numbers may be accompanied by a significant drop in quality.

Whenever you test any hardware encoder, identify the most important quality/throughput configuration options, test over the relevant range, and choose the operating point that delivers the best combination of quality and throughput. This will give the best chance to achieve a meaningful apples vs. apples comparison between different hardware encoders that incorporates quality, cost per stream, and watts per stream.

Evaluating Hardware Transcoder Performance

If you’ve ever benchmarked software codecs, you know the quality/throughput tradeoff; simply stated, the higher the quality, the lower the throughput. In contrast, for many first-generation hardware encoders, throughput was prioritized, but the quality was fixed; you got what you got.

Most next-gen hardware encoders offer presets or other switches to optimize quality at a cost to throughput that can be even more striking than with software. In comparing specifications for encoders, remember the quality/throughput tradeoff. And when you see quality stats, think, “hmm, at what throughput?” Or, if you see throughput stats, ask, “at what quality?”

Whenever you test a hardware encoder, you should start by identifying the configuration options that most impact quality and throughput and then test across a range of configurations to get a sense of the performance/quality tradeoff. If you plug in pricing and power consumption figures, you can also easily compute the cost per stream and watts per stream. This is the CAPEX and OPEX side of the equation.

Then you can choose the “operating point” that delivers the optimum blend of quality and throughput for your applications. When comparing multiple encoders, you should perform the same analysis for each to enable a complete apples-to-apples comparison.

Recently, I benchmarked the performance of NETINT’s Quadra Video Processing Unit (VPU) against the NVIDIA T4. In this post, I’ll review the testing and the Quadra results to give you a feel for the hardware evaluation process. In a future post, I’ll review the NVIDIA findings and compare the two.

Benchmarking Quadra

Briefly, Quadra is NETINT’s newest ASIC-based transcoder, called a VPU, because it has onboard decoding, scaling, encoding, and overlay, plus an 18 TOPS AI engine. The VPU can create encoded bitstreams in H.264, HEVC, and AV1.

Quadra has two major configuration options that impact quality, lookahead buffer, and rate-distortion optimization.

Briefly, the lookahead buffer allows the encoder to look at frames ahead of the frame being encoded, so it knows what’s coming and can make more intelligent decisions. This improves encoding quality, particularly at/around scene changes, and it can improve bitrate efficiency. But, lookahead adds latency equal to the lookahead duration, and it can decrease throughput.

Table 1 shows the impact of a 40-frame lookahead buffer when encoding to the H.264 format. The top-line harmonic mean VMAF score is 2.3 points lower, which is borderline significant. But the low-frame differential of almost 16 points could predict transient problems that might be apparent to some viewers. But in addition to injecting 1.3 seconds of latency into the process, you see that the lookahead cuts the throughput by 33%, from 36 1080p streams to 24.

Table 1. Quality and performance impact
of a 40-frame lookahead on Quadra H.264 encoding.

Rate distortion optimization (RDO) functions like most presets and adjusts several parameters that impact both quality and throughout, with higher values increasing quality and reducing throughput. With H.264 output, Quadra offers one level of RDO, while with HEVC, there are three levels, 1, 2, and 3.

Table 2 shows the range of H.264 options tested during the recent benchmarking. LA is lookahead, and I tested three values, 40, 20, and 0. I also tested with RDO on and off. To provide some perspective of quality, the x264 Quality Equivalent shows x264 quality encoded using the same parameters using the presets shown.

At the highest quality setting, Quadra’s output quality slightly exceeded that of x264 using the slow preset, and the unit produced 16 1080p streams. You see that dropping the lookahead from 40 to 20 with RDO disabled had little impact on quality or throughput but cut latency by 0.66 sec, making that choice easy for latency-sensitive events.

Table 2. H.264 options and results.

At the lowest possible quality setting, Quadra’s quality dropped to slightly better than veryfast quality, which is often the x264 preset used for live applications to ensure at least nominal throughput with CPU-only transcoding.

At this quality level, the VPU outputs 36 1080p streams. By adding the cost per stream and watts per stream data, you will get a true feel for the comparative CAPEX and OPEX costs produced by all settings combinations.

Table 3 shows the same data for HEVC transcoding using the same lookahead options and RDO at 1, 2, 3, and 0 (disabled). At the highest quality levels, the output quality nearly matched the x265 encoder using the slow setting but only produced four streams. At the other end of the spectrum, output quality nearly matched x265 using the very fast preset, but the Quadra produced 40 1080p 30 streams, four more than using the H.264 format.

Table 3. HEVC options and results.

There are several new hardware encoders coming, and their launches will be accompanied by aggressive claims about quality and throughput. My recommendation is not to assume that the same settings were used for both. In short, you better do your own testing. Trust but verify comes to mind.

When you perform your own testing, remember the methodology explained above:

  1. Identify the most critical quality-related options for your specific application. All producers have different priorities, whether it’s bitrate efficiency, absolute quality, ultra-low latency, density, power consumption, or cost per stream. You need to know what your critical constraints are in order to arrive at the best solution analysis.

  2. Test across a range of configurations from high quality/low throughput to low quality/high throughput. Increasingly, even for quality-driven use cases, imperceptible quality tradeoffs might be necessary to meet an operational cost or energy efficiency target. You should choose the operating point that delivers the optimum blend for your application.

  3. Compute quality, cost per stream, and watts per stream at the operating point to compare against other technologies. Remember to factor in the CAPEX of the additional servers required to run a software encoding service. Our customers report a reduction in the number of machines needed by 90% or more and this can translate to tens of millions or even hundreds of millions of dollars in savings or reclaimed CPUs that can be used in other parts of the operation.

In the next post, we’ll share quality results from the NVIDIA T4 GPU and compare them to Quadra.

ROI encoding is particularly relevant to cloud gaming, where viewers prefer fast-moving action, high resolutions, and high frame rates, but also want to play at low bitrates on wireless or cellular networks with ultra-low latency. These factors make cloud gaming a challenging compression environment.