Understanding the Economics of Transcoding

Understanding the Economics of Transcoding

Whether your business model is FAST or subscription-based premium content, your success depends upon your ability to deliver a high-quality viewing experience while relentlessly reducing costs. Transcoding is one of the most expensive production-related costs and the ultimate determinant of video quality, so obviously plays a huge role on both sides of this equation. This article identifies the most relevant metrics for ascertaining the true cost of transcoding and then uses these metrics to compare the relative cost of the available methods for live transcoding.

Economics of Transcoding: Cost Metrics

There are two potential cost categories associated with transcoding: capital costs and operating costs. Capital costs arise when you buy your own transcoding gear, while operating costs apply when you operate this equipment or use a cloud provider. Let’s discuss each in turn.

Economics of Transcoding: CAPEX

The simplest way to compare transcoders is to normalize capital and operating costs using the cost per stream or cost per ladder, which simplifies comparing disparate systems with different costs and throughput. The cost per stream applies to services inputting and delivering a single stream, while the cost per ladder applies to services inputting a single stream and outputting an encoding ladder.

We’ll present real-world comparisons once we introduce the available transcoding options, but for the purposes of this discussion, consider the simple example in Table 1. The top line shows that System B costs twice as much as System A, while line 2 shows that it also offers 250% of the capacity of System A. On a cost-per-stream basis, System B is actually cheaper.

Understanding the Economics of Transcoding - table 1
TABLE 1: A simple cost-per-stream analysis.

The next few lines use this data to compute the number of required systems for each approach and the total CAPEX. Assuming that your service needs 640 simultaneous streams, the total CAPEX for System A dwarfs that of System B. Clearly, just because a particular system costs more than another doesn’t make it the more expensive option.

For the record, the throughput of a particular server is also referred to as density, and it obviously impacts OPEX charges. System B delivers over six times the streams from the same 1RU rack as System A, so is much more dense, which will directly impact both power consumption and storage charges.

Details Matter

Several factors complicate the otherwise simple analysis of cost per stream. First, you should analyze using the output codec or codecs, current and future. Many systems output H.264 quite competently but choke considerably with the much more complex HEVC codec. If AV1 may be in your future plans, you should prioritize a transcoder that outputs AV1 and compare cost per stream against all alternatives.

The second requirement is to use consistent output parameters. Some vendors quote throughput at 30 fps, some at 60 fps. Obviously, you need to use the same value for all transcoding options. As a rough rule of thumb, if a vendor quotes 60 fps, you can double the throughput for 30 fps, so a system that can output 8 1080p60 streams and likely output 16 1080p30 streams. Obviously, you should verify this before buying.

If a vendor quotes in streams and you’re outputting encoding ladders, it’s more complicated. Encoding ladders involve scaling to lower resolutions for the lower-quality rungs. If the transcoder performs scaling on-board, throughput should be greater than systems that scale using the host CPU, and you can deploy a less capable (and less expensive) host system.

The last consideration involves the concept of “operating point,” or the encoding parameters that you would likely use for your production, and the throughput and quality at those parameters. To explain, most transcoders include encoding options that trade off quality vs throughput much like presets do for x264 and x265. Choosing the optimal setting for your transcoding hardware is often a balance of throughput and bandwidth costs. That is, if a particular setting saves 10% bandwidth, it might make economic sense to encode using that setting even if it drops throughput by 10% and raises your capital cost accordingly. So, you’d want to compute your throughput numbers and cost per stream at that operating point.

In addition, many transcoders produce lower throughput when operating in low latency mode. If you’re transcoding for low-latency productions, you should ascertain whether the quoted figures in the spec sheets are for normal or low latency.

For these reasons, completing a thorough comparison requires a two-step analysis. Use spec sheet numbers to identify transcoders that you’d like to consider and acquire them for further testing. Once you have them in your labs you can identify the operating point for all candidates, test at these settings, and compare them accordingly.

Economics of Transcoding: OPEX - Power

Now, let’s look at OPEX, which has two components: power and storage costs. Table 2 continues our example, looking at power consumption.

Unfortunately, ascertaining power consumption may be complicated if you’re buying individual transcoders rather than a complete system. That’s because while transcoding manufacturers often list the power consumption utilized by their devices, you can only run these devices in a complete system. Within the system, power consumption will vary by the number of units configured in the system and the specific functions performed by the transcoder.

Note that the most significant contributor to overall system power consumption is the CPU. Referring back to the previous section, a transcoder that scales onboard will require lower CPU contribution than a system that scales using the host CPU, reducing overall CPU consumption. Along the same lines, a system without a hardware transcoder uses the CPU for all functions, maxing out CPU utilization likely consuming about the same energy as a system loaded with transcoders that collectively might consume 200 watts. 

Again, the only way to achieve a full apples-to-apples comparison is to configure the server as you would for production and measure power consumption directly. Fortunately, as you can see in Table 2, stream throughput is a major determinant of overall power consumption. Even if you assume that systems A and B both consume the same power, System B’s throughput makes it much cheaper to operate over a five year expected life, and much kinder to the environment.

Understanding the Economics of Transcoding - table 2
TABLE 2. Computing the watts per stream of the two systems.

Economics of Transcoding: Storage Costs

Once you purchase the systems, you’ll have to house them. While these costs are easiest to compute if you’re paying for a third-party co-location service, you’ll have to estimate costs even for in-house data centers. Table 3 continues the five year cost estimates for our two systems, and the denser system B proves much cheaper to house as well as power.

Understanding the Economics of Transcoding - table 3
TABLE 3: Computing the storage costs for the two systems.

Economics of Transcoding: Transcoding Options

These are the cost fundamentals, now let’s explore them within the context of different encoding architectures.

There are three general transcoding options: CPU-only, GPU, and ASIC-based. There are also FPGA-based solutions, though these will probably be supplanted by cheaper-to-manufacture ASIC-based devices over time. Briefly,

  • CPU-based transcoding, also called software-based transcoding, relies on the host central processing unit, or CPU, for all transcoding functions.
  • GPU-based transcoding refers to Graphic Processing Units, which are developed primarily for graphics-related functions but may also transcode video. These are added to the server in add-in PCIe cards.
  • ASICs are Application-Specific Integrated Circuits designed specifically for transcoding. These are added to the server as add-in PCIe cards or devices that conform to the U.2 form factor.

Economics of Transcoding: Real-World Comparison

NETINT manufactures ASIC-based transcoders and video processing units. Recently, we published a case study where a customer, Mayflower, rigorously and exhaustively compared these three alternatives, and we’ll share the results here.

By way of background, Mayflower’s use case needed to input 10,000 incoming simultaneous streams and distribute over a million outgoing simultaneous streams worldwide at a latency of one to two seconds. Mayflower hosts a worldwide service available 24/7/365.

Mayflower started with 80-core bare metal servers and tested CPU-based transcoding, then GPU-based transcoding, and then two generations of ASIC-based transcoding. Table 4 shows the net/net of their analysis, with NETINT’s Quadra T2 delivering the lowest cost per stream and the greatest density, which contributed to the lowest co-location and power costs.

RESULTS: COST AND POWER

Understanding the Economics of Transcoding - table 4
TABLE 4. A real-world comparison of the cost per stream and OPEX associated with different transcoding techniques.

As you can see, the T2 delivered an 85% reduction in CAPEX with ~90% reductions in OPEX as compared to CPU-based transcoding. CAPEX savings as compared to the NVIDIA T4 GPU was about 57%, with OPEX savings around ~70%.

Table 5 shows the five-year cost of the Mayflower T-2 based solution using the cost per KWH in Cyprus of $0.335. As you can see, the total is $2,225,241, a number we’ll return to in a moment.

Understanding the Economics of Transcoding - table 5
TABLE 5: Five-year cost of the Mayflower transcoding facility.

Just to close a loop, Tables 1, 2, and 3, compare the cost and performance of a Quadra Video Server equipped with ten Quadra T1U VPUs (Video Processing Units) with CPU-based transcoding on the same server platform. You can read more details on that comparison here.

Table 6 shows the total cost of both solutions. In terms of overall outlay, meeting the transcoding requirements with the Quadra-based System B costs 73% less than the CPU-based system. If that sounds like a significant savings, keep reading. 

TABLE 6: Total cost of the CPU-based System A and Quadra T2-based System B.

Economics of Transcoding: Cloud Comparison

If you’re transcoding in the cloud, all of your costs are OPEX. With AWS, you have two alternatives: producing your streams with Elemental MediaLive or renting EC3 instances and running your own transcoding farm. We considered the MediaLive approach here, and it appears economically unviable for 24/7/365 operation.

Using Mayflower’s numbers, the CPU-only approach required 500 80-core Intel servers running 24/7. The closest CPU in the Amazon ECU pricing calculator was the 64-core c6i.16xlarge, which, under the EC2 Instance Savings plan, with a 3-year commitment and no upfront payment, costs 1,125.84/month.

Understanding the Economics of Transcoding - figure 1
FIGURE 1. The annual cost of the Mayflower system if using AWS.

We used Amazon’s pricing calculator to roll these numbers out to 12 months and 500 simultaneous servers, and you see the annual result in Figure 1. Multiply this by five to get to the five-year cost of $33,775,056, which is 15 times the cost of the Quadra T2 solution, as shown in table 5.

We ran the same calculation on the 13 systems required for the Quadra Video Server analysis shown in Tables 1-3 which was powered by a 32-core AMD CPU. Assuming a c6a.8xlarge CPU with a 3-year commitment and no upfront payment,, this produced an annual charge of $79,042.95, or $395,214.6 for the five-year period, which is about 8 times more costly than the Quadra-based solution.

Understanding the Economics of Transcoding - figure 2
FIGURE 2: The annual cost of an AWS system per the example schema presented in tables 1-3.

Cloud services are an effective means for getting services up and running, but are vastly more expensive than building your own encoding infrastructure. Service providers looking to achieve or enhance profitability and competitiveness should strongly consider building their own transcoding systems. As we’ve shown, building a system based on ASICs will be the least expensive option.

In August, NETINT held a symposium on Building Your Own Live Streaming Cloud. The on-demand version is available for any video engineer seeking guidance on which encoder architecture to acquire, the available software options for transcoding, where to install and run your encoding servers, and progress made on minimizing power consumption and your carbon footprint.

ON-DEMAND: Building Your Own Live Streaming Cloud

Choosing Transcoding Hardware: Deciphering the Superiority of ASIC-based Technology

Which technology reigns supreme in transcoding: CPU-only, GPU, or ASIC-based? Kenneth Robinson’s incisive analysis from the recent symposium makes a compelling case for ASIC-based transcoding hardware, particularly NETINT’s Quadra. Robinson’s metrics prioritized viewer experience, power efficiency, and cost. While CPU-only systems appear initially economical, they falter with advanced codecs like HEVC. NVIDIA’s GPU transcoding offers more promise, but the Quadra system still outclasses both in quality, cost per stream, and power consumption. Furthermore, Quadra’s adaptability allows a seamless switch between H.264 and HEVC without incurring additional costs. Independent assessments, such as Ilya Mikhaelis‘, echo Robinson’s conclusions, cementing ASIC-based transcoding hardware as the optimal choice.

Choosing transcoding hardware

During the recent symposium, Kenneth Robinson, NETINT’s manager of Field Application Engineering, compared three transcoding technologies: CPU-only, GPU, and ASIC-based transcoding hardware. His analysis, which incorporated quality, throughput, and power consumption, is useful as a template for testing methodology and for the results. You can watch his presentation here and download a copy of his presentation materials here.

Figure 1. Overall savings from ASIC-based transcoding (Quadra) over GPU (NVIDIA) and CPU.
Figure 1. Overall savings from ASIC-based transcoding (Quadra) over GPU (NVIDIA) and CPU.

As a preview of his findings, Kenneth found that when producing H.264, ASIC-based hardware transcoding delivered CAPEX savings of 86% and 77% compared to CPU and GPU-based transcoding, respectively. OPEX savings were 95% vs. CPU-only transcoding and 88% compared to GPU.

For the more computationally complex HEVC codec, the savings were even greater. As compared to CPU-based transcoding, ASICs saved 94% on CAPEX and 98% on OPEX. As compared to GPU-based transcoding, ASICs saved 82% on CAPEX and 90% on OPEX. These savings are obviously profound and can make the difference between a successful and profitable service and one that’s mired in red ink.

Let’s jump into Kenneth’s analysis.

Determining Factors

Digging into the transcoding alternatives, Kenneth described the three options. First are CPUs from manufacturers like AMD or Intel. Second are GPUs from companies like NVIDIA or AMD. Third are ASICs, or Application Specific Integrated Circuits, from manufacturers like NETINT. Kenneth noted that NETINT calls its Quadra devices Video Processing Units (VPU), rather than transcoders because they perform multiple additional functions besides transcoding, including onboard scaling, overlay, and AI processing.

He then outlined the factors used to determine the optimal choice, detailing the four factors shown in Figure 2. Quality is the average quality as assessed using metrics like VMAF, PSNR, or subjective video quality evaluations involving A/B comparisons with viewers. Kenneth used VMAF for this comparison. VMAF has been shown to have the highest correlation with subjective scores, which makes it a good predictor of viewer quality of experience.

Choosing transcoding hardware - Determining Factors
Figure 2. How Kenneth compared the technologies.

Low-frame quality is the lowest VMAF score on any frame in the file. This is a predictor for transient quality issues that might only impact a short segment of the file. While these might not significantly impact overall average quality, short, low-quality regions may nonetheless degrade the viewer’s quality of experience, so are worth tracking in addition to average quality.

Server capacity measures how many streams each configuration can output, which is also referred to as throughput. Dividing server cost by the number of output streams produces the cost per stream, which is the most relevant capital cost comparison. The higher the number of output streams, the lower the cost per stream and the lower the necessary capital expenditures (CAPEX) when launching the service or sourcing additional capacity.

Power consumption measures the power draw of a server during operation. Dividing this by the number of streams produced results in the power per stream, the most useful figure for comparing different technologies.

Detailing his test procedures, Kenneth noted that he tested CPU-only transcoding on a system equipped with an AMD Epic 32-core CPU. Then he installed the NVIDIA L4 GPU (a recent release) for GPU testing and NETINT’s Quadra T1U U.2 form factor VPU for ASIC-based testing.

He evaluated two codecs, H.264 and HEVC, using a single file, the Meridian file from Netflix, which contains a mix of low and high-motion scenes and many challenging elements like bright lights, smoke and fog, and very dark regions. If you’re testing for your own deployments, Kenneth recommended testing with your own test footage.

Kenneth used FFmpeg to run all transcodes, testing CPU-only quality using the x264 and x265 codecs using the medium and very fast presets. He used FFmpeg for NVIDIA and NETINT testing as well, transcoding with the native H.264 and H.265 codec for each device.

H.264 Average, Low-Frame, and Rolling Frame Quality

The first result Kenneth presented was average H.264 quality. As shown in Figure 3, Kenneth encoded the Meridian file to four output files for each technology, with encodes at 2.2 Mbps, 3.0 Mbps, 3.9 Mbps, and 4.75 Mbps. In this “rate-distortion curve” display, the left axis is VMAF quality, and the bottom axis is bitrate. In all such displays, higher results are better, and Quadra’s blue line is the best alternative at all tested bitrates, beating NVIDIA and x264 using the medium and very fast presets.

Figure 3. Quadra was tops in H.264 quality at all tested bitrates.
Figure 3. Quadra was tops in H.264 quality at all tested bitrates.

Kenneth next shared the low-frame scores (Figure 4), noting that while the NVIDIA L4’s score was marginally higher than the Quadra’s, the difference at the higher end was only 1%. Since no viewer would notice this differential, this indicates operational parity in this measure.

Figure 4. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.
Figure 4. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.

The final H.264 quality finding displayed a 20-second rolling average of the VMAF score. As you can see in Figure 5, the Quadra, which is the blue line, is consistently higher than the NVIDIA L4 or medium or very fast. So, even though the Quadra had a lower single-frame VMAF score compared to NVIDIA, over the course of the entire file, the quality was predominantly superior.

Figure 5. 20-second rolling frame quality over file duration.
Figure 5. 20-second rolling frame quality over file duration.

HEVC Average, Low-Frame, and Rolling Frame Quality

Kenneth then related the same results for HEVC. In terms of average quality (Figure 6), NVIDIA was slightly higher than the Quadra, but the delta was insignificant. Specifically, NVIDIA’s advantage starts at 0.2% and drops to 0.04% at the higher bit rates. So, again, a difference that no viewer would notice. Both NVIDIA and Quadra produced better quality than CPU-only transcoding with x265 and the medium and very fast presets.

Figure 6. Quadra was tops in H.264 quality at all tested bitrates.
Figure 6. Quadra was tops in H.264 quality at all tested bitrates.

In the low-frame measure (Figure 7), Quadra proved consistently superior, with NVIDIA significantly lower, again a predictor for transient quality issues. In this measure, Quadra also consistently outperformed x265 using medium and very fast presets, which is impressive.

Figure 7. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.
Figure 7. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.

Finally, HEVC moving average scoring (Figure 8) again showed Quadra to be consistently better across all frames when compared to the other alternatives. You see NVIDIA’s downward spike around frame 3796, which could indicate a transient quality drop that could impact the viewer’s quality of experience.

Figure 8. 20-second rolling frame quality over file duration.
Figure 8. 20-second rolling frame quality over file duration.

Cost Per Stream and Power Consumption Per Stream - H.264

To measure cost and power consumption per stream, Kenneth first calculated the cost for a single server for each transcoding technology and then measured throughput and power consumption for that server using each technology. Then, he compared the results, assuming that a video engineer had to source and run systems capable of transcoding 320 1080p30 streams.

You see the first step for H.264 in Figure 9. The baseline computer without add-in cards costs $7,100 but can only output fifteen 1080p30 streams using an average of the medium and veryfast presets, resulting in a cost per stream was $473. Kenneth installed two NVIDIA L4 cards in the same system, which boosted the price to $14,214, but more than tripled throughput to fifty streams, dropping cost per stream to $285. Kenneth installed ten Quadra T1U VPUs in the system, which increased the price to $21,000, but skyrocketed throughput to 320 1080p30 streams, and a $65 cost per stream.

This analysis reveals why computing and focusing on the cost per stream is so important; though the Quadra system costs roughly three times the CPU-only system, the ASIC-fueled output is over 21 times greater, producing a much lower cost per stream. You’ll see how that impacts CAPEX for our 320-stream required output in a few slides.

Figure 9. Computing system cost and cost per stream.
Figure 9. Computing system cost and cost per stream.

Figure 10 shows the power consumption per stream computation. Kenneth measured power consumption during processing and divided that by the number of output streams produced. This analysis again illustrates why normalizing power consumption on a per-stream basis is so necessary; though the CPU-only system draws the least power, making it appear to be the most efficient, on a per-stream basis, it’s almost 20x the power draw of the Quadra system.

Figure 10. Computing power per stream for H.264 transcoding.
Figure 10. Computing power per stream for H.264 transcoding.

Figure 11 summarizes CAPEX and OPEX for a 320-channel system. Note that Kenneth rounded down rather than up to compute the total number of servers for CPU-only and NVIDIA. That is, at a capacity of 15 streams for CPU-only transcoding, you would need 21.33 servers to produce 320 streams. Since you can’t buy a fractional server, you would need 22, not the 21 shown. Ditto for NVIDIA and the six servers, which, at 50 output streams each, should have been 6.4, or actually 7. So, the savings shown are underrepresented by about 4.5% for CPU-only and 15% for NVIDIA. Even without the corrections, the CAPEX and OPEX differences are quite substantial.

Figure 11. CAPEX and OPEX for 320 H.264 1080p30 streams.
Figure 11. CAPEX and OPEX for 320 H.264 1080p30 streams.

Cost Per Stream and Power Consumption Per Stream - HEVC

Kenneth performed the same analysis for HEVC. All systems cost the same, but throughput of the CPU-only and NVIDIA-equipped systems both drop significantly, boosting their costs per stream. The ASIC-powered Quadra outputs the same stream count for HEVC as for H.264, producing an identical cost per stream.

Figure 12. Computing system cost and cost per stream.
Figure 12. Computing system cost and cost per stream.

The throughput drop for CPU-only and NVIDIA transcoding also boosted the power consumption per stream, while Quadra’s remained the same.

Figure 13. Computing power per stream for H.264 transcoding.
Figure 13. Computing power per stream for H.264 transcoding.

Figure 14 shows the total CAPEX and OPEX for the 320-channel system, and this time, all calculations are correct. While CPU-only systems are tenuous–at best– for H.264, they’re clearly economically untenable with more advanced codecs like HEVC. While the differential isn’t quite so stark with the NVIDIA products, Quadra’s superior quality and much lower CAPEX and OPEX are compelling reasons to adopt the ASIC-based solution.

Figure 14. CAPEX and OPEX for 320 1080p30 HEVC streams.
Figure 14. CAPEX and OPEX for 320 1080p30 HEVC streams.

As Kenneth pointed out in his talk, even if you’re producing only H.264 today, if you’re considering HEVC in the future, it still makes sense to choose a Quadra-equipped system because you can switch over to HEVC with no extra hardware cost at any time. With a CPU-only system, you’ll have to more than double your CAPEX spending, while with NVIDIA,  you’ll need to spend another 25% to meet capacity.

The Cost of Redundancy

Kenneth concluded his talk with a discussion of full hardware and geo-redundancy. He envisioned a setup where one location houses two servers (a primary and a backup) for full hardware redundancy. A similar setup would be replicated in a second location for geo-redundancy. Using the Quadra video server, four servers could provide both levels of redundancy, costing a total of $84,000. Obviously, this is much cheaper than any of the other transcoding alternatives.

NETINT’s Quadra VPU proved slightly superior in quality to the alternatives, vastly cheaper than CPU-only transcoding, and very meaningfully more affordable than GPU-based transcoders. While these conclusions may seem unsurprising – an employee at an encoding ASIC manufacturer concludes that his ASIC-based technology is best — you can check Ilya Mikhaelis’ independent analysis here and see that he reached the same result.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

Hardware Transcoding: What it Is, How it Works, and Why You Care

What is Transcoding?

Like most terms relating to streaming, transcoding is defined more by practice than by a dictionary. In fact, transcoding isn’t in Websters or many other dictionaries. That said, it’s generally accepted that transcoding means converting a file from one format to another.  More particularly, it’s typically used within the context of a live-streaming application.

As an example, suppose you were watching a basketball game on NBA.tv. Assuming that the game is produced on-site, somewhere in the arena, a video mixer pulls together all video, audio, and graphics. The output would typically be fed into a device that compresses it to a high-bitrate H.264 or another compressed format and sends it to the cloud. You would typically call this live encoding; if the encoder is hardware-based, it would be hardware-based live encoding.

In the cloud, the incoming stream is transcoded to lower resolution H.264 streams for delivery to mobile and other devices or HEVC for delivery to a smart TV. This can be done in software but is typically performed using a hardware transcoder because it’s more efficient. More on this below.

Looking further into the production and common uses of streaming terminology, during the event or after, a video editor might create short highlights from the original H.264 video to share on social media. After editing the clip, they would encode it to H.264 or another compressed format to upload to Instagram or Facebook. You would typically call rendering the output from the software editor encoding, not transcoding, even though the software converts the H.264 input file to H.264 output, just like the transcoder.

Play Video about NETINT-Jan Ozer-Hardware Transcoding v Encoding
HARD QUESTIONS ON HOT TOPICS: Transcoding versus Encoding.
Watch the full conversation on YouTube: https://youtu.be/BcDVnoxMBLI

Boiling all this down in terms of common usage:

  • You encode a live stream from video input, in software or in hardware, to send it to the cloud for distribution. You use a live encoder, either hardware or software, for this.
  • In the cloud, you transcode the incoming stream to multiple resolutions or different formats using a hardware or software transcoder.
  • When outputting video for video-on-demand (VOD) deployment, you typically call this encoding (and not transcoding), even if you’re working from the same compressed format as the transcoding device.

Hardware Transcoding Alternatives

Anyone who has ever encoded a file knows that it’s a demanding process for your computer. When producing for VOD, time matters, but if the process takes a moment or two longer than planned, no one really notices. Live, of course, is different; if the video stream slows or is interrupted, viewers notice and may click to another website or change channels.

This is why hardware transcoding is typically deployed for high-volume transcoding applications. You can encode with a CPU and software, but CPUs perform multiple functions within the computer and are not optimized for transcoding. This means that a single server can produce fewer streams than hardware transcoders, which translates to higher CAPEX and power consumption.

Like the name suggests, hardware-based transcoding uses hardware devices other than the CPU to transcode the video. One alternative are graphics processing units (GPUs), which are highly optimized for graphic-intensive applications like gaming. Transcoding is supported with dedicated hardware circuits in the GPU, but the vast majority of circuits are for graphics and other non-transcoding functions. While GPUs are more efficient than CPUs for transcoding, they are expensive and consume significant power.

ASIC-Based Transcoding

Which takes us to ASICs. Application-Specific Integrated Circuits (ASICs) are designed for a specific task or application, like video transcoding. Because they‘re designed for this task, they are more efficient than CPU or GPU-based encoding, more affordable, and more power-efficient.

Because they‘re designed for this task, Application-Specific Integrated Circuits (ASICs) are more efficient than CPU or GPU-based encoding, more affordable, and more power-efficient.

ALEX LIU, Co-Founder,
COO at NETINT Technologies Inc.

ASICs are also very compact, so you can pack more ASICs into a server than GPUs or CPUs, increasing the output from that server. This means that fewer servers can deliver the same number of streams than with GPU or CPU-based transcoding, which saves additional server storage cost and maintenance.

While we’re certainly biased, if you’re looking for a cost-effective and power-efficient hardware alternative for high-volume transcoding applications, ASIC transcoders are the way to go. Don’t take our word for it; you can read here how YouTube converted much of their production operation to the ASIC-based Argos VCU (for video compression unit). Meta recently also released their own encoding ASIC. Of course, neither of these are for sale to the public; the primary vendor for ASIC-based transcoders is NETINT.

Vindral CDN Against Dinosaurs’ Agreement

Vindral's CDN Against Dinosaurs' Agreement.jpg

One thing is the bill that you're getting, the other thing is the bill we're leaving to our children...”

WATCH FULL CONVERSATION HERE: https://youtu.be/tNPFpXPVpxI

We’re going to talk about Vindral – but first, tell us a little bit about RealSprint?

RealSprint, we’re a Swedish company based in Northern Sweden, which is kind of a great place to be running a tech company. When you’re in a University Town, and any time after September, it gets dark outside for most parts of the day, which means  people generally try to find things to do inside. So, it’s a good place to have a tech business because you’ll have people spending a lot of time in front of their screens, creating things. RealSprint is a heavily culture-focused team, with the majority located in Northern Sweden and a few based in Stockholm and in the U.S.

The company started around 10 years ago as a really small team that did not have the end game figured out yet.  All they knew was that they wanted to do something around video, broadcasting, and streaming. From there it’s grown, and today we’re 30 people.

At a high level, what is Vindral?

Vindral is actually a product family. There is a live CDN, as you mentioned, and there’s also a video compositing software. As for the live CDN, it’s been around five or six years that it’s been running 24/7.

The product was born because we got questions from our clients about latency and quality. ‘Why do I have to choose if I want low latency or if I want high quality’. There are solutions on both ends of that spectrum, but when we got introduced to the problem, there weren’t really any good ones. We started looking into real-time technologies, like webRTC, in its current state and quickly found that it’s not really suitable if you want high quality. It’s amazing in terms of latency. But the client’s reality requires more. You can’t go all in on only one aspect of a solution. You need something that’s balanced.

Draw us a block diagram. So, you’ve got your encoder, you’ve got your CDN, you’ve got software…

We can take a typical client in entertainment or gaming. So, they have their content, and they want to broadcast that to a global audience. What they generally do is they ingest one signal to our endpoint, which is the most standard way of using our CDN. And there are several ways of ingesting multiple transfer protocols.

The first thing that happens on our end is we create the ABR ladder. We transcode all the qualities that are needed since network conditions vary  between  markets. Even in places that are well connected, the home Wi-Fi alone can be so bad at times, with a lot of jitter and latency.

After the ABR ladder is created, the next box fans out to the places in the world where there are potential viewers. And from there, we also have edge software as one part of this. Lastly, the signal is received by the player instanced on the device.

That’s basically it.

You’ve got an encoder in the middle of things creating the encoding ladder. Then you’ve got the CDN distributing. What about the software that you’ve contributed? How does that work? Do I log into some kind of portal and then administrate through there?

Exactly. Take a typical client in gaming, for example.They’re running 50 or 100 channels. And they want to see what’s going on in their operations, understand how much data is flowing through the system, and things like that. There is a portal where they can log in, see their usage, and see all of the channel information that they would need. It’s a very important part, of course, of any mature system that the client understands what’s going on.

Encoding is particularly important for us to solve because we have loads of channels running 24/7. So, that’s different. If you’re running a CDN, and your typical client is broadcasting for 20 minutes a month, then, of course, the encoding load is much lower. In our case, yes, we do have those types (minimal usage), but many of our clients are heavy users, and they own a lot of content rights. Therefore, the encoding part is several hundreds of terabytes ingested. Only one quality for each stream monthly on the ingest side.

You’re encoding ABR. Which codecs are you supporting? And which endpoints are you supporting?

So, codec-wise, everybody does H264, of course. That’s the standard when it comes to live streaming with low latency. We have recently added AV1, as well, which was something we announced as a world first. We weren’t the world’s first with AV1, but we were the world’s first with AV1 at what many would call real-time. We call it low latency.

We chose to add it because there’s a market pointing to AV1.

Which devices are you targeting? Is it TV? Smart TV? Mobile? The whole gamut?

I would say the whole gamut. That list of devices is steadily growing. I’m trying to think of any devices that we don’t support. Essentially, as long as it’s using the internet, we deliver to it. Any desktop or mobile browser, including IOS as well.

iOS is, basically, the hardest one. If you’re delivering to iOS browsers that are all running iOS Safari. We’re getting the same performance on iOS Safari. And then Apple TV, Google Chromecast, Samsung, LG TVs, and Android TVs. There’s a plethora of different devices that our clients require us to support.

4K? 1080p? HDR? SDR?

Yes, we support all of them. One of the very important things for us is to prove that you can get quality on low latency.

Take a typical client. They’re broadcasting sports and their viewers are used to watching this on their television, maybe a 77-inch or 85-inch TV. You don’t want that user to get a 720p stream. This is where the configurable latency really comes into play, allowing the client to pick a second of latency or 800 milliseconds, with 4K to be maintained on that latency. That is one of the use cases where we shine.

There’s also a huge market for lower qualities as well, where that’s important.

So, you mentioned ABR ladders, and yes, there are markets where you get 600 kilobits per second on the last mile. You need a solution for that as well.

Your system is the delivery side, the encoding side. Which types of encoders did you consider when you chose the encoder to fit into Vindral?

There are actually two steps to consider depending on whether we’re doing it on-prem or off, like a cloud solution. The client often has their own encoders. Many of our clients use Elemental or something similar just to push the material to us. But on the transcoding, where we generate the ladder, unless we’re passing all qualities through (which is also a possibility), there are, of course, different ways and different directions to go for different scenarios. For example, if you take an Intel CPU-based and you use software to encode. That is a viable option in some scenarios, but not in all.

There’s an Nvidia GPU, for example, which you could use in some scenarios since there are many factors coming into play when making that decision.

The highest priority of all is something that our business generally does badly –maintaining business viability. You want to make sure that any client that is using the system can pay and make their business work. Now, if we have channels that are running 24/7, as we do, and if it’s in a region where it’s not impossible to allocate bare metal or collocation space, then that is a fantastic option in many ways.

CPU-based, GPU-based, and ASICs are all different and make up the three different ones that we’ve looked into.

So, how do you differentiate? You talked about software being a good option in some instances. When is it not a good option?

No option is good or bad in a sense, but if you compare them, both the GPU and the ASIC outperform the software encoding when it comes to heavier use.

The software option is useful when you need to spin it up, spin it down, and you need to move things. You need it to be flexible which is, usually, in the lower revenue parts of the markets.

When it comes to the big broadcaster and the large rights holders, the use case is heavier with many channels, and large usage over time, then the GPU and especially the ASIC make a lot of sense.

You’re talking there about density. What is the quality picture?
A lot of people think software quality is going to be better than ASIC and GPUs. How do they compare?

It might be in some instances. We’ve found that the quality when using ASICs is fantastic. It’s all depending on what you want to do. Because we need to understand we’re talking about low latency here. We don’t have the option of passing encoding or anything like that. Everything needs to work in real time. Our requirement on encoding is that it takes a frame to encode, and that’s all the time that you get.

You mentioned density, but there are a lot of other things coming into play, quality being one.

If you’re looking at ASICs, you’re comparing that to GPUs. In some scenarios we’ve had for the past two years, the decision could have been based on the availability factor – there’s a chip shortage. What can I get my hands on? In some cases,  we’ve had a client banging on the door, and they want to go live right away.

Going back to the density part. That is a huge game changer because the ASIC is unmatched in terms of the number of streams per rack unit. If you just measure that KPI, and you’re willing to do the job of building your CDN in co-location spaces, which not everybody is, then that’s it. You have to ask yourself, though, who’s going to manage this? You don’t want to bloat when you’re managing this type of solution. If you have thousands of channels running, then cost is one thing when it comes to not having to take up a lot of rack space, but also, you don’t want it to bloat too much.

How formal of analysis did you make in choosing between the two hardware alternatives? Did you bring it down to cost per stream and power per stream?
Did you do any of that math? How did you make that decision between those two options?

Well, in a way, yes. But, on that particular metric, we need to look at the two options and say well, this is at a tenth of the cost. So I’m not going to give you the number, because I know it’s so much smaller.

We’re well aware of what costs are involved, but the cost per stream depends on profiles, etc. Just comparing them. We’ve, naturally, looked at things like started encoding streams, especially in AV1. We look at what the actual performance is, how much load there is, and what’s happening on the cards, and how much you can put on them before they start giving in… But then… there’s such a big difference…

Take, for example, a GPU. A great piece of hardware. But it’s also kind of like buying a car for the sound system. Because the GPU… If I’m buying an NVIDIA GPU to encode video, then I might not even be using the actual rendering capabilities. That is the biggest job that the GPU is typically built for. So, that’s one of the comparisons to make, of course.

Take, for example, a GPU. A great piece of hardware. But it's also kind of like buying a car for the sound system.”

What about the power side? How important is power consumption to either you yourself or your customers?

If you look at the energy crisis and how things are evolving I’d say it is very, very important. The typical offer you’ll be getting from the data center is: we’re going to charge you 2x the electrical bill. And that’s never been something that’s been charged because they don’t even bother. Only now, we’re seeing the first invoices coming in where the electrical bill is part of it. In Germany, the energy price peaked in August at 0.7 Euros per kilowatt hour.

Frankfurt, Germany, is one of the major exchanges that is extremely important. If you want performance streaming, you need to have something in Frankfurt.  There’s another part of it as well, which is, of course, the environmental aspect of it. One thing is the bill that you’re getting. The other thing is the bill we’re leaving to our children.

It’s kind of contradictory because many of our clients  make travel unnecessary. We have a Norwegian company that we’re working with that is doing remote inspections of ships. They were the first company in the world to do that. Instead of flying in an inspector, the ship owner, and two divers to the location, there’s only one operator of an underwater drone that is on the location. Everybody else is just connected. That’s obviously a good thing for the environment. But what are we doing?

Why did you decide to lead with AV1?

That’s a really good question. There are several reasons why we decided to lead with AV1. It is very compelling as soon as you can do it in real time. We had to wait for somebody to make it viable, which we found with the NETINT’s ASIC.

Viable acts at high quality and with latency and reliability that we could use and also, of course, with throughput. We don’t have to buy too much hardware to get it working.

We’re seeing markers that our clients are going to want AV1. And there are several reasons why that is the case. One of which is, of course, it’s license free. If you’re a content owner, especially if you’re a content owner with a large crowd with many subscribers to your content, that’s a game-changer. Because the cost of licensing a codec can grow to become a significant part of your business expenses.

Look at what’s happening with fast, free, ad-supported television. There you’re trying to get even more viewers. And you have lower margins so what you’re doing is creating eyeball minutes. And then, if you have codec and license costs, that’s a bit of an issue. It’s better if it’s free.

Is this what you’re hearing from your customers? Or is this what you’re assuming they’re thinking about?

That’s what we’re hearing from our customers, and that’s why we started implementing it.

For us, there’s also the bandwidth-to-quality aspect, which is great. I believe that it will explode in 2023. For example, if you look at what happened one month ago, Google made hardware decoding mandatory for Android 14 devices. That’s both phones and tablets. It opens so many possibilities.

We were not expecting to get business on it yet, but we are, and I’m happy about that. There are already clients reaching out because of the licensing aspect, as some of them are transmitting petabytes a month. If you can bring down the bandwidth while retaining the quality, that’s a good deal.

You mentioned before that your systems allow the user to dial in the latency and the quality. Could you explain how that works?

It’s important to make a difference between the user and the broadcaster. Our client is the broadcaster that owns the content, and they can pick the latency.

Vindral’s live CDN doesn’t work on a ‘fetch your file’ basis. The way it works is we’re going to push the file to you, and you’re going to play it out. And this is how much you’re going to buffer. Once you have that setup, and, of course, a lot of sync algorithms and things like that at work, then the stream is not allowed to drift.

A typical use case is where you have tick live auctions, for example. The typical setup for live auctions is 1080P, and you want below one second of latency because people are bidding. There are also people bidding in the actual auction house, so there’s the fairness aspect of it as well.

What we typically see is they configure maybe a 700-millisecond buffer, and it makes it possible. Even that small of a buffer makes such a huge difference. What we see in our metrics is that, basically, 99% of the viewers are getting the highest quality stream across all markets. That’s a huge deal.

How much does the quality drop off? What’s the lowest latency do you support and how much does the quality drop off at that latency as compared to one or two seconds.

I would say that the lowest that we would maybe recommend somebody to use our system for is 500 milliseconds. That would be about 250 milliseconds slower than a webRTC-based real-time solution. And why do I say that? It’s because other than that, I see no reason to use our approach. If you don’t want a buffer, you may as well use something else.

Actually, we don’t have that many clients trying that out, because most of them 500 milliseconds is the lowest somebody’s sets. And they’ve been like ‘this is so quick we don’t need anything more’. And it retains 4K at that latency.

How does the pitch work against webRTC?
If I’m a potential customer of yours and you come in and talk about your system and compared to webRTC, what are the pros and cons of each? It’s an interesting technological decision. I know that webRTC is going to be potentially lower latency, but it might only be one stream, may not come with captioning, it’s not gonna be the ABR It’s interesting to hear what technology was, how do you differentiate.

Let’s look from the perspective of when you should be using which. If you need to have a two-way voice conversation, you should use webRTC. There are actually studies that have been made proving that if you bring the latency up above 200 milliseconds, the conversation starts feeling awkward. If you have half a second, it is possible, but it’s not good. So, if that’s an ultimate requirement, then webRTC all day long.

Both technologies are actually very similar. The main difference I would point out is that we have added this buffer that the platform owner can set. So, the player’s instance is at that buffer level. WebRTC currently does not support that. And even if it did, we might even Implement that as an option. And it might go that way at some point. Today it’s not.

On the topic of differences, then. If 700 or 600 milliseconds of latency is good for you and quality is still important, then you should be using a buffer and using our solution. When you’re considering different vendors, the feature set, and what you’re actually getting in the package, there are huge differences. For some vendors, on their lower-tier products, ABR is not included. Things like that. Where the obvious thing is – you should be using ABR. Definitely.

You talked about the shortest. What’s the longest latency you see people dialing in?

We’ve actually had one use case in Hong Kong where they chose to set the latency at 3.7 seconds. That was because the television broadcast was at 3.7 seconds.

That’s the other thing. We talk a lot about latency. Latency is a hot topic, but honestly, many of our clients value synchronization even above latency. Not all clients, but some of them.

If you have a game show where you want to react to the chat and have some sort of interactivity… Maybe you have 1.5 seconds. That’s not a big issue if it’s at 1.5 seconds of latency. You will, naturally, get a little bit more stability since you’re increasing the buffer. Some of our clients have chosen to do that.

But around 3.5… That’s actually the only client we’ve had that has done that. But I think there could be more in the future. Especially in sports. If you have the satellite broadcast… It is at seven seconds of latency. We can match it to the hundreds of hundreds of milliseconds.

Latency is a hot topic, but honestly, many of our clients value synchronization even above latency.”

And the advantage of higher latency is going to be stream stability and quality.
Do you know what’s the quality difference is going to be?

Definitely. However, as soon as you’re above even one second, the returns are diminishing. It’s not like it unlocks this whole universe of opportunities. On extreme markets, it might, but I would think that if you’re going above two seconds, you’ve kind of done. There is no need to go higher. At least our clients have not found that need. The markets are basically from East Asia to South America and South Africa because we’ve expanded our CDN into those parts.

You’ve spoken a couple of times about where you install your equipment, and you’re talking about co-locating and things like that. What’s your typical server look like. How many encoders are you putting in it? And what type of density are you expecting from that?

In general, it would be something like one server can do 10 times as many streams if you’re using the ASIC. Then if you’re using GPUs, like Nvidia, for example, it’s probably just the one. I’m not stating any numbers, because my IT guys are going to tell me that I was wrong.

What is the cost of low latency? If I decide to go to the smallest setting, what is that going to cost me? I guess there’s going to be a quality answer, and there’s going to be a stability answer… Is there a hard economic answer?

My hope is that there shouldn’t be a cost difference, depending on regions. The way we’ve chosen to operate is about the design paradigm of the product that you’ve created. We have competitors that are going with one partner. They’ve picked cloud vendor X, and they’re running everything in their cloud. And then what they can do is limited to the deal with that cloud vendor.

For example, we had an AV1 request from Greece. Huge egress for an internet TV channel that I was blown away by, and they mentioned their pricing. They wanted to save costs by cutting their traffic by using av1. What we did with that request is we went out to our partners and vendors and asked them – can you help us match this, and we did. From a business perspective, it might, in some cases, cost more. But there is also a perception that plagues the low latency business of high cost and that is because many of these companies have not considered their power consumption – their form factors.

Actually, being willing to take a CAPEX investment instead of just running in the cloud and paying as you go. Many of those things that we’ve chosen to put the time into so that there will not be that big a difference.

Take, for example, Tata Communications, one of our biggest partners, and their pricing. They’re running our software stack in their environments to run their VDM, and it’s on a cost parity. So that’s something that should always be the aim. Then, I’m not going to say it’s always going to be like that, but that’s just a short version when you’re talking about the business implications.

We’re often getting requests where the potential client has this notion that it’s going to be a very high cost. Then they find that this makes sense, and we can build a business.

Are you seeing companies moving away from the cloud towards creating their own co-located servers with encoders and producing that way, as opposed to paying cents per minute to different cloud providers?

I would say I’m seeing the opposite. We’re doing both, just to be clear. I think the way to go is to do a hybrid.

For some clients, they’re going to be broadcasting 20 minutes a month. Cloud is awesome for that. You spin it up when you need it, and you kill it when it’s done. But that’s not always going to cut it. But if you’re asking me what motion I’m seeing in the market? There are more and more of these companies that are deploying across one cloud. And that’s where it resides. There are also types of offerings that you can instance yourself in third-party clouds, which is also an option. But again, it’s the design choice that it’s a cloud service that uses underlying cloud functions. It’s a shame that it’s not more of both. It creates an opportunity for us, though.

What are the big trends that you’re chasing for 2023 and beyond? What are you seeing? What forces are going to impact your business? The new features you’re going to be picking up? What are the big technology directions you’re seeing?

I mean, for us on our roadmap, we have been working hard on our partner strategy, and we’ve been seeing a higher demand for white-label solutions, which is what we’re working on with some partners.

We’ve done a few of those installs, and that’s where we are putting a lot of effort into it because we’re running our own CDN. But we can also enable others to do it, even as a managed service. You have these telcos that have maybe an edge or less offering since before, and they’re sitting on tons of equipment and fiber. So that’s one thing.

If we’re making predictions, there are two things worth a mention. I would expect the sports betting markets, especially in the US, to explode. That’s something we are definitely keeping our eyes on.

Maybe live shopping becomes a thing outside of China. Many of the big players, the big retailers, and even financial companies, are working on their own offerings and live shopping.

Vindral's CDN Against Dinosaurs' Agreement.jpg

The dinosaurs’ agreement?

Have I told you about the dinosaurs’ agreement? It’s comparable to a gentleman’s agreement. This might be provocative to some. And I get that it’s complicated in many cases.

There is, among some of the bigger players and also among independent consultants that have different stakes, a sort of mutual agreement to keep asking the question – do we really need low latency? Or do we really need synchronization?

As long as the bigger brands are not creating the experience that the audience is waiting for them to create, nobody's going to have to move.”

And while a valid question it’s also kind of a self-fulfilling prophecy. Because as long as the bigger brands are not creating the experience that the audience is waiting for them to create, nobody’s going to have to move. So that is what I’m calling the dinosaurs here. They’re holding on to the thing that they’ve always been doing. And they’re optimizing that, but not moving on to the next generation. And the problem they’re going to be facing, hopefully, is that when it reaches critical mass, the viewers are going to start expecting it, and that’s when things might start changing.

There are many workflow considerations, of course. There are tech legacy considerations. There are cost considerations and different aspects when it comes to scaling. However, saying that you don’t need low latency is a bit of an excuse.

One thing is the bill that you're getting, the other thing is the bill we're leaving to our children..”

NETINT Quadra vs. NVIDIA T4 – Benchmarking Hardware Encoding Performance

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer

This article is the second in a series about benchmarking hardware encoding performance. In the first article, available here, I delineated a procedure for testing hardware encoders. Specifically, I recommended this three-step procedure:

  1. Identify the most critical quality and throughput-related options for the encoder.
  2. Test across a range of configurations from high quality/low throughput to low quality/high throughput to identify the operating point that delivers the optimum blend of quality and throughput for your application.
  3. Compute quality, cost per stream, and watts per stream at the operating point to compare against other technologies.

After laying out this procedure, I applied it to the NETINT Quadra Video Processing Unit (VPU) to find the optimum operating point and the associated quality, cost per stream, and watts per stream. In this article, we perform the same analysis on the NVIDIA T4 GPU-based encoder.

About The NVIDIA T4

The NVIDIA T4 is powered by NVIDIA Turing Tensor Cores and draws 70 watts in operation. Pricing varies by the reseller, with $2,299 around the median price, which puts it slightly higher than the $1,500 quoted for the NETINT Quadra T1  VPU in the previous article.

In creating the command line for the NVIDIA encodes, I checked multiple NVIDIA documents, including a document entitled Video Benchmark Assumptions, this blog post entitled Turing H.264 Video Encoding Speed and Quality, and a document entitled Using FFmpeg with NVIDIA GPU Hardware acceleration that requires a login. I readily admit that I am not an expert on NVIDIA encoding, but the point of this exercise is not absolute quality as much as the range of quality and throughput that all hardware enables. You should check these documents yourself and create your own version of the optimized command string.

While there are many configuration options that impact quality and throughput, we focused our attention on two, lookahead and presets. As discussed in the previous article, the lookahead buffer allows the encoder to look at frames ahead of the frame being encoded, so it knows what is coming and can make more intelligent decisions. This improves encoding quality, particularly at and around scene changes, and it can improve bitrate efficiency. But lookahead adds latency equal to the lookahead duration, and it can decrease throughput.

Note that while the NVIDIA documentation recommends a lookahead buffer of twenty frames, I use 15 in my tests because, at 20, the hardware decoder kept crashing. I tested a 20-frame lookahead using software decoding, and the quality differential between 15 and 20 was inconsequential, so this shouldn’t impact the comparative results.

I also tested using various NVIDIA presets, which like all encoding presets, trade off quality vs. throughput. To measure quality, I computed the VMAF harmonic mean and low-frame scores, the latter a measure of transient quality. For throughput, I tested the number of simultaneous 1080p30 files the hardware could process at 30 fps. I divided the stream count into price and watts/hour to determine cost/stream and watts/stream.

As you can see in Table 1, I tested with a lookahead value of 15 for selected presets 1-9, and then with a 0 lookahead for preset 9. Line two shows the closest x264 equivalent score for perspective.

In terms of operating point for comparing to  Quadra, I choose the lookahead 15/preset 4 configuration, which yielded twice the throughput of preset 2 with only a minor reduction in VMAF Harmonic mean. We will consider low-frame scores in the final comparisons.

In general, the presets worked as they should, with higher quality and lower throughput at the left end, and the reverse at the right end, though LA15/P4 performance was an anomaly since it produced lower quality and higher throughput than LA15/P6. In addition, dropping the lookahead buffer did not produce the performance increase that we saw with Quadra, though it also did not produce a significant quality decrease.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 1
Table 1. H.264 options and results.

Table 2 shows the T4’s HEVC results. Though quality was again near the medium x265 preset with several combinations, throughput was very modest at 3 or 4 streams at that quality level. For HEVC, LA15/P4 stands out as the optimal configuration, with four times or better throughput than other combinations with higher-quality output.

In terms of expected preset behavior, LA15/P4 was again quite the anomaly, producing the highest throughput in the test suite with slightly lower quality than LA15/P6, which should deliver lower quality. Again, switching from LA 15 to LA 0 produced neither the expected spike in throughput nor a drop in quality, as we saw with the Quadra for both HEVC and H.264.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 2
Table 2. HEVC options and results.

Quadra vs. T4

Now that we have identified the operating points for Quadra and the T4, let us compare quality, throughput, CAPEX, and OPEX. You see the data for H.264 in Table 3.

Here, the stream count was the same, so Quadra’s advantage in cost per stream and watts per stream related to its lower cost and more efficient operation. At their respective operating points, the Quadra’s VMAF harmonic mean quality was slightly higher, with a more significant advantage in the low-frame score, a predictor of transient quality problems.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 3
Table 3. Comparing Quadra and T4 at H.264 operating points.

Table 4 shows the same comparison for HEVC. Here, Quadra output 75% more streams than the T4, which increases the cost per stream and watts per stream advantages. VMAF harmonic means scores were again very similar, though the T4’s low frame score was substantially lower.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 4
Table 4. Comparing Quadra and T4 at HEVC operating points. 

Figure 5 illustrates the low-frames and low-frame differential between the two files. It is the result plot from the Moscow State University Video Quality Measurement Tool (VQMT), which displays the VMAF score, frame-by-frame, over the entire duration of the two video files analyzed, with Quadra in red and the T4 in green. The top window shows the VMAF comparison for the entire two files, while the bottom window is a close-up of the highlighted region of the top window, right around the most significant downward spike at frame 1590.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Picture 1
Figure 5. The downward green spikes represent the low-frame scores in the T4 encode.

As you can see in the bottom window in Figure 5, the low-frame region extends for 2-3 frames, which might be borderline noticeable by a discerning viewer. Figure 6 shows a close-up of the lowest quality frame, Quadra on the left, T4 on the right, and the dramatic difference in VMAF score, 87.95 to 57, is certainly warranted. Not surprisingly, PSNR and SSIM measurements confirmed these low frames.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Picture 2
Figure 6. Quality comparisons, NETINT Quadra on the left, T4 on the right.

It is useful to track low frames because if they extend beyond 2-3 frames, they become noticeable to viewers and can degrade viewer quality of experience. Mathematically, in a two-minute test file, the impact of even 10 – 15 terrible frames on the overall score is negligible. That is why it is always useful to visualize the metric scores with a tool like VQMT, rather than simply relying on a single score.

Summing Up

Overall, you should consider the procedure discussed in this and the previous article as the most important takeaway from these two articles. I am not an expert in encoding with NVIDIA hardware, and the results from a single or even a limited number of files can be idiosyncratic.

Do your own research, test your own files, and draw your own conclusions. As stated in the previous article, do not be impressed by quality scores without knowing the throughput, and expect that impressive throughput numbers may be accompanied by a significant drop in quality.

Whenever you test any hardware encoder, identify the most important quality/throughput configuration options, test over the relevant range, and choose the operating point that delivers the best combination of quality and throughput. This will give the best chance to achieve a meaningful apples vs. apples comparison between different hardware encoders that incorporates quality, cost per stream, and watts per stream.