NETINT Breaks Into the Streaming Media 100 List 2023

NETINT joins the prestigious Streaming Media 100 List for 2023. Recognized for their pioneering ASIC-based transcoders, celebrated for innovation in live streaming, cloud gaming, and surveillance.

NETINT is proud to be included in the Streaming Media list of the Top 100 Companies in the Streaming Media Universe, which “set themselves apart from the crowd with their innovative approach and their contribution to the expansion and maturation of the streaming media universe.”

The list is compiled by members of Streaming Media Magazine’s inner circle and “foregrounds the industry’s most innovative and influential technology suppliers, service providers, platforms, and media and content companies, as acclaimed by our editorial team. Some are large and established industry standard-bearers, while others are comparably small and relatively new arrivals that are just beginning to make a splash.”

Commenting on the Award, Alex Lui, NETINT CEO said, “Over the last twelve months, video engineers have increasingly recognized the unique value that ASIC-based transcoders deliver to the live streaming, cloud gaming, and surveillance markets, including the lowest cost and power consumption per stream, and the highest density. Our entire company appreciates that insiders at Streaming Media share this assessment.”

“Over the last twelve months, video engineers have increasingly recognized the unique value that ASIC-based transcoders deliver to the live streaming, cloud gaming, and surveillance markets, including the lowest cost and power consumption per stream, and the highest density. Our entire company appreciates that insiders at Streaming Media share this assessment.”

NETINT - Streaming Media 100 in 2023

To learn more about NETINT’s Video Processing Units, access our RESOURCES here or SCHEDULE CONSULTATION with NETINT’s Engineers. 

ON-DEMAND: Building Your Own Live Streaming Cloud

Understanding the Economics of Transcoding

Understanding the Economics of Transcoding

Whether your business model is FAST or subscription-based premium content, your success depends upon your ability to deliver a high-quality viewing experience while relentlessly reducing costs. Transcoding is one of the most expensive production-related costs and the ultimate determinant of video quality, so obviously plays a huge role on both sides of this equation. This article identifies the most relevant metrics for ascertaining the true cost of transcoding and then uses these metrics to compare the relative cost of the available methods for live transcoding.

Economics of Transcoding: Cost Metrics

There are two potential cost categories associated with transcoding: capital costs and operating costs. Capital costs arise when you buy your own transcoding gear, while operating costs apply when you operate this equipment or use a cloud provider. Let’s discuss each in turn.

Economics of Transcoding: CAPEX

The simplest way to compare transcoders is to normalize capital and operating costs using the cost per stream or cost per ladder, which simplifies comparing disparate systems with different costs and throughput. The cost per stream applies to services inputting and delivering a single stream, while the cost per ladder applies to services inputting a single stream and outputting an encoding ladder.

We’ll present real-world comparisons once we introduce the available transcoding options, but for the purposes of this discussion, consider the simple example in Table 1. The top line shows that System B costs twice as much as System A, while line 2 shows that it also offers 250% of the capacity of System A. On a cost-per-stream basis, System B is actually cheaper.

Understanding the Economics of Transcoding - table 1
TABLE 1: A simple cost-per-stream analysis.

The next few lines use this data to compute the number of required systems for each approach and the total CAPEX. Assuming that your service needs 640 simultaneous streams, the total CAPEX for System A dwarfs that of System B. Clearly, just because a particular system costs more than another doesn’t make it the more expensive option.

For the record, the throughput of a particular server is also referred to as density, and it obviously impacts OPEX charges. System B delivers over six times the streams from the same 1RU rack as System A, so is much more dense, which will directly impact both power consumption and storage charges.

Details Matter

Several factors complicate the otherwise simple analysis of cost per stream. First, you should analyze using the output codec or codecs, current and future. Many systems output H.264 quite competently but choke considerably with the much more complex HEVC codec. If AV1 may be in your future plans, you should prioritize a transcoder that outputs AV1 and compare cost per stream against all alternatives.

The second requirement is to use consistent output parameters. Some vendors quote throughput at 30 fps, some at 60 fps. Obviously, you need to use the same value for all transcoding options. As a rough rule of thumb, if a vendor quotes 60 fps, you can double the throughput for 30 fps, so a system that can output 8 1080p60 streams and likely output 16 1080p30 streams. Obviously, you should verify this before buying.

If a vendor quotes in streams and you’re outputting encoding ladders, it’s more complicated. Encoding ladders involve scaling to lower resolutions for the lower-quality rungs. If the transcoder performs scaling on-board, throughput should be greater than systems that scale using the host CPU, and you can deploy a less capable (and less expensive) host system.

The last consideration involves the concept of “operating point,” or the encoding parameters that you would likely use for your production, and the throughput and quality at those parameters. To explain, most transcoders include encoding options that trade off quality vs throughput much like presets do for x264 and x265. Choosing the optimal setting for your transcoding hardware is often a balance of throughput and bandwidth costs. That is, if a particular setting saves 10% bandwidth, it might make economic sense to encode using that setting even if it drops throughput by 10% and raises your capital cost accordingly. So, you’d want to compute your throughput numbers and cost per stream at that operating point.

In addition, many transcoders produce lower throughput when operating in low latency mode. If you’re transcoding for low-latency productions, you should ascertain whether the quoted figures in the spec sheets are for normal or low latency.

For these reasons, completing a thorough comparison requires a two-step analysis. Use spec sheet numbers to identify transcoders that you’d like to consider and acquire them for further testing. Once you have them in your labs you can identify the operating point for all candidates, test at these settings, and compare them accordingly.

Economics of Transcoding: OPEX - Power

Now, let’s look at OPEX, which has two components: power and storage costs. Table 2 continues our example, looking at power consumption.

Unfortunately, ascertaining power consumption may be complicated if you’re buying individual transcoders rather than a complete system. That’s because while transcoding manufacturers often list the power consumption utilized by their devices, you can only run these devices in a complete system. Within the system, power consumption will vary by the number of units configured in the system and the specific functions performed by the transcoder.

Note that the most significant contributor to overall system power consumption is the CPU. Referring back to the previous section, a transcoder that scales onboard will require lower CPU contribution than a system that scales using the host CPU, reducing overall CPU consumption. Along the same lines, a system without a hardware transcoder uses the CPU for all functions, maxing out CPU utilization likely consuming about the same energy as a system loaded with transcoders that collectively might consume 200 watts. 

Again, the only way to achieve a full apples-to-apples comparison is to configure the server as you would for production and measure power consumption directly. Fortunately, as you can see in Table 2, stream throughput is a major determinant of overall power consumption. Even if you assume that systems A and B both consume the same power, System B’s throughput makes it much cheaper to operate over a five year expected life, and much kinder to the environment.

Understanding the Economics of Transcoding - table 2
TABLE 2. Computing the watts per stream of the two systems.

Economics of Transcoding: Storage Costs

Once you purchase the systems, you’ll have to house them. While these costs are easiest to compute if you’re paying for a third-party co-location service, you’ll have to estimate costs even for in-house data centers. Table 3 continues the five year cost estimates for our two systems, and the denser system B proves much cheaper to house as well as power.

Understanding the Economics of Transcoding - table 3
TABLE 3: Computing the storage costs for the two systems.

Economics of Transcoding: Transcoding Options

These are the cost fundamentals, now let’s explore them within the context of different encoding architectures.

There are three general transcoding options: CPU-only, GPU, and ASIC-based. There are also FPGA-based solutions, though these will probably be supplanted by cheaper-to-manufacture ASIC-based devices over time. Briefly,

  • CPU-based transcoding, also called software-based transcoding, relies on the host central processing unit, or CPU, for all transcoding functions.
  • GPU-based transcoding refers to Graphic Processing Units, which are developed primarily for graphics-related functions but may also transcode video. These are added to the server in add-in PCIe cards.
  • ASICs are Application-Specific Integrated Circuits designed specifically for transcoding. These are added to the server as add-in PCIe cards or devices that conform to the U.2 form factor.

Economics of Transcoding: Real-World Comparison

NETINT manufactures ASIC-based transcoders and video processing units. Recently, we published a case study where a customer, Mayflower, rigorously and exhaustively compared these three alternatives, and we’ll share the results here.

By way of background, Mayflower’s use case needed to input 10,000 incoming simultaneous streams and distribute over a million outgoing simultaneous streams worldwide at a latency of one to two seconds. Mayflower hosts a worldwide service available 24/7/365.

Mayflower started with 80-core bare metal servers and tested CPU-based transcoding, then GPU-based transcoding, and then two generations of ASIC-based transcoding. Table 4 shows the net/net of their analysis, with NETINT’s Quadra T2 delivering the lowest cost per stream and the greatest density, which contributed to the lowest co-location and power costs.

RESULTS: COST AND POWER

Understanding the Economics of Transcoding - table 4
TABLE 4. A real-world comparison of the cost per stream and OPEX associated with different transcoding techniques.

As you can see, the T2 delivered an 85% reduction in CAPEX with ~90% reductions in OPEX as compared to CPU-based transcoding. CAPEX savings as compared to the NVIDIA T4 GPU was about 57%, with OPEX savings around ~70%.

Table 5 shows the five-year cost of the Mayflower T-2 based solution using the cost per KWH in Cyprus of $0.335. As you can see, the total is $2,225,241, a number we’ll return to in a moment.

Understanding the Economics of Transcoding - table 5
TABLE 5: Five-year cost of the Mayflower transcoding facility.

Just to close a loop, Tables 1, 2, and 3, compare the cost and performance of a Quadra Video Server equipped with ten Quadra T1U VPUs (Video Processing Units) with CPU-based transcoding on the same server platform. You can read more details on that comparison here.

Table 6 shows the total cost of both solutions. In terms of overall outlay, meeting the transcoding requirements with the Quadra-based System B costs 73% less than the CPU-based system. If that sounds like a significant savings, keep reading. 

TABLE 6: Total cost of the CPU-based System A and Quadra T2-based System B.

Economics of Transcoding: Cloud Comparison

If you’re transcoding in the cloud, all of your costs are OPEX. With AWS, you have two alternatives: producing your streams with Elemental MediaLive or renting EC3 instances and running your own transcoding farm. We considered the MediaLive approach here, and it appears economically unviable for 24/7/365 operation.

Using Mayflower’s numbers, the CPU-only approach required 500 80-core Intel servers running 24/7. The closest CPU in the Amazon ECU pricing calculator was the 64-core c6i.16xlarge, which, under the EC2 Instance Savings plan, with a 3-year commitment and no upfront payment, costs 1,125.84/month.

Understanding the Economics of Transcoding - figure 1
FIGURE 1. The annual cost of the Mayflower system if using AWS.

We used Amazon’s pricing calculator to roll these numbers out to 12 months and 500 simultaneous servers, and you see the annual result in Figure 1. Multiply this by five to get to the five-year cost of $33,775,056, which is 15 times the cost of the Quadra T2 solution, as shown in table 5.

We ran the same calculation on the 13 systems required for the Quadra Video Server analysis shown in Tables 1-3 which was powered by a 32-core AMD CPU. Assuming a c6a.8xlarge CPU with a 3-year commitment and no upfront payment,, this produced an annual charge of $79,042.95, or $395,214.6 for the five-year period, which is about 8 times more costly than the Quadra-based solution.

Understanding the Economics of Transcoding - figure 2
FIGURE 2: The annual cost of an AWS system per the example schema presented in tables 1-3.

Cloud services are an effective means for getting services up and running, but are vastly more expensive than building your own encoding infrastructure. Service providers looking to achieve or enhance profitability and competitiveness should strongly consider building their own transcoding systems. As we’ve shown, building a system based on ASICs will be the least expensive option.

In August, NETINT held a symposium on Building Your Own Live Streaming Cloud. The on-demand version is available for any video engineer seeking guidance on which encoder architecture to acquire, the available software options for transcoding, where to install and run your encoding servers, and progress made on minimizing power consumption and your carbon footprint.

ON-DEMAND: Building Your Own Live Streaming Cloud

Choosing Transcoding Hardware: Deciphering the Superiority of ASIC-based Technology

Which technology reigns supreme in transcoding: CPU-only, GPU, or ASIC-based? Kenneth Robinson’s incisive analysis from the recent symposium makes a compelling case for ASIC-based transcoding hardware, particularly NETINT’s Quadra. Robinson’s metrics prioritized viewer experience, power efficiency, and cost. While CPU-only systems appear initially economical, they falter with advanced codecs like HEVC. NVIDIA’s GPU transcoding offers more promise, but the Quadra system still outclasses both in quality, cost per stream, and power consumption. Furthermore, Quadra’s adaptability allows a seamless switch between H.264 and HEVC without incurring additional costs. Independent assessments, such as Ilya Mikhaelis‘, echo Robinson’s conclusions, cementing ASIC-based transcoding hardware as the optimal choice.

Choosing transcoding hardware

During the recent symposium, Kenneth Robinson, NETINT’s manager of Field Application Engineering, compared three transcoding technologies: CPU-only, GPU, and ASIC-based transcoding hardware. His analysis, which incorporated quality, throughput, and power consumption, is useful as a template for testing methodology and for the results. You can watch his presentation here and download a copy of his presentation materials here.

Figure 1. Overall savings from ASIC-based transcoding (Quadra) over GPU (NVIDIA) and CPU.
Figure 1. Overall savings from ASIC-based transcoding (Quadra) over GPU (NVIDIA) and CPU.

As a preview of his findings, Kenneth found that when producing H.264, ASIC-based hardware transcoding delivered CAPEX savings of 86% and 77% compared to CPU and GPU-based transcoding, respectively. OPEX savings were 95% vs. CPU-only transcoding and 88% compared to GPU.

For the more computationally complex HEVC codec, the savings were even greater. As compared to CPU-based transcoding, ASICs saved 94% on CAPEX and 98% on OPEX. As compared to GPU-based transcoding, ASICs saved 82% on CAPEX and 90% on OPEX. These savings are obviously profound and can make the difference between a successful and profitable service and one that’s mired in red ink.

Let’s jump into Kenneth’s analysis.

Determining Factors

Digging into the transcoding alternatives, Kenneth described the three options. First are CPUs from manufacturers like AMD or Intel. Second are GPUs from companies like NVIDIA or AMD. Third are ASICs, or Application Specific Integrated Circuits, from manufacturers like NETINT. Kenneth noted that NETINT calls its Quadra devices Video Processing Units (VPU), rather than transcoders because they perform multiple additional functions besides transcoding, including onboard scaling, overlay, and AI processing.

He then outlined the factors used to determine the optimal choice, detailing the four factors shown in Figure 2. Quality is the average quality as assessed using metrics like VMAF, PSNR, or subjective video quality evaluations involving A/B comparisons with viewers. Kenneth used VMAF for this comparison. VMAF has been shown to have the highest correlation with subjective scores, which makes it a good predictor of viewer quality of experience.

Choosing transcoding hardware - Determining Factors
Figure 2. How Kenneth compared the technologies.

Low-frame quality is the lowest VMAF score on any frame in the file. This is a predictor for transient quality issues that might only impact a short segment of the file. While these might not significantly impact overall average quality, short, low-quality regions may nonetheless degrade the viewer’s quality of experience, so are worth tracking in addition to average quality.

Server capacity measures how many streams each configuration can output, which is also referred to as throughput. Dividing server cost by the number of output streams produces the cost per stream, which is the most relevant capital cost comparison. The higher the number of output streams, the lower the cost per stream and the lower the necessary capital expenditures (CAPEX) when launching the service or sourcing additional capacity.

Power consumption measures the power draw of a server during operation. Dividing this by the number of streams produced results in the power per stream, the most useful figure for comparing different technologies.

Detailing his test procedures, Kenneth noted that he tested CPU-only transcoding on a system equipped with an AMD Epic 32-core CPU. Then he installed the NVIDIA L4 GPU (a recent release) for GPU testing and NETINT’s Quadra T1U U.2 form factor VPU for ASIC-based testing.

He evaluated two codecs, H.264 and HEVC, using a single file, the Meridian file from Netflix, which contains a mix of low and high-motion scenes and many challenging elements like bright lights, smoke and fog, and very dark regions. If you’re testing for your own deployments, Kenneth recommended testing with your own test footage.

Kenneth used FFmpeg to run all transcodes, testing CPU-only quality using the x264 and x265 codecs using the medium and very fast presets. He used FFmpeg for NVIDIA and NETINT testing as well, transcoding with the native H.264 and H.265 codec for each device.

H.264 Average, Low-Frame, and Rolling Frame Quality

The first result Kenneth presented was average H.264 quality. As shown in Figure 3, Kenneth encoded the Meridian file to four output files for each technology, with encodes at 2.2 Mbps, 3.0 Mbps, 3.9 Mbps, and 4.75 Mbps. In this “rate-distortion curve” display, the left axis is VMAF quality, and the bottom axis is bitrate. In all such displays, higher results are better, and Quadra’s blue line is the best alternative at all tested bitrates, beating NVIDIA and x264 using the medium and very fast presets.

Figure 3. Quadra was tops in H.264 quality at all tested bitrates.
Figure 3. Quadra was tops in H.264 quality at all tested bitrates.

Kenneth next shared the low-frame scores (Figure 4), noting that while the NVIDIA L4’s score was marginally higher than the Quadra’s, the difference at the higher end was only 1%. Since no viewer would notice this differential, this indicates operational parity in this measure.

Figure 4. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.
Figure 4. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.

The final H.264 quality finding displayed a 20-second rolling average of the VMAF score. As you can see in Figure 5, the Quadra, which is the blue line, is consistently higher than the NVIDIA L4 or medium or very fast. So, even though the Quadra had a lower single-frame VMAF score compared to NVIDIA, over the course of the entire file, the quality was predominantly superior.

Figure 5. 20-second rolling frame quality over file duration.
Figure 5. 20-second rolling frame quality over file duration.

HEVC Average, Low-Frame, and Rolling Frame Quality

Kenneth then related the same results for HEVC. In terms of average quality (Figure 6), NVIDIA was slightly higher than the Quadra, but the delta was insignificant. Specifically, NVIDIA’s advantage starts at 0.2% and drops to 0.04% at the higher bit rates. So, again, a difference that no viewer would notice. Both NVIDIA and Quadra produced better quality than CPU-only transcoding with x265 and the medium and very fast presets.

Figure 6. Quadra was tops in H.264 quality at all tested bitrates.
Figure 6. Quadra was tops in H.264 quality at all tested bitrates.

In the low-frame measure (Figure 7), Quadra proved consistently superior, with NVIDIA significantly lower, again a predictor for transient quality issues. In this measure, Quadra also consistently outperformed x265 using medium and very fast presets, which is impressive.

Figure 7. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.
Figure 7. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.

Finally, HEVC moving average scoring (Figure 8) again showed Quadra to be consistently better across all frames when compared to the other alternatives. You see NVIDIA’s downward spike around frame 3796, which could indicate a transient quality drop that could impact the viewer’s quality of experience.

Figure 8. 20-second rolling frame quality over file duration.
Figure 8. 20-second rolling frame quality over file duration.

Cost Per Stream and Power Consumption Per Stream - H.264

To measure cost and power consumption per stream, Kenneth first calculated the cost for a single server for each transcoding technology and then measured throughput and power consumption for that server using each technology. Then, he compared the results, assuming that a video engineer had to source and run systems capable of transcoding 320 1080p30 streams.

You see the first step for H.264 in Figure 9. The baseline computer without add-in cards costs $7,100 but can only output fifteen 1080p30 streams using an average of the medium and veryfast presets, resulting in a cost per stream was $473. Kenneth installed two NVIDIA L4 cards in the same system, which boosted the price to $14,214, but more than tripled throughput to fifty streams, dropping cost per stream to $285. Kenneth installed ten Quadra T1U VPUs in the system, which increased the price to $21,000, but skyrocketed throughput to 320 1080p30 streams, and a $65 cost per stream.

This analysis reveals why computing and focusing on the cost per stream is so important; though the Quadra system costs roughly three times the CPU-only system, the ASIC-fueled output is over 21 times greater, producing a much lower cost per stream. You’ll see how that impacts CAPEX for our 320-stream required output in a few slides.

Figure 9. Computing system cost and cost per stream.
Figure 9. Computing system cost and cost per stream.

Figure 10 shows the power consumption per stream computation. Kenneth measured power consumption during processing and divided that by the number of output streams produced. This analysis again illustrates why normalizing power consumption on a per-stream basis is so necessary; though the CPU-only system draws the least power, making it appear to be the most efficient, on a per-stream basis, it’s almost 20x the power draw of the Quadra system.

Figure 10. Computing power per stream for H.264 transcoding.
Figure 10. Computing power per stream for H.264 transcoding.

Figure 11 summarizes CAPEX and OPEX for a 320-channel system. Note that Kenneth rounded down rather than up to compute the total number of servers for CPU-only and NVIDIA. That is, at a capacity of 15 streams for CPU-only transcoding, you would need 21.33 servers to produce 320 streams. Since you can’t buy a fractional server, you would need 22, not the 21 shown. Ditto for NVIDIA and the six servers, which, at 50 output streams each, should have been 6.4, or actually 7. So, the savings shown are underrepresented by about 4.5% for CPU-only and 15% for NVIDIA. Even without the corrections, the CAPEX and OPEX differences are quite substantial.

Figure 11. CAPEX and OPEX for 320 H.264 1080p30 streams.
Figure 11. CAPEX and OPEX for 320 H.264 1080p30 streams.

Cost Per Stream and Power Consumption Per Stream - HEVC

Kenneth performed the same analysis for HEVC. All systems cost the same, but throughput of the CPU-only and NVIDIA-equipped systems both drop significantly, boosting their costs per stream. The ASIC-powered Quadra outputs the same stream count for HEVC as for H.264, producing an identical cost per stream.

Figure 12. Computing system cost and cost per stream.
Figure 12. Computing system cost and cost per stream.

The throughput drop for CPU-only and NVIDIA transcoding also boosted the power consumption per stream, while Quadra’s remained the same.

Figure 13. Computing power per stream for H.264 transcoding.
Figure 13. Computing power per stream for H.264 transcoding.

Figure 14 shows the total CAPEX and OPEX for the 320-channel system, and this time, all calculations are correct. While CPU-only systems are tenuous–at best– for H.264, they’re clearly economically untenable with more advanced codecs like HEVC. While the differential isn’t quite so stark with the NVIDIA products, Quadra’s superior quality and much lower CAPEX and OPEX are compelling reasons to adopt the ASIC-based solution.

Figure 14. CAPEX and OPEX for 320 1080p30 HEVC streams.
Figure 14. CAPEX and OPEX for 320 1080p30 HEVC streams.

As Kenneth pointed out in his talk, even if you’re producing only H.264 today, if you’re considering HEVC in the future, it still makes sense to choose a Quadra-equipped system because you can switch over to HEVC with no extra hardware cost at any time. With a CPU-only system, you’ll have to more than double your CAPEX spending, while with NVIDIA,  you’ll need to spend another 25% to meet capacity.

The Cost of Redundancy

Kenneth concluded his talk with a discussion of full hardware and geo-redundancy. He envisioned a setup where one location houses two servers (a primary and a backup) for full hardware redundancy. A similar setup would be replicated in a second location for geo-redundancy. Using the Quadra video server, four servers could provide both levels of redundancy, costing a total of $84,000. Obviously, this is much cheaper than any of the other transcoding alternatives.

NETINT’s Quadra VPU proved slightly superior in quality to the alternatives, vastly cheaper than CPU-only transcoding, and very meaningfully more affordable than GPU-based transcoders. While these conclusions may seem unsurprising – an employee at an encoding ASIC manufacturer concludes that his ASIC-based technology is best — you can check Ilya Mikhaelis’ independent analysis here and see that he reached the same result.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

From CPU to GPU to ASIC: Mayflower’s Transcoding Journey

Ilya’s transcoding journey took him from $10 million to under $1.5 million CAPEX while cutting power consumption by over 90%. This analytical deep-dive reveals the trials, errors, and successes of Mayflower’s quest, highlighting a remarkable reduction in both cost and power consumption.

From CPU to GPU to ASIC: The Transcoding Journey

Ilya Mikhaelis

Ilya Mikhaelis is the streaming backend tech lead for Mayflower, which builds and hosts streaming infrastructures for multiple publishers. Mayflower’s infrastructure handles over 10,000 incoming streams and over one million plus outgoing streams at a latency that averages one to two seconds.

Ilya’s challenge was to find the most cost-effective technology to transcode the incoming streams. His journey took him from CPU-based transcoding to GPU and then two generations of ASIC-based transcoding. These transitions slashed total production transcoding costs from $10 million dollars to just under $1.5 million dollars while reducing power consumption by over 90%, from 325,000 watts to 33,820 watts.

Ilya’s rigorous textbook-worthy testing methodology and findings are invaluable to any video engineer seeking the highest quality transcoding technology at the lowest capital cost and most efficient power usage. But let’s start at the beginning.

The Mayflower Internal CDN

As Ilya describes it, “Mayflower is a big company, under which different projects stand. And most of these projects are about high-load, live media streaming. Moreover some of Mayflower resources were included  in the top 50 of the most visited sites worldwide. And all these streaming resources are handled by one internal CDN, which was completely designed and implemented by my team.”

Describing the requirements, Ilya added, “The typical load of this CDN is about 10,000 incoming simultaneous streams and more than one million outgoing simultaneous streams worldwide. In most cases, we target a latency of one to two seconds. We try to achieve a real-time experience for our content consumers, which is why we need a fast and effective transcoding solution.”

To build the CDN, Mayflower used bare metal servers to maximize network and resource utilization and run a high-performance profile to achieve stable stream processing and keep encoder and decoder queues around zero. As shown in Figure 1, the CDN inputs streams via WebRTC and RTMP and delivers with a mix of WebRTC, HLS, and low latency HLS. It uses customized WebRTC inside the CDN to achieve minimum latency between servers.

Figure 1. Mayflower’s Low Latency CDN
Figure 1. Mayflower’s Low Latency CDN .

Ilya’s team minimizes resource wastage by implementing all high-level network protocols, like WebRTC, HLS, and low latency HLS, on their own. They use libav, an FFmpeg component, as a framework for transcoding inside their transcoder servers.

The Transcoding Pipeline

In Mayflower’s transcoding pipeline (Figure 2), the system inputs a single WebRTC stream, which it converts to a five-rung encoding ladder. Mayflower uses a mixture of proprietary and libav filters to achieve a stable frame rate and stable load. The stable frame rate is essential for outgoing streams because some protocols, like low latency HLS or HLS, can’t handle variable frame rates, especially on Apple devices.

Figure 2. Mayflower’s Low Latency CDN.
Figure 2. Mayflower’s Low Latency CDN.

CPU-Only Transcoding - Too Expensive, Too Much Power

After creating the architecture, Ilya had to find a transcoding technology as quickly as possible. Mayflower initially transcoded on a Dell R940, which currently costs around $20,000 as configured for Mayflower. When Ilya’s team first implemented software transcoding, most content creators input at 720p. After a few months, as they became more familiar with the production operation, most switched to 1080p, dramatically increasing the transcoding load.

You see the numbers in Figure 3. Each server could produce only 20 streams, which at a server cost of $20,000 meant a per stream cost of $1,000. At this capacity, scaling up to handle the 10,000 incoming streams would require 500 servers at a total cost of $10,000,000.

Total power consumption would equal 500 x 650, or 325,000 watts. The Dell R940 is a 3RU server; at an estimated monthly cost of $125 for colocation, this would add $750,000 per year. 

Figure 3. CPU-only transcoding was very costly and consumed excessive power.
Figure 3. CPU-only transcoding was very costly and consumed excessive power.

These numbers caused Ilya to pause and reassess. “After all these calculations, we understood that if we wanted to play big, we would need to find a cheaper transcoding solution than CPU-only with higher density per server, while maintaining low latency. So, we started researching and found some articles on companies like Wowza, Xilinx, Google, Twitch, YouTube, and so on. And the first hint was GPU. And when you think GPU, you think NVIDIA, a company all streaming engineers are aware of.”

“After all these calculations, we understood that if we wanted to play big, we would need to find a cheaper transcoding solution than CPU-only with higher density per server, while maintaining low latency.”

GPUs - Better, But Still Too Expensive

Ilya initially considered three NVIDIA products: the Tesla V100, Tesla P100, and Tesla T4. The first two, he concluded, were best for machine learning, leaving the T4 as the most relevant option. Mayflower could install six T4s into each existing Dell server. At a current cost of around $2,000 for each T4, this produced a total cost of $32,000 per server.

Under capacity testing, the T4-enabled system produced 96 streams, dropping the per-stream cost to $333. This also reduced the required number of servers to 105, and the total CAPEX cost to $3,360,000.

With the T4s installed, power consumption increased to 1,070 watts for a total of 112,350 watts. At $125 per month per server, the 105 servers would cost $157,500 annually to house in a colocation facility.

Figure 4. Capacity and costs for an NVIDIA T4-based solution.
Figure 4. Capacity and costs for an NVIDIA T4-based solution.

Round 1 ASICs: The NETINT T432

The NVIDIA numbers were better, but as Ilya commented, “It looked like we found a possible candidate, but we had a strong sense that we needed to further our research. We decided to continue our journey and found some articles about a company named NETINT and their ASIC-based solutions.”

Mayflower first ordered and tested the T432 video transcoder, which contains four NETINT G4 ASICs in a single PCIe card. As detailed by Ilya, “We received the T432 cards, and the results were quite exciting because we produced about 25 streams per card. Power consumption was much lower than NVIDIA, only 27 watts per card, and the cards were cheaper. The whole server produced 150 streams in full HD quality, with a power consumption of 812 watts. For the whole production, we would pay about 2 million, which is much cheaper than NVIDIA solution.”

You see all this data in Figure 5. The total number of T432-powered servers drops to 67, which reduces total power to 54,404 watts and annual colocation to $100,500.

Figure 5. Capacity and costs for the NETINT T432 solution.
Figure 5. Capacity and costs for the NETINT T432 solution.

While costs and power consumption kept improving, Ilya noticed that the CDN’s internal queue started increasing when processing with T432-equipped systems. Initially, Ilya thought the problem was the lack of onboard scaling on the T432, but then he noticed that “even when producing all these ABR ladders, our CPU load was about only 40% during high load hours. The bottleneck was the card’s decoding and encoding capacity, not onboard scaling.”

Finally, he pinpointed the increase in the internal queue to the fact that the T432’s decoder couldn’t maintain 4K60 fps decode for H.264 input. This was unacceptable because it increased stream latency. Ilya went searching one last time; fortunately, the solution was close at hand.

Round 2 ASICs: The NETINT Quadra T2 - The Transcoding Monster

Ilya next started testing with the NETINT Quadra T2 video processing unit, or VPU, which contains two NETINT G5 chips in a PCIe card. As with the other cards, Ilya could install six in each Dell server.

“All those disadvantages were eliminated in the new NETINT card – Quadra…We have already tested this card and have added servers with Quadra to our production. It really seems to be a transcoding monster.”

Ilya’s team liked what they found. “All those disadvantages were eliminated in the new NETINT card – Quadra. It has a hardware scaler inside with an optimized pipeline: decoder – scaler – encoder in the same VPU. And H264 4K60 decoding is not a problem for it. We have already tested this card and have added servers with Quadra to our production. It really seems to be a transcoding monster.”

Figure 6 shows the performance and cost numbers. Equipped with the six T2 VPUs, each server could output 270 streams, reducing the number of required servers from 500 for CPU-only to a mere 38. This dropped the per stream cost to $141, less than half of the NVIDIA T4 equipped system, and cut the total CAPEX down to $1,444,000. Total power consumption dropped to 33,820 watts, and annual colocation costs for the 38 3U servers were $57,000.

Figure 6. Capacity and costs for the NETINT Quadra T2 solution.
Figure 6. Capacity and costs for the NETINT Quadra T2 solution.

Cost and Power Summary

Figure 7 presents a summary of costs and power consumption, and the numbers speak for themselves. In Ilya’s words, “It is obvious that Quadra T2 dominates by all characteristics, and according to our team experience, it is the best transcoding solution on the market today.”

Figure 7. Summary of costs and power consumption.
Figure 5. Capacity and costs for the NETINT T432 solution.

“It is obvious that Quadra T2 dominates by all characteristics, and according to our team experience, it is the best transcoding solution on the market today.”

Ilya also commented on the suitability of the Dell R940 system. “I want to emphasize that the DELL R940 isn’t the best server for VPU and GPU transcoders. It has a small density of PCIe slots and, as a result, a small density of VPU/GPU. Moreover, in the case of  Quadra and even T432, you don’t need such powerful CPUs.”

In terms of other servers to consider, Ilya stated, “Nowadays, you may find platforms on the market with even 16 PCIe slots. In such systems, especially if you use Quadra, you don’t need powerful CPUs inside because everything is done on the VPU. But for us, it was a legacy with which we needed to live.”

Video engineers seeking the optimal transcoding solution can take a lot from Ilya’s transcoding journey: a willingness to test a range of potential solutions, a rigorous focus on cost and power consumption per stream, and extreme attention to detail. At NETINT, we’re confident that this approach will lead you to precisely the same conclusion as Ilya, that the Quadra T2 is “the best transcoding solution on the market today.”

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

From Cloud to Control. Building Your Own Live Streaming Platform

Cloud services are an effective way to begin live streaming. Still, once you reach a particular scale, it’s common to realize that you’re paying too much and can save significant OPEX by deploying transcoding infrastructure yourself. The question is, how to get started?

NETINT’s Build Your Own Live Streaming Platform symposium gathers insights from the brightest engineers and game-changers in the live-video processing industry on how to build and deploy a live-streaming platform.

In just three hours, we’ll cover the following:

  • Hardware options for live transcoding and encoding to cut costs by as much as 80%.
  • Software options for producing, delivering, and playing your live video streams.
  • Co-location selection criteria to achieve cloud-like performance with on-premise affordability.

You’ll also hear from two engineers who will demystify the process of assembling a live-streaming facility, how they identified and solved key hurdles, along with real costs and performance data.

Cloud? Or your own hardware?

It’s clear to many that producing live streams via a public cloud like AWS can be vastly more expensive than owning your hardware. (You can learn more by reading “Cloud or On-Premises? The Streaming Dilemma” and “How to Slash CAPEX, OPEX, and Carbon Emissions Using the NETINT T408 Video Transcoder”). 

To quote serial entrepreneur David Hansson, who recently migrated two SaaS services from the cloud to on-premise, “Don’t let the entrenched cloud interests dazzle you into believing that running your own setup is too complicated. Everyone and their dog did it to get the internet off the ground, and it’s only gotten easier since.” 

For those who have only operated in the cloud, there’s fear of the unknown. Fear buying hardware transcoders, selecting the right software, and choosing the best colocation service. So, we decided to fight fear with education and host a symposium to educate streaming engineers on all these topics.  

“Building Your Own Live Streaming Cloud” will uncover how owning your encoding stack can slash operating costs and boost performance with minimal CAPEX.

Learn to select the optimal transcoding hardware, transcoding and packaging software, and colocation facilities. We’ll also discuss strategies to reduce carbon emissions from your transcoding engine. 

This FREE virtual event takes place on August 17th, from 11:00 AM – 2:15 PM EST.

Five issues tackled by nine experts:

Transcoding Hardware Options:

Learn the pros and cons of CPU, GPU, and ASIC-based transcoding via detailed throughput and cost examples shared by Kenneth Robinson, Manager of Field Application Engineers at NETINT Technologies. Then Ilya Mikhaelis, Streaming Backend Tech Lead at Mayflower, will describe his company’s journey from CPU to GPU to ASICs, covering costs, power consumption, latency, and density metrics.

Software Options:

Jan Ozer from NETINT will identify the three categories of transcoding software: multimedia frameworks, media servers, and other tools. Then, you’ll hear from experts in each category, starting with Romain Bouqueau, founder of Motion Spell, who will discuss the capabilities of the GPAC multimedia framework. Barry Owen, Chief Solutions Architect at Wowza, will discuss Wowza Streaming Engine’s suitability for private clouds. Lastly, Adrian Roe, Director at Id3as, developer of Norsk, will demonstrate Norsk’s simple, scripting-based operation, and extensive production and transcoding features.

Housing Options:

Once you select your hardware and software, the next step is finding the right co-location facility to house your live streaming infrastructure. Kyle Faber, with experience in building Edgio’s video streaming infrastructure, will guide you through the essential factors to consider when choosing a co-location facility.

Minimizing the Environmental Impact:

As responsible streaming professionals, it’s essential to address the environmental impact of our operations. Barbara Lange, Secretariat of Greening of Streaming, will outline actionable steps video engineers can take to minimize power consumption when acquiring and deploying transcoding servers.

Pulling it All Together:

Stef van der Ziel, founder of live-streaming pioneer Jet-Stream, will share lessons learned from his experience in creating both Jet-Stream’s private cloud and cloud transcoding solutions for customers. In his closing talk, Stef will demystify the process of choosing hardware, software, and a hosting facility, bringing all the previous discussions together into a cohesive plan.

Full Agenda:

11:00 am. – 11:10 am EST

Introduction (10 minutes):
Mark Donnigan, Head of Strategic Marketing at NETINT Technologies
Welcome, overview, and what you will learn.

 

11:10 am. – 11:40 am EST

Choosing transcoding hardware (30 minutes):
Kenneth Robinson, Manager of Field Application Engineers at NETINT Technologies
You have three basic approaches to transcoding, CPU-only, GPU, and ASICs. Kenneth outlines the pros and cons of each approach with extensive throughput and CAPEX and OPEX examples for each.

 

11:40 am. – 12:00 pm EST

From CPU to GPU to ASIC: Our Transcoding Journey (20 minutes):
Ilya Mikhaelis, Streaming Backend Tech Lead at Mayflower
Charged with supporting very high-volume live transcoding operations, Ilya started with libx264 software transcoding, which consumed massive power but yielded low stream density per server. Then he experimented with GPUs and other hardware and ultimately transitioned to an ASIC-based solution with much lower power consumption and much higher stream density per server. Ilya will detail the costs, power consumption, and density of all options, providing both data and an invaluable evaluation framework.

 

12:00 pm. – 12:10 pm EST

Choosing your live production software (10 minutes): 
Jan Ozer, Senior Director of Video Technology at NETINT Technologies
The core of every live streaming system is transcoding and packaging software. This comes in many shapes and sizes, from open-source software like FFmpeg and GPAC, to streaming servers like Wowza, and production systems like Norsk. Jan discusses these multiple options so you can cohesively and affordably build your own live-streaming ecosystem.

 

12:10 pm. – 1:10 pm EST

Speed Round (60 minutes):
20-minute presentations from GPAC, Wowza, and NORSK.
Speakers from GPAC, Wowza, and NORSK discussing the features, functions, operational paradigms, and cost structure of their live software offering.

Speakers include:

  • Adrian Roe, CEO at id3as, Product: Norsk, Title: Make Live Easy with NORSK SDK
  • Romain Bouqueau, Founder and CEO, Motion Spell (home for GPAC Licensing), Product: GPAC Title of Talk: Deploying GPAC for Transcoding and Packaging
  • Barry Owen, Chief Solutions Architect at Wowza, Title of Talk: Start Streaming in Minutes with Wowza Streaming Engine



1:10 pm. – 1:40 pm EST

Choosing a co-location facility (30 minutes): 
Kyle Faber, Senior Director of Product Management at Edgio.
Once you’ve chosen your hardware and software, you need a place to install them. If you don’t have your own connected data center, you may consider a colocation facility. In his talk, Kyle addresses the key factors to consider when choosing a co-location facility for your live streaming infrastructure.

 

1:40 pm. – 1:55 pm EST

How to Greenify Your Encoding Stack (15 minutes):
Barbara Lange, Secretariat of Greening of Streaming.
Learn how video streaming companies can work to significantly reduce their energy footprint and contribute to a greener streaming industry. Implement hardware and infrastructure optimization using immersion cooling and data center design improvements to maximize energy efficiency in your streaming infrastructure.

 

1:55 pm. – 2:15 pm EST

Closing Keynote (20 minutes):
Stef van der Ziel, Founder Jet-Stream
Jet-stream has delivered streaming solutions since its launch in 1994 and offers its own live streaming platform. One focus has been creating custom transcoding solutions for customers seeking to create their own private cloud for various applications. In his closing talk, Stef will demystify the process of choosing hardware, software, and a hosting facility and wrap a pretty bow around all previous presentations.

Unveiling the Quadra Server: The Epitome of Power and Scalability

The Quadra Server review by Jan Ozer from NETINT Technologies

Streaming engineers face constant pressure to produce more streams at a lower cost per stream and reduced power consumption. However, those considering new transcoding technologies need a solution that integrates with their existing workflows while delivering the quality and flexibility of software with the cost efficiency of ASIC-based hardware.

If this sounds like you, the US $21,000 NETINT Quadra Video Server could be the ideal solution. Combining the Supermicro 1114S-WN10RT AMD EPYC 7543P-powered server hosting ten NETINT Quadra T1U Video Processing Units (VPUs), is a power house. The Quadra server outputs H.264, HEVC, and AV1 streams at normal or low latency, and you can control operation via FFmpeg, GStreamer, or a low-level API. This makes the server a drop-in replacement for a traditional FFmpeg-based software or GPU-based encoding stack.

As you’ll see below, the 1RU form factor server can output up to 20 8Kp30 streams, 80 4Kp30 streams, up to 320 1080p30 streams, and 640 720p30 streams for live and interactive video streaming applications. For ABR production, the server can output over 120 encoding ladders in H.264, HEVC, and AV1 formats. This unparalleled density enables all video engineers to greatly expand capacity while shrinking the number of required servers and the associated power bills.

I’ll start this review with a technical description of the server and transcoding hardware. Then we’ll review some performance results for one-to-one streaming and H.264, HEVC, and AV1 ladder generation and finish with a look at the Quadra server’s AI-based features and output.

The Quadra Server - Quadra video processing server powered by 10 Quadras, ASIC-based VPUs from NETINT
Figure 1. The Quadra Video Server powered by the Codensity G5 ASIC.

Hardware Specs - The Quadra Server

The NETINT Quadra Video Server uses the Supermicro 1114S-WN10RT server platform with a 32-core AMD EPYC 7543P CPU running Ubuntu 20.04.05 LTS. The Quadra server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots that house the Quadra T1U VPUs. NETINT also offers the Quadra server with two other CPUs, the 64-core AMD EPYC 7713P processor ($24,000) for more demanding applications and the economical 8-core AMD EPYC 7232P processor ($19,000) for pure transcoding applications that may not require a 32-core CPU.

Supermicro* is a leading server and storage vendor that designs, develops, and manufactures primarily in the United States. Supermicro* adheres to high-quality standards, with a quality management system certified to the ISO 9001:2015 and ISO 13485:2016 standards, and an environmental management system certified to the ISO 14001:2015 standard. Supermicro is also a leader in green computing and reducing data center footprints (see the white paper Green Computing: Top Ten Best Practices for a Green Data Center). As you’ll see below, this focus has resulted in an extremely power-efficient server to house the NETINT Quadra VPUs.

*We are the leading server and storage vendor that designs, develops, and manufactures the majority of our development in the United States – at our headquarters in San Jose, Calif. Our Quality Management System is certified to ISO 9001:2015 and ISO 13485:2016 standards and our Environmental Management System is certified to ISO 14001:2015 standard. In addition to that, the Supermicro Information Security Managemen

SOURCE: https://www.supermicro.com/en/about

Hardware Specs – Quadra VPUs

The Quadra T1U VPUs are powered by the NETINT Codensity G5 ASIC and packaged in a U.2 form factor that plugs into the NVMe slots in the server and communicate via the ultra-high bandwidth PCIe 4.0 bus. Quadra VPUs can decode H.264, HEVC, and VP9 inputs and encode into the H.264, HEVC, and AV1 standards.

Beyond transcoding, Quadra VPUs house 2D processing engines that can crop, pad, and scale video, and perform video overlay, YUV and RGB conversion, reducing the load on the host CPU to increase overall throughput. These engines can perform xStack operations in hardware, making the Quadra server ideal for conferencing and security applications that combine multiple feeds into a multi-pane output mosaic window.

Each Quadra T1U in the Quadra server includes a 15 TOPS Deep Neural Network Inference Engine that can support models trained with all major deep learning frameworks, including Caffe, TensorFlow, TensorFlow Lite, Keras, Darknet, PyTorch, and ONNX. NETINT supplies several reference models, including a facial detection model that uses region of interest encoding to improve facial quality on security and other highly compressed streams. Another model provides background removal for conferencing applications.

Operational Overview

We tested the Quadra server with FFmpeg and GStreamer. Operationally, both GStreamer and FFmpeg communicate with the libavcodec layer that functions between the Quadra NVME interface and the FFmpeg/GStreamer software layers. This allows existing FFmpeg and GStreamer-based transcoding applications to control server operation with minimal changes.

Figure 2 - The Quadra Server - software architecture for controlling the Quadra Server
Figure 2. The software architecture for controlling the server.

To allocate jobs to the ten Quadra T1U VPUs, the Quadra device driver software includes a resource management module that tracks Quadra capacity and usage load to present inventory and status on available resources and enable resource distribution. There are several modes of operation, including auto, which automatically distributes the work among the available VPUs.

Alternatively, you can manually assign decoding and encoding tasks to different Quadra VPUs in the command line or application and even control which streams are decoded by the host CPU or a Quadra. With these and similar controls, you can most efficiently balance the overall transcoding load between the Quadra and host CPU and maximize throughput. We used auto distribution for all tests.

We tested running FFmpeg v 5.2.3 and GStreamer version 1.18 (with FFmpeg v 4.3.1), and with Quadra release 3.2.0. As you’ll see, we weren’t able to complete all tests in all modes with both software programs, so we presented the results we were able to complete.

In all tests, we configured the Quadra VPUs for maximum throughput as opposed to maximum quality. You can read about the configuration options and their impact on output quality and performance in Benchmarking Hardware Transcoder Performance. While quality will relate to each video and encoding configuration, the configuration used should produce quality at least equal to the veryfast x264 and x265 presets, with quality up to the slow presets available in configurations that optimize quality over throughput.

We tested multiple facets of system performance. The first series of tests involved a single stream in and single stream out, either at the same resolution as the incoming stream or scaled down and output at a lower resolution. Many applications use this mode of operation, including gaming, gambling, and auctions.

The second use case is ABR distribution, where a single input stream is transcoded to a full encoding ladder. Here we supplemented the results with software-only transcodes for comparison purposes. To assess AI-related throughput, we tested region-of-interest transcoding and background removal.

In most modes, we tested normal and low-latency performance. To simulate live streaming and minimize file I/O as a drag on system performance, we retrieved the source file from a RAM drive on the Quadra server and delivered the encoded file to RAM.

Same-Resolution Transcoding

Table 1 shows transcoding results for 8K, 4K, 1080p, and 720p in latency tolerant and low-delay modes. The number represents the amount of full frame rate outputs produced by the system at each configuration.

These results are most relevant for interactive gambling and similar applications that input a single stream, transcode the stream at full resolution, and stream it out. You see that 8K streaming is not available in the AV1 format and that H.264 and HEVC are not available in low latency mode with either program. Interestingly, FFmpeg outperformed GStreamer at this resolution while the reverse was true at 1080p.

4K and 720p results were consistent for all input and output codecs and for normal and low delay modes. All output numbers are impressive, but the 640 720p streams for AV1, H.264, or HEVC is remarkable density for a 1RU rack server.

At 1080p there are minor output differences between normal and low-delay mode and the different codecs, though the codec-related differences aren’t that substantial. Interestingly, HEVC throughput is slightly higher than H.264, with AV1 about 16% behind HEVC.

Jan Ozer - the Quadra server review-table-1
Table 1. Same resolution transcoding results.

Table 2 shows a collection of maximum data points (worst case) from the transcoding results presented in Table 1. As you can see, both Max CPU and power consumption track upwards with the number of streams produced. Max latency (decode plus encode) in normal latency mode tracks downward with the stream resolution, becoming quite modest at 720p. Max latency (decode plus encode) in low-delay mode for both decoding and encoding starts and stays under 30.9 milliseconds, which is less than a single frame.

Jan Ozer - the Quadra server review-table-2
Table 2. Maximum CPU, power consumption, and latency data for pure transcoding.

As between FFmpeg and GStreamer, the latter proved more CPU and power efficient than the former, in both normal and low-delay modes. For example, in all tests, GStreamer’s CPU utilization was less than half of FFmpeg, through the power consumption delta was generally under 20%.

At 8K and 4K resolutions, the latency reported was about even between the two programs, but at the lower resolutions in low-delay mode, GStreamer’s latency was often half that of FFmpeg. You can see an example of these two observations in Table 3, reporting 720p HEVC input and output as HEVC. Though the throughput was identical, GStreamer used much less energy and produced much lower latency. As you’ll see in the next section, this dynamic stayed true in transcoding with scaling tests, making GStreamer the superior app for applications involving same-resolution transcoding and transcoding with scaling. 

Quadra Server - Table 3. GStreamer was much more CPU and power efficient and delivered substantially lower latency than FFmpeg in these same resolution transcode tests.
Table 3. GStreamer was much more CPU and power-efficient
and delivered substantially lower latency than FFmpeg
in these same resolution transcode tests.

Transcoding and Scaling

Table 4 shows transcoding while scaling results, first 8K input to 4K output, then 4K to 1080p, and lastly 1080p to 720p. If you compare Table 3 with Table 1, you’ll see that performance tracks the input resolution, not output, which makes sense because decoding is a separate operation that obviously involves its own hardware limits.

Jan Ozer - the Quadra server review-table-4
Table 4. Transcoding while scaling results.

As the Quadra VPUs perform scaling on-board, there was no drop in throughput with the scaling related tests; rather, there was a slight increase in 8K > 4K and 4K > 1080p outputs over the same resolution transcoding reported in Table 1. In terms of throughput, the results were consistent between the codecs and software programs.

Table 5 shows the max CPU and power usage for all the transcodes in Table 3, which increased somewhat from the low-quantity high-resolution transcodes to the high-quantity low-resolution transcodes but was well within the performance envelope for this 32-core server.

The Max latency for all normal encodes was relatively consistent between five and six frames. With low delay engaged, 8K > 4K latency didn’t drop that significantly, though you’d assume that 8K to 4K transcodes are uncommon. Latency dropped to below a single frame in the two lower resolution transcodes.

Jan Ozer - the Quadra server review-table-5
Table 5. Maximum CPU, power consumption, and latency data for transcoding while scaling.

As between FFmpeg and GStreamer we saw the same dynamic as with full resolution transcodes; in most tests, GStreamer consumed significantly less power and produced sharply lower latency. You can see an example of this in Table 6, reporting the results of 1080p incoming HEVC output to AV1 at 720p. 

Table 6. GStreamer was much more CPU and power-efficient
and delivered much lower latency than FFmpeg in this scale then transcode tests.

Encoding Ladder Testing

Table 7 shows the results of full ladder testing with CPU, latency, and power consumption embedded in the output instances. Note that we tested a five-rung ladder for H.264, and four-rung ladders for HEVC and AV1. We didn’t test 4K H.264 output because few services would deploy this configuration. Also, we didn’t test with GSteamer because NETINT’s current GStreamer implementation can’t use Quadra’s internal scalers when producing more than a single file, an issue that the NETINT engineering team will resolve soon. Also, as you can see, low-delay mode wasn’t available for 4K testing. 

This fine print behind us, as with the single file testing, throughput was impressive. The ability to deliver up to 140 HEVC 4-rung ladders from a single 1RU rack, in either normal or low-latency mode, is remarkable.

Jan Ozer - the Quadra server review-table-7
Table 7: Encoding ladder throughput. 

For comparison purposes, we produced the equivalent encoding ladders on the same server using software-only encoding with FFmpeg and the x264, x265, and SVT-AV1 codecs. To match the throughput settings used for Quadra, we used the ultrafast preset for x264 and x265, and preset eleven for SVT-AV1. You see the results in Table 8

Note that these numbers over-represent software-based output since no engineer would produce a live stream with CPU utilization over 60 – 65%, since a sudden spike in CPU usage would crash all the streams. Not only is CPU utilization much lower for the Quadra-driven encodes, minimizing the risk of exceeding CPU capacity, Quadra-based transcoding is much more determinist than CPU-based transcoding, so CPU requirements don’t typically change in midstream.

All that said, Quadra proved much more efficient than software-based encoding for all codecs, particularly HEVC and AV1. In Table 7, the Multiple column shows the number of servers required to produce the same output as the Quadra server, plus the power consumed by all these servers. For H.264, you would need six servers instead of a single Quadra server to produce the 120 instances, and power costs would be nearly six times higher. That’s running each Quadra server at 98.3% CPU utilization. Running at a more reasonable 60% utilization would translate to ten servers and 4,287 watts per hour.

Jan Ozer - the Quadra server review-table-8
Table 8. Ladders, CPU utilization, and power consumed for CPU-only transcoding.

Even without factoring in the 60% CPU-utilization limits, the comparison reaches untenable levels with HEVC and AV1. As the data shows, CPU-based transcoding simply can’t keep up with these more complex codecs, while the ASIC-driven Quadra remains relatively consistent. 

AI-Related Functions

The next two tables benchmark AI-related functions, first region of interest encoding, then background removal. Briefly, region of interest encoding uses AI to search for faces in a stream and then increases the bits assigned to those faces to increase quality. This is useful in surveillance videos or any low-bitrate video environment where facial quality is important. 

We tested 1080p AVC input and output with FFmpeg only, and the system delivered sixty outputs in both normal and low-delay modes, with very modest CPU utilization and power consumption. For more on Quadra’s AI-related functions, and for an example of the region of interest filter, see an Introduction to AI Processing on Quadra.

Jan Ozer - the Quadra server review-table-9
Table 9. Throughput for Region of Interest transcoding via Artificial Intelligence.

Table 10 shows 1080p input/output using the AVC codec with background removal, which is useful in conferencing and other applications to composite participants in a virtual environment (see Figure 2). This task involves considerably more CPU but delivers slightly greater throughput.

Jan Ozer - the Quadra server review-table-10
Table 10. Throughput for background removal and transcoding via Artificial Intelligence.

As you can read about in the Introduction to AI Processing on Quadra, Quadra comes with these and other AI-based applications and can deploy AI-based models developed in most machine learning programs. Over time, AI-based operations will become increasingly integral to video transcoding functions, and the Quadra Video Server provides a future-proof platform for that integration.

Figure 3 -The Quadra Server - Compositing participants in a virtual environment with background removal
Figure 3. Compositing participants in a virtual environment with background removal

Conclusion

While there’s a compelling case for ASIC-based transcoding solely for H.264 production, these tests show that as applications migrate to more complex codecs like HEVC and AV1, CPU-based transcoding is untenable economically and for the environment. Beyond pure transcoding functionality, if there’s anything that the ChatGPT-era has proven, it’s that AI-based transcoding-related functions will become mainstream much sooner than anyone might have thought. With highly efficient ASIC-based transcoding hardware and AI engines, the Quadra Video Server checks all the boxes for a server to strongly consider for all high-volume live streaming applications. 

Build Your Own Streaming Infrastructure – Software

Build Your Own Streaming Infrastructure - Article by Jan Ozer from NETINT Technologies

My assumption is that you’re currently using a cloud-based service like AWS for your live streaming and are seeking to reduce costs by buying your own transcoding hardware, installing the necessary software, and hosting the server on-premises or in a co-location facility. This article covers the software side.

To begin, let’s acknowledge that AWS and other cloud services have created a well-featured and highly integrated ecosystem for live streaming and distribution. The downside is the cost.

To illustrate the potential savings, I’ll refer to this article, which compared the cost of producing 21 H.264 ladders and 27 HEVC ladders via AWS MediaLive and by encoding with NETINT’s recently launched Logan Video Server. As you can see in the table, MediaLive costs around $400K for H.264 and $1.8 million for HEVC, as compared to $11,140 in both cases for the co-located server.

Streaming Infrastructure - Table from article 'cloud or on-prem'
Table 1. Five-year cost comparison . AWS MediaLive pricing compared to the NETINT Server

While there are less expensive options available inside and outside of AWS, whenever you pay for hardware by the minute or hour of production, you’re vastly overpaying as compared to owning your own hardware. Sure, you say, but it’s so easy compared to running your own hardware.

If that’s a concern, here are some comforting words from David Heinemeier Hansson, co-owner, and CTO of software developer 37signals, the developer of the project management platform Basecamp and email service Hey. Recently, Hansson wrote  Why we’re leaving the cloud, a blog that detailed his companies’ decisions to do just that. Here’s the relevant quote.

Up until very recently, everyone ran their own servers, and much of the progress in tooling that enabled the cloud is available for your own machines as well. Don’t let the entrenched cloud interests dazzle you into believing that running your own setup is too complicated. Everyone and their dog did it to get the internet off the ground, and it’s only gotten easier since.

My wife has chihuahuas, and given their difficulties with potty training, I seriously doubt they could do it, but you get the point. To paraphrase FDR, all you have to fear is fear itself. The bottom line is that running your own live streaming service should cost relatively little CAPEX, will save significant OPEX, and won’t be nearly as challenging as you might be fearing.

Let’s look at your options for the software required to run your homegrown system.

Transcoding and Packaging Software

Figure 1 shows the minimum software and infrastructure needed for a live-streaming service. Presumably, you’ve already got the live production covered, and since AWS doesn’t offer a player, you have that piece addressed as well. You’ll need a content delivery network to deliver your streaming video, but you can continue to use CloudFront or other CDN. The software that you absolutely have to replace is the live transcoding and packaging component.

Here you have three options; multimedia frameworks, media servers, and “other.” Let’s discuss each in turn.

Multimedia Frameworks

Multimedia frameworks are software libraries, tools, and APIs that provide a set of functionalities and capabilities for multimedia processing, manipulation, and streaming. The best-known framework is FFmpeg, followed by GStreamer and GPAC, and they are all available open source.

Build Your Own Streaming Infrastructure - Software- diagram-2
Figure 1. Netflix uses GPAC for its packaging,
a significant technology endorsement for GPAC
and for multimedia frameworks in general.

Multimedia frameworks excel in projects at both ends of the complexity spectrum. For simple projects, like transcoding an input stream to an encoding ladder, you can create a script that inputs the stream, transcodes, and hands the packaged output streams off to a CDN in a matter of minutes. You can use the script to process thousands of simultaneous jobs, all at no charge.

At the other end of the spectrum, these frameworks also excel at complex jobs with idiosyncratic custom requirements that likely aren’t available in a server or commercial software product. The development, maintenance, and modification costs are considerable, but you get maximum feature flexibility if you’re willing to pay that cost.

What you don’t get with these tools is a user interface or simple configuration options – you start with a blank slate and must program in all desired features. What could be as simple as checking a checkbox in a streaming media server could require dozens or even thousands of lines of code in a multimedia framework.

Which takes us to streaming media servers.

Streaming Media Servers

The next category of products are streaming media servers, and it includes Wowza Streaming Engine, Nimble Streamer, and two open-source servers, Red5 and Ant Media Server. These servers tend to excel for most productions in the middle of the complexity spectrum and offer multiple advantages over multimedia frameworks.

There are several reasons why you might choose to use a streaming server over a multimedia framework, including a simplified setup and configuration. Most streaming servers provide out-of-the-box streaming solutions with pre-configured settings and management interfaces that simplify the setup and configuration process. While not all offer GUIs, those that don’t offer simple option selection in configuration files.

Build Your Own Streaming Infrastructure - Software- diagram-3
Figure 2. Wowza Streaming Engine is a highly regarded streaming server

As mentioned above, streaming servers often offer simpler access to advanced features that you’d have to craft by hand with a multimedia framework. They also offer better integration with third-party services like digital rights management (DRM) and content delivery networks. Between the simplified setup, easier access to features, and improved integration with other services, packaged servers can dramatically accelerate getting your live streaming service up and running.

Once you’re operational, you’ll appreciate management interfaces that monitor the health and performance of your streaming infrastructure, track viewer analytics, manage streaming workflows, and make real-time adjustments. If you’re in a dynamic demand environment, some streaming servers offer built-in scalability features and load balancing to manage the load over multiple hard transcoding resources. You’d have to build all that by hand or with plug-ins if using a multimedia framework.

The two potential downsides of streaming servers are cost and customizability. You’ll have to pay a monthly fee for some versions of these servers, and you may find it complicated or nearly impossible to add what you might consider to be essential features.

Other Streaming-Capable Programs

Most companies building their own live-streaming infrastructures will implement either a multimedia framework or a streaming server, but there are other programs that incorporate the core encoding and packaging functions. One such program is Norsk from id3as. Norsk bills itself as “an SDK that enables developers to easily create amazing, dynamic live video workflows and deploy them at any scale.” As such, it combines both video production and streaming server-related functions

You see this in Figure 3. The top portion shows that Norsk supports the typical codecs and packaging formats deployed by live-streaming producers. At the bottom of the figure, you see that Norsk also offers production-oriented features like multiple camera support, graphics and overlays, and transitions.

Build Your Own Streaming Infrastructure - Software- diagram-4
Figure 3. Norsk offers both production and server-related functions.

Interestingly, Norsk doesn’t have a GUI, instead offering a high-level API to simplify configuration and operation, with a Workflow Visualizer component to view the running state of the application. In this fashion, Norsk attempts to provide the configurability of multimedia frameworks with the ease of operation of scripting-driven streaming media servers.

Finding a program like Norsk that combines transcoding and packaging with other essential streaming-related functions makes a lot of sense; there’s one less vendor to onboard and one less product to learn and support. As remote production becomes more common, we expect more programs like Norsk to become available.

Those are your high-level options. If you’re interested in learning more about these and other programs that can drive encoding and packaging for your live transcoder. You should plan to attend our upcoming symposium; details will be available in the next couple of weeks.

NETINT Video Transcoding Server – ASIC technology at its best

NETINT Video Transcoding Server - quality-speed-density

Many high-volume streaming platforms and services still deploy software-only transcoding, but high energy prices for private data centers and escalating public cloud costs make the OPEX, carbon footprint, and dismal scalability unsustainable. Engineers looking for solutions to this challenge are actively exploring hardware that can integrate with their existing workflows and deliver the quality and flexibility of software with the performance and operational cost efficiency of purpose-built hardware. 

If this sounds like you, the USD $8,900 NETINT Video Transcoding Server could be the ideal solution. The server combines the Supermicro 1114S-WN10RT AMD EPYC 7543P-powered 1RU server with ten NETINT T408 video transcoders that draw just 7 watts each. Encoding HEVC and H.264 at normal or low latency, you can control transcoding operations via  FFmpeg, GStreamer, or a low-level API. This makes the server a drop-in replacement for a traditional x264 or x265 FFmpeg-based or GPU-powered encoding stack.

NETINT Video Transcoding Server

Due to the performance advantage of ASICs compared to software running on x86 CPUs, the server can perform the equivalent work of roughly 10 separate machines running a typical open-source FFmpeg and x264 or x265 configuration. Specifically,  the server can simultaneously transcode twenty 4Kp30 streams, and up to 80 1080p30 live streams. In ABR mode, the server transcodes up to 30 five-rung H.264 encoding ladders from 1080p to 360p resolution, and up to 28 four-rung HEVC encoding ladders. For engineers delivering UHD, the server can output seven 6-rung HEVC encoding ladders from 4K to 360p resolution, all while drawing less than 325 watts of total power.

This review begins with a technical description of the server and transcoding hardware and the options available to drive the encoders, including the resource manager that distributes jobs among the ten transcoders. Then we’ll review performance results for one-to-one streaming and then H.264 and HEVC ladder generation, and finish with a look at the server’s ultra-efficient power consumption.

NETINT Transcoding Server with 10 T408 Video Transcoders

Hardware Specs

Built on the Supermicro 1114S-WN10RT 1RU server platform, the NETINT Video Transcoding Server features ten NETINT Codensity ASIC-powered T408 video transcoders, and runs Ubuntu 20.04.05 LTSThe server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots to house the ten U.2 T408 video transcoders.

You can buy the server with any of three AMD EPYC processors with 8 to 64 cores. We performed the tests for this review on the 32-core AMD EPYC 7543P CPU that doubles to 64 threads with multithreading.  The server configured with the AMD EPYC 7713P processor with 64-cores and 128-threads sells for USD $11,500, and the economical AMD EPYC 7232P processor-based server with 8-cores and 16-threads lists for USD $7,000.

Regarding the server hardware, Supermicro is a leading server and storage vendor that designs, develops, and manufactures primarily in the United States. Supermicro adheres to high-quality standards, with a quality management system certified to the ISO 9001:2015 and ISO 13485:2016 standards and an environmental management system certified to the ISO 14001:2015 standard. Supermicro is also a leader in green computing and reducing data center footprints (see the white paper Green Computing: Top Ten Best Practices for a Green Data Center). As you’ll see below, this focus has resulted in an extremely power-efficient machine when operated with NETINT video transcoders.

Let’s explore the system - NETINT Video Transcoding Server

With this as background, let’s explore the system. Once up and running in Ubuntu, you can check T408 status via the ni_rsrc_mon_logan command, which reveals the number of T408s installed and their status. Looking at Figure 1, the top table shows the decoder performance of the installed T408s, while the bottom table shows the encoding performance.

Figure 1. Tracking the operation of the T408s, decode on top, encode on the bottom.

About the T408

T408s have been in service since 2019 and are being used extensively in hyper-scale platforms and cloud gaming applications. To date, more than 200 billion viewer minutes of live video have been encoded using the T408. This makes it one of the bestselling ASIC-based encoders on the market.

The NETINT T408 is powered by the Codensity G4 ASIC technology and is available in both PCIe and U.2 form factors. The T408s installed in the server are the U.2 form factor plugged into ten NVMe bays. The T408 supports close caption passthrough, and EIA CEA-708 encode/decode, along with support for High Dynamic Range in HDR10 and HDR10+ formats.

“To date, more than 200 billion viewer minutes of live video have been encoded using the T408. This makes it one of the bestselling ASIC-based encoders on the market.” 

ALEX LIU, Co-Founder,
COO at NETINT Technologies Inc.

The T408 decodes and encodes H.264 and HEVC on board but performs all scaling and overlay operations via the host CPU. For one-to-one same-resolution transcoding, users can select an option called YUV Bypass that sends the video transcoded by the T408 directly to the T408 encoder. This eliminates high-bandwidth trips through the bus to and from system memory, reducing the load on the bus and CPU. As you’ll see, in pure 1:1 transcode applications without overlay, CPU utilization is very low, so the T408 and server are very efficient for cloud gaming and other same-resolution, low-latency interactive applications. 

Netint Codensity, ASIC-based T408 Video Transcoder
Figure 2. The T408 is powered by the Codensity G4 ASIC.

Testing Overview

We tested the server with FFmpeg and GStreamer. As you’ll see, in most operations, performance was similar. In some simple transcoding applications, FFmpeg pulled ahead, while in more complex encoding ladder productions, particularly 4K encoding, GStreamer proved more performant, particularly for low-latency output.

Figure 3. The software architecture for controlling the server.  

Operationally, both GStreamer and FFmpeg communicate with the libavcodec layer that functions between the T408 NVME interface and the FFmpeg software layer. This allows existing FFmpeg and GStreamer-based transcoding applications to control server operation with minimal changes.

To allocate jobs to the ten T408s, the T408 device driver software includes a resource management module that tracks T408 capacity and usage load to present inventory and status on available resources and enable resource distribution. There are several modes of operation, including auto, which automatically distributes the work among the available resources.

Alternatively, you can manually assign decoding and encoding tasks to different T408 devices in the command line or application and control which streams are decoded by the host CPU or a T408. With these and similar controls, you can efficiently balance the overall transcoding load between the T408s and host CPU to maximize throughput. We used auto distribution for all tests.

Testing Procedures

We tested using Server version 1.0, running FFmpeg v4.3.1 and GStreamer v1.18 and T408 release 3.2.0. We tested with two use cases in mind. The first is a stream in-single stream out, either at the same resolution as the incoming stream or output at a lower resolution.  This mode of operation is used in many interactive applications like cloud gaming, real-time gaming, and auctions where the absolute lowest latency is required. We also tested scaling performance since many interactive applications scale the input to a lower resolution.

The second use case is ABR, where a single input stream is transcoded to a full encoding ladder. In both modes, we tested normal and low-latency performance. To simulate live streaming and minimize file I/O as a drag on system performance, we retrieved the source file from a RAM drive on the server and delivered the encoded file to RAM.

Play Video about NETINT Video Transcoding Server - ASIC technology at its best
HARD QUESTIONS ON HOT TOPICS
All you need to know about NETINT Transcoding Server powered by ASICs
Watch the full conversation on YouTube: https://youtu.be/6j-dbPbmejw

One-to-One Performance

Table 1 shows transcoding results for 4K, 1080p, and 720p in latency tolerant and low-delay modes. Instances is the number of full frame rate outputs produced by the system, with CPU utilization shown for reference. These results are most relevant for cloud gaming and similar applications that input a single stream, transcode the stream at full resolution, and distribute it.

As you can see, 4K results peak at 20 streams for all codecs, though results differ by the software program used to generate the streams. The number of 1080p outputs range from 70 – 80, while 720p streams range from 140 to 170. As you would expect, CPU utilization is extremely low for all test cases as the T408s are shouldering the complete decoding/encoding role. This means that performance is limited by T408 throughput, not CPU, and that the 64-core CPU probably wouldn’t produce any extra streams in this use case. For pure encoding operations, the 8-core server would likely suffice, though given the minimal price differential between the 8-core and 32-core systems, opting for the higher-end model is a prudent investment.

Latency

As for latency, in the normal mode, latency averaged around 45 ms for 4K transcoding and 34 ms for 1080p and 720p transcoding. In low delay mode, this dropped to around 24 ms for 4K, 7 ms for 1080p, and 3 ms for 720, all at 30 fps transcoding and measured with FFmpeg. For reference, at 30 fps, each frame is displayed for 33.33 ms. Even in latency-tolerant mode, latency is just over 1.36 frames for 4K and under a single frame for 1080p and 720p. In low delay modes, all resolutions are under a single frame of latency.

It’s worth noting that while software performance would drop significantly from H.264 to HEVC, hardware performance does not. Thus questions of codec performance for more advanced standards like HEVC do not apply when using ASICs. This is good news for engineers adopting HEVC, and those considering HEVC in the future. It means you can buy the server, comfortable in the knowledge that it will perform equally well (if not better) for HEVC encoding or transcoding.

Table 1. Full resolution transcodes with FFmpeg and Gstreamer
in regular and low delay modes.

Table 2 shows the performance when scaling from 4K to 1080p and from 1080p to 720p, again by the different codecs in and out. Since scaling is performed by the host CPU, CPU usage increases significantly, particularly on the higher volume 1080p to 720p output. Still, given that CPU utilization never exceeds 35%, it appears that the gating factor to system performance is T408 throughput. Again, while the 8-core system might be able to produce similar output if your application involves scaling, the 32-core system is probably better advised.

In these tests, latency was slightly higher than pure transcoding. In normal mode, 4K > 1080p latencies topped out at 46 ms and dropped to 39 ms for 1080p > 720p scaling, just over a single frame of latency. In low latency mode, these results dropped to 10 ms for 4K > 1080p and 10 ms for 1080p > 720p. As before, these latency results are for 30fps and were measured with FFmpeg.

Table 2: Performance while scaling from 4K to 1080p and 1080p to 720p.

The final set of tests involves transcoding to the AVC and HEVC encoding ladders shown in Table 3. These results will be most relevant to engineers distributing full encoding ladders in HLS, DASH, or CMAF containers.

Here we see the most interesting discrepancies between FFmpeg and GStreamer, particularly in low delay modes and in 4K results. In the 1080p AVC tests, FFmpeg produced 30 5-rung encoding ladders in normal mode but dropped to nine in low-delay mode. GStreamer produced 30 encoding ladders in both modes using substantially lower CPU resources. You see the same pattern in the 1080p four-rung HEVC output where GStreamer produced more ladders than FFmpeg using lower CPU resources in both modes.

Table 3. Full encoding ladders output in the listed modes.

FFmpeg produced very poor results in 4K testing, particularly in low latency mode, and it was these results that drove the testing with GStreamer. As you can see, GStreamer produced more streams in both modes and CPU utilization again remained very low. As with the previous results, the low CPU utilization means that the results reflect the encoding limits of the T408. For this reason, it’s unlikely that the higher end server would produce more encoding ladders.

In terms of latency, in normal mode, latency was 59 ms for the H.264 ladder, 72 ms for the 4 rung 1080p HEVC ladder, and 52 ms for the 4K HEVC ladder. These numbers dropped to 5 ms, 7 ms, and 9 ms for the respective configurations in low latency mode.

Power Consumption

Power consumption is an obvious concern for all video engineers and operations teams. To assess system power consumption, we tested using the IPMI Tool. When running completely idle, the system consumed 154 watts, while at maximum CPU, the unit averaged 400 watts with a peak of 425 watts.

We measured consumption during the three basic operations tested, pure transcoding, transcoding with scaling, and ladder creation, in each case testing the GStreamer scenario that produced the highest recorded CPU usage. You see the results in Table 4.

When you consider that CPU-only transcoding would yield a fraction of the outputs shown while consuming 25-30% more power, you can see that the T408 is exceptionally efficient when it comes to power consumption. The Watts/Output figure provides a useful comparison for other competitive systems, whether CPU or GPU-based.

Table 4. Power consumption during the specified operation.

Conclusion

With impressive density, low power consumption, and multiple integration options, the NETINT Video Transcoding Server is the new standard to beat for live streaming applications. With a lower price model available for pure encoding operations, and a more powerful model for CPU-intensive operations, the NETINT server family meets a broad range of requirements.

ASICs, A Preferred Technology for High Volume Transcoding

The video presented below (and the transcript) is from a talk I gave for the Streaming Video Alliance entitled The Nine Events that Shook the Codec World on March 30, 2023. During the talk, I discussed the events occurring over the previous 12-18 months that impacted codec deployment and utility.

Not surprisingly, number 1 was Google Chrome starting to play HEVC. Number 8 was Meta announcing their own ASIC -based transcoder. Given that both Google and Meta are now using ASICs in their encoding workflows, it was an important signal that ASICs were now the preferred technology for high-volume streaming. 

In this excerpt from the presentation, I discuss the history of ASIC-based encoding from the MPEG-2 days of satellite and cable TV to current-day deployments in cloud gaming and other high-volume live interactive video services. Spend about 4 minutes reading the transcript or watching the video and you’ll understand why ASICs have become the preferred technology for high-volume transcoding. 

Here’s the transcript; the video is below. I will say that I heavily edited the transcript to remove the ums, ahs, and other miscues in the transcript.  

Historically, you can look at ASIC usage in three phases. Back when digital video was primarily deployed on satellite and cable TV in a MPEG-2 format, almost all encoders were ASIC-based. And that was because the CPUs at the time weren’t powerful enough to produce MPEG-2 in real-time. 

Then starting in around 2012 or so and ending around 2018, video processing started moving to the cloud. CPUs were powerful enough to support real-time encoding or transcoding of H.264, and ASIC usage decreased significantly.

Then starting in around 2012 or so, and ending around 2018, video processing started moving to the cloud. CPUs were powerful enough to support real-time encoding or transcoding of H.264, and ASIC usage decreased significantly.

At the time, I was writing for Streaming Media Magazine, Elemental came out and in 2012 or 2013, they really hyped the fact that they had compression-centric hardware appliances for encoding. Later on, discussing the same hardware, they transitioned to what they called software-defined video processing. And that’s how they got bought by AWS. AWS now does most of the encoding with Elemental products with their own Graviton CPUs.

ASICs - the latest phase

Now the latest phase. We’re seeing a lot of high-volume interactive use like gambling, auctions, high-volume UGC and other live videos, and cloud gaming. 

Codecs are also getting more complex. As we move from H.264 to HEVC to AV1 and soon to VVC and perhaps LCEVC and EVC, GPUs and CPUs can’t keep up.

At the same time, power consumption and density are becoming critical factors. Everybody’s talking about cost of power, and power consumption in data centers, and using CPUs and GPUs is just very, very inefficient.

And this is where ASICs emerge as the best solution on a cost-per-stream, watts-per-stream, and density basis. Density means how many streams we can output from a single server.

And we saw this, “Google Replaces Millions of Intel’s CPUs With Its Own Homegrown Chips.” Those homegrown chips were encoding ASICs. And then we saw Meta. 

ASICs - significance.

These deployments legitimize encoding ASICs as the preferred technology for high-volume transcoding, implicitly and explicitly. 

“There are two types of companies in the video business. Those using Video Processing ASICs in their workflows, and those that will”.

– David Ronca

I say explicitly because of the following comments made by David Ronca, who was director of video encoding at Netflix and then moved to Meta, two or three years ago. Announcing Meta’s new ASIC, he said, “There are two types of companies in the video business. Those using Video Processing ASICs in their workflows, and those that will be.”

Usage by Google and Facebook, Meta, gives ASICs a lot more credibility than what you get from me saying it, as obviously, NETINT makes encoding ASICs. And these legitimize our technology. The technologies themselves are different. Meta made their own chips. Google made their own chips. We have our own chips. But the whole technology is legitimized by the usage of these premiere services.


Watch the full presentation on YouTube:
https://youtu.be/-4sJ0We0hro

ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses

ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses

As the title suggests, this post compares CAPEX and OPEX costs for live streaming using ASIC- based transcoding and CPU-based transcoding. The bottom line?

NETINT Transcoding Server with 10 T408 Video Transcoders
Figure 1. The 1 RU Deep Edge Appliance with ten NETINT T408 U.2 transcoders.

Jet-Stream is a global provider of live-streaming services, platforms, and products. One such product is Jet-Stream’s Deep Edge OTT server, an ultra-dense scalable OTT streaming transcoder, transmuxer, and edge cache that incorporates ten NETINT T408 transcoders. In this article, we’ll briefly review how Deep Edge compared financially to a competitive product that provided similar functionality but used CPU-based transcoding.

About Deep Edge

Jet-Stream Deep Edge is an OTT edge transcoder and cache server solution for telcos, cloud operators, compounds, and enterprises. Each Deep Edge appliance converts up to 80 1080p30 television channels to OTT HLS and DASH video streams, with a built-in cache enabling delivery to thousands of viewers without additional caches or CDNs.

Each Deep Edge appliance can run individually, or you can group multiple systems into a cluster, automatically load-balancing input channels and viewers per site without the need for human operation. You can operate and monitor Edge appliances and clusters from a cloud interface for easy centralized control and maintenance. In the case of a backlink outage, the edge will autonomously keep working.

Figure 2. Deep Edge operating schematic.

Optionally, producers can stream access logs in real-time to the Jet-Stream cloud service. The Jet-Stream Cloud presents the resulting analytics in a user-friendly dashboard so producers can track data points like the most popular channels, average viewing time, devices, and geographies in real-time, per day, week, month, and year, per site, and for all the sites.

Deep Edge appliances can also act as a local edge for both the internal OTT channels and Jet-Stream Cloud’s live streaming and VOD streaming Cloud and CDN services. Each Deep Edge appliance or cluster can be linked to an IP-address, IP-range, AS-number, country, or continent, so local requests from a cell tower, mobile network, compound, football stadium, ISP, city, or country to Jet-Stream Cloud are directed to the local edge cache. Each Deep Edge site can be added to a dynamic mix of multiple backup global CDNs, to tune scale, availability, and performance and manage costs.

Under the Hood

Each Deep Edge appliance incorporates ten NETINT T408 transcoders into a 1RU form factor driven by a 32-core CPU with 128 GB of RAM. This ASIC-based acceleration is over 20x more efficient than encoding software on CPUs, decreasing operational cost and CO2 footprint by order of magnitude. For example, at full load, the Deep Edge appliance draws under 240 watts.

The software stack on each appliance incorporates a Kubernetes-based container architecture designed for production workloads in unattended, resource-constrained, remote locations. The architecture enables automated deployment, scaling, recovery, and orchestration to provide autonomous operation and reduced operational load and costs.

The integrated Jet-Stream Maelstrom transcoding software provides complete flexibility in encoding tuning, enabling multi-bit-rate transcoding in various profiles per individual channel.

Each channel is transcoded and transmuxed in an isolated container, and in the event of a crash, affected processes are restarted instantly and automatically.

Play Video about ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses
HARD QUESTIONS ON HOT TOPICS
 ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses
Watch the full conversation on YouTube: https://youtu.be/pXcBXDE6Xnk

Deep Edge Proposal

Recently, Jet-Stream submitted a bid to a company with a contract to provide local streaming services to multiple compounds in the Middle East. The prospective customer was fully transparent and shared the costs associated with a CPU-based solution against which Deep Edge competed.

In producing these projections, Jet-Stream incorporated a cost per kilowatt of € 0.20 Euros and assumed that the software-based server would run at 400 Watts/hour while Deep Edge would run at 220 Watts per hour.  These numbers are consistent with lab testing we’ve performed at NETINT; each T408 draws only 7 watts of power, and because they transcode the incoming signal onboard, host CPU utilization is typically at a minimum.

Jet-Stream produced three sets of comparisons; a single appliance, a two-appliance cluster, and ten sites with two-appliance clusters. Here are the comparisons. Note that the Deep Edge cost includes all software necessary to deliver the functionality detailed above for standard features. In contrast, the CPU-based server cost is hardware-only and doesn’t include the licensing cost of software needed to match this functionality.    

Single Appliance

A single Deep Edge appliance can produce 80 streams, which would require five separate servers for CPU-based transcoding. Considering both CAPEX and OPEX, the five-year savings was €166,800.

ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses - Table 1
Table 1. CAPEX/OPEX savings for a single
Deep Edge appliance over CPU-based transcoding.

A Two-Appliance Cluster

Two Deep Edge appliances can produce 160 streams, which would require nine CPU-based encoding servers to produce. Considering both CAPEX and OPEX, the five-year savings for this scenario was €293,071.

Table 2 CAPEX/OPEX savings for a dual-appliance
Deep Edge cluster over CPU-based transcoding.
.

Ten Sites with Two-Appliance Clusters

Supporting ten sites with 180 channels would require 20 Deep Edge appliances and 90 servers for CPU-based encoding. Over five years, the CPU-based option would cost over € 2.9 million Euros more than Deep Edge.

Table 3. CAPEX/OPEX savings for ten dual-appliance
Deep Edge clusters over CPU-based transcoding.

While these numbers border on unbelievable, they are actually quite similar to what we computed in this comparison, How to Slash CAPEX, OPEX, and Carbon Emissions with T408 Video Transcoder, which compared T408-based servers to CPU-only on-premises and AWS instances.

The bottom line is that if you’re transcoding with CPU-based software, you’re paying way too much for both CAPEX and OPEX, and your carbon footprint is unnecessarily high. If you’d like to explore how many T408s you would need to assume your current transcoding workload, and how long it would take to recoup your costs via lower energy costs, check out our calculators here.

Play Video about ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses
Voices of Video: Building Localized OTT Networks
Watch the full conversation on YouTube: https://youtu.be/xP1U2DGzKRo

NETINT Quadra vs. NVIDIA T4 – Benchmarking Hardware Encoding Performance

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer

This article is the second in a series about benchmarking hardware encoding performance. In the first article, available here, I delineated a procedure for testing hardware encoders. Specifically, I recommended this three-step procedure:

  1. Identify the most critical quality and throughput-related options for the encoder.
  2. Test across a range of configurations from high quality/low throughput to low quality/high throughput to identify the operating point that delivers the optimum blend of quality and throughput for your application.
  3. Compute quality, cost per stream, and watts per stream at the operating point to compare against other technologies.

After laying out this procedure, I applied it to the NETINT Quadra Video Processing Unit (VPU) to find the optimum operating point and the associated quality, cost per stream, and watts per stream. In this article, we perform the same analysis on the NVIDIA T4 GPU-based encoder.

About The NVIDIA T4

The NVIDIA T4 is powered by NVIDIA Turing Tensor Cores and draws 70 watts in operation. Pricing varies by the reseller, with $2,299 around the median price, which puts it slightly higher than the $1,500 quoted for the NETINT Quadra T1  VPU in the previous article.

In creating the command line for the NVIDIA encodes, I checked multiple NVIDIA documents, including a document entitled Video Benchmark Assumptions, this blog post entitled Turing H.264 Video Encoding Speed and Quality, and a document entitled Using FFmpeg with NVIDIA GPU Hardware acceleration that requires a login. I readily admit that I am not an expert on NVIDIA encoding, but the point of this exercise is not absolute quality as much as the range of quality and throughput that all hardware enables. You should check these documents yourself and create your own version of the optimized command string.

While there are many configuration options that impact quality and throughput, we focused our attention on two, lookahead and presets. As discussed in the previous article, the lookahead buffer allows the encoder to look at frames ahead of the frame being encoded, so it knows what is coming and can make more intelligent decisions. This improves encoding quality, particularly at and around scene changes, and it can improve bitrate efficiency. But lookahead adds latency equal to the lookahead duration, and it can decrease throughput.

Note that while the NVIDIA documentation recommends a lookahead buffer of twenty frames, I use 15 in my tests because, at 20, the hardware decoder kept crashing. I tested a 20-frame lookahead using software decoding, and the quality differential between 15 and 20 was inconsequential, so this shouldn’t impact the comparative results.

I also tested using various NVIDIA presets, which like all encoding presets, trade off quality vs. throughput. To measure quality, I computed the VMAF harmonic mean and low-frame scores, the latter a measure of transient quality. For throughput, I tested the number of simultaneous 1080p30 files the hardware could process at 30 fps. I divided the stream count into price and watts/hour to determine cost/stream and watts/stream.

As you can see in Table 1, I tested with a lookahead value of 15 for selected presets 1-9, and then with a 0 lookahead for preset 9. Line two shows the closest x264 equivalent score for perspective.

In terms of operating point for comparing to  Quadra, I choose the lookahead 15/preset 4 configuration, which yielded twice the throughput of preset 2 with only a minor reduction in VMAF Harmonic mean. We will consider low-frame scores in the final comparisons.

In general, the presets worked as they should, with higher quality and lower throughput at the left end, and the reverse at the right end, though LA15/P4 performance was an anomaly since it produced lower quality and higher throughput than LA15/P6. In addition, dropping the lookahead buffer did not produce the performance increase that we saw with Quadra, though it also did not produce a significant quality decrease.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 1
Table 1. H.264 options and results.

Table 2 shows the T4’s HEVC results. Though quality was again near the medium x265 preset with several combinations, throughput was very modest at 3 or 4 streams at that quality level. For HEVC, LA15/P4 stands out as the optimal configuration, with four times or better throughput than other combinations with higher-quality output.

In terms of expected preset behavior, LA15/P4 was again quite the anomaly, producing the highest throughput in the test suite with slightly lower quality than LA15/P6, which should deliver lower quality. Again, switching from LA 15 to LA 0 produced neither the expected spike in throughput nor a drop in quality, as we saw with the Quadra for both HEVC and H.264.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 2
Table 2. HEVC options and results.

Quadra vs. T4

Now that we have identified the operating points for Quadra and the T4, let us compare quality, throughput, CAPEX, and OPEX. You see the data for H.264 in Table 3.

Here, the stream count was the same, so Quadra’s advantage in cost per stream and watts per stream related to its lower cost and more efficient operation. At their respective operating points, the Quadra’s VMAF harmonic mean quality was slightly higher, with a more significant advantage in the low-frame score, a predictor of transient quality problems.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 3
Table 3. Comparing Quadra and T4 at H.264 operating points.

Table 4 shows the same comparison for HEVC. Here, Quadra output 75% more streams than the T4, which increases the cost per stream and watts per stream advantages. VMAF harmonic means scores were again very similar, though the T4’s low frame score was substantially lower.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Table 4
Table 4. Comparing Quadra and T4 at HEVC operating points. 

Figure 5 illustrates the low-frames and low-frame differential between the two files. It is the result plot from the Moscow State University Video Quality Measurement Tool (VQMT), which displays the VMAF score, frame-by-frame, over the entire duration of the two video files analyzed, with Quadra in red and the T4 in green. The top window shows the VMAF comparison for the entire two files, while the bottom window is a close-up of the highlighted region of the top window, right around the most significant downward spike at frame 1590.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Picture 1
Figure 5. The downward green spikes represent the low-frame scores in the T4 encode.

As you can see in the bottom window in Figure 5, the low-frame region extends for 2-3 frames, which might be borderline noticeable by a discerning viewer. Figure 6 shows a close-up of the lowest quality frame, Quadra on the left, T4 on the right, and the dramatic difference in VMAF score, 87.95 to 57, is certainly warranted. Not surprisingly, PSNR and SSIM measurements confirmed these low frames.

Hardware Encoding - Benchmarking Hardware Encoding Performance by Jan Ozer - Picture 2
Figure 6. Quality comparisons, NETINT Quadra on the left, T4 on the right.

It is useful to track low frames because if they extend beyond 2-3 frames, they become noticeable to viewers and can degrade viewer quality of experience. Mathematically, in a two-minute test file, the impact of even 10 – 15 terrible frames on the overall score is negligible. That is why it is always useful to visualize the metric scores with a tool like VQMT, rather than simply relying on a single score.

Summing Up

Overall, you should consider the procedure discussed in this and the previous article as the most important takeaway from these two articles. I am not an expert in encoding with NVIDIA hardware, and the results from a single or even a limited number of files can be idiosyncratic.

Do your own research, test your own files, and draw your own conclusions. As stated in the previous article, do not be impressed by quality scores without knowing the throughput, and expect that impressive throughput numbers may be accompanied by a significant drop in quality.

Whenever you test any hardware encoder, identify the most important quality/throughput configuration options, test over the relevant range, and choose the operating point that delivers the best combination of quality and throughput. This will give the best chance to achieve a meaningful apples vs. apples comparison between different hardware encoders that incorporates quality, cost per stream, and watts per stream.