Hardware Transcoding: What it Is, How it Works, and Why You Care

What is Transcoding?

Like most terms relating to streaming, transcoding is defined more by practice than by a dictionary. In fact, transcoding isn’t in Websters or many other dictionaries. That said, it’s generally accepted that transcoding means converting a file from one format to another.  More particularly, it’s typically used within the context of a live-streaming application.

As an example, suppose you were watching a basketball game on NBA.tv. Assuming that the game is produced on-site, somewhere in the arena, a video mixer pulls together all video, audio, and graphics. The output would typically be fed into a device that compresses it to a high-bitrate H.264 or another compressed format and sends it to the cloud. You would typically call this live encoding; if the encoder is hardware-based, it would be hardware-based live encoding.

In the cloud, the incoming stream is transcoded to lower resolution H.264 streams for delivery to mobile and other devices or HEVC for delivery to a smart TV. This can be done in software but is typically performed using a hardware transcoder because it’s more efficient. More on this below.

Looking further into the production and common uses of streaming terminology, during the event or after, a video editor might create short highlights from the original H.264 video to share on social media. After editing the clip, they would encode it to H.264 or another compressed format to upload to Instagram or Facebook. You would typically call rendering the output from the software editor encoding, not transcoding, even though the software converts the H.264 input file to H.264 output, just like the transcoder.

Play Video about NETINT-Jan Ozer-Hardware Transcoding v Encoding
HARD QUESTIONS ON HOT TOPICS: Transcoding versus Encoding.
Watch the full conversation on YouTube: https://youtu.be/BcDVnoxMBLI

Boiling all this down in terms of common usage:

  • You encode a live stream from video input, in software or in hardware, to send it to the cloud for distribution. You use a live encoder, either hardware or software, for this.
  • In the cloud, you transcode the incoming stream to multiple resolutions or different formats using a hardware or software transcoder.
  • When outputting video for video-on-demand (VOD) deployment, you typically call this encoding (and not transcoding), even if you’re working from the same compressed format as the transcoding device.

Hardware Transcoding Alternatives

Anyone who has ever encoded a file knows that it’s a demanding process for your computer. When producing for VOD, time matters, but if the process takes a moment or two longer than planned, no one really notices. Live, of course, is different; if the video stream slows or is interrupted, viewers notice and may click to another website or change channels.

This is why hardware transcoding is typically deployed for high-volume transcoding applications. You can encode with a CPU and software, but CPUs perform multiple functions within the computer and are not optimized for transcoding. This means that a single server can produce fewer streams than hardware transcoders, which translates to higher CAPEX and power consumption.

Like the name suggests, hardware-based transcoding uses hardware devices other than the CPU to transcode the video. One alternative are graphics processing units (GPUs), which are highly optimized for graphic-intensive applications like gaming. Transcoding is supported with dedicated hardware circuits in the GPU, but the vast majority of circuits are for graphics and other non-transcoding functions. While GPUs are more efficient than CPUs for transcoding, they are expensive and consume significant power.

ASIC-Based Transcoding

Which takes us to ASICs. Application-Specific Integrated Circuits (ASICs) are designed for a specific task or application, like video transcoding. Because they‘re designed for this task, they are more efficient than CPU or GPU-based encoding, more affordable, and more power-efficient.

Because they‘re designed for this task, Application-Specific Integrated Circuits (ASICs) are more efficient than CPU or GPU-based encoding, more affordable, and more power-efficient.

ALEX LIU, Co-Founder,
COO at NETINT Technologies Inc.

ASICs are also very compact, so you can pack more ASICs into a server than GPUs or CPUs, increasing the output from that server. This means that fewer servers can deliver the same number of streams than with GPU or CPU-based transcoding, which saves additional server storage cost and maintenance.

While we’re certainly biased, if you’re looking for a cost-effective and power-efficient hardware alternative for high-volume transcoding applications, ASIC transcoders are the way to go. Don’t take our word for it; you can read here how YouTube converted much of their production operation to the ASIC-based Argos VCU (for video compression unit). Meta recently also released their own encoding ASIC. Of course, neither of these are for sale to the public; the primary vendor for ASIC-based transcoders is NETINT.

NETINT Video Transcoding Server – ASIC technology at its best

NETINT Video Transcoding Server - quality-speed-density

Many high-volume streaming platforms and services still deploy software-only transcoding, but high energy prices for private data centers and escalating public cloud costs make the OPEX, carbon footprint, and dismal scalability unsustainable. Engineers looking for solutions to this challenge are actively exploring hardware that can integrate with their existing workflows and deliver the quality and flexibility of software with the performance and operational cost efficiency of purpose-built hardware. 

If this sounds like you, the USD $8,900 NETINT Video Transcoding Server could be the ideal solution. The server combines the Supermicro 1114S-WN10RT AMD EPYC 7543P-powered 1RU server with ten NETINT T408 video transcoders that draw just 7 watts each. Encoding HEVC and H.264 at normal or low latency, you can control transcoding operations via  FFmpeg, GStreamer, or a low-level API. This makes the server a drop-in replacement for a traditional x264 or x265 FFmpeg-based or GPU-powered encoding stack.

NETINT Video Transcoding Server

Due to the performance advantage of ASICs compared to software running on x86 CPUs, the server can perform the equivalent work of roughly 10 separate machines running a typical open-source FFmpeg and x264 or x265 configuration. Specifically,  the server can simultaneously transcode twenty 4Kp30 streams, and up to 80 1080p30 live streams. In ABR mode, the server transcodes up to 30 five-rung H.264 encoding ladders from 1080p to 360p resolution, and up to 28 four-rung HEVC encoding ladders. For engineers delivering UHD, the server can output seven 6-rung HEVC encoding ladders from 4K to 360p resolution, all while drawing less than 325 watts of total power.

This review begins with a technical description of the server and transcoding hardware and the options available to drive the encoders, including the resource manager that distributes jobs among the ten transcoders. Then we’ll review performance results for one-to-one streaming and then H.264 and HEVC ladder generation, and finish with a look at the server’s ultra-efficient power consumption.

NETINT Transcoding Server with 10 T408 Video Transcoders

Hardware Specs

Built on the Supermicro 1114S-WN10RT 1RU server platform, the NETINT Video Transcoding Server features ten NETINT Codensity ASIC-powered T408 video transcoders, and runs Ubuntu 20.04.05 LTSThe server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots to house the ten U.2 T408 video transcoders.

You can buy the server with any of three AMD EPYC processors with 8 to 64 cores. We performed the tests for this review on the 32-core AMD EPYC 7543P CPU that doubles to 64 threads with multithreading.  The server configured with the AMD EPYC 7713P processor with 64-cores and 128-threads sells for USD $11,500, and the economical AMD EPYC 7232P processor-based server with 8-cores and 16-threads lists for USD $7,000.

Regarding the server hardware, Supermicro is a leading server and storage vendor that designs, develops, and manufactures primarily in the United States. Supermicro adheres to high-quality standards, with a quality management system certified to the ISO 9001:2015 and ISO 13485:2016 standards and an environmental management system certified to the ISO 14001:2015 standard. Supermicro is also a leader in green computing and reducing data center footprints (see the white paper Green Computing: Top Ten Best Practices for a Green Data Center). As you’ll see below, this focus has resulted in an extremely power-efficient machine when operated with NETINT video transcoders.

Let’s explore the system - NETINT Video Transcoding Server

With this as background, let’s explore the system. Once up and running in Ubuntu, you can check T408 status via the ni_rsrc_mon_logan command, which reveals the number of T408s installed and their status. Looking at Figure 1, the top table shows the decoder performance of the installed T408s, while the bottom table shows the encoding performance.

Figure 1. Tracking the operation of the T408s, decode on top, encode on the bottom.

About the T408

T408s have been in service since 2019 and are being used extensively in hyper-scale platforms and cloud gaming applications. To date, more than 200 billion viewer minutes of live video have been encoded using the T408. This makes it one of the bestselling ASIC-based encoders on the market.

The NETINT T408 is powered by the Codensity G4 ASIC technology and is available in both PCIe and U.2 form factors. The T408s installed in the server are the U.2 form factor plugged into ten NVMe bays. The T408 supports close caption passthrough, and EIA CEA-708 encode/decode, along with support for High Dynamic Range in HDR10 and HDR10+ formats.

“To date, more than 200 billion viewer minutes of live video have been encoded using the T408. This makes it one of the bestselling ASIC-based encoders on the market.” 

ALEX LIU, Co-Founder,
COO at NETINT Technologies Inc.

The T408 decodes and encodes H.264 and HEVC on board but performs all scaling and overlay operations via the host CPU. For one-to-one same-resolution transcoding, users can select an option called YUV Bypass that sends the video transcoded by the T408 directly to the T408 encoder. This eliminates high-bandwidth trips through the bus to and from system memory, reducing the load on the bus and CPU. As you’ll see, in pure 1:1 transcode applications without overlay, CPU utilization is very low, so the T408 and server are very efficient for cloud gaming and other same-resolution, low-latency interactive applications. 

Netint Codensity, ASIC-based T408 Video Transcoder
Figure 2. The T408 is powered by the Codensity G4 ASIC.

Testing Overview

We tested the server with FFmpeg and GStreamer. As you’ll see, in most operations, performance was similar. In some simple transcoding applications, FFmpeg pulled ahead, while in more complex encoding ladder productions, particularly 4K encoding, GStreamer proved more performant, particularly for low-latency output.

Figure 3. The software architecture for controlling the server.  

Operationally, both GStreamer and FFmpeg communicate with the libavcodec layer that functions between the T408 NVME interface and the FFmpeg software layer. This allows existing FFmpeg and GStreamer-based transcoding applications to control server operation with minimal changes.

To allocate jobs to the ten T408s, the T408 device driver software includes a resource management module that tracks T408 capacity and usage load to present inventory and status on available resources and enable resource distribution. There are several modes of operation, including auto, which automatically distributes the work among the available resources.

Alternatively, you can manually assign decoding and encoding tasks to different T408 devices in the command line or application and control which streams are decoded by the host CPU or a T408. With these and similar controls, you can efficiently balance the overall transcoding load between the T408s and host CPU to maximize throughput. We used auto distribution for all tests.

Testing Procedures

We tested using Server version 1.0, running FFmpeg v4.3.1 and GStreamer v1.18 and T408 release 3.2.0. We tested with two use cases in mind. The first is a stream in-single stream out, either at the same resolution as the incoming stream or output at a lower resolution.  This mode of operation is used in many interactive applications like cloud gaming, real-time gaming, and auctions where the absolute lowest latency is required. We also tested scaling performance since many interactive applications scale the input to a lower resolution.

The second use case is ABR, where a single input stream is transcoded to a full encoding ladder. In both modes, we tested normal and low-latency performance. To simulate live streaming and minimize file I/O as a drag on system performance, we retrieved the source file from a RAM drive on the server and delivered the encoded file to RAM.

Play Video about NETINT Video Transcoding Server - ASIC technology at its best
HARD QUESTIONS ON HOT TOPICS
All you need to know about NETINT Transcoding Server powered by ASICs
Watch the full conversation on YouTube: https://youtu.be/6j-dbPbmejw

One-to-One Performance

Table 1 shows transcoding results for 4K, 1080p, and 720p in latency tolerant and low-delay modes. Instances is the number of full frame rate outputs produced by the system, with CPU utilization shown for reference. These results are most relevant for cloud gaming and similar applications that input a single stream, transcode the stream at full resolution, and distribute it.

As you can see, 4K results peak at 20 streams for all codecs, though results differ by the software program used to generate the streams. The number of 1080p outputs range from 70 – 80, while 720p streams range from 140 to 170. As you would expect, CPU utilization is extremely low for all test cases as the T408s are shouldering the complete decoding/encoding role. This means that performance is limited by T408 throughput, not CPU, and that the 64-core CPU probably wouldn’t produce any extra streams in this use case. For pure encoding operations, the 8-core server would likely suffice, though given the minimal price differential between the 8-core and 32-core systems, opting for the higher-end model is a prudent investment.

Latency

As for latency, in the normal mode, latency averaged around 45 ms for 4K transcoding and 34 ms for 1080p and 720p transcoding. In low delay mode, this dropped to around 24 ms for 4K, 7 ms for 1080p, and 3 ms for 720, all at 30 fps transcoding and measured with FFmpeg. For reference, at 30 fps, each frame is displayed for 33.33 ms. Even in latency-tolerant mode, latency is just over 1.36 frames for 4K and under a single frame for 1080p and 720p. In low delay modes, all resolutions are under a single frame of latency.

It’s worth noting that while software performance would drop significantly from H.264 to HEVC, hardware performance does not. Thus questions of codec performance for more advanced standards like HEVC do not apply when using ASICs. This is good news for engineers adopting HEVC, and those considering HEVC in the future. It means you can buy the server, comfortable in the knowledge that it will perform equally well (if not better) for HEVC encoding or transcoding.

Table 1. Full resolution transcodes with FFmpeg and Gstreamer
in regular and low delay modes.

Table 2 shows the performance when scaling from 4K to 1080p and from 1080p to 720p, again by the different codecs in and out. Since scaling is performed by the host CPU, CPU usage increases significantly, particularly on the higher volume 1080p to 720p output. Still, given that CPU utilization never exceeds 35%, it appears that the gating factor to system performance is T408 throughput. Again, while the 8-core system might be able to produce similar output if your application involves scaling, the 32-core system is probably better advised.

In these tests, latency was slightly higher than pure transcoding. In normal mode, 4K > 1080p latencies topped out at 46 ms and dropped to 39 ms for 1080p > 720p scaling, just over a single frame of latency. In low latency mode, these results dropped to 10 ms for 4K > 1080p and 10 ms for 1080p > 720p. As before, these latency results are for 30fps and were measured with FFmpeg.

Table 2: Performance while scaling from 4K to 1080p and 1080p to 720p.

The final set of tests involves transcoding to the AVC and HEVC encoding ladders shown in Table 3. These results will be most relevant to engineers distributing full encoding ladders in HLS, DASH, or CMAF containers.

Here we see the most interesting discrepancies between FFmpeg and GStreamer, particularly in low delay modes and in 4K results. In the 1080p AVC tests, FFmpeg produced 30 5-rung encoding ladders in normal mode but dropped to nine in low-delay mode. GStreamer produced 30 encoding ladders in both modes using substantially lower CPU resources. You see the same pattern in the 1080p four-rung HEVC output where GStreamer produced more ladders than FFmpeg using lower CPU resources in both modes.

Table 3. Full encoding ladders output in the listed modes.

FFmpeg produced very poor results in 4K testing, particularly in low latency mode, and it was these results that drove the testing with GStreamer. As you can see, GStreamer produced more streams in both modes and CPU utilization again remained very low. As with the previous results, the low CPU utilization means that the results reflect the encoding limits of the T408. For this reason, it’s unlikely that the higher end server would produce more encoding ladders.

In terms of latency, in normal mode, latency was 59 ms for the H.264 ladder, 72 ms for the 4 rung 1080p HEVC ladder, and 52 ms for the 4K HEVC ladder. These numbers dropped to 5 ms, 7 ms, and 9 ms for the respective configurations in low latency mode.

Power Consumption

Power consumption is an obvious concern for all video engineers and operations teams. To assess system power consumption, we tested using the IPMI Tool. When running completely idle, the system consumed 154 watts, while at maximum CPU, the unit averaged 400 watts with a peak of 425 watts.

We measured consumption during the three basic operations tested, pure transcoding, transcoding with scaling, and ladder creation, in each case testing the GStreamer scenario that produced the highest recorded CPU usage. You see the results in Table 4.

When you consider that CPU-only transcoding would yield a fraction of the outputs shown while consuming 25-30% more power, you can see that the T408 is exceptionally efficient when it comes to power consumption. The Watts/Output figure provides a useful comparison for other competitive systems, whether CPU or GPU-based.

Table 4. Power consumption during the specified operation.

Conclusion

With impressive density, low power consumption, and multiple integration options, the NETINT Video Transcoding Server is the new standard to beat for live streaming applications. With a lower price model available for pure encoding operations, and a more powerful model for CPU-intensive operations, the NETINT server family meets a broad range of requirements.

ASICs – The Time is Now

A brief review of the history of encoding ASICs reveals why they have become the technology of choice for high-volume video streaming services and cloud-gaming platforms.

Like all markets, there will be new market entrants that loudly announce for maximum PR effect, promising delivery at some time in the future. But, to date, outside of Google’s internal YouTube ASIC project called ARGOS and the recent Meta (Facebook) ASIC also for internal use only, NETINT is the only commercial company building ASIC-based transcoders for immediate delivery.

“ASICs are the future of high-volume video transcoding as NETINT, Google, and Meta have proven. NETINT is the only vendor that offers its product for sale and immediate delivery making the T408 and Quadra safe bets.”

Delaying a critical technology decision always carries risk. The risk is that you miss an opportunity or that your competitors move ahead of you. However, waiting to consider an announced and not yet shipping product means that you ALSO assume the manufacturing, technology, and supply chain risk of THAT product.

What if you delay only to find out that the announced delivery date was optimistic at best? Or, what if the vendor actually delivers, only for you to find out that their performance claims were not real? There are so many “what if’s” when you wait that it rarely is the right decision to delay when there is a viable product available.

Now let’s review the rebirth of ASICs for video encoding and see how they’ve become the technology of choice for high-volume transcoding operations.  

The Rebirth of ASICs for Video Encoding

An ASIC is an application specific integrated circuit that is designed to do a small number of tasks with high efficiency. ASICs are purpose-built for a specific function. The history of video encoding ASICs can be traced back to the initial applications of digital video and the adoption of the MPEG-2 standard for satellite and cable transmission.

Most production MPEG-2 encoders were ASIC-based.

As is the case for most new codec standards, the first implementation of MPEG-2 compression was CPU-based. Given the cost of using commodity servers and software, dedicated hardware is always necessary to handle the processing requirements of high-quality video encoding cost-effectively.

This led to the development and application of video encoding ASICs, which are specialized integrated circuits designed to perform the processing tasks required for video encoding. Encoding ASICs provide the necessary processing power to handle the demands of high-quality video encoding while being more cost-effective than CPU-based solutions.

With the advent of the internet, the demand for digital video continued to increase. The rise of on-demand and streaming video services, such as YouTube and Netflix, led to a shift towards CPU-based encoding solutions. This was due in part to the fact that streaming video required a more flexible approach to encoding including implementation agility with the cloud and an ability to adjust encoding parameters based on the available bandwidth and device capabilities.

As the demand for live streaming services increased, the limitations of CPU-based encoding solutions became apparent. Live streaming services, such as cloud gaming and real-time interactive video like gaming or conferencing, require the processing of millions of live interactive streams simultaneously at scale. This has led to a resurgence in the use of encoding ASICs for live-streaming applications. Thus, the rebirth of ASICs is upon us and it’s a technology trend that should not be ignored even if you are working in a more traditional entertainment streaming environment.

NETINT: Leading the Resurgence

NETINT has been at the forefront of the ASIC resurgence. In 2019, the company introduced its Codensity T408 ASIC-based transcoder. This device was designed to handle 8 simultaneous HEVC or H.264 1080p video streams, making it ideal for live-streaming applications.

The T408 was well-received by the market, and NETINT continued to innovate. In 2021, the company introduced its Quadra series. These devices can handle up to 32 simultaneous 1080p video streams, making it even more powerful than the T408, also adding the anticipated AV1 codec.

“NETINT has racked up a number of major wins including major names such as ByteDance, Baidu, Tencent, Alibaba, Kuaishou, and a US-based global entertainment service.”

As described by Dylan Patel, editor of the Semianalysis blog, in his article Meet NETINT: The Startup Selling Datacenter VPUs To ByteDance, Baidu, Tencent, Alibaba, And More, “NETINT has racked up a number of major wins including major names such as ByteDance, Baidu, Tencent, Alibaba, Kuaishou, and a similar sized US-based global platform.”

NETINT Quadra T1U Video Processing Unit
– NETINT’s second-generation of shipping ASIC-based transcoders.

Patel also reported that using the HEVC codec, NETINT video transcoders and VPUs crushed Nvidia’s T4 GPU, which is widely assumed to be the default choice when moving to a hardware encoder for the data center. The density and power consumption that can be achieved with a video ASIC is unmatched compared to CPUs and GPUs.

Patel commented further, “The comparison using AV1 is even more powerful… NETINT is the leader in merchant video encoding ASICs.”

“The comparison using AV1 is even more powerful…NETINT is the leader in video encoding ASICs.”

-Dylan Patel

ASIC Advantages

ASICs are designed to perform a specific task, such as encoding video, with a high degree of efficiency and speed. CPUs and GPUs are designed to perform a wide range of general-purpose computing tasks. As evidence of this fact, today, the primary application for GPUs has nothing to do with video encoding. In fact, just 5-10% of the silicon real estate on some of the most popular GPUs in the market are dedicated to video encoding or processing. Highly compute-intensive tasks like AI inferencing are the most common workload for GPUs today.

The key advantage of ASICs for video encoding is that they are optimized for this specific task, with a much higher percentage of gates on the chip dedicated to encoding than CPUs and GPUs. ASICs can encode much faster and with higher quality than CPUs and GPUs, while using less power and generating less heat.

“ASICs can encode much faster and with higher quality than CPUs and GPUs while using less power and generating less heat.”

-Dylan Patel

Additionally, because ASICs are designed for a specific task, they can be more easily customized and optimized for specific use cases. Though some assume that ASICs are inflexible, in reality, with a properly designed ASIC, the function it’s designed for may be tuned more highly than if the function was run on a general purpose computing platform. This can lead to even greater efficiency gains and improved performance.

The key takeaway is that ASICs are a superior choice for video encoding due to their application-specific design, which allows for faster and more efficient processing compared to general-purpose CPUs and GPUs.

Confirmation from Google and Meta

Recent industry announcements from Google and Meta confirm these conclusions. When Google announced the ASIC-based Argos VCU (Video Coding Unit) in 2021, the trade press rightfully applauded. CNET announced that “Google supercharges YouTube with a custom video chip.” Ars Technica reported that Argos brought “up to 20-33x improvements in compute efficiency compared to… software on traditional servers.” SemiAnalysis reported that Argos “Replaces 10 Million Intel CPUs.”

Google’s Argos confirms the value of encoding ASICs
(and shipped 2 years after the NETINT T408).

As described in the article “Argos dispels common myths about encoding ASICs” (bit.ly/ASIC_myths), Google’s experience highlights the benefits of ASIC-based transcoders. That is, while many streaming engineers still rely on software-based transcoding, ASIC-based transcoding offers a clear advantage in terms of CAPEX, OPEX, and environmental sustainability benefits. The article goes on to address outdated concerns about the shortcomings of ASICs, including sub-par quality and the lack of upgradeability.

The article discusses several key findings from Google’s presentation on the Argos ASIC-based transcoder at Hot Chips 33, including:

  • Encoding time has grown by 8000% due to increased complexity from higher resolutions and frame rates. ASIC-based transcoding is necessary to keep video services running smoothly.
  • ASICs can deliver near-parity to software-based transcoding quality with properly designed hardware.
  • ASICs quality and functionality can be improved and changed long after deployment.
  • ASICs deliver unparalleled throughput and power efficiency, with Google reporting a 90% reduction in power consumption.

Though much less is known about the Meta ASIC, its announcement prompted Facebook’s Director of Video Encoding, David Ronca, to proclaim, “I propose that there are two types of companies in the video business. Those that are using Video Processing ASICs in their workflows, and those that will.”

“…there are two types of companies in the video business. Those that are using Video Processing ASICs in their workflows, and those that will.”

Meta proudly announces its encoding ASIC
(3 years after NETINT’s T408 ships).

Unlike the ASICs from Google and Meta, you can actually buy ASIC-based transcoders from NETINT, and in fact scores of tens of thousands of units are operating in some of the largest hyperscaler networks and video streaming platforms today. The fact that two of the biggest names in the tech industry are investing in ASICs for video encoding is a clear indication of the growing trend towards application-specific hardware in the video field. With the increasing demand for high-quality video streaming across a variety of devices and platforms, ASICs provide the speed, efficiency, and customization needed to meet these needs.

Avoiding Shiny New Object Syndrome

ASICs as the best method for transcoding high volumes of live video has not gone unnoticed, meaning you should expect product announcements that are made pointing to “availability later this year.” When these occur around prominent trade shows, it can indicate a rushed announcement made for the show, and that the later availability may actually be “much later…”

It’s useful to remember that while waiting for a new product from a third-party supplier to become available, companies face three distinct risks: manufacturing, technology, and supply chain.

Manufacturing Risk:

One of the biggest risks associated with waiting for a new product is the manufacturing risk, which means that the product may have issues in manufacturing. That is, there is always a chance that the manufacturing process may encounter unexpected problems, causing delays and increasing costs. For example, Intel has faced manufacturing issues with its 10nm processors, which resulted in delays for its upcoming processors. As a result, Intel lost market share to competitors such as AMD and NVIDIA, who were able to release their products earlier.

Technology Risk:

Another risk associated with waiting for a new product is technology risk, or that the product may not conform to the expected specifications, leading to performance issues, security concerns, or other problems. For example, NVIDIA’s RTX 2080 Ti graphics card was highly anticipated, but upon release, many users reported issues with its performance, including crashes, artifacting, and overheating. This led to a delay in the release of the RTX 3080, as NVIDIA had to address these issues before releasing the new product. Similarly, AMD’s Radeon RX7900 XTX graphics card has been plagued with claims of overheating. 

Supply Chain Risk:

The third risk associated with waiting for a new product is supply chain risk. This means that the company may be unable to get the product manufactured and shipped on time due to issues in the supply chain. For example, AMD faced supply chain issues with its Radeon RX 6800 XT graphics card, leading to limited availability and higher prices.

The reality is that any company building and launching a cloud gaming or streaming service is assuming its own technology and market risks. Compounding that risk by waiting for a product that “might” deliver minor gains in quality or performance (but equally might not) is a highly questionable decision, particularly in a market where even minor delays in launch dates can tank a new service before its even off the ground.

Clearly, ASICs are the future of high-volume video transcoding; NETINT, Google, and Meta have all proven this. NETINT is the only vendor of the three that actually offers its product for sale and immediate delivery; in fast-moving markets like interactive streaming and cloud gaming, this makes NETINT’s shipping transcoders, the T408 and Quadra, the safest bets of all.

ASICs, A Preferred Technology for High Volume Transcoding

The video presented below (and the transcript) is from a talk I gave for the Streaming Video Alliance entitled The Nine Events that Shook the Codec World on March 30, 2023. During the talk, I discussed the events occurring over the previous 12-18 months that impacted codec deployment and utility.

Not surprisingly, number 1 was Google Chrome starting to play HEVC. Number 8 was Meta announcing their own ASIC -based transcoder. Given that both Google and Meta are now using ASICs in their encoding workflows, it was an important signal that ASICs were now the preferred technology for high-volume streaming. 

In this excerpt from the presentation, I discuss the history of ASIC-based encoding from the MPEG-2 days of satellite and cable TV to current-day deployments in cloud gaming and other high-volume live interactive video services. Spend about 4 minutes reading the transcript or watching the video and you’ll understand why ASICs have become the preferred technology for high-volume transcoding. 

Here’s the transcript; the video is below. I will say that I heavily edited the transcript to remove the ums, ahs, and other miscues in the transcript.  

Historically, you can look at ASIC usage in three phases. Back when digital video was primarily deployed on satellite and cable TV in a MPEG-2 format, almost all encoders were ASIC-based. And that was because the CPUs at the time weren’t powerful enough to produce MPEG-2 in real-time. 

Then starting in around 2012 or so and ending around 2018, video processing started moving to the cloud. CPUs were powerful enough to support real-time encoding or transcoding of H.264, and ASIC usage decreased significantly.

Then starting in around 2012 or so, and ending around 2018, video processing started moving to the cloud. CPUs were powerful enough to support real-time encoding or transcoding of H.264, and ASIC usage decreased significantly.

At the time, I was writing for Streaming Media Magazine, Elemental came out and in 2012 or 2013, they really hyped the fact that they had compression-centric hardware appliances for encoding. Later on, discussing the same hardware, they transitioned to what they called software-defined video processing. And that’s how they got bought by AWS. AWS now does most of the encoding with Elemental products with their own Graviton CPUs.

ASICs - the latest phase

Now the latest phase. We’re seeing a lot of high-volume interactive use like gambling, auctions, high-volume UGC and other live videos, and cloud gaming. 

Codecs are also getting more complex. As we move from H.264 to HEVC to AV1 and soon to VVC and perhaps LCEVC and EVC, GPUs and CPUs can’t keep up.

At the same time, power consumption and density are becoming critical factors. Everybody’s talking about cost of power, and power consumption in data centers, and using CPUs and GPUs is just very, very inefficient.

And this is where ASICs emerge as the best solution on a cost-per-stream, watts-per-stream, and density basis. Density means how many streams we can output from a single server.

And we saw this, “Google Replaces Millions of Intel’s CPUs With Its Own Homegrown Chips.” Those homegrown chips were encoding ASICs. And then we saw Meta. 

ASICs - significance.

These deployments legitimize encoding ASICs as the preferred technology for high-volume transcoding, implicitly and explicitly. 

“There are two types of companies in the video business. Those using Video Processing ASICs in their workflows, and those that will”.

– David Ronca

I say explicitly because of the following comments made by David Ronca, who was director of video encoding at Netflix and then moved to Meta, two or three years ago. Announcing Meta’s new ASIC, he said, “There are two types of companies in the video business. Those using Video Processing ASICs in their workflows, and those that will be.”

Usage by Google and Facebook, Meta, gives ASICs a lot more credibility than what you get from me saying it, as obviously, NETINT makes encoding ASICs. And these legitimize our technology. The technologies themselves are different. Meta made their own chips. Google made their own chips. We have our own chips. But the whole technology is legitimized by the usage of these premiere services.


Watch the full presentation on YouTube:
https://youtu.be/-4sJ0We0hro

Cloud Gaming Economic Factors and Technical Considerations

Cloud Gaming Economic Factors

The gaming industry has come a long way. In 2022 it played host to an estimated 3.2 billion players worldwide, generating a total revenue of $184.4 billion, according to Newzoo.

One of the most remarkable developments in recent years has been the accessibility and affordability of gaming. Players can now enjoy gameplay on almost any device connected to the Internet via subscription services in addition to traditional PC and console games.

Game publishers have made great strides in adopting the latest graphics and hardware technologies. However, a delay in moving to cloud gaming from console-based approaches could open the door for disruption from subscription video platforms like NETFLIX. Just as NETFLIX disrupted the home entertainment rental ecosystem with their always-available subscription streaming service, they could do the same with gaming.

Cloud gaming platforms operate in a highly competitive environment with narrow margins. In the United States, popular cloud gaming platforms like Amazon Luna start at $4.99 per month. This makes choosing the right GPU for game graphics rendering and video encoder essential for profitability and competitiveness. Cloud gaming platforms specifying video encoders should consider four key factors; CAPEX, OPEX, Quality, and not funding their competitors.

Lowest Cost Per Stream

For a cloud gaming platform, the cost per stream represents the initial investment required to set up the platform, including the cost of servers and encoders. With the cloud, the cost per stream impacts the profitability of a managed service like a cloud gaming platform to the point of making the entire business model viable.

ASICs are the secret to making a cloud gaming service viable. With an ASIC-based encoder like the NETINT Quadra T2 VPU (Video Processing Unit), coupled with a GPU from AMD, a single server can deliver as many as 200 simultaneous 720p60 gameplay sessions. This performance beats the previous high-water mark of 48 game play sessions using eight GPUs in a single server chassis.

Lowest Possible OPEX Per Stream

OPEX (Operating Expense) represents the ongoing costs of running the platform, including electricity, bandwidth, and maintenance. Energy (electricity) costs are a significant part of OPEX, and they are increasing in many regions. This makes power consumption an important and key consideration for choosing an encoder.

NETINT VPUs are the ideal hedge against rising energy costs, ensuring the platform remains viable despite uncertain energy and economic conditions.

Compared to CPU-based encoding with software, the Quadra T2 VPU consumes 10 to 20-times less energy at only 40 watts per hour delivering the same throughput. Depending on the host server configuration, as many as ten VPUs may be installed making each server the functional equivalent of ten to twenty high-end server machines.

Rack space requirements should also be considered. With colocation prices ranging from $50 – $300 per month, the additional servers needed in a software only implementation would cost up to an extra $5,700 per month for 200 gamers (co-location costs only). While costs may be less if housed in your own facility, you still need racks, cooling, and maintenance for 20 servers compared to one.

With subscriber rates starting at $4.99 per month and in some cases lower, margins are razor thin making high-density transcoding and efficient power usage essential to profitability. This should put ASIC-based transcoders on the short list of all cloud gaming services.

Quality Considerations

A long-lingering misconception about ASICs is that the quality cannot match that produced by the software. Obviously, video quality depends upon configuration options and the operational state that the encoder is operated in. Internal tests show that the HEVC output quality of NETINT VPUs is quite competitive to software and other hardware transcoders, especially when run in their lowest latency state. See Table 1.

For example, as compared to x265, the Quadra VPU produced better output quality than NVENC, the popular encoder that is available on NVIDIA’s more recent GPUs and x265 up to the medium preset. x265 using the medium preset produces quality that is close to VOD. But it is an operational mode not commonly used because of the computing power needed.

Most live streaming engineers use the x265 veryfast or superfast presets. When compared to the x265 superfast preset, Quadra VPU produced the same quality and with an additional 25% bitrate reduction, which translates to significant savings.

Cloud Gaming Economic Factors and Technical Considerations
Table 1. BD-Rate PSNR quality comparisons between Quadra, x265,
and the NVIDIA RTX 3090 encoder in low latency settings.

At the extreme right, you see that Quadra was able to match the quality of the NVIDIA RTX 3090 HEVC encoder at up to an 11.57% bitrate production. ASICs producing quality that rivals software encoding is not unusual. As discussed here, Google has achieved near-software quality with their ASIC-based ARGOS transcoder as well. This shows that clearly, you do not need to compromise on quality to achieve the density and efficiency benefits of ASIC-based transcoding.

Play Video about Cloud Gaming Economic Factors and Technical Considerations - thumbnails
HARD QUESTIONS ON HOT TOPICS
Cloud Gaming Economic Factors and Technical Considerations
Watch the full conversation on YouTube: https://youtu.be/PM5Ts9Ko7DA

Hidden Costs of GPU

Evaluating the cost of hardware is relatively straightforward if the primary factors are easily understood and defined. However, with GPUs, there are hidden costs that are not always recognized or acknowledged. For example, as tech platforms expand their offerings, Cloud gaming platforms could find that they are funding potential competitors.

As an illustration, the US Federal Trade Commission is attempting to block Microsoft’s acquisition of Activision, partly because the Azure cloud platform gives Microsoft a cost advantage over cloud gaming platforms without similar infrastructure.

Presumably, Amazon with AWS, has the same advantage. Similarly, this article describes the cost advantage that NVIDIA derives from other services that buy its GPUs for game rendering.

Another hidden cost can be found in the complexity of the procurement process for GPUs. Due to the supply chain issues triggered by COVID, and the incredible demand spike for GPUs, simply having the opportunity to buy the amount needed was far from certain. Still, your negotiation strength could have significant sway on the price or delivery schedule that you received. Put simply, for anyone needing to buy GPUs in the quantities needed by a cloud gaming platform, it cannot be assumed all that is needed is a P.O.

Finally, there’s a significant loss of negotiating leverage once a gaming platform chooses a GPU vendor, and this is particularly true when the GPU performs double duty in rendering frames and encoding them for streaming. Once a platform chooses a GPU vendor, their technical architecture is essentially locked with that selection, so they can’t switch to another GPU vendor without significant development time and cost. This puts the platform at a disadvantage when negotiating with the selected vendor as they have limited bargaining power.

Often, GPU vendors abuse this leverage by charging expensive license/API costs or refusing to make improvements for their customers. In other cases, this lack of bargaining power could lock platforms into using a GPU-based encoder that delivers uncompetitive quality as compared to third-party options. Some GPU vendors may even refuse to undertake enhancements that would enable the use of third-party transcoders, even if this would improve throughput and quality and reduce OPEX for the game platform.

By implementing a dedicated transcoding unit separate from the GPU, a cloud gaming platform can decouple its design into standalone GPU and VPU modules. This makes it simpler for vendors to switch to different vendors, providing significant leverage to negotiations with all vendors.

The Cloud Gaming Opportunity

According to Newzoo, cloud gaming is one of the fastest-growing gaming industry segments, with a CAGR of 50.9% from 2020 to 2023, accounting for 49% of the global gaming market. Cloud gaming is a benefit to players in all regions and it opens up new entertainment experiences for many people without access to expensive consoles or who cannot afford the newest games.

For others, access to high-quality gaming is a way to extend the entertainment experience outside of the home. Also, it offers a way for mobile gamers to access games that they may be unable to play on their mobile devices due to hardware limitations.

With NETINT VPUs, you can deliver
a premium experience profitably.

The business and market outlook for cloud gaming is sure to be a growth driver not to be ignored. With NETINT VPUs, you can profitably deliver a premium experience. Reach out, and we’ll happily show you how to move forward on this exciting trend.

The Components That Make Cloud Gaming Production Affordable (or Not)

CPUs, GPUs, and ASICs - major cost elements of cloud gaming platforms with commercial examples of hardware combinations and stream output. Normalizing comparisons on a single form factor is essential.

If you’ve made it past the title, you know that cloud gaming platforms operate in a highly competitive environment with narrow margins. This makes the purchase and operating costs per stream critical elements to system success.

This brief article will lay out the major cost elements of cloud gaming platforms and cite some commercial examples of hardware combinations and stream output. We’ve created a table you can use to collect the critical data points while looking at potential solutions around the NAB show, or if you’re simply browsing around the web. If you are at NAB, come by and see us at booth W1672 to discuss the NETINT solution shown in the table.

At their cores, cloud gaming production systems perform three functions; game logic, graphics rendering, and video encoding (Figure 1). Most systems process the game logic on a CPU and the graphics on a GPU. Encoding can be performed via the host CPU, the GPU, or a separate transcoder like NETINT’s ASIC-based Quadra, which outputs H.264, HEVC, and AV1.

The Components That Make Cloud Gaming Production Affordable - diagram 1
Figure 1. The three core functions of a cloud gaming system.

Given the different components and configurations, identifying the cost per stream is critical to comparison analysis. Obviously, a $25,000 system that outputs 200 720p60 streams (cost/stream = $125) is more affordable than a $10,000 system that outputs 25 720p60 streams (cost/stream = $400).

Power consumption per stream is also a major cost contributor. Assuming a five-year expected life, even a small difference between two systems will be multiplied by 60 months of power bills and will significantly impact TCO, not to mention the environment or regulatory considerations.

Finally, normalizing comparisons on a single form factor, like a 1RU or 2RU server, is also essential. Beyond the power cost of a system, rack space costs money, whether in colocation fees or your own in-house costs. The other side of this coin is system maintenance; it costs less to maintain five servers that deliver 1,000 streams than 20 servers that deliver the same output.

Play Video about The Components That Make Cloud Gaming Production Affordable - thumbnail
HARD QUESTIONS ON HOT TOPICS
Get the cost per stream with the proper mix of GPU, CPU, and ASIC-based VPU
Watch the full conversation on YouTube: https://youtu.be/xaSRL847eIs

Comparing Systems

Enough talk; let’s compare some systems. Let’s agree up front that any comparison is unavoidably subjective, with results changing with the games tested and game configurations. You’ll almost certainly complete your own tests before buying, and at that point, you can ensure an apples-to-apples comparison. Use this information and the data you collect on your own to form a high-level impression of the value proposition delivered by each hardware configuration.

Table 1 details three systems, a reference design that is in mass production from NETINT, one from an established mobile cloud gaming platform, and one from Supermicro based on an Ampere Arm processor and four NVIDIA A16 GPUs.

Table 1. System configurations.

To compute the pricing information for the systems shown in table 2, we priced each component on the web and grabbed maximum power consumption data from each manufacturer. Pricing and power consumption shown are for the components listed, not the entire system. The number of 720p outputs is from each manufacturer, including NETINT.

Table 2. Component cost and power usage, total and on a cost-per-stream basis.

From there, it’s simple math; divide the cost and total watts by the 720p stream count to determine the cost per stream and watts per stream. Again, this is only for the core components identified, but the computer and other components should be relatively consistent irrespective of the CPU, GPU, and VPU that you use. 

ASIC-based transcoders plus GPUs are the most cost-effective configuration to deliver a profitable and high-quality game streaming experience.
We are happy to share our data and sources so you can confirm independently.

As you walk the NAB show floor, or check proposed solutions on the web, beware of custom bespoke architectures using proprietary solutions (e.g. all Intel, all NVIDIA, all AMD). Each company has their demos that showcase technology, but not operational competitiveness. None of these systems can meet the OPEX or CAPEX needed for a competitive and profitable cloud gaming solution.

We challenge you to get your own numbers and compare them!
Download the printable TABLE HERE

ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses

ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses

As the title suggests, this post compares CAPEX and OPEX costs for live streaming using ASIC- based transcoding and CPU-based transcoding. The bottom line?

NETINT Transcoding Server with 10 T408 Video Transcoders
Figure 1. The 1 RU Deep Edge Appliance with ten NETINT T408 U.2 transcoders.

Jet-Stream is a global provider of live-streaming services, platforms, and products. One such product is Jet-Stream’s Deep Edge OTT server, an ultra-dense scalable OTT streaming transcoder, transmuxer, and edge cache that incorporates ten NETINT T408 transcoders. In this article, we’ll briefly review how Deep Edge compared financially to a competitive product that provided similar functionality but used CPU-based transcoding.

About Deep Edge

Jet-Stream Deep Edge is an OTT edge transcoder and cache server solution for telcos, cloud operators, compounds, and enterprises. Each Deep Edge appliance converts up to 80 1080p30 television channels to OTT HLS and DASH video streams, with a built-in cache enabling delivery to thousands of viewers without additional caches or CDNs.

Each Deep Edge appliance can run individually, or you can group multiple systems into a cluster, automatically load-balancing input channels and viewers per site without the need for human operation. You can operate and monitor Edge appliances and clusters from a cloud interface for easy centralized control and maintenance. In the case of a backlink outage, the edge will autonomously keep working.

Figure 2. Deep Edge operating schematic.

Optionally, producers can stream access logs in real-time to the Jet-Stream cloud service. The Jet-Stream Cloud presents the resulting analytics in a user-friendly dashboard so producers can track data points like the most popular channels, average viewing time, devices, and geographies in real-time, per day, week, month, and year, per site, and for all the sites.

Deep Edge appliances can also act as a local edge for both the internal OTT channels and Jet-Stream Cloud’s live streaming and VOD streaming Cloud and CDN services. Each Deep Edge appliance or cluster can be linked to an IP-address, IP-range, AS-number, country, or continent, so local requests from a cell tower, mobile network, compound, football stadium, ISP, city, or country to Jet-Stream Cloud are directed to the local edge cache. Each Deep Edge site can be added to a dynamic mix of multiple backup global CDNs, to tune scale, availability, and performance and manage costs.

Under the Hood

Each Deep Edge appliance incorporates ten NETINT T408 transcoders into a 1RU form factor driven by a 32-core CPU with 128 GB of RAM. This ASIC-based acceleration is over 20x more efficient than encoding software on CPUs, decreasing operational cost and CO2 footprint by order of magnitude. For example, at full load, the Deep Edge appliance draws under 240 watts.

The software stack on each appliance incorporates a Kubernetes-based container architecture designed for production workloads in unattended, resource-constrained, remote locations. The architecture enables automated deployment, scaling, recovery, and orchestration to provide autonomous operation and reduced operational load and costs.

The integrated Jet-Stream Maelstrom transcoding software provides complete flexibility in encoding tuning, enabling multi-bit-rate transcoding in various profiles per individual channel.

Each channel is transcoded and transmuxed in an isolated container, and in the event of a crash, affected processes are restarted instantly and automatically.

Play Video about ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses
HARD QUESTIONS ON HOT TOPICS
 ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses
Watch the full conversation on YouTube: https://youtu.be/pXcBXDE6Xnk

Deep Edge Proposal

Recently, Jet-Stream submitted a bid to a company with a contract to provide local streaming services to multiple compounds in the Middle East. The prospective customer was fully transparent and shared the costs associated with a CPU-based solution against which Deep Edge competed.

In producing these projections, Jet-Stream incorporated a cost per kilowatt of € 0.20 Euros and assumed that the software-based server would run at 400 Watts/hour while Deep Edge would run at 220 Watts per hour.  These numbers are consistent with lab testing we’ve performed at NETINT; each T408 draws only 7 watts of power, and because they transcode the incoming signal onboard, host CPU utilization is typically at a minimum.

Jet-Stream produced three sets of comparisons; a single appliance, a two-appliance cluster, and ten sites with two-appliance clusters. Here are the comparisons. Note that the Deep Edge cost includes all software necessary to deliver the functionality detailed above for standard features. In contrast, the CPU-based server cost is hardware-only and doesn’t include the licensing cost of software needed to match this functionality.    

Single Appliance

A single Deep Edge appliance can produce 80 streams, which would require five separate servers for CPU-based transcoding. Considering both CAPEX and OPEX, the five-year savings was €166,800.

ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses - Table 1
Table 1. CAPEX/OPEX savings for a single
Deep Edge appliance over CPU-based transcoding.

A Two-Appliance Cluster

Two Deep Edge appliances can produce 160 streams, which would require nine CPU-based encoding servers to produce. Considering both CAPEX and OPEX, the five-year savings for this scenario was €293,071.

Table 2 CAPEX/OPEX savings for a dual-appliance
Deep Edge cluster over CPU-based transcoding.
.

Ten Sites with Two-Appliance Clusters

Supporting ten sites with 180 channels would require 20 Deep Edge appliances and 90 servers for CPU-based encoding. Over five years, the CPU-based option would cost over € 2.9 million Euros more than Deep Edge.

Table 3. CAPEX/OPEX savings for ten dual-appliance
Deep Edge clusters over CPU-based transcoding.

While these numbers border on unbelievable, they are actually quite similar to what we computed in this comparison, How to Slash CAPEX, OPEX, and Carbon Emissions with T408 Video Transcoder, which compared T408-based servers to CPU-only on-premises and AWS instances.

The bottom line is that if you’re transcoding with CPU-based software, you’re paying way too much for both CAPEX and OPEX, and your carbon footprint is unnecessarily high. If you’d like to explore how many T408s you would need to assume your current transcoding workload, and how long it would take to recoup your costs via lower energy costs, check out our calculators here.

Play Video about ASIC vs. CPU-Based Transcoding: A Comparison of Capital and Operating Expenses
Voices of Video: Building Localized OTT Networks
Watch the full conversation on YouTube: https://youtu.be/xP1U2DGzKRo

The Evolution of Video Codecs: AV1 and HEVC Take the Lead

HEVC and AV1 - The Evolution of Codecs

For years, H.264 has remained dominant because it plays everywhere; but as videos grow larger, faster, and deeper in color, cost of distributing H.264 has become too high.

AV1 has leap-frogged VP9 in the so-called “open-source” horse race, while HEVC is the clear successor to H.264 in standards-based codecs, at least for the next 3-4 years as VVC slowly matures.

AV1 and HEVC have had their well-known Achilles heels, AV1 in the living room and on Apple devices, and HEVC in browsers. The last few months have seen critical movement and new data in all these platforms that will fundamentally change how we use them.

AV1 in the Living Room

HEVC has dominated Smart TVs and OTT dongles since 4K and High Dynamic Range (HDR) became must-haves for premium content producers. However, in late 2021, Netflix began distributing AV1 video to this market, and device support has burgeoned since then. As Bitmovin reported in this blog post, AV1 runs on smart TVs running Android TV and Google TV operating systems, including Sony Google TV models from 2021 and forward and many Amazon Fire TV models as far back as 2020. Starting in late 2020, most Samsung TVs have hardware AV1 decoders, with LG extending support to some TVs.

HEVC and AV1 - The Evolution of Codecs
Figure 1. Netflix started the migration of living room content towards AV1. 

Regarding OTT dongles, the Amazon Fire TV Stick 4K Max and the Roku Streaming Stick 4K, and other Roku models support AV1 playback, as does the PlayStation 4 Pro and Xbox One.

The one caveat is that AV1 support for dynamic metadata is nascent. The HDR10+ AV1 Metadata Handling Specification was finalized on December 7, 2022, so it will take a while for encoders and decoders to fully and reliably support it. Since Google’s Project Caviar is proposing a royalty-free alternative to Dolby Vision, Dolby Vision still only supports H.264 and HEVC and may never support AV1.

To be clear, YouTube supports HDR with AV1, so it’s technically feasible today. But standards like the HDR10+ Metadata Handling Specification promote broad playback compatibility necessary for most publishers to help it. For example, when Netflix first started streaming video to bright TV sets in 2021, it was Standard Dynamic Range only, and that’s still the case. Besides, suppose you’re already encoding your video to HEVC for living room delivery in HDR. In that case, it may not make economic sense to reencode to AV1 for slightly more efficient delivery to a market that you’re already serving.

Play Video about HEVC and AV1 - The Evolution of Codecs - thumbnail
HARD QUESTIONS ON HOT TOPICS – EVOLUTION OF VIDEO CODECS – WHEN IS AV1 READY?
Watch the full conversation on YouTube: https://youtu.be/wbMojTl_cpA

HEVC Plays in Chrome

Browser playback has been a traditional strength of AV1 since it first launched. Not surprising, given that all major browser developers are members of the Alliance for Open Media. For the same reason, it’s also no surprise that browsers like Chrome and Firefox never supported HEVC, even when hardware or software on the computer or device did support HEVC playback.

This changed in September 2022, when Google “fixed a bug” and enabled HEVC support when the hardware HEVC playback was available on the system. As the story goes, the lack of HEVC playback was reported by Bitmovin as a bug in 2015. On September 19, 2022, Google responded six years later, “Enabled by default on all releases.” Within weeks, browser support for HEVC, as reported in CanIUse, jumped from the low 20s to 86.49, well ahead of AV1 at around 73%.

This could be a massive benefit to streaming sites that deliver primarily to computers and mobile devices and have avoided HEVC because of the lack of Chrome playback. In a straightforward bugfix, Google enabled HEVC playback on all supported platforms with existing decoders, including Windows, Mac, iOS, and Android.

A caveat exists here, as well, specifically that “HEVC with Widevine DRM is not supported at this point.” This obviously limits the benefit of Chrome support for premium content producers.

Apple May Start Supporting AV1

Apple has a checkered history with the Alliance for Open Media. When Apple joined in 2018, they big footed their way in as a “founding member,” even though the organization was formed over two years earlier. Despite this aggressive posturing, Apple has never supported AV1 playback in its operating systems or browsers and was a massive supporter of HEVC.

Figure 2. Apple is now supporting AV1 playback in Safari 16.4.

At least respecting AV1, this may be about to change. With Safari 16.4, Apple added AV1 support in the media capabilities API and WebRTC support for hardware AV1 decoding on supported device configurations. It turns out that the software AV1 decoder dav1d is already included in the updated WebKit engine used in Apple Safari Technology Preview 161.

Apple is dipping its toes in the AV1 waters; this could mean that it intends to support AV1 playback via software in the short term or that it may unlock previously unannounced hardware playback capabilities in existing CPUs. It could also mean hardware AV1 support will be added in future CPUs. Whatever the strategy, it’s probably safe to assume that Safari will play AV1 at some point in the future, hopefully sooner than later.

That said, the major data point that recently surfaced was a Scientamobile report that indicated that while 86.60% of HEVC smartphones had HEVC hardware support, only 2.52% had AV1 support. Since hardware support guarantees full frame rate playback at minimal power draw, HEVC will likely remain the format of choice for mobile devices for the next 12-24 months.

#image_title
Figure 3. HEVC currently enjoys much greater hardware support in mobile devices than AV1.

Whether you decide to stay with H.264 for your live transcodes, or transition to AV1 or HEVC, NETINT has you covered. Our G4-based line of products (T408, T432) transcode to H.264 and HEVC, while the G5-based Quadra line (T1, T1A, T2A) support H.264, HEVC, and AV1. All products deliver competitive video quality, market-leading density, a highly affordable cost per stream, and the lowest possible power consumption and OPEX.

Region of Interest Encoding for Cloud Gaming: A Survey of Approaches

Region of Interest (ROI) Encoding for-Cloud Gaming

As cloud gaming use cases expand, we are studying even more ways to deliver high-quality video with low latency and efficient bitrates.

Region of Interest Encoding (ROI) is one way to enhance video quality while reducing bandwidth. This post will discuss three ROI-based techniques recently proposed in research papers that may soon be adopted in cloud gaming encoding workflows.

This blog is meant to be informative. If I missed any important papers or methods, feel free to contact me HERE.

Region of Interest (ROI) Encoding

ROI encoding allows encoders to prioritize frame quality in critical regions most closely scrutinized by the viewer and is an established technique for improving viewer Quality of Experience. For example, NETINT’s Quadra video processing unit (VCU) uses artificial intelligence (AI) to detect faces in videos and then ROI encoding to improve facial quality. The NETINT T408/T432 also supports ROI encoding, but the specific regions must be manually defined in the command string.

ROI encoding is particularly relevant to cloud gaming, where viewers prefer fast-moving action, high resolutions, and high frame rates, but also want to play at low bitrates on wireless or cellular networks with ultra-low latency. These factors make cloud gaming a challenging compression environment.

Whether for real word videos or cloud gaming, the challenge with ROI encoding lies in identifying the most relevant regions of interest. As you’ll see, the three papers described below all take a markedly different approach. 

In the paper “Content-aware Video Encoding for Cloud Gaming” (2019), researchers from Simon Fraser University and Advanced Micro Devices propose using metadata provided by the game developer to identify the crucial regions. As the article states,

“Identifying relevant blocks is straightforward for game developers, because they know the logic and semantics of the game. Thus, they can expose this information as metadata with the game that can be accessed via APIs... Using this information, one or more regions of interest (ROIs) are defined as bounding boxes containing objects of importance to the task being achieved by the player.”

The authors label their proposed method CAVE, for Content-Aware Video Encoding. Architecturally, CAVE sits between the game process and the encoder, as shown in Figure 1. Then, “CAVE uses information about the game’s ROIs and computes various encoding parameters to optimize the quality. It then passes these parameters to the Video Encoder, which produces the encoded frames sent to the client.”

Region of Interest Encoding for Cloud Gaming: A Survey of Approaches
Figure 1. The CAVE encoding method is implemented between the game process and encoder.

The results were promising. The technique “achieves quality gains in ROIs that can be translated to bitrate savings between 21% and 46% against the baseline HEVC encoder and between 12% and 89% against the closest work in the literature.”

Additionally, the processing overhead introduced by CAVE was less than 1.21%, which the authors felt would be reduced even further with parallelization, though implementing the process in silicon could completely eliminate the additional CPU loading.

ROI from Gaze Tracking

Another ROI-based approach was studied in the paper “Cloud Gaming With Foveated Video Encoding” by researchers from Aalto University in Finland and Politecnico di Torino in Italy. In this study, the region of interest was detected by a Tobii 4C Eye Tracker. This data was sent to the server, which used it to identify the ROI and adjust the Quantization Parameter (QP) values for the affected blocks accordingly.

Region of Interest Encoding for Cloud Gaming: A Survey of Approaches
Figure 2. Using region of interest data from a gaze tracker.

Referring to the title of this paper, the term ‘foveation’ refers to a “non-uniform sampling response to visual stimuli” that’s inherent to the human visual system. By incorporating the concept of foveation, the encoder can most effectively allocate QP values to the regions of interest and surrounding frames, and seamlessly blend them with other regions within the frame.

As stated in the paper, to compute the quality of each macroblock, “the gaze location is translated to a macroblock based coordinate system. The macroblock corresponding to the current gaze location is assigned the lowest QO, while the QO of macroblocks away from the gaze location increases progressively with distance from the gaze macroblock.” 

The researchers performed extensive testing and analysis and ultimately concluded that “[o]ur evaluation results suggest that its potential to reduce bandwidth consumption is significant, as expected.” 
Regarding latency, the paper reports that “user study establishes the feasibility of FVE for FPS games, which are the most demanding latency wise.”

Obviously, any encoding solution tied to a gaze tracker has limited applicability, but the authors saw a much broader horizon ahead. “[w]e intend to attempt eliminating the need for specialized hardware for eye tracking by employing web cameras for the purpose. Using web cameras, which are ubiquitous in modern consumer computing devices like netbooks and mobile devices, would enable widespread adoption of foveated streaming for cloud gaming.”

Detecting ROI from Machine Learning

Finally, DeepGame: Efficient Video Encoding for Cloud Gaming was published in October 2021 by researchers from Simon Fraser University and Advanced Micro Devices, including three authors of the first paper mentioned above.

As detailed in the introduction, the authors propose “a new video encoding pipeline, called DeepGame, for cloud gaming to deliver high-quality game streams without codec or game modifications…DeepGame takes a learning-based approach to understand the player contextual interest within the game, predict the regions of interest (ROIs) across frames, and allocate bits to different regions based on their importance.”

At a high level, DeepGame is implemented in three stages:

  1. Scene analysis to gather data
  2. ROI prediction, and
  3. Encoding parameters calculation

Regarding the last stage, these encoding parameters are passed to the encoder via “a relatively straightforward set of APIs” so it’s not necessary to modify the encoder source code.

The authors describe their learning-based approach as follows; “DeepGame learns the player’s contextual interest in the game and the temporal correlation of that interest using a spatiotemporal deep neural network.” The schema for this operation is shown in Figure 3.

In essence, this learning-based approach means that some game-specific training is required beforehand and some processing during gameplay to identify ROIs in real time. The obvious questions are, how much latency does this process add, and how much bandwidth does the approach save.

Region of Interest Encoding for Cloud Gaming: A Survey of Approaches
Figure 3. DeepGame’s neural network-based schema for detecting region of interest.

Regarding latency, model training is performed offline and only once per game (and for major upgrades). Running the inference on the model is performed during each gaming session. During their testing, the researchers ran the inference model on every third frame and concluded that “ROI prediction time will not add any processing delays to the pipeline.”

The researchers trained and tested four games, FIFA 20, a soccer game, CS:GO, a first-person shooter game, and NBA Live 19 and NHL 19, and performed multiple analyses. First, they compared their predicted ROIs to actual ROIs detected using a Gazepoint GP3 eye-tracking device. Here, accuracy scores ranged from a high of 85.95% for FIFA 20 to a low of 73.96% for NHL 19.

Then, the researchers compared the quality in the ROI regions with an unidentified “state-of-the-art H.265 video encoder” using SSIM and PSNR. BD-Rate savings for SSIM ranged from 33.01% to 20.80%, and from 35.06% to 19.11% for PSNR. They also compared overall frame quality using VMAF, which yielded nearly identical scores, proving that DeepGame didn’t degrade overall quality despite the bandwidth savings and improved quality with regions of interest.

The authors also performed a subjective study with the FIFA 20 and CS:GO games using x264 with and without DeepGame inputs. The mean opinion scores incorporated the entire game experience, including lags, distortions, and artifacts. In these tests, DeepGame improved the Mean Opinion Scores by up to 33% over the base encoder.

Play Video about Region of Interest Encoding for Cloud Gaming: A Survey of Approaches - thumbnail
HARD QUESTIONS ON HOT TOPICS – CLOUD OR ON PREMISES, HOW TO DO THE MATH?
Watch the full conversation on YouTube: https://youtu.be/KIaYFS54QNY

Summary

All approaches have their pros and cons. The CAVE approach should be most accurate in identifying ROIs but requires metadata from game developers. The gaze tracker approach can work with any game but requires hardware that many gamers don’t have and is unproven for webcams. Meanwhile, DeepGame can work with any game but requires pre-game training and involves ingame running of reference models.

All appear to be very viable approaches for improving QoE and reducing bandwidth and latency while working with existing codecs and encoders. Unfortunately, none of the three proposals described seem to have progressed towards implementation. This makes ROI encoding for cloud gaming a technology worth watching, if not yet available for implementation.

All You Need to Know About the NETINT Product Line

Quadra - All You Need to Know About the NETINT Product Line

This article will introduce you to the NETINT product line and Codensity ASIC generations. We will focus primarily on the hardware differences, since all products share a common software architecture and feature set, which are briefly described at the end of the article.

PRODUCT GALLERY. Click the product image to visit product page

Codensity G4-Powered Video Transcoder Products

The Codensity G4 was the first encoding ASIC developed by NETINT. There are two G4-based transcoders, the T408 (Figure 1), is available in a U.2 form factor and as an add-in card, and the T432 (Figure 2), which is available as an add-in card. The T408 contains a single G4 ASIC and draws 7 watts under full load, while the T432 contains four G4 ASICs and draws 27 watts.

The T408 costs $400 in low volumes, while the T432 costs $1,500. The T432 delivers 4x the raw performance of the T408.

Netint Codensity, ASIC-based T408 Video Transcoder
Figure 1. The NETINT T408 is powered by a single Codensity G4 ASIC.

T408 and T432 decode and encode H.264 and HEVC on the device but perform all scaling, overlay, and deinterlacing on the host CPU.

If you’re buying your own host, the selected CPU should reflect the extent of processing that it needs to perform and the overhead requirements of the media processing framework that is running the transcode function. 

When transcoding inputs without scaling, as in a cloud gaming or conferencing application, a modest CPU can suffice. If you are creating standard encoding ladders, deinterlacing multiple streams, or frequently scaling incoming videos, you’ll need a more capable CPU. For a turn-key solution, check out the NETINT Logan Video Server options.

Netint Codensity, ASIC-based T432 Video Transcoder
Figure 2. The NETINT T432 includes four Codensity G4 ASICs.

The T408 and T432 run on multiple versions of Ubuntu and CentOS; see here for more detail about those versions and recommendations for configuring your server.

The NETINT Logan Video Server

The NETINT Video Transcoding Server includes ten T408 U.2 transcoders. It is targeted for high-volume transcoding applications as an affordable turn-key replacement for existing hardware transcoders or where a drop-in solution to a software-based transcoder is preferred.

The lowest priced model costs $7,000 and is built on the Supermicro 1114S-WN10RT server platform powered by an AMD EPYC 7232P CPU Series Processor with eight CPU cores and 16 threads running Ubuntu 20.04.05 LTS. The server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots that house the ten T408 transcoders. At full transcoding capacity, the server draws 220 watts while encoding or transcoding up to ten 4Kp60 streams or as many as 160 720p60 video streams.

The server is also offered with two more powerful CPUs, the AMD EPYC 7543P Server Processor (32-cores/64-threads, $8,900) and the AMD EPYC 7713P Server Processor (64-cores/128-threads, $11,500). Other than the CPU, the hardware specifications are identical.

FIGURE 3. The NETINT Video Transcoding Server.

All Codensity G4-based products support HDR10 and HDR10+ for H.264 and H.265 encode and decode, as well as EIA CEA-708 closed captions for H.264 and H.265 encode and decode. In low-latency mode, all products support sub-frame latency. Other features include region-of-interest encoding, a customizable GOP structure with eight presets, and forced IDR frame inserts at any location.

The T408, T432, and NETINT Server are targeted toward high-volume interactive applications that require inexpensive, low-power, and high-density transcoding using the H.264 and HEVC codecs.

Codensity G5-Powered Live Transcoder Products

In addition to roughly quadrupling the H.264 and HEVC throughput of the Codensity G4, the Codensity G5 is our second-generation ASIC that adds AV1 encode support, VP9 decode support, onboard scaling, cropping, padding, graphical overlay, and an 18 TOPS (Trillions of Operations Per Second) artificial intelligence engine that runs the most common frameworks all natively in silicon.

Codensity G5 also includes audio DSP engines for encoding and decoding audio codecs such as MP3, AAC-LC, and HE AAC. All this on-board activity minimizes the role of the CPU allowing Quadra products to operate effectively in systems with modest CPUs.

Where the G4 ASIC is primarily a transcoding engine, the G5 incorporates much more onboard processing for even greater video processing acceleration. For this reason, NETINT labels Codensity G4-based products as Video Transcoders and Codensity G5-based products as Video Processing Units or VPUs.

The Codensity G5 is available in three products (Figure 4), the U.2-based Quadra T1 and PCIe-based Quadra T1A, which include one Codensity G5 ASIC, and the PCIe-based , which includes two Codensity G5 ASICs. Pricing for the T1 starts at $1,500. 

In terms of power consumption, the T1 draws 17 Watts, the T1A 20 Watts, and the T2 draws 40 Watts.

Figure 4. The Quadra line of Codensity G5-based products.

All Codensity G5-based products provide the same HDR and close caption support as the Codensity G4-based products. They have also been tested on Windows, MacOS, Linux and Android OS with support for virtual machine and container virtualization, including Single Root I/O Virtualization [SRIOV].

From a quality perspective, the Codensity G4-based transcoder products offer no configuration options to optimize quality vs. throughput. Quadra Codensity G5-powered VPUs offer features like lookahead and rate-distortion optimization that allow users to customize quality and throughput for their particular applications.

Play Video about Hard Questions - NETINT product line
HARD QUESTIONS ON HOT TOPICS – WHAT DO YOU NEED TO UNDERSTAND ABOUT NETINT PRODUCTS LINE
Watch the full conversation on YouTube: https://youtu.be/qRtnwjGD2mY

AI-Based Video Processing

Beyond VP9 ingest and AV1 output, and superior on-board processing, the Codensity G5 AI engine is a game changer for many current and future video processing applications. Each Codensity G5 ASIC includes two onboard Neural Processing Units (NPUs). Combined with Quadra’s integrated decoding, scaling, and transcoding hardware, this creates an integrated AI and video processing architecture that requires minimal interaction from the host CPU.

Today, in early 2023, the AI-enabled processing market is nascent, but Quadra already supports several applications like AI-based region of interest filter, background removal (see Quadra App Note APPS553), and others. Additional features under development include an automatic facial ID for video conferencing, license plate detection and OCR for security, object detection for a range of applications, and voice-to-text.

Quadra includes an AI Toolchain workflow that enables importing models from AI tools like Caffe, TensorFLow, Keras, and Darknet for deployment on Quadra. So, in addition to the basic models that NETINT provides, developers can design their own applications and easily implement them on Quadra

Like NETINT’s Codensity G4 based products, Quadra VPUs are ideal for interactive applications that require low CAPEX and OPEX. Quadra VPUs offer increased onboard processing that enables lower-cost host systems and the ability to customize throughput and quality, deliver AV1 output, and deploy AI video applications.

The NETINT Quadra 100 Video Server

The NETINT Quadra 100 Video Server includes ten Quadra T1 U.2 VPUs and is targeted for ultra high-volume transcoding applications and for services seeking to deliver AV1 stream output.  

The Quadra 100 Video Server costs $20,000 and is built on the Supermicro 1114S-WN10RT server platform powered by an  AMD EPYC 7543P Server Processor (32-cores/64-threads) running Ubuntu 20.04.05 LTS. The server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots that house the ten T1 U.2 VPUs. At full transcoding capacity, the server draws around 500 watts while encoding or transcoding up to 20 8Kp30 streams or as many as 640 720p30 video streams.

The Quadra server is also offered with two different CPUs, the AMD EPYC 7232P Server Processor (8-cores/16-threads, price TBD) and the AMD EPYC 7713P Server Processor (64-cores/128-threads, price TBD). Other than the CPU, the hardware specifications are identical.

Media Processing Frameworks - Driving NETINT Hardware

In addition to SDKs for both hardware generations, NETINT offers highly efficient FFmpeg and GStreamer SDKs that allow operators to apply an FFmpeg/libavcodec or GStreamer patch to complete the integration.

In the FFmpeg implementation, the libavcodec patch on the host server functions between the NETINT hardware and FFmpeg software layer, allowing existing FFmpeg-based video transcoding applications to control hardware operation with minimal changes.

The NETINT hardware device driver software includes a resource management module that tracks hardware capacity and usage load to present inventory and status on available resources and enable resource distribution. User applications can build their own resource management schemes on top of this resource pool or let the NETINT server automatically distribute the decoding and encoding tasks.

In automatic mode, users simply launch multiple transcoding jobs, and the device driver automatically distributed the decode/encode/processing tasks among the available resources. Or, users can assign different hardware tasks to different NETINT devices, and even control which streams are decoded by the host CPU or NETINT hardware. With these and similar controls, users can most efficiently balance the overall transcoding load between the NETINT hardware and host CPU and maximize throughput.

In all interfaces, the syntax and command structure is similar for T408s and Quadra units which simplifies migrating from G4-based products to Quadra hardware. It is also possible to operate T408 and Quadra hardware together in the same system.

That’s the overview. For more information on any product, please check the following product pages (click the image below to see product page). 

PRODUCT GALLERY. Click the product image to visit product page