All You Need to Know About the NETINT Product Line

Quadra - All You Need to Know About the NETINT Product Line

This article will introduce you to the NETINT product line and Codensity ASIC generations. We will focus primarily on the hardware differences, since all products share a common software architecture and feature set, which are briefly described at the end of the article.

PRODUCT GALLERY. Click the product image to visit product page

Codensity G4-Powered Video Transcoder Products

The Codensity G4 was the first encoding ASIC developed by NETINT. There are two G4-based transcoders, the T408 (Figure 1), is available in a U.2 form factor and as an add-in card, and the T432 (Figure 2), which is available as an add-in card. The T408 contains a single G4 ASIC and draws 7 watts under full load, while the T432 contains four G4 ASICs and draws 27 watts.

The T408 costs $400 in low volumes, while the T432 costs $1,500. The T432 delivers 4x the raw performance of the T408.

Netint Codensity, ASIC-based T408 Video Transcoder
Figure 1. The NETINT T408 is powered by a single Codensity G4 ASIC.

T408 and T432 decode and encode H.264 and HEVC on the device but perform all scaling, overlay, and deinterlacing on the host CPU.

If you’re buying your own host, the selected CPU should reflect the extent of processing that it needs to perform and the overhead requirements of the media processing framework that is running the transcode function. 

When transcoding inputs without scaling, as in a cloud gaming or conferencing application, a modest CPU can suffice. If you are creating standard encoding ladders, deinterlacing multiple streams, or frequently scaling incoming videos, you’ll need a more capable CPU. For a turn-key solution, check out the NETINT Logan Video Server options.

Netint Codensity, ASIC-based T432 Video Transcoder
Figure 2. The NETINT T432 includes four Codensity G4 ASICs.

The T408 and T432 run on multiple versions of Ubuntu and CentOS; see here for more detail about those versions and recommendations for configuring your server.

The NETINT Logan Video Server

The NETINT Video Transcoding Server includes ten T408 U.2 transcoders. It is targeted for high-volume transcoding applications as an affordable turn-key replacement for existing hardware transcoders or where a drop-in solution to a software-based transcoder is preferred.

The lowest priced model costs $7,000 and is built on the Supermicro 1114S-WN10RT server platform powered by an AMD EPYC 7232P CPU Series Processor with eight CPU cores and 16 threads running Ubuntu 20.04.05 LTS. The server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots that house the ten T408 transcoders. At full transcoding capacity, the server draws 220 watts while encoding or transcoding up to ten 4Kp60 streams or as many as 160 720p60 video streams.

The server is also offered with two more powerful CPUs, the AMD EPYC 7543P Server Processor (32-cores/64-threads, $8,900) and the AMD EPYC 7713P Server Processor (64-cores/128-threads, $11,500). Other than the CPU, the hardware specifications are identical.

FIGURE 3. The NETINT Video Transcoding Server.

All Codensity G4-based products support HDR10 and HDR10+ for H.264 and H.265 encode and decode, as well as EIA CEA-708 closed captions for H.264 and H.265 encode and decode. In low-latency mode, all products support sub-frame latency. Other features include region-of-interest encoding, a customizable GOP structure with eight presets, and forced IDR frame inserts at any location.

The T408, T432, and NETINT Server are targeted toward high-volume interactive applications that require inexpensive, low-power, and high-density transcoding using the H.264 and HEVC codecs.

Codensity G5-Powered Live Transcoder Products

In addition to roughly quadrupling the H.264 and HEVC throughput of the Codensity G4, the Codensity G5 is our second-generation ASIC that adds AV1 encode support, VP9 decode support, onboard scaling, cropping, padding, graphical overlay, and an 18 TOPS (Trillions of Operations Per Second) artificial intelligence engine that runs the most common frameworks all natively in silicon.

Codensity G5 also includes audio DSP engines for encoding and decoding audio codecs such as MP3, AAC-LC, and HE AAC. All this on-board activity minimizes the role of the CPU allowing Quadra products to operate effectively in systems with modest CPUs.

Where the G4 ASIC is primarily a transcoding engine, the G5 incorporates much more onboard processing for even greater video processing acceleration. For this reason, NETINT labels Codensity G4-based products as Video Transcoders and Codensity G5-based products as Video Processing Units or VPUs.

The Codensity G5 is available in three products (Figure 4), the U.2-based Quadra T1 and PCIe-based Quadra T1A, which include one Codensity G5 ASIC, and the PCIe-based , which includes two Codensity G5 ASICs. Pricing for the T1 starts at $1,500. 

In terms of power consumption, the T1 draws 17 Watts, the T1A 20 Watts, and the T2 draws 40 Watts.

Figure 4. The Quadra line of Codensity G5-based products.

All Codensity G5-based products provide the same HDR and close caption support as the Codensity G4-based products. They have also been tested on Windows, MacOS, Linux and Android OS with support for virtual machine and container virtualization, including Single Root I/O Virtualization [SRIOV].

From a quality perspective, the Codensity G4-based transcoder products offer no configuration options to optimize quality vs. throughput. Quadra Codensity G5-powered VPUs offer features like lookahead and rate-distortion optimization that allow users to customize quality and throughput for their particular applications.

Play Video about Hard Questions - NETINT product line
HARD QUESTIONS ON HOT TOPICS – WHAT DO YOU NEED TO UNDERSTAND ABOUT NETINT PRODUCTS LINE
Watch the full conversation on YouTube: https://youtu.be/qRtnwjGD2mY

AI-Based Video Processing

Beyond VP9 ingest and AV1 output, and superior on-board processing, the Codensity G5 AI engine is a game changer for many current and future video processing applications. Each Codensity G5 ASIC includes two onboard Neural Processing Units (NPUs). Combined with Quadra’s integrated decoding, scaling, and transcoding hardware, this creates an integrated AI and video processing architecture that requires minimal interaction from the host CPU.

Today, in early 2023, the AI-enabled processing market is nascent, but Quadra already supports several applications like AI-based region of interest filter, background removal (see Quadra App Note APPS553), and others. Additional features under development include an automatic facial ID for video conferencing, license plate detection and OCR for security, object detection for a range of applications, and voice-to-text.

Quadra includes an AI Toolchain workflow that enables importing models from AI tools like Caffe, TensorFLow, Keras, and Darknet for deployment on Quadra. So, in addition to the basic models that NETINT provides, developers can design their own applications and easily implement them on Quadra

Like NETINT’s Codensity G4 based products, Quadra VPUs are ideal for interactive applications that require low CAPEX and OPEX. Quadra VPUs offer increased onboard processing that enables lower-cost host systems and the ability to customize throughput and quality, deliver AV1 output, and deploy AI video applications.

The NETINT Quadra 100 Video Server

The NETINT Quadra 100 Video Server includes ten Quadra T1 U.2 VPUs and is targeted for ultra high-volume transcoding applications and for services seeking to deliver AV1 stream output.  

The Quadra 100 Video Server costs $20,000 and is built on the Supermicro 1114S-WN10RT server platform powered by an  AMD EPYC 7543P Server Processor (32-cores/64-threads) running Ubuntu 20.04.05 LTS. The server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots that house the ten T1 U.2 VPUs. At full transcoding capacity, the server draws around 500 watts while encoding or transcoding up to 20 8Kp30 streams or as many as 640 720p30 video streams.

The Quadra server is also offered with two different CPUs, the AMD EPYC 7232P Server Processor (8-cores/16-threads, price TBD) and the AMD EPYC 7713P Server Processor (64-cores/128-threads, price TBD). Other than the CPU, the hardware specifications are identical.

Media Processing Frameworks - Driving NETINT Hardware

In addition to SDKs for both hardware generations, NETINT offers highly efficient FFmpeg and GStreamer SDKs that allow operators to apply an FFmpeg/libavcodec or GStreamer patch to complete the integration.

In the FFmpeg implementation, the libavcodec patch on the host server functions between the NETINT hardware and FFmpeg software layer, allowing existing FFmpeg-based video transcoding applications to control hardware operation with minimal changes.

The NETINT hardware device driver software includes a resource management module that tracks hardware capacity and usage load to present inventory and status on available resources and enable resource distribution. User applications can build their own resource management schemes on top of this resource pool or let the NETINT server automatically distribute the decoding and encoding tasks.

In automatic mode, users simply launch multiple transcoding jobs, and the device driver automatically distributed the decode/encode/processing tasks among the available resources. Or, users can assign different hardware tasks to different NETINT devices, and even control which streams are decoded by the host CPU or NETINT hardware. With these and similar controls, users can most efficiently balance the overall transcoding load between the NETINT hardware and host CPU and maximize throughput.

In all interfaces, the syntax and command structure is similar for T408s and Quadra units which simplifies migrating from G4-based products to Quadra hardware. It is also possible to operate T408 and Quadra hardware together in the same system.

That’s the overview. For more information on any product, please check the following product pages (click the image below to see product page). 

PRODUCT GALLERY. Click the product image to visit product page

Reducing Power Consumption in Data Centers: A Response to the European Energy Crisis

Reducing power consumption - European Energy Crisis

Encoding technology refreshes are seldom CFO driven. For European data centers, over the next few years, they may need to be as reducing power consumption in data centers becomes a primary focus.

Few European consumers or businesses need to be reminded that they are in the midst of a power crisis. But a recent McKinsey & Company article entitled Four themes shaping the future of the stormy European power market provides interesting insights into the causes of the crisis and its expected duration. Engineering and technical leaders, don’t stop reading because this crisis will impact the architecture and technology decisions you may be making.

The bottom line, according to McKinsey? Buckle up, Europe, “With the frequency of high-intensity heat waves expected to increase, additional outages of nuclear facilities planned in 2023, and further expected reductions in Russian gas imports, we expect that wholesale power prices may not reduce substantially (defined as returning to three times higher than pre-crisis levels) until at least 2027.” If you haven’t been thinking about steps your organization should take to reduce power consumption and carbon emissions, now is the time.

Play Video about Hard Questions - Reducing Power Consumption in Europe - NETINT technologies
HARD QUESTIONS ON HOT TOPICS – EUROPEAN ENERGY CRISIS AS PER MCKINSEY REPORT
WATCH THE FULL CONVERSATION ON YOUTUBE: https://youtu.be/yiYSoUB4yXc

The Past

The war in Ukraine is the most obvious contributor to the energy crisis, but McKinsey identifies multiple additional contributing factors. Significantly, even before the War, Europe was in the midst of “structural challenges” caused by its transition from carbon-emitting fossil fuels to cleaner and more sustainable sources like wind, solar, and hydroelectric.

Then, in 2022, the shock waves began. Prior to the invasion of Ukraine in February, Russia supplied 30 percent of Europe’s natural gas, which dropped by as much as 50% in 2022, and is expected to decline further. This was exacerbated by a drop of 19% in hydroelectric power caused by drought and a 14% drop in nuclear power caused by required maintenance that closed 32 of France’s 56 reactors. As a result, “wholesale prices of both electricity and natural gas nearly quadrupled from previous records in the third quarter of 2022 compared with 2021, creating concerns for skyrocketing energy costs for consumers and businesses.”

Figure 1. As most European consumers and businesses know, prices skyrocketed in 2022
and are expected to remain high through 2027 and beyond.

Four key themes

Looking ahead, McKinsey identifies four key themes it expects to shape the market’s evolution over the next five years.

  • Increase in Required Demand

McKinsey sees power usage increasing from 2,900 terawatt-hours (TWh) in 2021 to 3,700 TWh in 2030, driven by multiple factors. For example, the switch to electric cars and other modes of transportation will increase power consumption by 14% annually. In addition, the manufacturing sector, which needs power for electrolysis, will increase to 200 TWh by 2030.

  • The Rise of Intermittent Renewable Energy Sources

By 2030, wind and solar power will provide 60% of Europe’s energy, double the share in 2021. This will require significant new construction but could also face challenges like supply chain issues, material shortages, and a scarcity of suitable land and talent.

  • Balancing Intermittent Energy Sources

McKinsey sees the energy market diverging into two types of sources; intermittent sources like solar, wind, and hydroelectric, and dispatchable sources like coal, natural gas, and nuclear that can be turned on and off to meet peak requirements. Over the next several years, McKinsey predicts that “a gap will develop between peak loads and the dispatchable power capacity that can be switched on to meet it.”

To close the gap, Europe has been aggressively developing clean energy sources of dispatchable capacity, including utility-scale battery systems, biomass, and hydrogen. In particular, hydrogen is set to play a key role in Europe’s energy future, as a source of dispatchable power and as a means to store energy from renewable sources.

All these sources must be further implemented and massively scaled, with “build-outs remaining highly uncertain due to a reliance on supportive regulations, the availability of government incentives, and the need for raw materials that are in short supply, such as lithium ion.”

  • New and evolving markets and rules

Beyond temporary measures designed to reduce costs for energy consumers, European policymakers are considering several options to reform how the EU energy market operates. These include

  • A central buyer model: A single EU or national regulatory agency would purchase electricity from dispatchable sources at fixed prices under long-term contracts and sell it to the market at average cost prices.
  • Decoupled day-ahead markets: Separate zero marginal cost energy resources (wind, solar) and marginal cost resources (coal) into separate markets to prioritize dispatching of renewables.
  • Capacity remuneration mechanism: Grid operator provides subsidies to producers based on forecast cost of keeping power capacity in the market to ensure a steady supply of dispatchable electricity and protect consumers.

McKinsey closes on a positive note, “Although the European power market is experiencing one of its most challenging periods, close collaboration among stakeholders (such as utilities, suppliers, and policy makers) can enable Europe’s green-energy transition to continue while ensuring a stable supply of power.”

The future of the European power market is complex and subject to many challenges, but policymakers and stakeholders are working to address them and find solutions to ensure a stable and affordable energy system for consumers and businesses.

In the meantime, the mandate for data centers isn’t new as video engineers are being asked to reduce power consumption to save OPEX, reduce carbon footprint to ensure ESG metrics are hit by the company, and minimize the potential disruption of energy instability.

If you’re in this mode, NETINT’s ASIC-based transcoders can help by offering the lowest available power draw of any silicon solution (CPU, GPU, FPGA), and thus the highest possible density.

Cloud or on-premise – streaming publisher’s dilemma

Publisher's dilemma - cloud or on-premise

Processing your media in the cloud or on-premises is one of the most critical decisions facing a streaming video service. Two recent articles provide strong opinions and insights on this decision and are worthy of review. Our take? Do the math and make your own decision.

The first article is “Why we’re leaving the cloud.”

By way of background, Hansson is co-owner and CTO of software developer 37signals, the developer of the project management platform Basecamp , and the premium email service Hey.

After running the two platforms on AWS for a number of years, Hannson commented that “renting computers is (mostly) a bad deal for medium-sized companies like ours with stable growth. The savings promised in reduced complexity never materialized.” As an overview, he asserts that the cloud excels at two ends of the spectrum: 1) simple and low-traffic applications and 2) highly irregular load with wild swings or towering peaks in usage.

When Hey first launched, running in AWS allowed the new service to seamlessly onboard the 300,000 users that signed up in the first three weeks, wildly exceeding the forecast of 30,000 in 6 months. However, since then, Hansson reported, these capacity spikes never reoccured, and by “continuing to operate in the cloud, we’re paying an at times almost absurd premium for the possibility that [they] could.”

In abandoning the cloud, Hansson had to stare down two common beliefs. First, is that the cloud simplifies systems and computer management. As it relates to his own businesses, he reports that “anyone who thinks running a major service like HEY or Basecamp in the cloud is “simple” has clearly never tried. Some things are simpler, others more complex, but on the whole, I’ve yet to hear of organizations at our scale being able to materially shrink their operations team, just because they moved to the cloud.”

He also tackles perceptions regarding the complexity of running equipment on-premise. “Up until very recently, everyone ran their own servers, and much of the progress in tooling that enabled the cloud is available for your own machines as well. Don’t let the entrenched cloud interests dazzle you into believing that running your own setup is too complicated. Everyone and their dog did it to get the internet off the ground, and it’s only gotten easier since.”

“Up until very recently, everyone ran their own servers, and much of the progress in tooling that enabled the cloud is available for your own machines as well. Don’t let the entrenched cloud interests dazzle you into believing that running your own setup is too complicated. Everyone and their dog did it to get the internet off the ground, and it’s only gotten easier since.”

In “Media Processing in the Cloud or On-Prem—Which Is Right for You?” , Alex Emmermann, Director of Business Development for Cloud Products at Telestream, takes a more moderate view (as you would expect).

Emmermann starts by pointing out where the cloud makes sense, zeroing in on the same capacity swings as Hansson. “A typical painful example is when capacity requirements shift underneath you, such as a service becoming more popular than you had initially allocated resources for. For example, when running a media services operation, there are many situations that can stress systems... In media processing, full-catalog licenses, mergers, or content migrations can cause enormous capacity requirements for transcoding and QC.”

Emmermann also introduces the concept of hybrid operations. “For many companies, a wholesale move may feel too risky, so a hybrid approach works well by allowing excess capacity requirements to burst into the cloud as required. This allows run rate systems to continue functioning while taking immediate advantage of cloud scaling when and if required. Depending on the needs of the service, a hybrid setup could continue to run indefinitely and very cost-effectively if on-prem CapEx resources have already been spent and the resources are in place to keep them running.”

In terms of companies that should operate on premises, Emmerman cites two examples. First are companies with significant CAPEX investments in encoding gear. “For the many thousands of busy on-premises servers processing run-rate media workflows throughout the world, they’re efficiently and cheaply doing what they need to do and will no doubt continue to do so for a long time.” He also mentions that inexpensive and reliable connectivity is an absolute requirement, and “there are certain places on the planet that may not have reliable interconnectivity to a cloud provider.”

All told, Emmerman concludes, “There’s no question that any media company investing in new services or wanting to have the capacity to say yes to any customer request will want to do this with a public cloud provider… On the other hand, any steady-state, on-premises service that is happily functioning as designed and only occasionally requires a small capital refresh will be happy to stay the course.”

Our Take? Do the Math

Play Video about Hard-Questions-on-Hot-Topics-1-cloud-or-on-prem
HARD QUESTIONS ON HOT TOPICS – CLOUD OR ON PREMISES, HOW TO DO THE MATH?
Watch the full conversation on YouTube: https://youtu.be/GSQsa4oQmCA

Anyone who has ever provisioned an EC2 instance from AWS and paid the hourly rate has wondered, “how does that compare to buying your own system?” We’re certainly not immune.

Given the impetus of this article, we decided to put pencil to paper or keyboard to a spreadsheet. We recently launched the NETINT Video Transcoding Server, which costs $7,000 and includes ten T408 transcoders that can output H.264 and HEVC. In benchmarking the entry-level system, it produced 21 five-rung H.264 ladders and 27 4-rung H.264 ladders. What would it cost to produce the same number of streams in AWS?

We checked the MediaLive price list here and confirmed it with the pricing calculator estimate here (Figure 3 for HEVC). Though a single hour of H.264 live streaming costs $0.46, this adds up to $4,004.17/per year. This jumps to $1.527 per hour for HEVC, or $13,375.55 per year. Both are for a single ladder.

Figure 3. Yearly cost for streaming a single five-rung HEVC encoding ladder.

To compare this to our streaming server, we multiplied each ladder by the number of ladders the server could produce, and extended all calculations out to five years. This translates to a five-year cost of $420,441 for H.264 and a staggering $1,805,712 for HEVC.

To compute the same five-year cost for the server, we added $69/month for colocation charges to the $7,000 base price. This came to $11,140 for either format.

Cloud or on-premise - streaming publisher's dilemma - table 1
Table 1. Five-year cost comparison, AWS MediaLive pricing compared to the NETINT server.

This comparison brought to mind Hansson’s comment that “Amazon, in particular, is printing profits renting out servers at obscene margins.” Surely, no streaming publisher is using MediaLive for 24/7 365 operations.

Taking a step back, it’s tough not to agree with the key points from both authors. The cloud does make the most sense when you need instant capacity for peak encoding. For steady-state operations, owning your own gear is always going to be cheaper.

All that said, run the numbers no matter what you’re doing in the cloud. While the results probably won’t be as startling as those shown in Table 1, you won’t know until you do the math.

Maximizing Cloud Gaming Performance with ASICs

Maximizing Cloud Gaming Performance with ASICs

Ask ten cloud gamers what an acceptable level of latency is for cloud gaming, and you’ll get ten different answers. However, they will all agree that lower latency is better.

At NETINT, we understand. As a supplier of encoders to the cloud gaming market, our role is to supply the lowest possible latency at the highest possible quality and the greatest encoding density with the lowest possible power consumption. While this sounds like a tall order, because our technology is ASIC based, it’s what we do for cloud gaming and high-volume video streaming workloads of all types.

In this article, we’ll take a quick look at the technology stack for cloud gaming and the role of compression. Then we’ll discuss the performance of the NETINT Quadra VPU (video processing unit) series using the four measuring sticks of latency, density, video quality, and power consumption.

The Cloud Gaming Technology Stack

Figure 1 illustrates the different elements of the cloud gaming technology stack, particularly how the various transfer, compute, rendering, and encoding activities contribute to overall latency.

At the heart of every cloud gaming center is a game engine that typically runs the operating system native to the game, usually Android or Windows, though Linux and macOS is not uncommon. (see here for Meta’s dual OS architecture)

Since most games rely on GPU for rendering, all cloud gaming data centers have a healthy dose of GPU resources. These functions are incorporated in the cloud compute and graphics engine shown on the left, which creates the frames sent to the encode function for encoding and transmission to the gamer.

As illustrated in Figure 1, Nokia budgets 100 ms for total latency. Inside the data center, which is shown on the left, Nokia allows 15 ms to receive the data, 40 ms to process the input and render the frame, 5 ms to encode the frame, and 15 seconds to return it to the remote player. That’s a lot to do in the time it takes a sound wave to travel just 100 feet.

Maximizing Cloud Gaming Performance with ASICs - figure 1
Figure 1. Cloud gaming latency budget from Nokia.

NETINT’s Quadra VPU series is ideal for the standalone encode function. All Quadra VPUs are powered by the NETINT Codensity G5 ASIC. It’s called a video processing unit because in addition to H.264, HEVC, and VP9 decode, and H.264, HEVC, and AVI encode, Quadra VPUs offer onboard scaling, overlay, and an 18 TOPS AI engine (per chip).

Quadra is available in several single-chip solutions (T1 and T1A) and a dual-chip solution (T2) and starts at $1,500 in low quantities. Depending upon the configuration that you purchase, you can install up to ten Quadra VPUs in a single 1RU server and twenty Quadra VPUs in a 2RU server.

Cloud Gaming Latency and Density

Table 1 reports latency and density for a single Quadra VPU. As you would expect, latency depends on video resolution by way of the available network bandwidth and, to a much lesser degree, the number of jobs being processed.

Game producers understand the resolution/latency tradeoff and design the experience around this. So, a cloud gaming vendor might deliver a first-person shooter game at 720p to minimize latency while providing a better UX on medium bandwidth connections and a slower-paced role-playing or strategy game at larger resolutions to optimize the visual experience. As you can see, a single Quadra VPU can service both scenarios, with 4K latency under 20 ms and 720p latency around 4 ms at extremely high stream counts.

Maximizing Cloud Gaming Performance with ASICs - table 1
Table 1. Quadra throughput and average latency for AVC and HEVC.

In terms of density, the jobs shown in Table 1 are for a single Quadra VPU. Though multiple units won’t scale linearly, performance will increase substantially as you install additional units into a server. Because the Quadra is focused solely on video processing and encoding operations, it outperforms most general-purpose GPUs, CPUs, and even FPGA-based encoders from a density perspective.

Quadra Output Quality

From a quality perspective, hardware transcoders are typically benchmarked against the x264 and x265 codecs running in FFmpeg. Though FFmpeg’s throughput is orders of magnitude lower, these codecs represent well known and accepted quality levels. NETINT recently compared Quadra quality against x264 and x265 in a low latency configuration using a CGI-based data set.

Table 2 shows the results for H.264, with Rate-Distortion Optimization Quantization enabled and disabled. Enabling RDOQ increases quality slightly but decreases throughput. Quadra exceeded x264 quality in both configurations using the veryfast preset, typical for live streaming.

Maximizing Cloud Gaming Performance with ASICs - table 2
Table 2. The NETINT Quadra VPU series delivers better H.264 quality
than the x264 codec using the veryfast preset.

For HEVC, Table 3 shows the equivalent x265 preset with RDOQ disabled (the high throughput, lower-quality option) at three Rate Distortion Optimization levels, which also trade-off quality for throughput. Even with RDOQ disabled and with RDO set to 1 (low quality. high throughput) Quadra delivers the equivalent of x265 Medium quality. Note that most live streaming engineers use superfast or ultrafast to produce even a modest number of HEVC streams in a software-only encoding scenario.

Table 3. The NETINT Quadra VPU series delivers better quality
than the x265 codec using the medium preset.

Low Power Transcoding for Cloud Gaming

At full power, Quadra T1 draws 70 watts. Though some GPUs offer similar power consumption, they typically deliver much fewer streams.

In this comparison with the NVIDIA T4, the Quadra T1 drew .71 watts per 1080p stream, about 84% less than the 3.7 watts per stream required by the T4. This obviously translates to an 84% reduction in energy costs and carbon emissions per stream. In terms of CAPEX, Quadra costs $53.57 per 1080p stream, 63% cheaper than the T4’s $144/stream.

When it comes to gameplay, most gamers prioritize latency and quality. In addition to delivering these two key QoE elements, cloud gaming vendors must also focus on CAPEX, OPEX, and sustainability.  By all these metrics, the ASIC-based Quadra is the most ideal encoder for any cloud gaming production workflow. 

Mobile cloud gaming and technology suppliers

Cloud gaming is the perfect application for ASIC-based transcoding. NETINT products are extensively deployed in cloud gaming overseas. High-profile domestic...

Video games are a huge market segment, projected to reach US$221.4 billion in 2023, expanding to an estimated US$285 billion by 2027. Of that, cloud gaming grossed an estimated US$3 billion+ in 2022 and is projected to produce over US$12 billion in revenue by 2026.

While the general video game market generates minimal revenue from encoder sales, cloud gaming is the perfect application for ASIC-based transcoding. NETINT products were designed, in part, for cloud gaming and are extensively deployed in cloud gaming overseas. We expect to announce some high-profile domestic design wins in 2023.

If you’re not a gamer, you may not be familiar with what cloud gaming is and how it’s different from PC or console-based gaming. This is the first of several introductory articles to get you up to speed on what cloud gaming is, how it works, who the major players are, and why it’s projected to grow so quickly. 

What is cloud gaming

Figure 1, from this article, illustrates the difference between PC/console gaming and cloud gaming. On top is traditional gaming, where the gamer needs an expensive, high-performance console or game computer to process the game logic and render the output. To the extent that there is a cloud component, say for multiple players, the online server tracks and reports the interactions, but all computational and rendering heavy lifting is performed locally.

Mobile cloud gaming and technology suppliers - figure 1
Figure 1. The difference between traditional and cloud gaming. From this article.

On the bottom is cloud gaming. As you can see, all you need on the consumer side is a screen and game controller. All of the game logic and rendering are performed in the cloud, along with encoding for delivery to the consumer.

Cloud gaming workflow

Figure 2 shows a high-level cloud workflow – we’ll dig deeper into the cloud gaming technology stack in future articles, but this should help you grasp the concept. As shown, the gamer’s inputs are sent to the cloud, where a virtual instance of the game interprets, executes, and renders the input. The resultant frames are captured, encoded, and transmitted back to the consumer, where the frames are decoded and displayed. 

#image_title
Figure 2. A high-level view of the cloud side of cloud gaming from this seminal article.

Cloud gaming and consumers' benefits

Cloud gaming services incorporate widely different business models, pricing levels, available games, performance envelopes, and compatible devices. In most cases, however, consumers benefit because:

  • They don’t need a high performant PC or game console to play games – they can play on most connected devices. This includes some Smart TVs for a true, big-screen experience.
  • They don’t need to download, install, or maintain games on their game platform.
  • They don’t need to buy expensive games to get started.
  • They can play the same game on multiple platforms, from an expensive gaming rig or console to a smartphone or tablet, with all ongoing game information stored in the cloud so you can immediately pick up where you left off.

Publishers benefit because they get instant access to users on all platforms, not just the native platforms the games were designed for. So, console and PC-based games are instantly accessible to all players, even those without the native hardware. Since games aren’t downloaded during cloud gaming, there’s no risk of piracy, and the cloud negates the performance advantages long-held by those with the fastest hardware, leveling the playing field for game play.

Gaming experience

Speaking of performance, what’s necessary to achieve a traditional local gameplay experience? Most cloud platforms recommend a 10 Mbps download speed at a minimum for mobile, with a wired Ethernet connection recommended for computers and smart TVs. As you would expect, your connection speed dictates performance, with 4K ultra-high frame rate games requiring faster connection speeds than 1080p@30fps gameplay.

As mentioned at the top, cloud gaming is expected to capture an increasing share of overall gameplay revenue going forward, both from existing gamers who want to play new games on new platforms and new gamers. Given the revenue numbers involved, this makes cloud gaming a critical market for all related technology suppliers. 

Argos dispels common myths about encoding ASICs

Argos dispels common myths about encoding ASICs

Even in 2023, many high-volume streaming producers continue to rely on software-based transcoding, despite the clear CAPEX, OPEX, and environmental benefits of ASIC-based transcoding. Part of the inertia relates to outdated concerns about the shortcomings of ASICs, including sub-par quality and lack of flexibility to add features or codec enhancements.

As a parent, I long ago concluded that there were no words that could come out of my mouth that would change my daughter’s views on certain topics. As a marketer, I feel some of that same dynamic, that no words can come out of my keyboard that would shake the negative beliefs about ASICs from staunch software-encoding supporters.

So, don’t take our word that these beliefs are outdated; consider the results from the world’s largest video producer, YouTube. The following slides and observations are from a Google presentation by Aki Kuusela and Clint Smullen on the Argos ASIC-based transcoder at Hot Chips 33 back in August 2021. The slides are available here, and the video here

In the presentation, the speakers discussed why YouTube developed its own ASIC and the performance and power efficiency achieved during the first 16 months of deployment. Their comments go a long way toward dispelling the myths identified above and make for interesting reading.

Advanced Codecs Means Encoding Time Has Grown by 8,000% Since H.264

In discussing why Google created its own encoder, Kuusela explained that video was getting harder to compress, not only from a codec perspective but from a resolution and frame rate perspective.  Here’s Kuusela (all quotes grabbed from the YouTube video and  lightly edited for readability).

“In order to sustain the higher resolutions and frame rate requirements of video, we have to develop better video compression algorithms with improved compression efficiency. However, this efficiency comes with greatly increased complexity. For example, if we compare the vp9 from 2013 to the decade older H.264, the time to encode videos in software has grown to 10x. The more recent AV1 format from 2018 is already 200 times more time-consuming than the h.264 standard.

If we further compound this effect with the increase in resolution and frame rate for top-quality video, we can see that the time to encode a video from 2003 to 2018 has grown eight thousand-fold. It is very obvious that the CPU performance improvement has not kept up with this massive complexity growth, and to keep our video services running smoothly, we had to consider warehouse scale acceleration. We also knew things would not get any better with the next generation of compression.”

Argos dispels common myths about encoding ASICs - 1
Figure 1. Google moved to hardware
to address skyrocketing encoding times.

Reviewing Figure 1, it should be noted that though few engineers use VP9 as extensively as YouTube, if you swap HEVC for VP9, the complexity difference between H.264 is the same. Beyond the higher resolutions and frame rates engineers must support to remain competitive, the need for hardware becomes even more apparent when you consider the demands of live production.

“Near Parity” with Software Encoding Quality

One consistent concern about ASICs has been quality, which admittedly lagged in early hardware generations. However, Google’s comparison shows that properly designed hardware can deliver near-parity to software-based transcoding.

Kuusela doesn’t spend a lot of time on the slide shown in Figure 2, merely stating that “we also wanted to be able to optimize the compression efficiency of the video encoder based on the real-time requirements and time available for each encoder and to have full access to all quality control algorithms such as bitrate allocation and group of picture selection. So, we could get near parity to software-based encoding quality with our no-compromises implementation.”

Figure 2. Argos delivers “near-parity”
with software encoders.

NETINT’s data more than supports this claim. For example, Table 1 compares the NETINT Quadra VPU with various x265 presets. Depending upon the test configuration, Quadra delivers quality on par with the x265 medium preset. When you consider that software-based live production often necessitates using the veryfast or ultrafast preset to achieve marginal throughput, Quadra’s quality far exceeds that of software-based transcoding.

Argos dispels common myths about encoding ASICs - table 1
Table 1. Quadra HEVC quality compared to x265
in high-quality latency tolerant configuration.

ASIC Performance Can Improve After Deployment

Another concern about ASIC-based transcoders is the inability to upgrade, and accelerated obsolescence. Proper ASIC design allows ASICs to balance encoding tasks between hardware, firmware, and control software to ensure continued upgradeability.

Figure 3 shows how the bitrate of VP9 and H.264 continued to improve compared to software in the months after the product launch, even without changes to the firmware or kernel driver. The second Google presenter, Clint Smullen attributed this to a hybrid hardware/software design, commenting that “Using a software approach was critical both to supporting the quality and feature development in the video core as well as allowing customer teams to iteratively improve quality and performance.”

Figure 3. Argos continued to improve after deployment
without changes to firmware or the kernel driver.

The NETINT Codensity G4 ASIC included in the T408 and the NETINT Codensity G5 ASIC that powers our Quadra family of VPUs, both use a hybrid design that distributes critical functions between the ASIC, driver software, and firmware.

We optimize ASIC design to maximize functional longevity as explained here on the role of firmware in ASIC implementations, “The functions implemented in the hardware are typically the lower-level parts of a video codec standard that do not change over time, so the hardware does not need to be updated. The higher levels parts of the video codecs are in firmware and driver software and can still be changed.”

As Google’s experience and NETINT’s data show, well-designed ASICs can continue improving in quality and functionality long after deployment. 

90% Reduction in Power Consumption

Few engineers question the throughput and power efficiency of ASICs, and Google’s data bears this out. Commenting on Figure 4, Smullen stated, “For H.264 transcoding a single VCU matches the speed of the baseline system while using about one-tenth of the system level power. For VP9, a single 20 VCU machine replaces multiple racks of CPU-only systems.”

Figure 4. Throughput and comparative efficiency
of Argos vs software-only transcoding.

NETINT ASICs deliver similar results. For example, a single T408 transcoder (H.264 and HEVC) delivers roughly the same throughput as a 16-core computer encoding with software and draws only about 7 watts compared to 250+ for the computer. NETINT Quadra draws 20 watts and delivers roughly 4x the performance of the T408 for H.264, HEVC, and AV1. In one implementation, a single 1RU rack of ten Quadras can deliver 320 1080p streams or 200 720p cloud gaming sessions, which like Argos, replaces multiple racks of CPUs.

Time to Reconsider?

As Google’s experience with YouTube and Argos shows, ASICs deliver unparalleled throughput and power efficiency in high-volume publishing workflows. If you haven’t considered ASICs for your workflow, it’s time for another look.

How Scaling Method and Technique Impacts Quality and Throughput

How Scaling Method and Technique Impacts Quality and Throughput

The thing about FFmpeg is that there are almost always multiple ways to accomplish the same basic function. In this post, we look at four approaches to scaling to reveal how the scaling method and techniques used impact quality and throughput.

We found that if you’re scaling using the default -s function (-s 1280×720), you’re leaving a bit of quality on the table compared to other methods. How much depends upon the metric you prefer; about ten percent if you’re a VMAF (hand raised here) or SSIM fan, much less if you still bow to the PSNR gods. More importantly, if you’re chasing throughput via cascaded scaling with fast scaling algorithms (flags=fast_bilinear), you’re probably losing quality without a meaningful throughput increase.

That’s the TL/DR; here’s the backstory.

The Backstory

NETINT sells ASIC-based hardware transcoders. One key advantage over software-only/CPU-based encoding is throughput, so we perform lots of hardware vs. software benchmarking. Fairness dictates that we use the most efficient FFmpeg command string when deriving the command string for software-only encoding.

In addition, the NETINT T408 transcoder scales in software using the host CPU, so we are vested in techniques that increase throughput for T408 transcodes. In contrast, the NETINT Quadra scales and performs overlays in hardware and provides an AI engine, which is why it’s designated a Video Processing Unit (VPU) rather than a transcoder.

One proposed scaling technique for accelerating both software-only and T408 processing is cascading scaling, where you create a filter complex that starts at full resolution, scales to the next lower resolution, then uses the lower resolution to scale to the next lower resolution. Here’s an example.

filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=fast_bilinear,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=fast_bilinear,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=fast_bilinear,split=2[out720p][in720p];[in720p]scale=640:360:flags=fast_bilinear[out360p]”

So, rather than performing multiple scales from full resolution to the target (4K > 2K, 4K to 1080p, 4K > 720p, 4K to 360p), you’re performing multiple scales from lower resolution sources (4K > 2K > 1080p >720p > 360p). The theory was that this would reduce CPU cycles and improve throughput, particularly when coupled with a fast scaling algorithm. Even assuming a performance increase (which turned out to be a bad assumption), the obvious concern is quality; how much does quality degrade because the lower-resolution transcodes are working from a lower-resolution source?

In contrast, if you’ve read this far,  you know that the typical scaling technique used by most beginning FFmpeg producers is the -s command (-s 1280×720). For all rungs below 4K, FFmpeg scales the source footage down to the target resolution using the bicubic scaling algorithm,

So, we had two proposed methods which I expanded to four, as follows.

  • Default (-s 1280×720)
  • Cascade using fast bilinear
  • Cascade using Lanczos
  • Video filter using Lanczos (-vf scale=1280×720 -sws_flags lanczos)

I tested the following encoding ladder using the HEVC codec.

  • 4K @ 12 Mbps
  • 2K @ 7 Mbps
  • 1080p @ 3.5 Mbps
  • 1080p @ 1.8 Mbps
  • 720p @ 1 Mbps
  • 360p @ 500 kbps

I encoded two 3-minute 4Kp30 files, excerpts from the Netflix Meridian and Harmonic Football test clips using the x265 codec and ultrafast preset. You can see full command strings at the end of the article. I measured throughput in frames per second and measured the 2K to 360p rung quality with VMAF, PSNR, and SSIM, compiling the results into BD-Rate comparisons in Excel.

I tested on a Dell Precision 7820 tower driven by two 2.9 GH Intel Xeon Gold (6226R) CPUs running Windows 10 Pro for Workstations with 64 GB of RAM. I tested with FFmpeg 5.0, a version downloaded from www.gyan.dev on December 15, 2022.

Performance

How Scaling Method and Technique Impacts Quality and Throughput - table 1
TABLE 1. FPS BY SCALING METHOD

Table 1 shows that cascading delivered negligible performance benefits with the two test files and the selected encoding parameters. I asked the engineer who suggested the cascading scaling approach why we saw no throughput increase. Here’s a brief exchange. 

Engineer: It’s not going to make any performance difference in your example anyways but it does reduce the scaling load

       Me: Why wouldn’t it make a performance difference if it reduces the scaling load?

Engineer: Because, as your example has shown, the x265 encoding load dominates. It would make a very small difference

       Me: Ah, so the slowest, most CPU-intensive process controls overall performance.

Engineer: Yes, when you compare 1000+1 with 1000+10 there is not too much difference.

What this means, of course, is that these results may vary by the codec. If you’re encoding with H.264, which is much faster, cascading scaling might increase throughput. If you’re encoding with AV1 or VVC, almost certainly not.

Given that the T408 transcoder is multiple times faster than real-time, I’m now wondering if cascaded scaling might increase throughput when producing with the T408. You probably wouldn’t attempt this approach if quality suffered, but what if cascaded scaling improved quality? Sound far-fetched? Read on.

Quality Results

Table 2 shows the combined VMAF results for the two clips. Read this by choosing a row and moving from column to column. As you would suspect, green is good, and red is bad. So, for the Default row, that technique produces the same quality as Cascade – Fast Bilinear with a bitrate reduction of 18.55%. However, you’d have to boost the bitrate by 12.89% and 11.24%, respectively, to produce the same quality as Cascade – Lanczos and  Video Filter – Lanczos.

How Scaling Method and Technique Impacts Quality and Throughput - table 2
Table 2. BD-Rate comparisons for the four techniques using the VMAF metric.

From a quality perspective, the Cascade approach combined with the fast bilinear algorithm was the clear loser, particularly compared to either method using the Lanczos algorithm. Even if there was a substantial performance increase, which there wasn’t, it’s hard to see a relevant use case for this algorithm.

The most interesting takeaway was that cascading scaling with the Lanczos algorithm produced the best results, slightly higher than using a video filter with Lanczos. The same pattern emerged for PSNR, where Cascade – Lanc was green in all three columns, indicating the highest-quality approach. 

How Scaling Method and Technique Impacts Quality and Throughput - table 3
Table 3. BD-Rate comparisons for the four techniques using the PSNR metric.

Ditto for SSIM.

How Scaling Method and Technique Impacts Quality and Throughput - table 4
Table 4. BD-Rate comparisons for the four techniques using the SSIM metric.

The cascading approach delivering better quality than the video filter was an anomaly. Not surprisingly, the engineer noted:

Engineer: It is odd that cascading with Lanczos has better quality than direct scaling. I’m not sure why that would be.

       Me: Makes absolutely no sense. Is anything funky in the two command strings?

Engineer: Nothing obvious but I can look some more.

Later analysis yielded no epiphanies. Perhaps they can come from a reader.

The Net Net

First, the normal caveats; your mileage may vary by codec and content. My takeaways are:

  • Try cascading scaling with Lanczos with the T408,
  • For software encodes, never use -s again.
  • Use cascade or the simpler video filter approach. 
  • With most software-based encoders, faster scaling methods may not deliver performance increases but could degrade quality.

Further, as we all know, there are several, if not dozens, additional approaches to scaling; if you have meaningful results that prove one is substantially better, please share them with me via THIS email.

Finally, taking a macro view, it’s worth remembering that a $12,000 + workstation could only produce 25 fps when producing a live 4K ladder to HEVC using x265’s ultrafast preset. Sure, there are faster software encoders available. Still, hardware encoding is the best answer for affordable live 4K transcoding from both an OPEX and CAPEX perspective.

Command Strings:

Default:

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_12M_default.mp4 ^

-s 2560×1440 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_7M_default.mp4  ^

-s 1920×1080 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_3_5M_default.mp4 ^

-s 1920×1080 -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_1_8M_default.mp4 ^

-s 1280×720  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_1M_default.mp4 ^

-s 640×360  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_500K_default.mp4

Cascade – Fast Bilinear

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=fast_bilinear,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=fast_bilinear,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=fast_bilinear,split=2[out720p][in720p];[in720p]scale=640:360:flags=fast_bilinear[out360p]” ^

-map [out4k] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_cascade_12M_fast_bi.mp4 ^

-map [out1440p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_cascade_7M_fast_bi.mp4  ^

-map [out1080p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_3_5M_fast_bi.mp4 ^

-map [out1080p2] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_1_8M_fast_bi.mp4 ^

-map [out720p]  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_8_bit_cascade_1M_fast_bi.mp4 ^

-map [out360p]  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_8_bit_cascade_500K_fast_bi.mp4

Cascade – Lanczos

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-filter_complex “[0:v]split=2[out4k][in4k];[in4k]scale=2560:1440:flags=lanczos,split=2[out1440p][in1440p];[in1440p]scale=1920:1080:flags=lanczos,split=3[out1080p][out1080p2][in1080p];[in1080p]scale=1280:720:flags=lanczos,split=2[out720p][in720p];[in720p]scale=640:360:flags=lanczos[out360p]” ^

-map [out4k] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_8_bit_cascade_12M_lanc.mp4 ^

-map [out1440p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_8_bit_cascade_7M_lanc.mp4  ^

-map [out1080p] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_3_5M_lanc.mp4 ^

-map [out1080p2] -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_8_bit_cascade_1_8M_lanc.mp4 ^

-map [out720p]  -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_8_bit_cascade_1M_lanc.mp4 ^

-map [out360p]  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_cascade_500K_lanc.mp4

Video Filter – Lanczos

c:\ffmpeg\bin\ffmpeg -y -i  football_4K30_all_264_short.mp4 -y ^

-c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 12M -maxrate 12M  -bufsize 24M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_4K_12M_filter_lanc.mp4 ^

-vf scale=2560×1440 -sws_flags lanczos -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 7M -maxrate 7M  -bufsize 14M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_2K_7M_filter_lanc.mp4  ^

-vf scale=1920×1080 -sws_flags lanczos  -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 3.5M -maxrate 3.5M  -bufsize 7M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_3_5M_filter_lanc.mp4 ^

-vf scale=1920×1080 -sws_flags lanczos  -c:v libx265 -an -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1.8M -maxrate 1.8M  -bufsize 3.6M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_1080p_1_8M_filter_lanc.mp4 ^

-vf scale=1280×720 -sws_flags lanczos -c:v libx265 -an  -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v 1M -maxrate 1M  -bufsize 2M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 Fball_x265_720p_1M_filter_lanc.mp4 ^

-vf scale=640×360 -sws_flags lanczos  -c:v libx265 -force_key_frames expr:gte^(t,n_forced*2^) -tune psnr -b:v .5M -maxrate .5M  -bufsize 1M -preset ultrafast  -x265-params open-gop=0:b-adapt=0:aq-mode=0:rc-lookahead=16 -report Fball_x265_360p_500K_filter_lanc.mp4

Is power consumption your company’s priority?

Is power consumption your company's priority?

Power consumption is a priority for NETINT customers and a passion for NETINT engineers and technicians. Matthew Ariho, a system engineer in SoC Engineering at NETINT, recently answered some questions about:

  • How to test power consumption
  • Which computer components draw the most power
  • Why using older computers is bad for your power bills, and
  • The best way for video-centric data centers to reduce power consumption.

What are the different ways to test power consumption (and cost)?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

There are software and hardware-based solutions to this problem. I use one of each as a means of confirming any results.

One software tool is the IPMItool linux package which provides a simple command-line interface to IPMI-enabled devices through a Linux kernel driver. This tool polls the instantaneous, average and peak and minimum instantaneous power draw of the over a sampling period.

Is power consumption your company's priority?

On the hardware side of things, you can use different forms of multimeters, like the Kill-A-Watt meter and a 208VAC power bar are examples of such devices available in our lab.

What are their pros and cons (and accuracy)?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

The IPMItool is great because it provides a lot of information. It is fairly simple to set up and use. There is a question of reliability because it is software based, it depends on readings whose source I’m not familiar with.

The multimeters (like the Kill-A-Watt meter), while also simple to use, do not have any logging capabilities which makes measurements like average or steady state power draw difficult to measure. Both methods have a resolution of 1W which is not ideal but more than sufficient for our use cases.

What activities to you run when you test power consumption?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

We run multi-instances that mimic streaming workloads but only to the point that each of those instances is performing up to par with our standards (for example, 30 fps).

What’s the range of power consumption you’ve seen?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

I’ve seen reports of power consumption of up to 450 watts, but personally never tested a unit that drew that much. Typically, without any load on the T408 devices, the power consumption hovers around 150W, which increases to 210 to 220W during peak periods.

What’s the difference between Power Supply rating and actual power consumption (and are they related)?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

Power supplies take in 120VAC or 208VAC and convert to various DC voltages (12V, 5V, 3.3V) to power different devices in a computer. This conversion process inherently has several inefficiencies. The extent of these inefficiencies depends on the make of the power supply and the quality of components used.

Power supplies are offered with an efficiency rating that certify how efficiently a power supply will function at different loads. Power consumption measured at the wall will always be less than power supplied within a computer.

What are the hidden sources of excessive power that most people don’t know about?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

The operating system of a computer can consume a lot of power performing background tasks though this has become less of a problem with more efficient CPUs on the market. Other sources of excessive power are bloatware that are usually unnecessary programs that run in the background.

What distinguishes a power-hungry computer from an efficient one – what should the reader look for?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

The power supply rating is something to watch. Small variations in the power supply rating make significant differences in efficiency. The difference between a PSU rated at 80 PLUS and a PSU rated at 80 PLUS Bronze is about 2% to 5% depending on the load. This number only grows with better rated PSUs.

Other factors including the components of the computer. Recently, newer devices (CPUs, GPUs and motherboards) have been made with beyond significant generational improvements in efficiency. A top-of-the-line computer from 3 years ago simply cannot compete with some mid-range computers in terms of both power efficiency or performance. So, while sourcing older but cheaper components in the past may have been a good decision, nowadays, its not as clear cut.

Which components draw the most power?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

CPUs and GPUs. Even consumer CPUs can draw over 200W sustained. GPUs on the lower end consume around 150W and now more recently over 400W.

How does the number of cores in a computer impact power usage?

Is power consumption your company's priority? - Matthew Ariho
Matthew Ariho

I’m really not an expert on server components and it is hard to say without having examples. There are too many options to provide a conclusion on a proper trend. There are AMD 64 core server CPUs that pull about 250 to 270 W and 12 to 38 core Intel server CPUs that do about the same. Ultimately architectural advantages/features determine performance and efficiencies when comparing CPUs across manufacturer or even CPUs from the same manufacturer.

You can't manage what you don't measure.

One famous quote attributed to Peter Drucker is that you can’t manage what you don’t measure. As power consumption becomes increasingly important, it’s incumbent upon all of us to both measure and manage it.

Insights from the Bitmovin Video Developer Report

Insights from the Bitmovin Video Developer Report

The Bitmovin Video Developer Report, now in its 6th edition, is one of the most far-reaching and useful documents available to streaming professionals (now with no registration required). It’s a report that I happily download each December and generally refer to frequently during the next twelve months.

Like the proverbial elephant, what you find important in the report depends upon your interests. I typically zero in on video codec usage, encoding practices, and the most important problems and opportunities facing streaming developers. As discussed below, this year’s edition has some surprises, like the fact that more respondents are currently working with H.266/VVC than AV1.

Beyond this, the report also tracks details on development frameworks, content distribution, monetization practices, DRM, video analytics, and many other topics. This makes it extraordinarily valuable to anyone needing a finger on the pulse of streaming industry practices.

Let’s start with some details about how Bitmovin compiles the data and then jump to what I found most interesting.

Gathering the Data

Bitmovin collected the data between June and September 2022. A total of 424 respondents from over 80 countries answered the survey. Geographically, EMEA led the charge with 43%, followed by North America (34%), APAC (14%), and Latin America (8%). Regarding job function, 34% of respondents were manager/CEO/VP level, 23% developer/engineer, 14% technical manager, 10% product manager, 9% architect/consultant, 7% in R&D, and 3% in sales and marketing.

A quarter of respondents worked in OTT streaming services, 21% in online video platforms, 15% for broadcasters, 12% for integrators, 7% for publishers, 6% for telcos, 5% for social media sites, with 10% other. In terms of company size, 35% worked in companies with 300+ employees, 17% 101-300, 19% 51 – 100, and 29% 1 – 50. In other words, a very useful cross-section of geography, industry, job function, and company size.

To be clear, the results are not actual data from Bitmovin’s cloud encoding facility, which would be useful in its own right. Rather, the respondents answered questions about their current practices and future plans in each of the listed topics.

Current and Planned Codec Usage

Figure 1 shows current and planned codec usage for live encoding, with current usage in blue and planned usage in red. The numbers exceed 100% (of course) because most respondents use multiple codecs.

It’s always a surprise to see H.264 at less than 100%, but there’s 78% clear as day. Even given the breadth of industries that responded to the survey, it’s tough to imagine any publisher not supporting H.264.

Insights from the Bitmovin Video Developer Report - 1
Figure 1. Answers to the question, “Which streaming formats are you using in production for distribution and which ones are you planning to introduce within the next year?”

HEVC was next at 40%, with AV1 in fifth at 18%, bracketed by VP8 (19%) and VP9 (17%), presumably more for WebRTC than OTT. These are the codecs most likely to be used to actually publish video in 2022. Other codecs presumably implemented by infrascture providers were H.266/VVC a suprising third at 19%, with LCEVC and EVC both at 16%.

Looking ahead, HEVC looks to be most likely to succeed in 2023 with 43% of respondents planning to implement, with AV1 next at 34%, H.264/AVC at 33%, and VVC at 20%. Given that CanIUse lists AV1 support at 73% while VVC isn’t even listed, you’d have to assume that actual AV1 deployments in the near term will dwarf H.266/VVC, but you can’t ignore the interest this standard based codec is receiving from the industry. VOD encoding tracks these results fairly closely for both current and planned usage.

Video Quality Related Findings

Quality is a constant concern for video professionals and quality-related data appeared in several questions. In terms of challenges faced by respondents, “finding the root case of quality issues” ranked fifth with 23%, while “quality of experience” ranked ninth, with 19%.

Interestingly, in response to the question, “For which of the following video use cases do you expect to use machine learning (ML) or artificial intelligence (AI) to improve the video experience for your viewers,” 33% cited “video quality optimization,” which ranked third, while 30% cited “quality of experience (QoE),” which ranked fourth.

With so many respondents looking for futuristic means to improve quality, it was ironic that so many ignored content-aware encoding (CAE), a proven method of improving both quality and quality of experience. Specifically, only 33% percent of respondents were currently using CAQ, with 35% planning to implement CAE within the next 12 months. If you’re not in either of these camps, consider yourself scolded.

Live Encoding Practices

Lastly, I focused on live encoding practices, finding that 53% of respondents used commercial encoders, which presumably include both hardware and software. In comparison, 34% encode via open source, which is all software. What’s interesting is how poorly this group dovetails with both the most significant challenge faced by respondents and the largest opportunity for innovation perceived by respondents.

Figure 2. Answers to the question, “Where do you encode video?”

Specifically, controlling cost was the most significant challenge in the report, selected by 33% of respondents. On a cost per stream basis, considering both CAPEX and OPEX, software-encoding is by far more expensive than encoding with hardware, particularly ASICs.

The most significant opportunity for innovation reported by respondents was live streaming at scale, again at 33%. In this regard, the same lack of throughput that makes CPU-driven open-source encoding the most expensive solution makes it the least scalable. Simply stated, publishers currently encoding with CPU-driven open-source codecs can help address both their biggest challenge and their most significant opportunity by switching to ASIC-based transcoding.

Insights from the Bitmovin Video Developer Report - 3
Figure 3. Responses to the question, “Where do you see the most opportunity for innovation in your service?

Curious? Download our white paper, How to Slash CAPEX, OPEX, and Carbon Emissions Using the NETINT T408 Video Transcoder here. Or, compute how long it will take to recoup your investment in ASIC-based encoding through reduced power costs via calculators available here.

And don’t forget to download the Bitmovin Video Developer Report, here.

How NETINT enables ASIC upgradeability with Software

ASIC upgradeability with Software - NETINT technologies

ASICs provide a tremendous energy efficiency, and yet suffer from being fixed function with limited programmability. This was a core engineering challenge that we addressed in the development of the Codensity ASIC family with upgradeable firmware that can be used for a variety of purposes, including adding new features and improving coding performance, and functionality.

To explore these capabilities, we spoke with two members of the NETINT development team, Neil Gunn, who is NETINT’s Video Firmware Tech Lead, and Savio Lam, a firmware engineer. In this short discussion, they describe how firmware allows Codensity video transcoders and VPU’s to evolve and improve long after leaving the foundry. 

This conversation focuses mainly on our Codensity G4 ASIC, however the capability to upgrade firmware applies to all of our ASIC platforms including the Codensity G5.

What do you do with NETINT?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

I am a firmware architect and also develop the firmware and to a lesser extent, the host side software (libxcoder and FFmpeg) for NETINT transcoding ASICs. I started at NETINT in 2018 working on T408 (Codensity G4 based) firmware development. Then, I moved to Quadra (Codensity G5 based) as a software architect and firmware/software developer. I continue to support T408 in the background.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

I am a firmware engineer working on our video transcoding products.

What did you do on the T408?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

I implemented a number of video features in the firmware such as 10-bit transcoding, close captions, HDR10, HDR10+, HLG10, Dolby Vision, HRD, Region of Interest, encoder parameter change, etc. I also worked on bug fixes and customer issues.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

I worked on the system design and integration. I mainly developed code that controls how video data comes in and out of our transcoder in the most efficient and reliable way.

What is firmware in an ASIC?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

The firmware is software that runs on embedded CPUs within the ASIC. The firmware provides a high-level interface to the low-level encoding and decoding hardware. The firmware does a lot of the high-level bitstream processing, such as creating VPS, SPS, and PPS headers, and SEI processing, leaving the ASIC hardware to do the low-level number crunching. Functions that consume a lot of processing and are likely not to change are implemented in hardware.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

To add to what Neil has already described, the firmware in our T408 ASIC manages several significant functions. For example, it comprises code responsible for the NVMe protocol, which allows us to efficiently receive and return up to 8GB/s of video input and output data. To properly consume and process the video data, the firmware sets up and schedules tasks to the appropriate hardware blocks.

Our firmware is also the brain that oversees the bigger picture part of the rate control. In this role, it’s part of a feedback loop that inputs subpicture data from low-level hardware blocks and uses that data to make better decisions that improve picture quality.

To sum up, the firmware is the brain that controls all the hardware blocks in the ASIC and gives instructions to each of them to perform their tasks as efficiently as possible.

How is firmware different from the gates burned into the chip?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

Firmware, like all software, can be changed, unlike actual gates in a chip. It’s called firmware because it’s a little harder to change than software. Firmware is stored in Flash memory which can be reprogrammed through an upgrade process. A T408 firmware release typically consists of new host-side software and firmware that must be version-matched for proper operation. Software provided to our customers with the release simplify the upgrade for one or more T408s in a system.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

There is logic in our T408 ASIC, which could have been designed as part of the hardware for better performance. However, that would significantly limit us from adding and improving the certain product features to suit different customer needs. We believe we have found the right balance on deciding what should be implemented in the firmware or hardware.

What functions can you adjust and/or improve within firmware?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

Things like the codec headers, seis, and rate control, to a certain extent, can be adjusted and/or improved within the firmware. Some lower-level rate control features are fixed in the hardware. Lower-level parts of the encoding standard are fixed in the hardware as these require a lot of processing and are unlikely to change.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

As Neil said, we are quite flexible when it comes to adding or improving support for different video metadata. And as we both explained earlier, since the firmware is also part of the brain that operates the picture rate control for encoding, we can continue to improve quality to a certain degree post-ASIC development.

Do you have any examples of significant improvements with the T408?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

We significantly reduced codec delay on both the encoder and decoder. Our low delay mode removes all frame buffering and encodes and decodes a single frame at a time. Our encoder uses a low delay GOP and sets flags in the bitstream appropriately so that another decoder knows that it doesn’t need to add any delay while decoding.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

Based on different customers’ feedback, we have made several improvements (or fixes) in the past to our rate control through firmware fixes which improved or resolved some of the video quality-related problems they have encountered.

When you hear people say ASICs are obsolete the day they come out of the foundry, what’s your response?

NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Neil Gunn

It’s not true. It is true that the hardware is fixed in an ASIC. Still, the functions implemented in the hardware are typically the lower-level parts of a video codec standard that do not change over time and so the hardware does not need to be updated. The higher levels parts of the video codecs are in firmware and driver software and can still be changed. For example, the T408 encoder hardware is designed for H.264 and H.265. We cannot add new codecs to the T408, but we can add new features to the existing codecs.

SAVIO LAM - NEIL GUNN - How NETINT enables ASIC upgradeability with Software
Savio Lam

There is a fine balance between what needs to be implemented in hardware for performance and what needs to be implemented in the firmware for flexibility (programmability). We think we struck the perfect balance with the Codensity G4 which is what makes it a great ASIC.

This conversation focuses mainly on our Codensity G4 ASIC, however the capability to upgrade firmware applies to all of our ASIC platforms including the Codensity G5.