Build Your Own Streaming Infrastructure – Software

Build Your Own Streaming Infrastructure - Article by Jan Ozer from NETINT Technologies

My assumption is that you’re currently using a cloud-based service like AWS for your live streaming and are seeking to reduce costs by buying your own transcoding hardware, installing the necessary software, and hosting the server on-premises or in a co-location facility. This article covers the software side.

To begin, let’s acknowledge that AWS and other cloud services have created a well-featured and highly integrated ecosystem for live streaming and distribution. The downside is the cost.

To illustrate the potential savings, I’ll refer to this article, which compared the cost of producing 21 H.264 ladders and 27 HEVC ladders via AWS MediaLive and by encoding with NETINT’s recently launched Logan Video Server. As you can see in the table, MediaLive costs around $400K for H.264 and $1.8 million for HEVC, as compared to $11,140 in both cases for the co-located server.

Streaming Infrastructure - Table from article 'cloud or on-prem'
Table 1. Five-year cost comparison . AWS MediaLive pricing compared to the NETINT Server

While there are less expensive options available inside and outside of AWS, whenever you pay for hardware by the minute or hour of production, you’re vastly overpaying as compared to owning your own hardware. Sure, you say, but it’s so easy compared to running your own hardware.

If that’s a concern, here are some comforting words from David Heinemeier Hansson, co-owner, and CTO of software developer 37signals, the developer of the project management platform Basecamp and email service Hey. Recently, Hansson wrote  Why we’re leaving the cloud, a blog that detailed his companies’ decisions to do just that. Here’s the relevant quote.

Up until very recently, everyone ran their own servers, and much of the progress in tooling that enabled the cloud is available for your own machines as well. Don’t let the entrenched cloud interests dazzle you into believing that running your own setup is too complicated. Everyone and their dog did it to get the internet off the ground, and it’s only gotten easier since.

My wife has chihuahuas, and given their difficulties with potty training, I seriously doubt they could do it, but you get the point. To paraphrase FDR, all you have to fear is fear itself. The bottom line is that running your own live streaming service should cost relatively little CAPEX, will save significant OPEX, and won’t be nearly as challenging as you might be fearing.

Let’s look at your options for the software required to run your homegrown system.

Transcoding and Packaging Software

Figure 1 shows the minimum software and infrastructure needed for a live-streaming service. Presumably, you’ve already got the live production covered, and since AWS doesn’t offer a player, you have that piece addressed as well. You’ll need a content delivery network to deliver your streaming video, but you can continue to use CloudFront or other CDN. The software that you absolutely have to replace is the live transcoding and packaging component.

Here you have three options; multimedia frameworks, media servers, and “other.” Let’s discuss each in turn.

Multimedia Frameworks

Multimedia frameworks are software libraries, tools, and APIs that provide a set of functionalities and capabilities for multimedia processing, manipulation, and streaming. The best-known framework is FFmpeg, followed by GStreamer and GPAC, and they are all available open source.

Build Your Own Streaming Infrastructure - Software- diagram-2
Figure 1. Netflix uses GPAC for its packaging,
a significant technology endorsement for GPAC
and for multimedia frameworks in general.

Multimedia frameworks excel in projects at both ends of the complexity spectrum. For simple projects, like transcoding an input stream to an encoding ladder, you can create a script that inputs the stream, transcodes, and hands the packaged output streams off to a CDN in a matter of minutes. You can use the script to process thousands of simultaneous jobs, all at no charge.

At the other end of the spectrum, these frameworks also excel at complex jobs with idiosyncratic custom requirements that likely aren’t available in a server or commercial software product. The development, maintenance, and modification costs are considerable, but you get maximum feature flexibility if you’re willing to pay that cost.

What you don’t get with these tools is a user interface or simple configuration options – you start with a blank slate and must program in all desired features. What could be as simple as checking a checkbox in a streaming media server could require dozens or even thousands of lines of code in a multimedia framework.

Which takes us to streaming media servers.

Streaming Media Servers

The next category of products are streaming media servers, and it includes Wowza Streaming Engine, Nimble Streamer, and two open-source servers, Red5 and Ant Media Server. These servers tend to excel for most productions in the middle of the complexity spectrum and offer multiple advantages over multimedia frameworks.

There are several reasons why you might choose to use a streaming server over a multimedia framework, including a simplified setup and configuration. Most streaming servers provide out-of-the-box streaming solutions with pre-configured settings and management interfaces that simplify the setup and configuration process. While not all offer GUIs, those that don’t offer simple option selection in configuration files.

Build Your Own Streaming Infrastructure - Software- diagram-3
Figure 2. Wowza Streaming Engine is a highly regarded streaming server

As mentioned above, streaming servers often offer simpler access to advanced features that you’d have to craft by hand with a multimedia framework. They also offer better integration with third-party services like digital rights management (DRM) and content delivery networks. Between the simplified setup, easier access to features, and improved integration with other services, packaged servers can dramatically accelerate getting your live streaming service up and running.

Once you’re operational, you’ll appreciate management interfaces that monitor the health and performance of your streaming infrastructure, track viewer analytics, manage streaming workflows, and make real-time adjustments. If you’re in a dynamic demand environment, some streaming servers offer built-in scalability features and load balancing to manage the load over multiple hard transcoding resources. You’d have to build all that by hand or with plug-ins if using a multimedia framework.

The two potential downsides of streaming servers are cost and customizability. You’ll have to pay a monthly fee for some versions of these servers, and you may find it complicated or nearly impossible to add what you might consider to be essential features.

Other Streaming-Capable Programs

Most companies building their own live-streaming infrastructures will implement either a multimedia framework or a streaming server, but there are other programs that incorporate the core encoding and packaging functions. One such program is Norsk from id3as. Norsk bills itself as “an SDK that enables developers to easily create amazing, dynamic live video workflows and deploy them at any scale.” As such, it combines both video production and streaming server-related functions

You see this in Figure 3. The top portion shows that Norsk supports the typical codecs and packaging formats deployed by live-streaming producers. At the bottom of the figure, you see that Norsk also offers production-oriented features like multiple camera support, graphics and overlays, and transitions.

Build Your Own Streaming Infrastructure - Software- diagram-4
Figure 3. Norsk offers both production and server-related functions.

Interestingly, Norsk doesn’t have a GUI, instead offering a high-level API to simplify configuration and operation, with a Workflow Visualizer component to view the running state of the application. In this fashion, Norsk attempts to provide the configurability of multimedia frameworks with the ease of operation of scripting-driven streaming media servers.

Finding a program like Norsk that combines transcoding and packaging with other essential streaming-related functions makes a lot of sense; there’s one less vendor to onboard and one less product to learn and support. As remote production becomes more common, we expect more programs like Norsk to become available.

Those are your high-level options. If you’re interested in learning more about these and other programs that can drive encoding and packaging for your live transcoder. You should plan to attend our upcoming symposium; details will be available in the next couple of weeks.

What Can a VPU Do for You?

What Can a VPU Do for You? - NETINT Technologies

For Cloud-Gaming, a VPU can deliver 200 simultaneous 720p30 game sessions from a single 2RU server.

When you encode using a Video Processing Unit (VPU) rather than the built-in GPU encoder, you will decrease your cost per concurrent user (CCU) by 90%, enabling profitability at a much lower subscription price. How is this technically feasible? Two technology enablers make this possible. First, extraordinarily capable encoding hardware, known as a VPU (video processing unit), dedicated to the task of high-quality video encoding and processing. And second, peer-to-peer direct memory access (DMA) that enables video frames to be delivered at the speed of memory compared to the much slower NVMe buss between the GPU and VPU. Let’s discuss these in reverse order.

Peer-to-Peer Direct Memory Access (DMA)

Within a cloud gaming architecture, the primary role of the GPU is to render frames from the game engine output. These frames are then encoded into a standard codec that is easily decoded on a wide cross section of devices. Generally this is H.264 or HEVC, though AV1 is becoming of interest to those with a broader Android user based. Encoding on the GPU is efficient from a data transfer standpoint because the rendering and encoding occurs on the same silicon die; there’s no transfer of the rendered YUV frame to a separate transcoder over the slower PCIe or NVMe busses. However, since encoding requires substantial GPU resources, this dramatically reduces the overall throughput of the system. Interestingly, it’s the encoder that is often at full capacity and, thus the bottleneck, not the rendering engine. Modern GPU’s are built for general-purpose graphical operations, thus, more real estate is devoted to this compared to video encoding.

By installing a dedicated video encoder in the system and using traditional data transfer techniques, the host CPU can easily manage the transfer of the YUV frame from the GPU to the transcoder but as the number of concurrent game sessions increase the probability of dropped frames or corrupted data makes this technique not usable.

NETINT, working with AMD enabled peer-to-peer direct memory access (DMA) to overcome this situation. DMA is a technology that enables devices within a system to exchange data in memory by allowing the GPU to send frames directly to the VPU whereby removing the situation of the buss becoming clogged as the concurrent session count increases above 48 720p streams.

What can a VPU do for you?

The Benefits of Peer-to-Peer DMA

Peer-to-peer DMA delivers multiple benefits. First, by eliminating the need for CPU involvement in data transfers, peer-to-peer DMA significantly reduces latency, which translates to a more responsive and immersive gaming experience for end-users. NETINT VPUs feature latencies as low as 8ms in fully loaded and sustained operation.

In addition, peer-to-peer DMA relieves the CPU of the burden of managing inter-device data transfers. This frees up valuable CPU cycles, allowing the CPU to focus on other critical tasks, such as game logic and physics calculations, optimizing overall system performance and producing a smoother gaming experience.

By leveraging peer-to-peer communications, data can be transferred at greater speeds and efficiency than CPU-managed transfers. This improves productivity and scalability for cloud gaming production workflows.

These factors combine to produce higher throughput without the need for additional costly resources. This cost-effectiveness translates to improved return on investment (ROI) and a major competitive advantage.

Extraordinarily Capable VPUs

Peer-to-peer DMA has no value if the encoding hardware used is not equally capable. With NETINT VPUs, that isn’t the case here.

The reference system that produces 200 720p30 cloud gaming sessions is built on the Supermicro AS-2015CS-TNR server platform with a single GPU and two Quadra T2A VPUs. This server supports AV1, HEVC, and H.264 video game streaming at up to 8K and 60fps, though as may be predicted, the simultaneous stream counts will be reduced as you increase framerate or resolution.

Quadra T2A is the most capable of the Quadra VPU line, the world’s first dedicated hardware to support AV1. With its embedded AI and 2D engines, the Quadra T2A can support AI-enhanced video encoding, region of interest, and content-adaptive encoding. Quadra T2A coupled with a P2P DMA enabled GPU, allows cloud gaming providers to achieve unprecedented high throughput with ultra-low latency.

Quadra T2A is an AIC (HH HL) form-factor video processing unit with two Codensity G5 ASICs that operates in x86 or Arm-based servers requiring just 40 watts at maximum load. It enables cloud gaming platforms to transition from software or GPU-only based encoding with up to a 40x reduction in the total cost of ownership.

What Can A VPU Do For You?

What Can A VPU Do For You?

It makes Cloud Gaming profitable, finally.

Peer-to-peer DMA is a game-changing technology that reduces latency and increases system throughput. When paired with an extraordinarily capable VPU like the NETINT Quadra T2A, now you can deliver an immersive gaming experience at a CCU that cannot be matched by any competing architecture.

Key Cloud Gaming Concepts with Blacknut’s Olivier Avaro

Cloud Gaming Primer - key concepts - NETINT Technologies

Recently, our Mark Donnigan interviewed Olivier Avaro, the CEO of Blacknut, the world’s leading pure-player cloud gaming service. As an emerging market, cloud gaming is new to many, and the interview covered a comprehensive range of topics with clarity and conciseness. For this reason, we decided to summarize some of the key concepts and include them in this post. If you’d like to listen to the complete interview, and we recommend you do, click here. Otherwise, you can read a lightly edited summary of the key topics below.

For perspective, Avaro founded Blacknut in 2016, and the company offers consumers over seven hundred premium titles for a monthly subscription, with service available across Europe, Asia, and North America on a wide range of devices, including mobiles, set-top-boxes, and Smart TVs. Blacknut also distributes through ISPs, device manufacturers, OTT services, and media companies, offering a turnkey service, including infrastructure and games that allow businesses to instantly offer their own cloud gaming service.

Cloud Gaming Primer - the key points covered in the interview

The basic cloud gaming architecture is simple.

The architecture of cloud gaming is simple. You take games, you put them on the server in the cloud, and you virtualize and stream it in the form of a video stream so that you don’t have to download the game on the client side. When you interact with the game, you send a command back to the server, and you interact with the game this way.

Of course, bandwidth needs to be sufficient, let’s say six megabits per second. Latency needs to be good, let’s say less than 80 milliseconds. And, of course, you need to have the right infrastructure on the server that can run games. This means a mixture of CPU, GPU, storage, and all this needs to work well.

But cost control is key.

We passed the technology inflection point where actually the service becomes to be feasible. Technically feasible, the experience is good enough for the mass market. Now, the issue is on the unique economics and how much it costs to stream and deliver games in an efficient manner so that it is affordable for the mass market.

Public Cloud is great for proof of concept.

We started deploying the service based on the public cloud because this allowed us to test the different metrics, how people were playing the service, and how many hours. And this was actually very fast to launch and to scale…That’s great, but they are quite expensive.

But you need your own infrastructure to become profitable.

So, to optimize the economics, we built what we call the hybrid cloud for cloud gaming, which is a combination of both the public cloud and private cloud. So, we must install our own servers based on GPUs, CPUs, and so on so we can improve the overall performance and the unique economics of the system.

Cost per concurrent user (CCU) is the key metric.

The ultimate measure is the cost per concurrent user that you can get on a specific bill of material. If you have a CPU plus GPU architecture, the game is going to slice the GPU in different pieces in a more dynamic manner and in a more appropriate manner so that you can run different games and as many games as possible.

GPU-only architectures deliver high CCUs, which decreases profitability.

There are some limits on how much you can slice the GPU and still be efficient and so there are some limits in this architecture because it all relies on the GPU. We are investigating different architectures using a VPU, like NETINT’s, that will offload the GPU of the task of encoding and streaming the video so that we can augment the density.

VPU-augmented architectures decrease CCU by a factor of ten.

I think in terms of some big games, because they rely much more on the GPU, you will probably not augment the density that much. But we think that overall, we can probably gain a factor of ten on the number of games that you can run on this kind of architecture. So, passing from a max of 20, 24 games to running two hundred games on an architecture of this kind.

Which radically increases profitability.

So, augmenting the density by a factor of ten means also, of course, diminishing the cost per CCU by a factor of ten. So, if you pay $1 currently, you will pay ten cents, and that makes a whole difference. Because let’s assume basic gamers will play 10 hours per month or 30 hours per month; if this costs $1 per hour, this is $30, right? If this is ten cents, then costs are from $1 to $3, which I think makes the match work on the subscription, which is between 5 to 15 euros per month

The secret sauce is peer-to-peer DMA.

[Author’s note: These comments, explaining how NETINT VPU’s deliver a 10x performance advantage over GPUs, are from Mark Donnigan].

Anybody who understands basic server architecture, it’s not difficult to think, wait a second, isn’t there a bottleneck inside the machine? What NETINT did was create a peer-to-peer sharing inside the DMA (Direct Memory Access). So, the GPU will output a rendered frame, and it’s transferred inside memory, so that the VPU can pick that up, encode it, and there’s effectively zero latency because it’s happening in the memory buffer.

5G is key to successful gameplay in emerging markets.

[Back to Olivier] What we’ve been doing with Ericsson is using 5G networks and defining specific characteristics of what is a slice in the 5G network. So, we can tune the 5G network to make it fit for gaming and to optimize the delivery of gaming with 5G.

So, we think that 5G is going to get much faster in those regions where actually the internet is not so great. We’ve been deploying the Blacknut service in Thailand, Singapore, Malaysia, now in the Philippines. And this has allowed us to reach people in regions where there is no cable or bandwidth with fiber.

Latency needs to be eighty milliseconds or less (much less for first-person shooter games).

You can get a reasonably good experience at 80 milliseconds for most games. But for first-person shooter games, you need to be close to frame accuracy, which is very difficult in cloud gaming. You need to go down to thirty milliseconds and lower, right?

That’s only feasible with the optimal network infrastructure.

And that’s only feasible if you have a network that allows for it. Because it’s not only about the encoding part, the server side, and the client side; it’s also about where the packets are going through the networks. You need to make sure that there is some form of CDN for cloud gaming in place that makes the experience optimal.

Edge servers reduce latency.

We are putting a server at the edge of the network. So, inside the carrier’s infrastructure, the latency is super optimized. So that’s one thing that is key for the service. We started with a standard architecture, with CPU and GPU. And now, with the current VPU architecture, we are putting whole servers consisting of AMD GPU and NETINT VPU. We build the whole package so that we put this in the infrastructure of the carrier, and we can deploy the Blacknut cloud gaming on top of it.

The best delivery resolution is device dependent.

The question is, again, the cost and the experience. Okay? Streaming 4K on a mobile device does not really make sense. The screen is smaller, so you can screen a smaller resolution and that’s sufficient. On a TV, likely you need to have a bigger resolution. Even if there is a great upscale available on most TV sets, we stream 720p on Samsung devices, and that’s super great, right? But of course, scaling up to 1080p will provide a much better experience. So, on TVs and for the game that requires it, I think we’re indeed streaming the service at about 1080p.

Frame rates must match game speed.

When playing a first-person shooter, if you have the choice and you cannot stream 1080p, you would probably stream 720p at 60 FPS rather than 1080p at 30 FPS. But if you have different games with elaborate textures, the resolution is more important, then maybe you will actually select more 1080p and 30 fps resolution.

What we build is fully adaptable. Ultimately, you should not forget that there is a network in between. And even if technically you can stream 4K or 8K, the networks may not sustain it. Okay? And then you’ll have a worse experience streaming 4K than at 1080p 60 FPS resolution.

Revolutionizing Online Media Distribution and Delivery

Advancements in Streaming

Streaming technologies have revolutionized the digital media landscape, transforming how content is distributed and delivered to audiences worldwide. One pioneering figure in this field is Alex Zambelli, whose career at Microsoft has been closely intertwined with the rise of streaming as the dominant digital media distribution method. Zambelli’s work with NBC Sports, particularly during the 2008 Beijing Olympics and subsequent events, was pivotal in advancing online streaming capabilities and earning industry recognition. This article, based on Jan Ozer‘s conversation with Alex during Voices of Video, explores Zambelli’s contributions to streaming technologies, the implementation of multi-view camera angles in Sunday Night Football, and key considerations in livestreaming from insights gained during Olympic events.

Evolution of Streaming Technologies

Alex Zambelli’s career at Microsoft has coincided with the transition from physical media to streaming as the dominant method of distributing digital media. Around 2007, streaming started gaining momentum, gradually overtaking CDs, DVDs, and Blu-rays. Zambelli’s focus on streaming technologies led him to work on Microsoft’s Silverlight, a competitor to Flash, which facilitated the creation of rich web experiences and premium media delivery, including digital rights management. This technology was a significant milestone in the evolution of streaming.

Zambelli’s collaboration with NBC Sports began with the 2008 Beijing Olympics, where they aimed to pioneer online streaming of all Olympics content. Initially, they utilized Windows Media and Silverlight, incorporating adaptive streaming capabilities. The subsequent transition to Microsoft’s Smooth Streaming technology for the 2010 Vancouver Olympics marked a significant advancement. This technology offered on-demand and live streams in high definition, providing viewers with an immersive and seamless experience. These groundbreaking endeavors earned Zambelli and the team recognition from the industry, including nominations for sports Emmys.

Multi-View Camera Angles in Sunday Night Football

The implementation of Smooth Streaming technology played a crucial role in enabling the seamless transition between camera angles in Sunday Night Football broadcasts. By utilizing a single manifest that contained all four camera angles, switching between views became as smooth as switching between bitrates in modern streaming protocols like DASH or HLS. This technology, developed by the broadcast team, allowed viewers to simultaneously watch multiple camera angles, enhancing the overall viewing experience.

Key Considerations in Livestreaming: Insights from Olympic Events

Livestreaming presents unique challenges compared to on-demand streaming due to its real-time nature. Issues such as packet loss, segment loss, blackouts, and ad insertions demand immediate attention and resolution. Unlike on-demand streaming, where there is some leeway to address content or delivery chain issues over time, livestreaming requires constant vigilance. Even a brief interruption or technical problem can significantly impact the viewer experience.

Successful livestreaming events often involve collaborative efforts from multiple companies, including Microsoft, NBC, Akamai, and iStreamPlanet. These events require dedicated teams ready to address and resolve any issues that arise in real time. The nature of livestreaming necessitates a higher level of focus and attention compared to on-demand streaming. It is crucial to prioritize and allocate sufficient resources to ensure the seamless execution of live events. The potential for unexpected issues or failures makes constant monitoring and immediate troubleshooting essential, as even a minor disruption can have significant consequences.

Voices of Video - Cloud Gaming being Real

Play Video about Advancements in Streaming Technologies - NETINT Technologies (Voices of Video with Alex Zambelli from Warner Bros Discovery
VOICES OF VIDEO
Scalable distribution in the age of DRM: Key Challenges and Implications.
Watch the full conversation on YouTube: https://youtu.be/s_afoa71muM
 

Evolution of Video Codecs and Streaming Protocols

The evolution of video codecs and streaming protocols has played a vital role in shaping the streaming landscape. In the early 2000s, the popular video codecs for streaming were VC-1 (supported by Silverlight) and H.264 (supported by Flash). However, the introduction of HTML5 posed challenges for streaming solutions, as the HTML specification lacked the necessary APIs to provide the required level of control and functionality for streaming.

Silverlight and Flash emerged as proprietary plugins that advanced streaming technology beyond what HTML could offer at the time. They provided opportunities to overcome HTML’s limitations and introduced features such as media stream sources and content protection (DRM) to enhance the streaming experience. Silverlight’s media stream source concept, which later influenced HTML’s media source extensions, allowed developers to handle their own segment downloading and parsing, passing the video and audio streams to a media buffer for decoding and rendering. Content protection was a crucial aspect addressed by Silverlight and Flash, as HTML lacked a robust solution for DRM.

Around 2011-2012, Silverlight and Flash gradually phased out as HTML5 matured, offering the necessary APIs for implementing streaming protocols like DASH, HLS, and Smooth Streaming within the browser while incorporating DRM capabilities. HTML5 overcame initial growing pains and established itself as the predominant platform for streaming. By 2014-2015, HTML5 had evolved sufficiently to support basic streaming functionalities and content protection with DRM.

Optimizing Encoding Quality and Cost

Achieving optimal encoding quality while considering cost is a crucial concern for content creators and distributors. At Warner Brothers Discovery, the x264 and x265 codecs are commonly used for transcoding purposes, employing the slow or slower presets to achieve higher quality outputs. This approach balances encoding cost with desired video quality.

Recent discussions within the organization have prompted exploration into the idea of customizing presets based on specific resolutions and content complexities. The focus is on optimizing encoding efficiency by adjusting presets according to the intricacy of the content and the resolution being processed. Different resolutions have varying encoding requirements, and applying the very slow preset to all resolutions may result in unnecessary computational overhead for lower resolutions. Similarly, content complexity plays a role in determining the appropriate preset, as not all content requires the very slow preset. Customizing presets based on resolution and content characteristics allows for more efficient allocation of computational resources.

The popularity and viewership of specific content also factor into the choice of preset. Content with a larger audience may benefit from the slower preset due to potential CDN savings resulting from improved video quality. On the other hand, smaller-scale content with fewer viewers may not necessitate the same level of complexity in encoding. Balancing encoding quality and cost requires thoughtful consideration of these factors.

Adaptive Encoding Ladders: Variations, Frame Rates, and Device Considerations

Adaptive encoding ladders play a crucial role in delivering content based on source resolution and frame rate. At Warner Brothers Discovery, these encoding ladders consist of approximately six to eight different variations, allowing flexibility in content delivery. The source resolution determines the stopping point within the UHD ladder, minimizing the need for multiple permutations of the ladders themselves.

Variations in frame rates necessitate different encoding ladders. The introduction of high frame rates, especially with reality TV content, requires separate encoding ladders to preserve the temporal resolution. Encoding ladders also differ for SDR and HDR content, with distinctions made between HDR10 and Dolby Vision 5, offering specific encoding settings for each.

While currently the same encoding ladders are used for all devices, specific subsets of the ladder may be delivered to certain devices to accommodate their capabilities. Device differentiation is particularly important for high frame rates or resolutions above 1080p. By intentionally capping the manifest delivered to devices that cannot handle certain capabilities, compatibility and optimal viewing experiences can be ensured. Differentiating encoding ladders for various devices is essential for maintaining consistent quality across different devices.

VBR Control, Per-Title Encoding, and DRM Considerations in Video Encoding

Video encoding involves crucial considerations such as VBR control, per-title encoding, and DRM integration. At Warner Brothers Discovery, the x264 and x265 codecs employ a CRF (Constant Rate Factor) rate control with a bitrate and buffer cap for VBR (Variable Bit Rate) encoding. This approach ensures control over codec levels, peak rates, and overall encoding quality.

VBR control is achieved by using VBV (Video Buffering Verifier) buffer size and VBV max rate parameters. These parameters allow for setting the highest average bitrate for the video, while CRF brings the average bitrate below the specified max rate in most cases. This method enables per-title encoding, achieving CDN savings without compromising quality. Differentiating encoding ladders based on resolutions, frame rates, and HDR formats is essential to conform to content licensing agreements and compatibility requirements.

DRM has a significant impact on the encoding ladder. Licensing agreements often demand different security levels for various resolutions, necessitating the assignment of different encryption keys and playback policies to different security groups. The use of hardware-backed DRM, such as Widevine L1 and PlayReady SL3000, is often required for higher resolutions. The trend in the industry is moving towards increased use of DRM across the entire encoding ladder, with a focus on stricter requirements for HDR content. Content licensing agreements are evolving to require comprehensive DRM implementation for improved content protection.

Exploring Hardware and Software DRM: Implementation and Impact on Video Streaming

The choice between hardware and software DRM implementations has implications for video streaming security and performance. Hardware DRM involves integrating DRM clients into the secure video path of the system, tightly coupling with the hardware decoder. This ensures secure decoding and decryption of video streams, preventing unauthorized access to the content. Hardware-based DRM establishes a secure video path or secure media path, where the decrypted and decoded bits cannot be retrieved or accessed by applications. This level of security is achieved through close integration with the hardware decoder, ensuring protection throughout the entire decoding process.

On the other hand, software DRM performs decoding and decryption in software, introducing a potential vulnerability where the decoded bits could be compromised or accessed by unauthorized parties. Software DRM lacks the same level of hardware integration and security provided by hardware-based DRM.

The limitations of software-based DRM can impact the resolution of premium content when viewing it on certain platforms or browsers without hardware support. For example, Chrome’s support for Widevine DRM is limited to L3, the software-based implementation. This can result in inferior video quality compared to browsers like Edge or Safari, which support hardware DRM, allowing for a more secure video path and higher quality streaming.

Unifying Packaging Formats: HLS, DASH, and CMAF in Video Streaming

Standardizing packaging formats is crucial for compatibility and interoperability in video streaming. Warner Brothers Discovery and Hulu have been utilizing both HLS (HTTP Live Streaming) and DASH (Dynamic Adaptive Streaming over HTTP) for content distribution. HLS is predominantly used for Apple devices, while DASH is employed for other devices.

The commonality between HLS and DASH lies in their utilization of the CMAF (Common Media Application Format) standard. CMAF serves as a standardized version of fragmented MP4 (fMP4), specifying the necessary boxes and encryption application for fMP4 media segments used in HLS and DASH. CMAF is not a streaming protocol itself but encompasses two components.

Firstly, it defines a refined version of fMP4 for HLS and DASH, establishing a more precise set of guidelines for compatibility. Many existing HLS and DASH implementations using fMP4 media segments are already CMAF-compliant.

Secondly, CMAF specifies a hypothetical logical media presentation model, outlining the relationship between tracks, segments, fragments, and chunks. This model closely resembles HLS or DASH without explicitly using those terms. It provides a framework for addressing different levels of the media presentation.

HLS and DASH can be considered as the physical implementations of the logical media presentation model described by CMAF. The HLS-DASH interoperability specification, such as CTA 5005, heavily relies on CMAF, serving as a unifying model and describing how both HLS and DASH integrate with CMAF. This unification allows for similar concepts to be described across both formats, enhancing compatibility and simplifying the streaming ecosystem.

Exploring Hardware and Software DRM: Implementation and Impact on Video Streaming

The streaming industry faces challenges related to content publishing and compatibility across diverse platforms and devices. The Consumer Technology Association (CTA) plays a crucial role in addressing these challenges and streamlining content publishing processes. The CTA is actively working to enhance interoperability within the streaming industry, allowing publishers to focus primarily on content development rather than compatibility concerns.

The CTA’s WAVE initiative serves as a platform for fostering efforts to streamline content publishing and compatibility. One major challenge in the streaming landscape is the presence of numerous application development platforms. For example, within Warner Brothers Discovery, there are approximately a dozen or 16 different application development platforms utilized for their streaming service, with some overlap between certain platforms such as Android TV and Fire TV.

Developers often encounter the unique scenario of building multiple versions of the same application in various programming languages using different platform APIs. This complexity arises due to the diversity of devices and platforms requiring tailored applications. This situation is unparalleled compared to other industries where typically a web app, iOS app, and Android app cover the majority of development needs.

The multitude of application development platforms poses challenges in areas such as encoding and packaging. Determining device capabilities becomes arduous without a standardized specification or set of APIs that can provide consistent and reliable information across different platforms.

The standardization of device media capabilities detection APIs is a crucial step towards enhancing compatibility in the streaming industry. Efforts within the World Wide Web Consortium (W3C) to define these APIs in HTML are underway. However, it is important to note that not all platforms utilize HTML, necessitating the presence of similar APIs across all platforms. Once standardized APIs for media capabilities detection are established, developing a standardized method for signaling these capabilities to servers becomes essential. This facilitates targeting specific devices based on their capabilities and enables actions such as manifest filtering.

Standardization efforts are vital for simplifying content publishing and enhancing compatibility in the streaming industry. By establishing standardized specifications and APIs, the industry can overcome compatibility challenges and streamline the development and distribution of streaming content.

The Leverage Is Imperative

The evolution of streaming technologies has brought about significant advancements in digital media distribution and delivery. Pioneers like Alex Zambelli have played a crucial role in driving innovation and pushing the boundaries of what is possible in online streaming. The implementation of multi-view camera angles, considerations in livestreaming, advancements in video codecs and streaming protocols, and optimization of encoding quality and cost are key areas that shape the streaming landscape. Standardization efforts, hardware and software DRM implementations, and the role of organizations like the CTA further contribute to enhancing compatibility and simplifying content publishing in the streaming industry. As the streaming industry continues to evolve, leveraging these advancements and best practices is imperative to deliver high-quality, seamless streaming experiences to audiences worldwide.

Parking Lot Rules, B-Frames, and Ultra Low-Latency Encoding

One of my sweetest memories of bringing up our two daughters was weekly trips to the grocery store. Each got a $5.00 bribe for accompanying their father, which they happily invested in various tchotchkes that seldom lasted the week. When we exited the car, “parking lot rules” always applied, which meant that each daughter held one of Daddy’s hands for the walk to the store. Two girls, two hands, no running around the busy parking lot.

Parking lot rules came to mind as we debugged a decoding latency issue when testing a new server product. Initial tests revealed a decoding latency of up to 200 milliseconds in some high-volume configurations. Given that the encoding latency was under 20 milliseconds, the decoding numbers were uncomfortably high.

Eliminate B-Frames from the Origination Stream

After raising the issue, our testing team implemented a fix, which dropped latency to under 20 milliseconds, and decreased encoding latency as well. The change is the parking-lot-rules corollary for live streamers, which is “for ultra-low latency, eliminate B-frames from your live streaming workflow.” With H.264, this means using the baseline profile, which eliminates B-frames. With H.265, you’ll have to use a GOP structure that does the same.

A quick glance at Figure 1 reveals why B-frames blow-up decoding latency (shoutout to OTTverse, where we grabbed the image). B-frames, of course, incorporate redundancies from frames before and after the frame being encoded. They are packed and decoded out of order. Any frame decoded out of order adds latency – the further they are out of order, the greater the latency.

Figure 1. B-frames are packed out of order and can increase decode latency.

Will eliminating B-frames (or the Baseline H.264 profile) reduce the quality of the incoming stream? Only minimally, if at all. These streams are typically produced at a relatively high bit rate, so B-frames or higher-quality profiles deliver minimal additional quality. It’s even less likely that any decrease in quality would be noticeable in the output stream (see here).

Let’s pause for a moment and reflect on the bigger picture. Figure 2 shows the typical live streaming workflow. We’ve been talking about B-frames in the on-premise encode impacting the decoding latency in the transcoding server. What about B-frames in the transcoding server when encoding streams for delivery to viewers?

Predictably, the result is the same. B-frames introduce the same latency during encoding for delivery, for the same reason–packing frames out of order introduces delays. This is why, when implementing low-latency mode with the NETINT Quadra Video Processing Unit and T408 transcoder, you must use a GOP preset that encodes with consecutive frames.

When you get things right – incoming streams without B-frames and outgoing streams without B-frames, the results are transformative. Let’s have a look.

Tue Low Latency Transcoding

Table 1 below shows actual testing results. This use case involves scaling 1080p AVC input down to 720p for delivery, which is common for interactive gaming, auction sites, and conferencing, and the server can produce 320 streams while encoding AVC, HEVC, and AV1. I don’t have the original data for the input file with B-frames, but as I recall, decoder latency averaged 150 – 200 ms, a noticeable break in a live conversation. Even worse, unlike encoder latency, it didn’t drop significantly in low-delay mode.

As you see in the table, after the fix, total latency is around 160 ms for all outputs in normal, (latency-tolerant) mode. Working with the input file without B-frames, and outputting streams without B-frames, combined encoder and decoder latency plummets to around 22 ms, well under a single frame (which for 30 fps video takes 33 ms to display). That’s low enough for even the most latency-sensitive applications.

Table 1. Encode/decode latency in normal and low-delay mode
(with a properly formatted input file).

How much will the lack of B-frames impact quality in the output encoding ladder? Once again, B-frames have delivered surprisingly little value in the tests that I’ve performed. You can read a good article on the subject here, and access updated data here (see page 22), which show less than a 1% quality difference between streams with and without B-frames. The bottom line, of course, is that if your application needs ultra-low latency, you have to prioritize that over any potential quality loss, though it’s good to know that few, if any, viewers will notice it.  

Returning to the thoughts that prompted this article, when my daughters have their kids, an endearing wish is that they implement parking lot rules in all relevant shopping trips. Given their progress to date, this may not occur in my lifetime. If you’re a live-streaming engineer, you have no similar excuse to ignore the corollary. If latency is critical, make sure you eliminate B-frames from your live-streaming workflows.

PS. The server referenced is the Quadra Video 100 Server, which combines ten Quadra video processing units (VPUs) with a SuperMicro chassis driven by a 32-core CPU. Total cost should be around $20,000 in this configuration. Stay tuned for more details or message us.

Unlocking the Potential of Cloud Gaming with VPUs

Blacknut-cloud gaming-B.jpg

In this interview, Olivier Avaro, the CEO of Blacknut, discusses the emergence and potential of cloud gaming. Blacknut aims to bring the joy of gaming to the mass market by offering a large catalog of games through cloud-based distribution. Avaro highlights the maturity of both users and technology, making cloud gaming a feasible and attractive option. The interview explores the transition from physical discs to streaming, the importance of cost-effectiveness in delivery, and the architectural advancements in cloud gaming systems.

Avaro emphasizes the potential of hybrid cloud infrastructure and the role of GPU and VPU in maximizing the number of concurrent players and reducing costs. He acknowledges the challenge of making cloud gaming affordable for a wider range of consumers, including those in emerging markets. However, he emphasizes that the cost of delivering the service can be kept within a reasonable range, with subscription prices ranging from $5 to $15 per month, depending on the economic conditions of the region.

The technical infrastructure of cloud gaming is explored in detail. Avaro explains the basic architecture, where games are stored on cloud servers and streamed to users’ devices, eliminating the need for downloads. The key requirements for a seamless experience include sufficient bandwidth, low latency, and a well-equipped server infrastructure comprising CPUs, GPUs, and storage. Initially deployed on public cloud platforms for scalability, Blacknut has devised a hybrid cloud approach to optimize the economics of the service. This involves the incorporation of private cloud servers, allowing for improved performance and cost efficiency.

The interview addresses an innovative architectural aspect of Blacknut’s system. Avaro discusses the decision to offload video encoding from the GPU to a dedicated video processor unit (VPU) provided by NETINT.

This approach increases the density of concurrent game sessions, enabling up to 200 players on a single server. This breakthrough in density enhances the economic viability of cloud gaming platforms by significantly reducing costs.

These insights offer valuable perspectives on the advancements in cloud gaming, the importance of cost considerations, and the technological infrastructure that underpins its success.

Avaro also addresses challenges related to unstable internet connectivity in certain regions, discussing collaborations with Ericsson to leverage 5G networks and optimize network characteristics for gaming. While geographical limitations exist, Blacknut is actively expanding its presence to provide global access to its gaming service.

Voices of Video - Cloud Gaming being Real

Play Video about Cloud gaming platforms can greatly benefit from Avaro's revelation: offloading video encoding to a dedicated VPU, enabling 200 players on a single server.
VOICES OF VIDEO
Cloud Gaming being Real. A conversation with the CEO of Blacknut
Watch the full conversation on YouTube: https://youtu.be/w9Pho6G_bdM
 

Mark Donnigan:
So we are at the top of the hour, and looks like we should get started. Oliver, are you ready to talk about cloud gaming?

Oliver Avaro:
Absolutely ready.

Mark Donnigan:
Excellent, excellent. Well, welcome to those who are joining us live. This is the May edition of Voices of Video. And if you haven’t joined us before, Voices of Video is a conversation, or some might say a real dialogue. Not a podcast, I guess a videocast. We go live on LinkedIn and also a lot of other platforms. And we are talking each month with innovators in the video space. And so this month I am super excited to have Oliver Avaro, who is the CEO of a company called Blacknut. And we are talking about cloud gaming. I will let Oliver tell us all about what his company does. But welcome to Voices of Video, Oliver.

Oliver Avaro:
Look, thanks a lot, Mark, for the nice introduction. So my name is Oliver Avaro, I’m the CEO of Blacknut, which in short is doing to games what Spotify did for music, right? So we are distributing game from the cloud, large catalog of games, more than 700 games so far, and this for a simple subscription fee, right? I was long time a gamer. I enjoyed it a lot when I was a teenager. I enjoyed it a lot with friends, with my family, later with my kids. And I started Blacknut in 2016 with the big ambition to actually brings this joy of gaming, this good emotion, all the also positive value of playing together to the mass market. We deployed the tech for about three years. I think cloud gaming does require a bit of technology to work efficiently. Then we started deploy it all over the world and this is where we are today.

Mark Donnigan:
So we are at the top of the hour, and looks like we should get started. Oliver, are you ready to talk about cloud gaming?

Oliver Avaro:
Absolutely ready.

Is the Blacknut CEO a gamer himself?

Mark Donnigan:
I love it. So I have to ask the question, sometimes when we’re building advanced technologies, we get so into the technology, we don’t get to do the thing that we originally set up to do like play games. So are you still a gamer? Set aside time each day to play?

Oliver Avaro:
I set aside each time to play a little bit. That’s true. And I have to say that I was a… The first game I played was on the Commodore 64 machine, it was named Boulder Dash, right? The older of the audience will know about it. Now I’m still, I’ve been playing with my kid of course on the Wii, all the Nintendo games. And Mario and Super Mario Kart and Super Mario Galaxy, right? And to be truly honest, I’m still playing a bit with my kid, but mostly I’m touching a bit Pokemon Go sometimes to still get a conversation with my wife on gaming.

Mark Donnigan:
That’s good. That’s good. Well, I am really excited for this conversation. And I was just thinking back as I was making some notes for what I thought we should talk about. And in 2007 I had the distinct privilege, and I really do consider it to be a privilege, to be a part of a company, one of the early, early innovators of streaming what we call now OTT, and at the time it was transactional VOD. The company still exists, it’s called Voodoo. And we had this crazy idea to take the Blockbuster, those who have been around for a little while will remember Blockbuster video stores in the US. Other countries, they had the equivalent. And eventually I think Blockbuster did expand outside the US. But you’d go to the video store, you’d rent a disc, DVD, and then eventually Blu-ray, and you would drive home so excited for the family to join around the TV and watch it.

And I can remember how shocking it was to have built this amazing experience where every title was in stock. And those of us who remember the video store, remember that that was part of the challenge, on new release day you had to rush down to the store to be the first in line so you could even get the movie, because they only had so many copies. And then of course you had to worry about did I return it, did I return it by the deadline or do I have to pay for a second day. There was a lot about the experience that actually wasn’t so great. And yet we were shocked at how many people said, “Why would I want to stream over the internet? DVD is great. This is amazing. Look at the quality. No one’s going to want to replace the DVD.” Well, 15 years later, obviously that sounds absolutely crazy, as now the entire world is streaming and we can’t even imagine a world without it.

But as I was thinking about cloud gaming, it feels like maybe we’re a little bit further than we were in 2007, but they’re still not everybody’s convinced. And I’m even surprised that major publishers that I’m coming across, and it’s not a foregone conclusion that the console is going to be replaced with streaming. And so let’s start there. Oliver, I have to imagine that a lot of what you’re spending time doing, aside from building the technology, is making the case for why internet delivery of a game experience is going to be better and is ultimately better than something that’s installed on a PC, downloaded or a console. So what insights do you have to share about where we are in this transition from consoles and discs to streaming for games?

Oliver Avaro:
And Mark, I think the analogy with the Blockbusters I think is very relevant. And I feel that first, in terms of market maturity for the end user, we are probably at that point where people would question, “Why should I do that? I can download a game, why should I actually stream it? Why do something different?” Right? And when I created Blacknut, actually a person that I highly respect told me, “Wow.” People will not use it because they can download it, right? Now, if you look at where we are right now with people now consuming all the media, like audio and video and your musics and books in a streaming manner, it seemed that definitely having those people accessing games the same way seems to be actually, it’s the right idea or the right next step, right?

And I do think that there is a bit more of maturity of people actually willing to access games this way. Now, there has been probably an inflection points in terms of technology maturity. I think the technology, meaning basically the hardware you can have on the cloud, the bandwidth you have available on your home, as a kind of device you have to run it and so on, is good enough to provide actually a great experience. And I do think that we are at the time here where we’re passing this inflection point that probably years ago it was not sufficient. And we have seen lot of companies trying to do this, but actually failing and failing really badly. But actually learning a lot from these failures.

So I think we’re at a very exciting time now where we have this maturity in terms of technology. We have the maturity of the end user, because they are used to consume this kind of media with audio, video, eBooks and so on. So probably they’re craving to get access to game, and more and more people are gaming. And we have also the maturity of the content owner and the publisher. So I think we’re at a very, very good time in the market.

Deliver at ultra low latency. Possible?

Mark Donnigan:
Well, I definitely agree that we are much further advanced than we were. I think of some of the things that we had to do, Voodoo in 2007 actually required an appliance, a device with a hard drive in it that we could download the first 30 seconds, maybe a minute of every single title in the library in it. At that time, the library was not as big as what the libraries are today. But just because streaming bandwidth was 768 kilobits. Maybe 1.5 megabits was really fast. If you were really lucky you had 5 megabits. My, how we’ve grown. So it’s definitely we’re in a better position.

Before we get into the technology, because that’s where we’re going to spend the bulk of our time today. But something that I think also you’re in a really good position to address is, is the cost side. So certainly, we’re at a place today with the cloud that you can deliver anything, really anywhere via the cloud. So the notion that you can do cloud gaming, i.e., it’s possible to deliver an ultra low latency, very high quality experience from the cloud. I don’t think anybody conceivably would say, “Oh, I don’t believe that. That’s not possible.” But there is a real issue of the cost. And so why don’t you address where we’re at in terms of just delivery cost, and I’m speaking of OpEx. Where are we at? I mean, is this possible but not affordable, or is this possible and affordable, even for someone who might not be able to charge their consumer a whole lot of money? Not all markets are the US or Western Europe, or some of these regions where consumers are willing to pay $10, $15, $20 a month.

Oliver Avaro:
No, that really is a key issue, Mark. Because, as you mentioned, I think we passed the technology inflection point where actually the service becomes to be feasible. Technically feasible, the experience is good. We think it’s good enough for the mass market. I am sure that some people will be unhappy with it. Really, core gamers will say, “Well…”

Mark Donnigan:
Sure.

Oliver Avaro:
Probably the same people that when the DVD came they say, “Well, I still want to listen to my vinyl on my turntable because this is what I’m using to listen my music. And you will not beat that quality with digital sound.” Right? But for the mass market, I think we got to the point where the feasibility is here. Of course we need good bandwidth, stable, very low jitter, so the variation of the latency. But we are here right.

Now, the issue is indeed on the unique economics and how much it costs to actually stream and deliver games in an efficient manner, so that it is affordable basically for the mass market. And one thing here is I think the gaming is not done. Okay? There is some challenges. As you know, the cost of streaming depends on the number of hours per month, let’s say that you stream. We think that we got at least some maturity where it’s becoming available so that you get to a price point which is what people expect, which is between $5 to $15, depending on the how poor are the country is. So we think this is realistic. But of course, it depends on the intensity of the player, how much they play. And if you want somehow to really sustain and to have great economics, there is still some improvement to be done. Okay? And I would say we have the baseline architecture that allows the service to be profitable, to make it really work, really scale. There is still some margin of improvement. And we have ways actually to improve this unique economics.

Technical infrastructure

Mark Donnigan:
So you’re saying right now that to the end user, which means that the actual cost to deliver the service has to be less. But to the end user, about $5 a month to $15 a month is a target that is possible to reach?

So $5 a month, even in more emerging markets where maybe subscription prices cannot be what they are say in the US, feels like that’s doable. So that’s actually good to hear. Tell us what is the technical… Let’s talk now about what the technical infrastructure looks like and what it takes to deliver. How have you built your system? And then we will get to the broader architecture of Blacknut and what exactly you’re offering. But let’s start with what is your system built on? What does it look like? What are you deploying? Is this a cloud service? Is it run all on prem?

Oliver Avaro:
So basically, the architecture of cloud gaming is somehow simple. You take games, you put them on the server in the cloud and you’re going basically to virtualize it and stream it in the form of a video stream or in some other format so that you don’t have to download the game on the client side, and you can play it as you are playing a video stream. And when you interact with the game, you send a command back to the server and then you interact with the game this way. And so of course bandwidth need to be sufficient, let’s say 6 megabit per second. Latency need to be good, let’s say less than 80 milliseconds. And of course you need to have the right infrastructure on the server that can run games. No games mean a mixture of CPU, GPU, storage, and all this need to work well.

We start deploying the service based on public cloud, because this allow us to test the different metrics, how people were playing the service, how many hours. And this was actually very fast to launch and to scale. So this is what the public clouds, the hyperscaler, SCP, and so on provides. That’s great, but they are quite expensive as you know. So to optimize the economics, we actually built and invented in Blacknut what we call the hybrid cloud for cloud gaming, which is a combination of both the public cloud and private cloud. So we have to install our own servers based on GPUs, CPUs and so on, either directly in Blacknut or with some partners like Radian Arc so that we can improve the overall performances and the unique economics of the system. That I think allowed us to build a profitable service. I think if you just match basically the public cloud currently, I think this is super hard to get something which is viable. But with this kind of hybrid cloud, I think it’s actually very doable.

Mark Donnigan:
And these are standard x86, commercial, off-the-shelf, Intel, AMD machines. I mean, there’s nothing special required or have you gone to a purpose-built design?

Oliver Avaro:
No, the current design is basically definitely specific for the private cloud, but it’s based on standard x86. And for GPU we use a AMD or NVIDIA. Okay? We have a mixture of different providers, but basically this is, I would say reasonably standard architecture, with a mix of CPU, GPU and storage.

Cloud gaming use case

Mark Donnigan:
The cloud gaming use case is a primary one and that’s obviously why we got introduced. And you are using Netin, which we will get to. But kind of the key measure from a technology perspective, and it maps directly back to cost, for a cloud gaming installation is the number of concurrent sessions per server. Obviously, just stands to reason that the more concurrent sessions or players that you can get on a server, well, it’s going to be less expensive to operate and to run. So that’s not too difficult to understand.

One of the things that’s really interesting is, and I’d like for you to talk about this architecture where you have the GPU rendering the game, but you’re actually not doing the video encoding on the GPU. So what does that look like? And also, talk to us about the evolution, because that’s not where you started. And most cloud gaming platforms today are attempting to keep everything on the GPU, which has some advantages, but it has some very distinct disadvantages and trade-offs. And the disadvantage is you just can’t get the density, which means that your cost per stream likely cannot meet that economic bar where you can really affordably deliver to a wider number of players. I.e., you can’t drive your cost down so you have to charge more, and there’s people who will say, “Well that’s too expensive.” But talk to us about this architecture.

Oliver Avaro:
So that’s correct, Mark. I think the ultimate measure is the cost per CCU, right? The cost per concurrent user that you can get on a specific bill of material. If you have a CPU plus GPU architecture, the game is going to actually slice the GPU in different pieces in the more dynamic manner and in the more appropriate manner so that you can run different game and as much game as possible. Right? So typically if you get on the standard GPU, you can run probably a big game, like a large game and you can cut the GPU in four pieces. If you run a medium game, you can run it maybe in 6 or 8 pieces. And if you run a smaller game, then maybe you can get to, I don’t know, 20 pieces, right?

There is some limits on how much you can slice the GPU for the GPU to be still efficient. And likely, for example, the NVIDIA centralized you to slice one GPU in 24 pieces, but that’s it, right? And so there is some limits in this architecture because it all rely on the GPU. We are indeed investigating different architectures where indeed we are using a VPU, like NETINT is providing a video processor that will somehow offload the GPU of the task of encoding and streaming the video so that we can augment the density. And we see it in as terms of full architecture as something which will be a bit more flexible. I think in terms of number of big games, because they rely much more on the GPU, probably you will not augment the density that much. But we think that overall, probably we can gain a factor of 10 on the number of games that you can overall run on this kind of architecture. So passing from a max of 20, 24 games to a time 10, right? Running 200 games on architecture of this kind.

Mark Donnigan:
Yeah, that’s really remarkable. And just in case somebody isn’t doing the quick math here, what you’re saying is that is it with this CPU plus GPU plus VPU, which the VPU is the ASIC based video encoder, all in the same chassis, so the same server, we’re not talking about different servers, you can get up to 200 game players simultaneously, so concurrent players. Which just radically changes the economics. And in our experience, working with publishers and working with platforms, cloud gaming platforms, nearly everybody has said literally without that it’s not even really economical to build the platform. In other words, you end up having to charge your customer so much, and where the experience is, it’s not viable.

Oliver Avaro:
That’s correct.

Mark Donnigan:
Yeah, that’s important.

Oliver Avaro:
And for certain category of games, definitely you can reach this level. So actually augmenting the density by a factor of 10 means also of course diminishing the cost per CCU by a factor of 10. So if you pay $1, currently you will pay 10 cents, and that makes a whole difference. Because let’s assume basic gamers will play 10 hours per month or 30 hours per month, if this is $1, this is $30, right? If this is 10 cents, then you go to one to $3, which I think makes the match work on the subscription, which is between 5 to 15 euro per month.

Is hardware super expensive

Mark Donnigan:
One of the questions that comes up, and I know we’ve had this conversation with you, is how is this possible? Because anybody who understands basic server architecture, basically it’s not difficult to think, well, wait a second, isn’t there a bottleneck inside the machine? And this must require a really super hot rodded machine. So maybe the cost savings is offset by super expensive hardware. And I think it’s important to note that the reason why this is possible is first of all, the VPU is built on NVMe architecture. So it’s using the exact same storage protocol as your hard drive, as the SSDs that are in the machine. And what we have done, what Netin has done is actually created a peer-to-peer sharing inside the DMA. So basically the GPU will output a frame, a rendered frame, and it’s transferred literally inside memory, so that then the VPU can pick that up, encode it, and there’s effectively zero latency, at least in terms of the latency is so low because it’s happening in the memory buffer.

And so if anybody’s listening and raising an eyebrow wondering, “Well wait a second, surely there’s a bottleneck.” And especially if you’re talking 60 frame per second, which by the way, our benchmarks are generally always at 60 frames per second. Because unless it’s real casual games, you need that frame rate to really deliver a great experience. Even above resolution in some cases, it’s better to get the frame rate up than to increase the size of the frame.

Oliver Avaro:
Absolutely. Absolutely.

Mark Donnigan:
Yeah. Let me just pause here and say that we would love to have questions. And so feel free, on whatever platform, if you’re on YouTube or LinkedIn or wherever watching us right now, just type in and I will try and pick those up. I have looks like, like we already have one. I think this is actually a really good one. I’m going to pick this up right here. But feel free to enter questions in the chat. So Oliver, the question is, “I live in a country where stable internet is not always available.” And by the way, I would say that this isn’t only a country issue, internet varies, right? And the expectation of users is more and more that they don’t think about the fact that I’m in a car, I happen to be in an area where there’s great coverage, but seven miles down the road that changes, right? They want to keep playing and keep enjoying this great experience.

So the question is, “I live in a country where stable internet is not always available. How will this affect the gaming experience?” And yeah, I mean, that’s the question. So what’s your experience and how are you guys solving for this?

Oliver Avaro:
You see, in Netflix or Spotify, you can actually buffer content so that even if your bandwidth is a bit clumsy, you can actually store that content in the CDM and keep the experience good enough, right? Or you can download the video and make it work. So definitely you have some way to solve that problem in I would say cold media, right? Media that you can encode in one way, then stream later. In games, this is completely different.

Mark Donnigan:
Yeah, you can’t do that.

Oliver Avaro:
Because we have to encode, stream, deliver, and then in text integration right away. So if your bandwidth is not enough, if the quality of the bandwidth is not enough, and not only in terms of the size of the bandwidth but also in terms of characteristic. The latency, how this latency is stable and so on, then the experience will be great, right?

So what we’ve been doing actually with Ericsson, okay, is to use 5G networks and to define specific characteristic of what is a slice in the 5G network. So we can tune the 5G network to make it fit for gaming. And to optimize basically the delivery of gaming with 5G. So we think that 5G is going to get much faster in those region where actually the internet is not so great. We’ve been deploying the Blacknut service in Thailand, in Singapore, in Malaysia, now in the Philippines and so on. And this has allowed us to actually reach people in regions where there is no cable or bandwidth with fiber and this kind of things. So look, I’m not going to solve a problem where bandwidth is not available, but maybe bandwidth will come faster with 5G and that could be the solution.

Mark Donnigan:
Yeah, I want to make a comment there, and thank you for the answer. We are seeing, so it’s very interesting, and I’ll use India as an example. So for years in video streaming, the Indian market was used as an example of where it was very difficult to deliver high quality, and especially if you wanted to deliver say 720p, and 1080p was almost assumed at a certain period of time it’s not even possible. Because the network capacity and the speeds were just so low.

What has happened is, and India’s a great case study here, but it’s really almost all regions of the world, as these infrastructures, these wireless infrastructures have been upgraded, they leapfrogged literally from 3G or in some cases even 2.5G and before, and just went all the way to 5G. And so in the last five years there has been such a fundamental shift in bandwidth availability that in some cases, some of these regions of the world, not only is it definitely no longer true that they’re slow, they’re faster than some of the more developed countries. So I do want to make that statement there. One question, Oliver, can you talk about is this webRTC? What protocols you’re using? There’s a lot of talk right now about QUIC. And I think that would be interesting for some of the listeners who might be wondering even what protocols you’re using.

Oliver Avaro:
So we use standard codeX to start with the bottom line. We have not embedded codeX, we have been into the standardization industry of audio and video for quite some years, and I think you have great experts here doing great technology. And this technology is actually embedded into the chipset, into the hardware, so actually you can rely on hardware encoding and decoding capabilities. So we do think standard codeX is basically a must have, right? Of course you need to configure them the right way because you have to code real time. Okay? So you cannot use a particular techniques to wait for a couple of frames or more, so you have to optimize this. But basically we use standard codeX.

Then on the protocols on top of this we have actually a large variety of protocol. It depends on the device on which you are streaming. So it can goes from full-property protocol that we have invented and patented in Blacknut, to standard webRTC. Okay? So if you look at devices like Samsung and LG, which are basically the top manufacturers, I think the service has been launched on LG. We are going to announce, I think our launch with Samsung in very short time. And these devices support webRTC, and that basically is the only way to implement and to support the cloud gaming solution efficiently. So short answer, we use a wide range of protocol, always the one that is the most appropriate and provides the best experience to the end user. We’re using at of course new protocol, new standards, experimenting this. But I would say for the main streamline new solution, we use our own solution plus webRTC. It’s the only… that they’re there.

The end-to-end latency targets

Mark Donnigan:
The end-to-end latency targets, I think previously you made the comment about 80 milliseconds. But give us some guidelines, what is, obviously the answer is as low as possible, but what’s the upper limit where the game experience just falls apart? It’s just not playable?

Oliver Avaro:
You know that the limit for conventional video is about 150 milliseconds. For playing games, this is much lower, probably half of it. So I think you can get a reasonably good experience at 80 milliseconds for actually most of the game that does not require this kind of fast reaction. But then if you want to go to FPS or this kind of thing, that really need to… to nearly be reactive at the frame accuracy, which is very of course difficult in cloud gaming, you need to go down to the 30 millisecond and lower, right? And then I think it’s only feasible if you have a network that allows for it. Because it’s not only about the encoding part, the server side and the client side, it’s also on where the packets are going through the networks. Okay?

Because you can have the most efficient systems in terms of encoding latency and decoding latency, but if you bucket instead of going directly from the server to the end user, go here and there and transit in many places, then your experience will be crappy. And Mark, this is actually a real issue, because we for example had a great demonstration with Ericsson in Barcelona of the Mobile World Congress. And we had servers in Madrid, but when we first make the first test, we discovered that the packets were going from Madrid to Paris, and back to Barcelona, right? So this need a bit of intelligence and technology to make this connection as efficient as possible.

Mark Donnigan:
Tell us about Blacknut, what exactly you guys deliver?

Oliver Avaro:
We provide basically a cloud gaming service, which is, let’s say categorize it as a game as a service. Okay? This means that for the subscription fee per month you get access to the real stuff. You get access to 700 games. We are adding 10 to 15 new games per month, which is I think the fastest pace in terms of increasing game on the market. And we provide this experience on all single devices that can actually receive a video. Okay? So that’s what we do. And we distribute this service either B2C, so direct to the consumer. So if you go on your Blacknut webpage, you can subscribe, you can access to the games. But we also distribute it through carriers, so telecommunication carriers, operators all over the world. We currently have about 20 signed agreement with the carriers live actually. More than 40 signed, and we are signing and delivering one to two new carriers per month. So that’s the pace where we are in Blacknut. And there’s the choice to use carriers here is for the reason I explained to you that it’s good to have.

Mark Donnigan:
Optimization of the network.

Oliver Avaro:
You need to know where the packets are going. You need to make sure that there is some form of CDN for cloud gaming that is in place here that makes the experience optimal.

Mark Donnigan:
Yeah, it completely makes sense to me, especially because you mentioned the 5G optimization. And obviously carriers, yeah, they’ve been investing now for years in building out their 5G networks. But they’re always looking for reasons to drive more value and to really extract the full potential off the 5G or out of the 5G investment. So yeah, it really makes sense.

Oliver Avaro:
That’s the kind of thing we’re doing as well with our partner Radian Arc, and we are putting a server at the edge of the network. So inside the carrier’s infrastructure so that the latency is really super optimized. So that’s one thing that is key for the service.

The architecture

Mark Donnigan:
What is the architecture of that edge server? What’s in it? What CPU, GPU, VPU. Describe that.

Oliver Avaro:
We started with a standard architecture, with CPU and GPU. And now with the current VPU architecture, we are putting actually a whole servers consisting in AMD GPU, Netin VPU. And basically we build the whole package so that we put this in the infrastructure of the carrier and we can deploy the Blacknut cloud gaming on top of it.

Mark Donnigan:
And are you delivering to only a handful of fixed resolutions? If I was on a TV for example, do I get 4K or do you limit to 1080p or how do you handle that?

Oliver Avaro:
Again, great question. Okay? We actually can handle multiple resolution. I think we can stream from 720p up to 4K. The technology basically has no limits for it, right? And streaming 4K or even 8K is a problem that has somehow been solved already, from a technical matter. The question is, again, the cost and the experience. Okay? Streaming 4K on the mobile device does not really make sense. I think the screen is a bit more so you can screen a smaller resolution and that’s sufficient. On a TV likely you need to have a bigger resolution. Even if actually there is great upscale available on most of the TV sets, we stream 720p on Samsung devices and that’s super great, right? But of course scaling up to 1080p will provide a much better experience. So on TVs and for the game that require it, I think we’re indeed streaming the service about 1080p for the game that requires this.

Mark Donnigan:
Do you also find that frame rate is almost more important than resolution?

Oliver Avaro:
For certain games, absolutely. But again, it is game dependent. Of course-

Mark Donnigan:
It’s game, yeah.

Oliver Avaro:
If you are on a FPS, you probably, if you have the choice and you cannot stream 1080p, you would probably stream 720p at 60 FPS rather than 1080p 30 FPS, right?

Mark Donnigan:
Yes.

Oliver Avaro:
If you have to make some trade-off. But if you have different games where the textures, the resolution is more important, then maybe you will actually select more 1080p and 30 fps resolution. And what we build is actually fully adaptable. Ultimately, you should not forget that there is a network in between. And even if technically you can stream 4K or 8K, the networks may not sustain it. Okay? And then actually you’ll have less good experience streaming 4K than actually a 1080p 60 FPS resolution.

Gaming anywhere where you live?

Mark Donnigan:
Okay. I see a question just came in and it is how do we know where the service is available or is it available anywhere you live? And so I think you can answer that question, but why don’t you also explain are there geographical limitations? Is your content available anywhere? And then as an extension, I don’t think you actually talked about how many publishers you have. You did talk about every month you’re onboarding I think 10 or 12 new games. But yeah, so are there geographical restrictions? How can someone access this?

Oliver Avaro:
Great. Let’s start with content. Okay? Indeed, we have more than 700 games right now, 10 to 15 new games per month. And we actually try not to have geographical limitation on the content. Okay? So this being the content we have on the catalog is, from a licensing point of view, available worldwide. So that’s basically what we do. And we do have exceptions, as usual. But basically, a large part of the catalog is available worldwide. Now deploys this catalog of different region, we are available in more than 45 countries. We definitely need to have servers that are close enough to the end user so that the streaming experience is good enough. And we think that a reduce of between 750 to 1,500 kilometers probably the maximum. So I think we will actually put some point of presence in those geographical areas so that basically the latency, limited by the speed of light, that does not harm the service.

So of course if you look at it, we have Europe very much covered. We have US and Canada very much covered. We have a large portion of Southeast Asia, Korean and Japan very much covered. We are now expanding in Latin America, which is a bit harder. We have a strong presence now as well in the Middle East, with partners like STC in the region. And of course we have some zone that are less covered. Africa is not well covered at all. South Africa is, but basically the rest of Africa is a bit harder to reach.

Mark Donnigan:
By the way, what is the website? Why don’t you give out the URL there?

Oliver Avaro:
www.blacknut.com
I think try the service. We’ll be very happy to support and give feedback. I’m very interested in the feedback as well.

Mark Donnigan:
It’s super exciting. And as I said in the beginning, for me personally, having been really in the very early stages of the transition from physical entertainment delivery, I’m talking about movies specifically, like DVDs, to streaming. I’m just super excited to also now, 15 years later, be there with games. And there’s a lot of work to be done. And as you pointed out, the experience is absolutely not exactly mapped. We can’t throw out the console yet. But the opportunity to bring really the gaming experience to a much wider audience is really enabled with streaming. So by the way, so I think there’s a follow on question here. Do you have infrastructure in South Africa? You mentioned Africa’s not covered as well, but…

Oliver Avaro:
Yes, we do have the capacity to deploy the service in South Africa, absolutely.

Mark Donnigan:
To deploy in South Africa. Okay, great. Great. Well, we’re right up against time and thank you for everyone who joined us live. Really appreciate it. And thank you, Oliver. It’s amazing what you’ve built. And we’re super excited to be working with Blacknut.

Oliver Avaro:
Thank you everyone. Thanks, Mark.

Video Transcoder vs. Video Processing Unit (VPU)

When choosing a product for live stream processing, half the battle is knowing what to search for. Do you want a live transcoder, a video processing unit (VPU), a video coding unit (VCU), Scalable Video Processor (SVP) or something else? If you’re not quite sure what these terms mean and how they relate, this short article will educate you in four minutes or less.  

In the Beginning, There Were Transcoders

Simply stated, a transcoder is any technology, software or hardware, that can input a compressed stream (decode) and output a compressed stream (encode). FFmpeg is a transcoder, and for video-on-demand applications, it works fine in most low-volume applications.

For live applications, particularly high-volume live interactive applications (think Twitch), you’ll probably need a hardware transcoder to achieve the necessary cost per stream (CAPEX), operating cost per stream, and density.

For example, the NETINT Video Transcoding Server, a single 1RU server with ten NETINT T408 Video Transcoders, can deliver up to 80 H.264/HEVC 1080p30 streams while drawing under 250 watts. Performed in software using only the CPU, this same output could take up to ten separate 1RU servers, each drawing well over 250 watts.

Netint Codensity, ASIC-based T408 Video Transcoder
The NETINT T408 Video Transcoder.

Speaking of the T408, if Websters defined a transcoder (it doesn’t), it might have a picture of the T408 as the perfect example of a transcoder. Based on custom transcoding ASICs, the T408 is inexpensive ($400), capable (4K @ 60 FPS or 4x 1080p60 streams), flexible (H.264 and HEVC), and exceptionally efficient (only 7 watts).

What doesn’t the T408 do? Well, that leads us to the difference between a transcoder and a VPU.

The difference between a transcoder and a Video Processing Unit (VPU)

First, the T408 doesn’t scale video. If you’re building a full encoding ladder from a high-resolution source, all the scaling for the lower rungs is performed by the host CPU. In addition, the T408 doesn’t perform overlay in hardware. So, if you insert a logo or other bug over your videos, again, the CPU does the heavy lifting.

Finally, the T408 was launched in 2019, the first ASIC-based transcoder to ship in quite a long time. So, it’s not surprising that it doesn’t incorporate any artificial intelligence processing capabilities.

What is a Video Processing Unit (VPU)?

What’s a Video Processing Unit? A hardware device that does all that extra stuff, scaling, overlay, and AI. You see this in the transcoding pipeline shown below, which is for the NETINT Quadra.

When it came to labeling the Quadra, you see the problem; It does much more than a video transcoder. Not only does it outperform the T408 by a factor of four, it adds AV1 output and all the additional hardware functionality. It’s much more than a simple video transcoder, it’s a video processing unit (VPU).

As much as we’d like to lay claim to the acronym, it actually existed before we applied it to the Quadra. It’s not surprising. It follows the terminology for CPU (central processing unit) and GPU (graphical processing unit). And, if Websters defined VPU (it doesn’t). Oh, you get the point. Here’s the required Quadra glamour shot.

Netint Codensity, ASIC-based Quadra T1A Video Processing Unit
The NETINT Quadra Video Processing Unit.

VCUs and M(SVP)

While NETINT was busy developing ASIC-based transcoders and VPUs for the mass market, large video publishers like YouTube and Meta produced their own ASICs to achieve similar benefits (and produce more acronyms). In 2021, when Google shipped their own ASIC-based transcoder called Argos, they labeled it a Video Coding Unit, or VCU.

Like the T408 and Quadra, the benefits of this ASIC-based technology are profound; as reported by CNET, “Argos handles video 20 to 33 times more efficiently than conventional servers when you factor in the cost to design and build the chip, employ it in Google’s data centers, and pay YouTube’s colossal electricity and network usage bills.” Interestingly, despite YouTube’s heavy usage of the AV1 codec, Argos encodes only H.264 and VP9, not AV1.

In May 2023, Meta released their own ASIC, which, like Argos, outputs H.264 and VP9, but not AV1. Called the Meta Scalable Video Processor (MSVP), the unit delivered impressive results, including “a throughput gain of ~9x for H.264 when compared against libx264 SW encoding…[and] a throughput gain of ~50x when compared with libVPX speed 2 preset.” Meta also noted that the unit drew only 10 watts of power, which is skimpy but also about 43% higher than the T408.

Of course, neither Google or Meta sells their ASIC to third parties, so if want the CAPEX and OPEX efficiencies that ASIC-based VPUs deliver, you’ll have to buy from NETINT.

Of course, neither Google or Meta sells their ASIC to third parties, so if want the CAPEX and OPEX efficiencies that ASIC-based VPUs deliver, you’ll have to buy from NETINT. The bottom line is that whether you call it a transcoder, VPU, VCU, or MSVP, you’ll get the highest throughput and lowest power consumption if it’s powered by an ASIC.

Play Video about HARD QUESTIONS ON HOT TOPICS: ASIC-based Video Transcoder versus Video Processing Unit (VPU)
HARD QUESTIONS ON HOT TOPICS:
ASIC-based Video Transcoder versus Video Processing Unit (VPU)
Watch the full conversation on YouTube: https://youtu.be/iO7ApppgJAg

Which AWS CPU is Best for FFmpeg – AMD, Graviton, or Intel?

Which AWS CPU is Best for FFmpeg - AMD, Graviton, or Intel?

If you encode with FFmpeg on AWS, you probably know that you have three CPU options: AMD, Graviton, and Intel. Which delivers the most bang for the buck?

For those in a hurry, it’s Graviton for x264 and AMD for x265, often by a significant margin. But the devil is always in the details, and if you want to learn how we tested and how big a difference your CPU selection makes, you can follow the narrative or hopscotch through the fancy charts below. We conclude with a look at the optimal core count for those encoding with AMD CPUs.

Testing the AWS CPUs

Let me start by saying that this was my first foray into CPU testing on AWS, and while it appears straightforward, some unconsidered complexity may have skewed the results. If you see any errors or other factors worth considering, please drop me a note at jan.ozer@netint.com.

Second, your source clip and command string may produce different results than those shown below. If you’re spending big to encode with FFmpeg on AWS, don’t consider my results the final word; instead, consider them as evidence that your CPU choice really does matter and as motivation to perform your own tests. 

Those caveats aside, let’s dig into the testing.

Codecs/Configurations/Command Strings

I tested three test cases.

  • 8-bit 1080p30 with x264
  • 8-bit 1080p30 with x265
  • 10-bit 4K60p with x265

I present the command strings at the bottom of this article. Note that I used the veryslow preset for x264, slower for x265 at 1080p30, and slow for the 4K60 HEVC encodes. Why such demanding presets? Because based upon a total cost of distribution (encoding and bandwidth), the optimal economic decision when view counts will exceed 10,000 views is to use a high-quality preset.

Based upon a total distribution cost (encoding and bandwidth), the optimal economic decision when view counts exceed 10,000 views is to use a high-quality preset.

Remember, presets don’t determine quality; your quality expectations do. Most compressionists target a VMAF score of between 93-95 VMAF points for the top rung of their encoding ladders. Using the veryslow preset, you might achieve that at, say, 3 Mbps. Using ultrafast, you might need a bit rate of as much as 5 Mbps to achieve the same quality. Ultrafast might cut your encoding time/cost by 90%, but you only pay that once, while you pay bandwidth costs for each video view. Even at a cost per GB of $0.02, it takes less than 10,000 views for the veryslow preset to break even based on lower bandwidth costs.

Instances and Pricing

I tested using the 8-core instances and on-demand pricing shown in Table 1. I tested all systems running Ubuntu version 22.04. Note that the cost delta between Intel and AMD is ten percent, a number I’ll refer to below.

Table 1:  Instances and on-demand pricing tested.

Encoding Procedure

As you’ll see in the charts below, I started encoding a single FFmpeg instance and kept adding simultaneous encodes until the cost per stream began to increase, indicating that spinning up another instance was more cost effective than adding additional encodes to the same system.

FFmpeg Versions

Here’s where things get a bit complicated. My premise was that I would produce the optimal results using FFmpeg versions compiled specifically for each CPU tested. I downloaded builds for Graviton, AMD, and Intel from https://johnvansickle.com/ffmpeg/ and happily contributed via PayPal. However, I was also in touch with MulticoreWare, who requested that I test with an advanced version of their x265 codec that was optimized for Graviton.

Figure 1. I tested with CPU-specific versions of FFmpeg 6.0 from https://johnvansickle.com/ffmpeg/.

Before testing, I compared the performance of the stock version of FFmpeg (Version 4.4) with the CPU-specific versions from Vansickle on the AMD and Intel platforms and for x264 on Graviton. In all cases, the Vansickle version produced the same or better throughput with identical quality.

Note that in other tests on different AMD instances with core counts ranging from 2 – 32, the Vansickle version was not always the best performer. So, if you try the Vansickle versions or your own CPU-specific compiled versions, you should verify that it outperforms the native version in all relevant use cases.

Note that the MulticoreWare version of FFmpeg performed much better on the Graviton system than the generic version of 4.4 or the Vansickle version, though still far behind Intel and particularly AMD. As you’ll see clearly below, if you’re running x265 on a Graviton system using high quality presets, you’re missing a great opportunity to shave your costs.

For the record, I tried upgrading the stock version of FFmpeg on the Ubuntu system to version 6.0 but ran into multiple issues that ultimately corrupted the system and forced me to start back at ground zero. Unfortunately, Ubuntu operation and maintenance are not a core-strengths of mine, but since I ran all tests using Version 6.0, whether supplied by Vansickle or MulticoreWare, the results should be representative.

Table 2 shows the different versions of FFmpeg that I ran on the three systems for the three test cases.

Table 2. The FFmpeg versions deployed on the three systems for the three test cases.

Results

Here are the results for the three test cases.

1080p x264

Figure 2 shows the cost per hour to produce a 1080p30 stream using FFmpeg and the x264 codec. One of the more interesting testing results was that the combination of FFmpeg and Ubuntu handled multiple instances of FFmpeg with minimal overhead, particularly on the Graviton CPU. You see this with the cost per hour for Graviton remaining consistent through twelve instances, while it increased slightly for Intel after 10 instances and AMD after 12.

In all cases, you see the cost per instance drop significantly when moving from single to multiple simultaneous encodes. If you’re performing a single 1080p x264 encode on an 8-core system, you’re probably wasting money.

On the other hand, once each CPU hits the lowest cost per hour, it’s time to consider adding another instance. The cost per stream will remain the same, but your encoding speed will double. So, if you’re encoding on a Graviton system, your encoding time will double if you perform twelve simultaneous encodes as opposed to six, but your cost per hour will be almost exactly the same. If you spin up another 8-core system and encode six simultaneous encodes on the two systems, your cost will be almost identical, but your throughput will double.

Figure 2. Cost per hour to produce a single 1080p stream using the x264 codec and FFmpeg. Graviton is clearly the most cost-effective.

1080p x265

What a difference a codec makes. Where Graviton was the clear leader for x264, it’s the clear laggard for x265. Again, I produced the Graviton results shown in Figure 3 using a version of FFmpeg supplied by x265 developer MulticoreWare; the results would have been much worse with either the Vansickle version or the stock version. As you may know, Graviton is an Arm-based CPU that uses a different instruction set than Intel or AMD CPUs. While the x264 codec was Arm-friendly, the x265 codec was decidedly the reverse, at least using the high-quality presets that I used in my tests.

Interestingly, for both Intel and AMD, we realized the lowest cost per stream at relatively low simultaneous stream counts, two for Intel and two and three for AMD. If your testing confirms this, you should consider adding instances once you achieve this threshold rather than adding additional encodes to existing instances.

Figure 3. Cost per hour to produce a single 1080p stream using the x265 codec and FFmpeg.

Comparing the lowest cost Intel ($6.60) to the lowest cost AMD ($5.49), shows a cost delta of about 17%. As shown in Table 1, 10% of this relates to pricing, leaving about a 7% performance delta.

For the record, note that an Amazon engineer ran similar tests here and found that Graviton was faster for both x264 and x265. Note, however, that the author used the ultrafast preset, while I used higher quality presets for the stated reasons. Have a look and draw your own conclusions.

4K60 x265

In 4K60p testing, the Graviton was clearly overwhelmed from both a cost and performance aspect, unable to complete even three simultaneous encodes. The overall cost delta between Intel and AMD narrowed slightly, dropping to 13.7% overall, with 10% relating to pricing. The actual throughput delta between the two in these tests is 3.7%.

Figure 4. Cost per hour to produce a single 4K60p stream using the x265 codec and FFmpeg.

This 4K60 test stressed memory usage much more so than the 1080p tests, limiting successful simultaneous transcodes to two for Graviton and four for AMD and Intel. Interestingly, in these tests, AMD produced the lowest cost per stream while running a single encode, and Intel did so at 2. With these challenging encodes; you may want to spin up new machines after only one or two encodes rather than attempting more simultaneous encodes. Or, perhaps, try a machine with more cores. Hold that thought until the last section.

For reference, Table 3 summarizes the lowest cost per hour for the three test cases.

Table 3. Cost per hour for the three test cases on the three tested CPUs.

Which leads us to the last section.

What’s the Optimal Number of Cores for FFmpeg?

AWS offers multiple core counts in all three CPU flavors: what’s the optimal core count? To evaluate this, I ran tests on multiple AMD CPUs for all three test cases and present the results below.

Let’s talk about expectations first. AWS charges linearly for the machine cores, so an 8-core system costs twice as much as a 4-core system and a quarter of a 32-core system. Given the results presented above, where FFmpeg/Ubuntu proved highly efficient when processing multiple instances, I expected a similar cost per hour for all CPUs. The results were close.

With x264, 2-core and 8-core systems were slightly more affordable than 16-core, though a 32-core system finally caught up at 32 simultaneous transcodes. If you’re going to run a 32-core system for 1080p30/x264 encodes, you need to be running quite a few simultaneous encodes to achieve the optimal cost per stream.

Figure 5. x264 encoding cost for the CPU core counts shown.

With x265 encoding at 1080p, the results were closer to what I expected, though again, the 2-core and 8-core systems were slightly more affordable. Unlike x264, the 32-core system became slightly more expensive as the number of simultaneous encodes increased, making eight simultaneous streams the most affordable.

Figure 6. x265 encoding cost for 1080p30 encodes and the CPU core counts shown.

When encoding 4K videos, the phrase “go big or go home” comes to mind. Here, 32-cores delivered the lowest cost, though only by a fraction, and only at four simultaneous encodes. After that, the cost per hour increases slightly through eight encodes and then starts a more serious climb.

Figure 7. x265 encoding cost for 4K60 encodes and the CPU core counts shown.

As you can see, all these results are highly codec and source material specific. The most important takeaway from this article should not be that Graviton is best for x264 and AMD best for x265. It should be that real differences exist between the performance of the CPUs, and these differences may translate to significant cost differentials. If you’re spending even a few thousand dollars a month on AWS for FFmpeg encoding, it makes sense to run tests like these to identify the most cost-effective CPU and core-count.

Test Strings

1080p30 x264:

ffmpeg -y -i Orchestra.mp4 -c:v libx264 -profile:v high  -preset veryslow -g 60 -keyint_min 60 -sc_threshold 0  -b:v 4200k -pass 1  -f mp4 /dev/null

ffmpeg -y -i Orchestra.mp4 -c:v libx264  -preset veryslow -g 60 -keyint_min 60 -sc_threshold 0  -b:v 4200k -maxrate 8400k -bufsize 8400k -pass 2  orchestra_x264_output.mp4

1080p30 x265:

ffmpeg  -y -i Football_short.mp4 -c:v libx265 -preset slower -x265-params keyint=60:min-keyint=60:scenecut=0:bitrate=3500:pass=1  -f mp4 /dev/null

ffmpeg  -y -i Football_short.mp4 -c:v libx265 -preset slower -x265-params keyint=60:min-keyint=60:scenecut=0:bitrate=3500:vbv-maxrate=7000:vbv-bufsize=7000:pass=2  Football_x265_HD_output.mp4

4K60 x265:

ffmpeg -y -i Football_4K60.mp4 -c:v libx265 -preset slow -x265-params keyint=120:min-keyint=120:scenecut=0:bitrate=12500K:pass=1  -f mp4 /dev/null

ffmpeg -y -i Football_4K60.mp4 -c:v libx265 -preset slow -x265-params keyint=120:min-keyint=120:scenecut=0:bitrate=12500K:vbv-maxrate=25000K:vbv-bufsize=25000K:pass=2  Football_4K_output.mp4 

Play Video about Which AWS CPU is Best for FFmpeg - AMD, Graviton, or Intel?
HARD QUESTIONS ON HOT TOPICS: AMD, Graviton, and Intel
– three CPU options to encode with FFmpeg on AWS
 
Watch the full conversation on YouTube: https://youtu.be/BOZZuiemMAU

World’s First AV1 Live Streaming CDN powered by VPUs

AV1 live streaming CDN

RealSprint’s vision for Vindral, its live-streaming CDN, is to deliver the quality of HLS and the latency of WebRTC. Early trials revealed that CPU-only transcoding lacked scalability, and GPUs used excessive power and proved challenging to configure.

Implementing NETINT’s ASIC-based Quadra delivered the required quality and latency in a low-power, simple-to-configure package with H.264, HEVC, and AV1 output. As a result, Quadra became a “preferred component” of the Vindral setup.

Implementing NETINT’s ASIC-based Quadra delivered the required quality and latency in a low-power, simple-to-configure package with H.264, HEVC, and AV1 output. As a result, Quadra became a “preferred component” of the Vindral setup.

The RealSprint Story

RealSprint is a tech company founded in 2013 and based in Umeå, Sweden. Since its inception, RealSprint has delivered industry-defining solutions that drive real business value. It’s flagship solution, Vindral live CDN, combines ultra-low latency streaming with 4K support, sync, and absolute stability. The latest addition, Composer, streamlines the setup for live video compositing, effects, and encoding.

In explaining RealSprint’s goals to Streaming Media Magazine, RealSprint CEO Daniel Alinder stated that part of the company’s goal is “to disrupt, spur innovation, and ensure high-end streaming experiences.” This focus, and RealSprint’s painstaking execution, has brought customers like Sotheby’s, Hong Kong Jockey Club, and IcelandAir into RealSprint’s client roster.

RealSprint is a tech company founded in 2013 and based in Umeå, Sweden. Since its inception, RealSprint has delivered industry-defining solutions that drive real business value. It’s flagship solution, Vindral live CDN, combines ultra-low latency streaming with 4K support, sync, and absolute stability.

live streaming - World’s First AV1 Live Streaming CDN powered by VPUs
Figure 1. Check out this Vindral demo at https://demo.vindral.com/?4k

Finding the Ideal Transcoder for Vindral

The Vindral live CDN is transforming the landscape for live streaming, offering high-quality streaming at low latency and synchronized playout. As a result, Vindral is highly optimized for verticals such as live sports, iGaming, live auctions, and entertainment markets with a desired latency of around one second and where stability is imperative, even at high video quality.

Alinder explains, “It is, of course, possible to configure for 0.5-second latency as well, but none of our clients has chosen to go that low. More common focus areas are image quality and synchronized playout. A game show with host-crowd interaction does not require real-time latency. Keeping all viewers in sync, around 1 second, while maintaining full-HD quality is a common request that we see.”

Elaborating on Alinder’s comments, Niclas Åström, founder and Chief Product Officer at RealSprint, adds, “we call it the Sweet Spot. Vindral is built to put clients in charge of their own sweet spot in terms of buffer and quality. While we are highly impressed by technologies such as WebRTC, we aim to pave the way for a new mainstream in which latency is only one of the parameters.”

Expanding upon Vindral’s target use cases, Alinder details, “A typical use case is live auctions. The usual setup for live auctions is 1080P, and you want below one second of latency because people are bidding online. There are also people bidding in the actual auction house, so there’s the fairness aspect of it as well.”

“Clients typically configure around a 700-millisecond buffer, and even that small of a buffer makes such a huge difference in quality and reliability. What we see in our metrics is that, basically, 99% of the viewers watch the highest quality stream across all markets. That’s a huge deal.”

Play Video about live streaming - World’s First AV1 Live Streaming CDN powered by VPUs
HARD QUESTIONS ON HOT TOPICS:
World’s first AV1 live streaming CDN powered by NETINT’s Quadra VPU
Watch on YouTube: https://youtu.be/Qhe6wuJoOX0

Exploring Transcoder Options

To provide this flexible latency, Vindral depends upon a transcoder to produce the streams with minimal latency, and a vendor-agnostic hybrid content delivery network (CDN) to deliver the streams. To explain, the transcoder inputs the incoming stream from the live source and produces multiple outputs to deliver to viewers watching on different devices and connections.

Choosing the transcoder is obviously a critical decision for Vindral and RealSprint. When exploring its transcoder options, RealSprint considered multiple criteria, including cost per stream, power, output quality, format support, latency, and density.

According to CTO Per Mafrost, “We started using only CPUs but quickly concluded that we needed better scalability. We moved on to using GPUs, but the hardware setups got a bit more troublesome and more energy-demanding. A year back, we got in touch with NETINT to test their ASICs and were pleased with our findings.”

Netint Codensity, ASIC-based Quadra T2A Video Processing Unit
Figure 2. The NETINT Quadra T2 VPU.

“We’ve found that the quality when using ASICs is fantastic.”

RealSprint CEO Daniel Alinder

Quadra Fills the Gap

Specifically, Vindral implemented NETINT’s Quadra Video Processing Unit (VPU), which is driven by the Codensity G5 ASIC, which stands for Application Specific Integrated Circuit, in terms of transcoding, Quadra inputs H.264, HEVC, and VP9 video and outputs H.264, HEVC, and AV1, all at sub-frame latencies, which translate to under 0.03 seconds for a 30-fps input stream. Quadra is called a VPU rather than a transcoder because, in addition to audio and video transcoding, it also offers onboard scaling, overlay and houses two Deep Neural Network engines capable of 18 Trillion Operations per Second (TOPS).

According to Alinder, Quadra delivers both top quality and the necessary low latency. “We’ve found that the quality when using ASICs is fantastic. It’s all depending on what you want to do. Because we need to understand we’re talking about low latency here. Everything needs to work in real time. Our requirement on encoding is that it takes a frame to encode, and that’s all the time that you get.”

Quadra’s AV1 output was another key consideration. As Alinder explained, “we’re seeing markers that our clients are going to want AV1. And there are several reasons why that is the case. One of which is, of course, it’s license free. If you’re a content owner, especially if you’re a content owner with a large crowd with many subscribers to your content, that’s a game-changer. Because the cost of licensing a codec can grow to become a significant part of your business expenses.”

“That is a huge game changer because ASICs are unmatched in terms of the number of streams per rack unit.”

RealSprint CEO Daniel Alinder

Density and Power Consumption

Density refers to the number of streams a device or server can output. Because ASICs are purpose-built for video transcoding, they’re extremely efficient transcoders that provide maximum density but also very low power consumption. Speaking to Quadra’s density, Alinder commented, “That is a huge game changer because ASICs are unmatched in terms of the number of streams per rack unit.”

Of course, power consumption is also critical, particularly in Europe. As Alinder detailed, “If you look at the energy crisis and how things are evolving, I’d say [power consumption] is very, very important. The typical offer you’ll be getting from the data center is: we’re going to charge you 2x the electrical bill. In Germany, the energy price peaked in August 2022 at 0.7 Euros per kilowatt hour.”

To be clear, in some instances, Vindral can reduce power consumption and other carbon emissions by making travel unnecessary. As Alinder explained, “We have a Norwegian company that we’re working with that is doing remote inspections of ships. They were the first company in the world to do that. Instead of flying in an inspector, the ship owner, and two divers to the location, there’s only one operator of an underwater drone that is on the location. Everybody else is just connected. That’s obviously a good thing for the environment.”

“Another seldom mentioned topic set NETINT ASICs apart from CPUs and many GPUs: linear load. Specifically, it was relatively easy to create a solution where we could feel safe when calculating the load and expected capacity for transcoder nodes. The density, cost/stream, and quality are bonuses.”

RealSprint CTO Per Mafrost

Linear Load

One final characteristic set Quadra apart, was a predictable “linear load” pattern. As described by CTO Mafrost, “in choosing between different alternatives, the usual suspects such as cost, power, quality, and density were our main criteria. But another seldom mentioned topic set NETINT ASICs apart from CPUs and many GPUs: linear load. Specifically, it was relatively easy to create a solution where we could feel safe when calculating the load and expected capacity for transcoder nodes. The density, cost/stream, and quality are bonuses.”

RealSprint began deploying NETINT Quadra VPUs in 2022. As Mafrost concluded, “Since then, ASICs have started to be a preferred component of our setup.”

live streaming - World’s First AV1 Live Streaming CDN powered by VPUs
Figure 3. NETINT Quadra has become a “preferred component” of Vindral.

The NETINT View

NETINT Technologies is an innovator of ASIC-based video processing solutions for low-latency video transcoding. Users of NETINT solutions realize a 10X increase in encoding density and a 20X reduction in carbon emissions compared to CPU-based software encoding solutions. NETINT makes it seamless to move from software to hardware-based video encoding so that hyper-scale services and platforms can unlock the full potential in their computing infrastructure.

Regarding Vindral’s use of Quadra, NETINT’s COO Alex Liu commented, “Live streaming video platforms demand more efficient and cost-effective video encoding solutions due to the emergence of new interactive video applications which can only be met with ASIC hardware encoding. Vindral, the industry’s first 4K AV1 streaming platform and powered with NETINT’s Quadra T2 real-time, low-latency 4K AV1 encoder, is a game changer. We are really excited about the amazing video experiences that Vindral users will bring to their customers as a result of this breakthrough in latency and quality,”

RealSprint began deploying NETINT Quadra VPUs in 2022. As Mafrost concluded, “Since then, ASICs have started to be a preferred component of our setup.”

Figure 4. Streaming Media Magazine discussing Vindral with RealSprint CEO Daniel Alinder. https://youtu.be/xJ2Zfo2r7SM

The Industry Takes Notice

The potent combination of Vindral and Quadra has the industry taking notice. For example, in this Streaming Media interview, respected contributing editor Tim Siglin interviewed Alinder about Vindral, summarizing “the fact that [Quadra] is an ASIC that does more transcodes at a lower power consumption means that it gives you a better viability.” 

The Industry Takes Notice

NETINT was the first company to ship AV1-based ASIC transcoders and has shipped tens of thousands of transcoders and VPUs, producing over 200 billion streams in 2022. In fact, NETINT has shipped more ASIC-based transcoders than any other supplier to the cloud gaming, broadcast, and similar live-streaming markets.

Validating NETINT’s approach, in 2021, Google launched their own encoding ASIC-based transcoder, called ARGOS, as did Meta in 2022. Both products are exclusively used internally by the respective companies.

The best way to leverage the benefits of encoding ASICs is to contact NETINT.

Hardware Transcoding: What it Is, How it Works, and Why You Care

What is Transcoding?

Like most terms relating to streaming, transcoding is defined more by practice than by a dictionary. In fact, transcoding isn’t in Websters or many other dictionaries. That said, it’s generally accepted that transcoding means converting a file from one format to another.  More particularly, it’s typically used within the context of a live-streaming application.

As an example, suppose you were watching a basketball game on NBA.tv. Assuming that the game is produced on-site, somewhere in the arena, a video mixer pulls together all video, audio, and graphics. The output would typically be fed into a device that compresses it to a high-bitrate H.264 or another compressed format and sends it to the cloud. You would typically call this live encoding; if the encoder is hardware-based, it would be hardware-based live encoding.

In the cloud, the incoming stream is transcoded to lower resolution H.264 streams for delivery to mobile and other devices or HEVC for delivery to a smart TV. This can be done in software but is typically performed using a hardware transcoder because it’s more efficient. More on this below.

Looking further into the production and common uses of streaming terminology, during the event or after, a video editor might create short highlights from the original H.264 video to share on social media. After editing the clip, they would encode it to H.264 or another compressed format to upload to Instagram or Facebook. You would typically call rendering the output from the software editor encoding, not transcoding, even though the software converts the H.264 input file to H.264 output, just like the transcoder.

Play Video about NETINT-Jan Ozer-Hardware Transcoding v Encoding
HARD QUESTIONS ON HOT TOPICS: Transcoding versus Encoding.
Watch the full conversation on YouTube: https://youtu.be/BcDVnoxMBLI

Boiling all this down in terms of common usage:

  • You encode a live stream from video input, in software or in hardware, to send it to the cloud for distribution. You use a live encoder, either hardware or software, for this.
  • In the cloud, you transcode the incoming stream to multiple resolutions or different formats using a hardware or software transcoder.
  • When outputting video for video-on-demand (VOD) deployment, you typically call this encoding (and not transcoding), even if you’re working from the same compressed format as the transcoding device.

Hardware Transcoding Alternatives

Anyone who has ever encoded a file knows that it’s a demanding process for your computer. When producing for VOD, time matters, but if the process takes a moment or two longer than planned, no one really notices. Live, of course, is different; if the video stream slows or is interrupted, viewers notice and may click to another website or change channels.

This is why hardware transcoding is typically deployed for high-volume transcoding applications. You can encode with a CPU and software, but CPUs perform multiple functions within the computer and are not optimized for transcoding. This means that a single server can produce fewer streams than hardware transcoders, which translates to higher CAPEX and power consumption.

Like the name suggests, hardware-based transcoding uses hardware devices other than the CPU to transcode the video. One alternative are graphics processing units (GPUs), which are highly optimized for graphic-intensive applications like gaming. Transcoding is supported with dedicated hardware circuits in the GPU, but the vast majority of circuits are for graphics and other non-transcoding functions. While GPUs are more efficient than CPUs for transcoding, they are expensive and consume significant power.

ASIC-Based Transcoding

Which takes us to ASICs. Application-Specific Integrated Circuits (ASICs) are designed for a specific task or application, like video transcoding. Because they‘re designed for this task, they are more efficient than CPU or GPU-based encoding, more affordable, and more power-efficient.

Because they‘re designed for this task, Application-Specific Integrated Circuits (ASICs) are more efficient than CPU or GPU-based encoding, more affordable, and more power-efficient.

ALEX LIU, Co-Founder,
COO at NETINT Technologies Inc.

ASICs are also very compact, so you can pack more ASICs into a server than GPUs or CPUs, increasing the output from that server. This means that fewer servers can deliver the same number of streams than with GPU or CPU-based transcoding, which saves additional server storage cost and maintenance.

While we’re certainly biased, if you’re looking for a cost-effective and power-efficient hardware alternative for high-volume transcoding applications, ASIC transcoders are the way to go. Don’t take our word for it; you can read here how YouTube converted much of their production operation to the ASIC-based Argos VCU (for video compression unit). Meta recently also released their own encoding ASIC. Of course, neither of these are for sale to the public; the primary vendor for ASIC-based transcoders is NETINT.

NETINT Video Transcoding Server – ASIC technology at its best

NETINT Video Transcoding Server - quality-speed-density

Many high-volume streaming platforms and services still deploy software-only transcoding, but high energy prices for private data centers and escalating public cloud costs make the OPEX, carbon footprint, and dismal scalability unsustainable. Engineers looking for solutions to this challenge are actively exploring hardware that can integrate with their existing workflows and deliver the quality and flexibility of software with the performance and operational cost efficiency of purpose-built hardware. 

If this sounds like you, the USD $8,900 NETINT Video Transcoding Server could be the ideal solution. The server combines the Supermicro 1114S-WN10RT AMD EPYC 7543P-powered 1RU server with ten NETINT T408 video transcoders that draw just 7 watts each. Encoding HEVC and H.264 at normal or low latency, you can control transcoding operations via  FFmpeg, GStreamer, or a low-level API. This makes the server a drop-in replacement for a traditional x264 or x265 FFmpeg-based or GPU-powered encoding stack.

NETINT Video Transcoding Server

Due to the performance advantage of ASICs compared to software running on x86 CPUs, the server can perform the equivalent work of roughly 10 separate machines running a typical open-source FFmpeg and x264 or x265 configuration. Specifically,  the server can simultaneously transcode twenty 4Kp30 streams, and up to 80 1080p30 live streams. In ABR mode, the server transcodes up to 30 five-rung H.264 encoding ladders from 1080p to 360p resolution, and up to 28 four-rung HEVC encoding ladders. For engineers delivering UHD, the server can output seven 6-rung HEVC encoding ladders from 4K to 360p resolution, all while drawing less than 325 watts of total power.

This review begins with a technical description of the server and transcoding hardware and the options available to drive the encoders, including the resource manager that distributes jobs among the ten transcoders. Then we’ll review performance results for one-to-one streaming and then H.264 and HEVC ladder generation, and finish with a look at the server’s ultra-efficient power consumption.

NETINT Transcoding Server with 10 T408 Video Transcoders

Hardware Specs

Built on the Supermicro 1114S-WN10RT 1RU server platform, the NETINT Video Transcoding Server features ten NETINT Codensity ASIC-powered T408 video transcoders, and runs Ubuntu 20.04.05 LTSThe server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots to house the ten U.2 T408 video transcoders.

You can buy the server with any of three AMD EPYC processors with 8 to 64 cores. We performed the tests for this review on the 32-core AMD EPYC 7543P CPU that doubles to 64 threads with multithreading.  The server configured with the AMD EPYC 7713P processor with 64-cores and 128-threads sells for USD $11,500, and the economical AMD EPYC 7232P processor-based server with 8-cores and 16-threads lists for USD $7,000.

Regarding the server hardware, Supermicro is a leading server and storage vendor that designs, develops, and manufactures primarily in the United States. Supermicro adheres to high-quality standards, with a quality management system certified to the ISO 9001:2015 and ISO 13485:2016 standards and an environmental management system certified to the ISO 14001:2015 standard. Supermicro is also a leader in green computing and reducing data center footprints (see the white paper Green Computing: Top Ten Best Practices for a Green Data Center). As you’ll see below, this focus has resulted in an extremely power-efficient machine when operated with NETINT video transcoders.

Let’s explore the system - NETINT Video Transcoding Server

With this as background, let’s explore the system. Once up and running in Ubuntu, you can check T408 status via the ni_rsrc_mon_logan command, which reveals the number of T408s installed and their status. Looking at Figure 1, the top table shows the decoder performance of the installed T408s, while the bottom table shows the encoding performance.

Figure 1. Tracking the operation of the T408s, decode on top, encode on the bottom.

About the T408

T408s have been in service since 2019 and are being used extensively in hyper-scale platforms and cloud gaming applications. To date, more than 200 billion viewer minutes of live video have been encoded using the T408. This makes it one of the bestselling ASIC-based encoders on the market.

The NETINT T408 is powered by the Codensity G4 ASIC technology and is available in both PCIe and U.2 form factors. The T408s installed in the server are the U.2 form factor plugged into ten NVMe bays. The T408 supports close caption passthrough, and EIA CEA-708 encode/decode, along with support for High Dynamic Range in HDR10 and HDR10+ formats.

“To date, more than 200 billion viewer minutes of live video have been encoded using the T408. This makes it one of the bestselling ASIC-based encoders on the market.” 

ALEX LIU, Co-Founder,
COO at NETINT Technologies Inc.

The T408 decodes and encodes H.264 and HEVC on board but performs all scaling and overlay operations via the host CPU. For one-to-one same-resolution transcoding, users can select an option called YUV Bypass that sends the video transcoded by the T408 directly to the T408 encoder. This eliminates high-bandwidth trips through the bus to and from system memory, reducing the load on the bus and CPU. As you’ll see, in pure 1:1 transcode applications without overlay, CPU utilization is very low, so the T408 and server are very efficient for cloud gaming and other same-resolution, low-latency interactive applications. 

Netint Codensity, ASIC-based T408 Video Transcoder
Figure 2. The T408 is powered by the Codensity G4 ASIC.

Testing Overview

We tested the server with FFmpeg and GStreamer. As you’ll see, in most operations, performance was similar. In some simple transcoding applications, FFmpeg pulled ahead, while in more complex encoding ladder productions, particularly 4K encoding, GStreamer proved more performant, particularly for low-latency output.

Figure 3. The software architecture for controlling the server.  

Operationally, both GStreamer and FFmpeg communicate with the libavcodec layer that functions between the T408 NVME interface and the FFmpeg software layer. This allows existing FFmpeg and GStreamer-based transcoding applications to control server operation with minimal changes.

To allocate jobs to the ten T408s, the T408 device driver software includes a resource management module that tracks T408 capacity and usage load to present inventory and status on available resources and enable resource distribution. There are several modes of operation, including auto, which automatically distributes the work among the available resources.

Alternatively, you can manually assign decoding and encoding tasks to different T408 devices in the command line or application and control which streams are decoded by the host CPU or a T408. With these and similar controls, you can efficiently balance the overall transcoding load between the T408s and host CPU to maximize throughput. We used auto distribution for all tests.

Testing Procedures

We tested using Server version 1.0, running FFmpeg v4.3.1 and GStreamer v1.18 and T408 release 3.2.0. We tested with two use cases in mind. The first is a stream in-single stream out, either at the same resolution as the incoming stream or output at a lower resolution.  This mode of operation is used in many interactive applications like cloud gaming, real-time gaming, and auctions where the absolute lowest latency is required. We also tested scaling performance since many interactive applications scale the input to a lower resolution.

The second use case is ABR, where a single input stream is transcoded to a full encoding ladder. In both modes, we tested normal and low-latency performance. To simulate live streaming and minimize file I/O as a drag on system performance, we retrieved the source file from a RAM drive on the server and delivered the encoded file to RAM.

Play Video about NETINT Video Transcoding Server - ASIC technology at its best
HARD QUESTIONS ON HOT TOPICS
All you need to know about NETINT Transcoding Server powered by ASICs
Watch the full conversation on YouTube: https://youtu.be/6j-dbPbmejw

One-to-One Performance

Table 1 shows transcoding results for 4K, 1080p, and 720p in latency tolerant and low-delay modes. Instances is the number of full frame rate outputs produced by the system, with CPU utilization shown for reference. These results are most relevant for cloud gaming and similar applications that input a single stream, transcode the stream at full resolution, and distribute it.

As you can see, 4K results peak at 20 streams for all codecs, though results differ by the software program used to generate the streams. The number of 1080p outputs range from 70 – 80, while 720p streams range from 140 to 170. As you would expect, CPU utilization is extremely low for all test cases as the T408s are shouldering the complete decoding/encoding role. This means that performance is limited by T408 throughput, not CPU, and that the 64-core CPU probably wouldn’t produce any extra streams in this use case. For pure encoding operations, the 8-core server would likely suffice, though given the minimal price differential between the 8-core and 32-core systems, opting for the higher-end model is a prudent investment.

Latency

As for latency, in the normal mode, latency averaged around 45 ms for 4K transcoding and 34 ms for 1080p and 720p transcoding. In low delay mode, this dropped to around 24 ms for 4K, 7 ms for 1080p, and 3 ms for 720, all at 30 fps transcoding and measured with FFmpeg. For reference, at 30 fps, each frame is displayed for 33.33 ms. Even in latency-tolerant mode, latency is just over 1.36 frames for 4K and under a single frame for 1080p and 720p. In low delay modes, all resolutions are under a single frame of latency.

It’s worth noting that while software performance would drop significantly from H.264 to HEVC, hardware performance does not. Thus questions of codec performance for more advanced standards like HEVC do not apply when using ASICs. This is good news for engineers adopting HEVC, and those considering HEVC in the future. It means you can buy the server, comfortable in the knowledge that it will perform equally well (if not better) for HEVC encoding or transcoding.

Table 1. Full resolution transcodes with FFmpeg and Gstreamer
in regular and low delay modes.

Table 2 shows the performance when scaling from 4K to 1080p and from 1080p to 720p, again by the different codecs in and out. Since scaling is performed by the host CPU, CPU usage increases significantly, particularly on the higher volume 1080p to 720p output. Still, given that CPU utilization never exceeds 35%, it appears that the gating factor to system performance is T408 throughput. Again, while the 8-core system might be able to produce similar output if your application involves scaling, the 32-core system is probably better advised.

In these tests, latency was slightly higher than pure transcoding. In normal mode, 4K > 1080p latencies topped out at 46 ms and dropped to 39 ms for 1080p > 720p scaling, just over a single frame of latency. In low latency mode, these results dropped to 10 ms for 4K > 1080p and 10 ms for 1080p > 720p. As before, these latency results are for 30fps and were measured with FFmpeg.

Table 2: Performance while scaling from 4K to 1080p and 1080p to 720p.

The final set of tests involves transcoding to the AVC and HEVC encoding ladders shown in Table 3. These results will be most relevant to engineers distributing full encoding ladders in HLS, DASH, or CMAF containers.

Here we see the most interesting discrepancies between FFmpeg and GStreamer, particularly in low delay modes and in 4K results. In the 1080p AVC tests, FFmpeg produced 30 5-rung encoding ladders in normal mode but dropped to nine in low-delay mode. GStreamer produced 30 encoding ladders in both modes using substantially lower CPU resources. You see the same pattern in the 1080p four-rung HEVC output where GStreamer produced more ladders than FFmpeg using lower CPU resources in both modes.

Table 3. Full encoding ladders output in the listed modes.

FFmpeg produced very poor results in 4K testing, particularly in low latency mode, and it was these results that drove the testing with GStreamer. As you can see, GStreamer produced more streams in both modes and CPU utilization again remained very low. As with the previous results, the low CPU utilization means that the results reflect the encoding limits of the T408. For this reason, it’s unlikely that the higher end server would produce more encoding ladders.

In terms of latency, in normal mode, latency was 59 ms for the H.264 ladder, 72 ms for the 4 rung 1080p HEVC ladder, and 52 ms for the 4K HEVC ladder. These numbers dropped to 5 ms, 7 ms, and 9 ms for the respective configurations in low latency mode.

Power Consumption

Power consumption is an obvious concern for all video engineers and operations teams. To assess system power consumption, we tested using the IPMI Tool. When running completely idle, the system consumed 154 watts, while at maximum CPU, the unit averaged 400 watts with a peak of 425 watts.

We measured consumption during the three basic operations tested, pure transcoding, transcoding with scaling, and ladder creation, in each case testing the GStreamer scenario that produced the highest recorded CPU usage. You see the results in Table 4.

When you consider that CPU-only transcoding would yield a fraction of the outputs shown while consuming 25-30% more power, you can see that the T408 is exceptionally efficient when it comes to power consumption. The Watts/Output figure provides a useful comparison for other competitive systems, whether CPU or GPU-based.

Table 4. Power consumption during the specified operation.

Conclusion

With impressive density, low power consumption, and multiple integration options, the NETINT Video Transcoding Server is the new standard to beat for live streaming applications. With a lower price model available for pure encoding operations, and a more powerful model for CPU-intensive operations, the NETINT server family meets a broad range of requirements.