Simplify Building Your Own Streaming Cloud with Wowza

Transcoding and packaging software is a key component of any live-streaming cloud, and one of the most functional and flexible programs available is the Wowza Streaming Engine. During the symposium, Barry Owen, Chief Solutions Architect at Wowza, detailed how to create a scalable streaming infrastructure using the Wowza Streaming Engine (WSE).

He started by discussing Wowza’s history, from its formation in 2005 to its recent acquisition of FlowPlayer. After defining the typical live streaming production pipeline, Barry detailed how WSE can serve as an origin server, transcoder, and packager, ensuring optimal viewer experience. He discussed WSE’s adaptability, including its ability to scale through GPU- and VPU-based transcoding, and emphasized WSE’s deployment options, which range from on-premises to cloud-based infrastructures. He then outlined Wowza’s infrastructure for distributing to audiences large and small.

Barry concluded by validating the session title by getting WSE up and running in under five minutes using Docker in a demo that you can watch below, at the end of this article.

Simplify Building Your Own Streaming Cloud with WOWZA

Start Streaming in Minutes with Wowza Streaming Engine

The focus of Barry’s talk was how to create a highly scalable streaming infrastructure with Wowza Streaming Engine (WSE). He began by recounting Wowza’s history. Established in 2005, the company launched its inaugural product, the Wowza Media Server, in 2007. This was later complemented by the Wowza Cloud, a SaaS solution, in 2013. Since its inception, Wowza has grown to support over 6,000 customers in 170 countries and boasts more than 35,000 streaming implementations. Their products are responsible for 38 million video transcoding hours each month. Recently, the company acquired FlowPlayer, adding a premier video player to its product lineup.

Barry emphasized Wowza’s commitment to providing streaming solutions that are reliable, scalable, and adaptable. He noted the importance of customization in the streaming sector and highlighted the company’s robust support team and services, which are designed to ensure customer success.

Wowza Streaming Engine Functionality

Barry then moved to the heart of his talk, which he set up by illustrating the streaming pipeline, which begins with video capture from sources like cameras, encoders, or mobile devices (Figure 1). Within this pipeline, WSE serves as a comprehensive media server that’s capable of functioning as an origin server, transcoder, and packager in a single system.

In this role, WSE offers real-time encoding and transcoding, producing multiple-bit rate streams for optimal viewer experience. It also performs real-time packaging into formats like HLS and DASH, facilitating compatibility across devices, and ancillary functions like adding DRM and captions, ad insertion, and metadata handling. Once processed, the stream is ready for delivery to a vast audience through one or multiple CDNs, depending on the desired scale and workflow.

NETINT Symposium - Figure 1. The role WSE plays in the streaming pipeline.
Figure 1. The role WSE plays in the streaming pipeline.

Then Barry dug deeper into the capabilities of the Wowza Streaming Engine, emphasizing its comprehensive nature as an end-to-end media server. These capabilities include:

  • Input Protocols: The Streaming Engine can ingest almost any input protocol, including RTSP, RTMP, SRT, WebRTC, HLS, and more.
  • Transcoding: WSE offers just-in-time, real-time transcoding with minimal latency. It also supports features like compositing and overlays, preparing the stream for packaging.
  • Packaging: WSE supports commonly used formats like HLS and DASH, as well as more specialty formats such as WebRTC, RTSP, and MPEG-TS .
  • Delivery: Wowza supports both push and pull models for stream delivery. It can integrate with multiple CDN vendors, including its own, and allows syndication to platforms like Facebook and LinkedIn.
  • Extensibility: A significant feature of the Streaming Engine is its flexibility. It offers a complete Java API for custom processing and a REST API for system command and control. WSE’s user interface (Streaming Engine Manager) is built on this REST API, demonstrating its functionality.
  • Configuration and Control: This Streaming Engine Manager allows users to manage one or more Streaming Engine instances from one web interface. Advanced users can also programmatically edit configurations to integrate with their systems.

Barry underscored WSE’s adaptability, highlighting its ability to cater to custom workflows, from complex ad insertions to machine learning applications. He also mentioned the availability of GitHub libraries with examples and encouraged exploring the Streaming Engine Manager for system configuration and monitoring.

Deploying Wowza Streaming Engine
NETINT Symposium - Figure 2. WSE deployment options.
Figure 2. WSE deployment options.

Barry next discussed the deployment options for the Wowza Streaming Engine. These include:

  • On-Premises: WSE can be deployed on-premises, offering cost-effective and efficient solutions, especially in high-density scenarios or when access to a personal data center is available.
  • Managed Hardware Platforms: WSE can be set up on platforms like Linode, providing access to bare metal in a managed environment.
  • Public Clouds: Pre-built images are available for major cloud platforms, allowing quick setup. Users can choose from marketplace images or standard ones, where they bring their own license key. Pre-configurations for common use cases are also provided.
  • Docker: Wowza offers Docker images for users, emphasizing its significance in automating deployment, scaling, and ensuring high availability in modern infrastructure setups.

Barry emphasized WSE’s adaptability to various deployment needs, from traditional setups to modern cloud-based infrastructures.

Scaling Wowza Streaming Engine
NETINT Symposium - Figure 3. Scaling stream processing with GPUs and VPUs (ASICS).
Figure 3. Scaling stream processing with GPUs and VPUs (ASICS).

Barry shifted the discussion to scaling and stream processing, emphasizing the different approaches and addressing their pros and cons. For stream processing, WSE can deploy CPU, GPU, and VPU-based transcoding. Here’s a brief discussion of each option.

CPU-Based Transcoding:

Barry highlighted the traditional approach of using software CPU-based transcoding. The Wowza Streaming Engine can efficiently leverage the processing power of CPUs to handle video streams. This method is straightforward and can be scaled by adding more servers or opting for higher-capacity CPUs.

He shared that CPU-based transcoding offers a wide range of adaptability, allowing for various encoding and decoding combinations. Given that CPUs are a standard component in servers, there’s no need for specialized hardware. On the other hand, he pointed out CPUs aren’t the best option for achieving high density or low power consumption.

GPU-Based Transcoding:

Regarding GPU-based transcoding, Barry stated that GPUs can handle a significant number of streams, and take on the heavy lifting from the CPU, ensuring smoother operation. However, they are expensive, and not exclusively designed for video processing, which can lead to higher power consumption.

VPU-Based Transcoding:

Barry expressed considerable enthusiasm for the capabilities of Video Processing Units (VPUs), or ASIC-based transcoders. Unlike general-purpose CPUs and GPUs, VPUs are purpose-built for video processing which allows them to handle video streams with remarkable efficiency. In recent years, VPUs have emerged as a promising solution, especially when it comes to achieving high-density streaming. Barry noted that these units not only offer a competitive price per channel but also boast minimal power consumption.

The Evolution Towards Specialization:

Drawing from his insights, Barry seemed to suggest a trend in the streaming industry: a move towards more specialized solutions. While CPUs and GPUs have been stalwarts in the industry, the rise of VPUs indicates a shift towards tools and technologies tailored specifically for streaming. This specialization promises not only enhanced performance but also greater efficiency in terms of cost and energy consumption.

Distributing Your Streams

Barry concluded his talk by discussing the distribution options available from Wowza. He emphasized the importance of adaptability when it comes to scaling outputs, especially given the diverse audience sizes that streaming services might cater to. WSE offers multiple distribution options to ensure that content reaches its intended audience efficiently, regardless of its size.

On-Premises Scaling:

One of the primary methods Barry discussed was scaling on-premises. By simply adding more servers to the existing infrastructure, streaming services can handle a larger load. This method is particularly useful for organizations that already have a significant on-premises setup and are looking to leverage that infrastructure.

CDN (Content Delivery Network):

For those expecting a vast number of viewers, Barry recommended using a content delivery network, or CDN. CDNs are designed to handle large-scale content delivery, distributing the content across a network of servers to ensure smooth and efficient delivery to a global audience. By offloading the streaming to a CDN, services can ensure that their content reaches viewers without any hitches, even during peak times.

Hybrid Approaches:

Barry found the hybrid model particularly intriguing. This approach combines the strengths of both on-premises scaling and CDNs. For instance, an organization could use its on-premises setup for regular streaming to a smaller audience. However, during events or times when a larger audience is expected, they could “burst” to the cloud, leveraging the power of CDNs to handle the increased load. This model offers both cost efficiency and scalability, ensuring that services are not overextending their resources during regular times but are also prepared for peak events.

In essence, Barry underscored the importance of flexibility in scaling. The ability to choose between on-premises, CDN, or a hybrid approach ensures that streaming services can adapt to meet any audience size.

NETINT Symposium - Wowza - Figure 4. Options for distributing to various audience sizes.
Figure 4. Options for distributing to various audience sizes.
Figure 8. A simple production with two cameras, a source switcher, and WebRTC output.

Start Streaming in Minutes with WSE: The Demonstration

Play Video about NETINT Symposium - Wowza
Figure 5. Click the image to run Barry’s demo.

Barry then ran a recorded demonstration to illustrate the simplicity of setting up the Wowza Streaming Engine using Docker – you can run this below. He ran the demo using Docker Desktop and Docker Compose, and the objective was to launch two containers: one for the Wowza Streaming Engine and another for its manager.

He began by activating the services using the command ‘Docker compose up’. Since he recorded the demo on an M1 Mac, he noted that the process might be slightly slower due to the Rosetta translation layer. As the services initialized, Barry explained the YAML file he used to provision these services. The file contained configurations for both the Streaming Engine and its Manager, detailing aspects like image sources, environment variables, and port settings.

With the services up and running, Barry navigated to Docker Desktop to monitor the performance of the two launched services, observing metrics like CPU and memory usage. He then accessed the Streaming Engine Manager via a web browser. Barry highlighted the versatility of Docker Compose, mentioning that it can manage multiple service instances, which can be beneficial for scalability, high availability, or clustering.

Upon accessing the manager, Barry logged in to view the server’s health snapshot, providing insights into its status. He then navigated to a pre-configured application named ‘live’ to stream content. Using a live streaming program called Open Broadcaster Software on his system, Barry set it up to stream to the server, pointing out the server’s recognition of the incoming stream and its subsequent packaging.

Returning to the manager, Barry verified the incoming stream’s presence and details. He then extracted the HLS URL for the stream, which he opened in a Safari browser tab to demonstrate live playback. The stream played seamlessly, underscoring the efficiency and ease of the entire process.

The demo showcased how, in a matter of minutes, you can configure, initiate, and stream using the Wowza Streaming Engine. You can get started yourself by downloading a trial version of WSE here.

ON-DEMAND:
Barry Owen, Start Streaming in Minutes with Wowza Streaming Engine

Simplify Building Your Own Streaming Cloud with GPAC

Romain Bouqueau is CEO of Motion Spell and one of the principal architects of the GPAC open-source software, one of the three software alternatives presented in the symposium. He spoke about the three challenges facing his typical customers: features, cost, and flexibility, and identified how GPAC delivers on each challenge.

Then, he illustrated these concepts with three impressive case studies: Synamedia/Quortex, Instagram, and Netflix. Overall, Romain made a strong case for GPAC as the transcoding/packaging element of your live streaming cloud.

Simplify Building Your Own Streaming Cloud with GPAC

NETINT Symposium - GPAC

Romain began his talk with an excellent summary of the situation facing many live-streaming engineers. “It’s a pleasure to discuss the challenges of building your own live-streaming cloud. Cloud services are convenient, but once you scale, you may realize that you’re paying too much and you are not as flexible as you’d like to be. I hope to convince you that the cost of customization that you have when using GPAC is actually an investment with a very interesting ROI if you make the right choices. That’s what we’re going to talk about.”

NETINT Symposium - GPAC - Figure 1. About Romain, GPAC, and Motion Spell.
Figure 1. About Romain, GPAC, and Motion Spell.

Then, he briefly described his background as a principal architect of the GPAC open-source software, which he has contributed to for over 15 years. In this role, Romain is known for his advocacy of open source and open standards and as a media streaming entrepreneur. His primary focus has been on GPAC, a multimedia framework recognized for its emphasis on modularity and standards compliance.

He described that GPAC offers tools for media content processing, inspection, packaging, streaming playback, and interaction. Unlike many multimedia frameworks that cater to 2D TV-like experiences, GPAC is characterized by versatility, controlled latency, and the ability to support various scenarios, including hybrid broadcast broadband setups, interactivity, scripting, virtual reality, and 3D scenes.

Romain’s notable achievements include streamlining the MPEG ISO-based media file format used in formats like MP4, CMAF, DASH, and HLS. His work earned recognition through a technology engineering EMMY award. To facilitate the wider use of GPAC, Romain established Motion Spell, which serves as a bridge between GPAC and its practical applications. Motion Spell provides consulting, support, and training, acting as the exclusive commercial licenser of GPAC.

During his introduction, Romain discussed challenges faced by companies in choosing between commercial solutions and open source for video encoding and packaging. He posited that many companies often lack the confidence and necessary skills to fully implement GPAC but emphasized that despite this, the implementation process is both achievable and simpler than commonly assumed.

He shared that his customers face three major challenges, features, cost, and flexibility, and addressed each in turn.

Features

NETINT Symposium - GPAC -  Figure 2. The three challenges facing those building their live streaming cloud.
Figure 2. The three challenges facing those building their live streaming cloud.

The first challenge Romain highlighted relates to features and capabilities. He advised the audience to create a comprehensive list that encompasses the needed capabilities, including codecs, formats, containers, DRMs, captions, and metadata management.

He also underscored the importance of seamless integration with the broader ecosystem, which involves interactions with external players, analytics probes, and specific content protocols. Romain noted that while some solutions offer user-friendly graphical interfaces, deeper configuration details often need to be addressed to accommodate diverse codecs, parameters, and use cases, especially at scale.

Highlighting Netflix’s usage of GPAC, Romain emphasized that GPAC is well-equipped to handle features and innovation, given its research and standardization foundation. He acknowledged that while GPAC is often a step ahead in the industry, it cannot implement everything alone. Thus, sponsorship and contributions from the industry are crucial for the continued development of this open-source software.

Romain explained that GPAC’s compatibility with the ecosystem is a result of its broad availability. Its role as a reference implementation, driven by standardization efforts, makes it a favored choice. Additionally, he mentioned that Motion Spell’s efforts have led to GPAC becoming part of numerous plugin systems across the industry.

Cost

The second challenge highlighted by Romain is cost optimization. He explained that costs are typically divided into Capital Expenditure (CAPEX) and Operational Expenditure (OPEX). He noted that GPAC, being written in the efficient C programming language, benefits from rigorous scrutiny from the open-source community, making it highly efficient. He acknowledged that while GPAC offers various features, each use case varies, leading to questions about resource allocation. Romain encouraged considerations like the need for CDNs for all channels and premium encoders for all content.

Regarding CAPEX, Romain mentioned integration costs associated with open-source software, emphasizing that some costs might be challenging to evaluate, such as error handling. He referenced the Synamedia/Quortex architecture as an example of efficient error management. Romain also addressed the misconception that open source implies free software, referencing a seminar he participated in that compared the costs of different options.

He shared an example of a broadcaster with a catalog of 100,000 videos and 500 concurrent streams. The CAPEX for packaging ranged from $100,000 to $200,000, depending on factors like developer rates and location, with running costs being relatively low compared to transcoding costs.

Romain revealed that, based on his research, open source consistently ranked as the most cost-efficient option or a close competitor across different use cases. He concluded that combining GPAC with Motion Spell’s professional services and efficient encoding appliances like NETINT‘s aligns well with the industry’s efficiency challenges.

Flexibility

The final challenge discussed by Romain was flexibility, emphasizing the importance of moving swiftly in a fast-paced environment. He described how Netflix successfully transitioned from SVOD to AVOD, adapted from on-demand to live streaming, switched from H.264 to newer codecs, and consolidated multiple containers into one over short time frames, contributing to their profitability. Romain underlined the potential for others to achieve similar success using GPAC.

He introduced a new application within GPAC called “gpac”, designed to build customized media pipelines. In contrast to historical GPAC applications that offered fixed media pipelines, this new “gpac” application enables users to create tailored pipelines to address specific requirements. This includes transcoding packaging, content protection, networking, and in general, any feature you need for your private cloud.

The Synamedia/Quortex “just-in-time everything” paradigm

NETINT Symposium - GPAC -  Figure 3. Motion Spell’s work with Quortex which was acquired by Synamedia.
Figure 3. Motion Spell’s work with Quortex, which was acquired by Synamedia.

Romain then moved on to the Synamedia/Quortex use case that illustrated the challenge of GPAC supplying comprehensive features. He described Quortex’s innovative “just-in-time everything” paradigm for media pipelines.

Unlike the traditional 24/7 transcoder that is designed to never fail and requires backup solutions for seamless switching, Quortex divides the media pipeline into small components that can fail and be relaunched when necessary. This approach is particularly effective for live streaming scenarios, offering low latency.

Romain highlighted that the Quortex approach is highly adaptable as it can run on various instances, including cloud instances that are cost-effective but might experience interruptions. The system generates content on-demand, meaning that when a user wants to watch specific content on a device, it’s either cached, already generated, or created just-in-time. This includes packaging, transcoding, and other media processing tasks.

Romain attributed the success of the development project to Quortex’s vision and talented teams, as well as the strategic partnership with Motion Spell. He also shared that after project completion, Synamedia acquired Quortex.

Instagram

NETINT Symposium - GPAC - Figure 4. GPAC helped Instagram cut compute times by 94%.
Figure 4. GPAC helped Instagram cut compute times by 94%.

The second use case addressed the challenge of cost and involved Instagram, a member of the Meta Group. According to Romain, Instagram utilized GPAC’s MP4Box to reduce video compute times by an impressive 94%. This strategic decision helped prevent a capacity shortage within just twelve months, ensuring the platform’s ability to provide video uploads for all users.

Romain presented Instagram’s approach as noteworthy because it emphasizes the importance of optimizing costs based on content usage patterns. The platform decided to prioritize transmission and packaging of content over transcoding, recognizing that a significant portion of Instagram’s content is watched only a few times. In this scenario, the cost of transcoding outweighs the savings on distribution expenses. As Romain explained, “It made more sense for them to package and transmit most content instead of transcoding it, because most of Instagram’s content is watched only a few times. The cost of transcoding, in their case, outweighs the savings on the distribution cost.”

According to Romain, this strategy aligns with the broader efficiency trend in the media tech industry. By adopting a combined approach, Instagram used lower quality and color profiles for less popular content, while leveraging higher quality encoders for content requiring better compression. This optimization was possible because Instagram controls its own encoding infrastructure, which underscores the value of open-source solutions in providing control and flexibility to organizations.

The computational complexity of GPAC’s packaging is close to a bit-for-bit copy, contributing to the 94% reduction in compute times. Romain felt that Instagram’s successful outcome exemplifies how open-source solutions like GPAC can empower organizations to make significant efficiency gains while retaining control over their systems.

Netflix

NETINT Symposium - GPAC - Figure 5. GPAC helped Netflix transition from SVOD to AVOD, from On-Demand to live, and from H264 to newer codecs.
Figure 5. GPAC helped Netflix transition from SVOD to AVOD,
from On-Demand to live, and from H264 to newer codecs.

The final use case addresses the challenge of flexibility and involves a significant collaboration between GPAC, Motion Spell, and Netflix. According to Romain, this collaboration had a profound impact on Netflix’s video encoding and packaging platform, and contributed to an exceptional streaming experience for millions of viewers globally.

At the NAB Streaming Summit, Netflix and Motion Spell took the stage to discuss the successful integration of GPAC’s open-source software into Netflix’s content operations. During the talk, Netflix highlighted the ubiquity of the ISO BMFF (MPEG ISO-based media file format) in their workflows and emphasized their commitment to open standards and innovation. The alignment between GPAC and Netflix’s goals allowed them to leverage GPAC’s innovations for free, thanks to sponsorships and prior implementations.

Romain explained how Netflix’s transformation from SVOD to AVOD, from On-Demand to live, and from H264 to newer codecs was facilitated by GPAC’s ease of integration and efficiency in operations. In this fashion, he asserted, the collaboration between Motion Spell and Netflix exemplifies the capacity of open-source solutions to drive innovation and adaptability.

Romain further described how GPAC’s rich feature set, rooted in research and standardization, offers capabilities beyond most publishers’ current needs. The unified “gpac” executable simplifies deployment, making it accessible for service implementation. Leveraging open-source principles, GPAC proves to be cost-competitive and easy to integrate. Motion Spell’s role in helping organizations maximize GPAC’s potential, as demonstrated with Netflix, underscores the practical benefits of the collaboration.

Romain summarized how GPAC’s flexibility empowers organizations to optimize and differentiate themselves rapidly. Examples like Netflix’s interactive Bandersnatch, intelligent previews, exceptional captioning, and accessibility enhancements showcase GPAC’s adaptability to evolving demands. Looking forward, Romain described how user feedback continues to shape GPAC’s evolution, ensuring its continued improvement and relevance in the media tech landscape.

With a detailed description of GPAC’s features and capabilities, underscored by very relevant case studies, Romain clearly demonstrated how GPAC can help live streaming publishers overcome any infrastructure-related challenge. And for those who would like to learn more, or need support or assistance integrating GPAC into their workflows, he invited them to contact him directly.

NETINT Symposium - GPAC

ON-DEMAND:
Romain Bouqueau, Deploying GPAC for Transcoding and Packaging

Simplify Building Your Own Streaming Cloud with NORSK SDK

Adrian Roe from id3as discussed Norsk, a technology designed to simplify the building of large-scale media workflows. id3as, originally a consultancy-led organization, works with major clients like DAZN and Nasdaq and is now pivoting to concentrate on Norsk, which it sells as an SDK. The technology underlying Norsk is responsible for delivering hundreds of thousands of live events annually and offers extensive expertise in low-latency and early-adoption technologies.

Adrian emphasized the company’s commitment to reliability, especially during infrastructure failures, and its initiatives in promoting energy-efficient streaming, including founding the Greening of Streaming organization. He also highlighted that about half of their deployments are cloud-based, suitable for fluctuating workloads, while the other half are on-premises or hybrid models, often driven by the need for high density at low cost and low energy consumption.

NETINT Symposium - About id3as

Simplify Building Your Own Streaming Cloud with Norsk SDK

Encoding Infrastructure is Simpler and Cheaper than Ever Before

The focus of the symposium was creating your own encoding infrastructure, and Adrian next focused on how new technologies were simplifying this and making it more affordable. For example, Adrian mentioned that advancements like NETINT’s Quadra video processing units (VPU) are changing the game, allowing some clients to consider shifting back to on-premises solutions.

Then, he described a recent server purchase to highlight the advancements in computing hardware capabilities. The server, which is readily available off-the-shelf and not particularly expensive, boasts impressive specs with 256 physical cores, 512 logical cores, and room for 24 U.2 cards like NETINT’s T408 or Quadra T1U.

Adrian then shared that during load testing, the server’s CPU profile was so extensive that it exceeded the display capacity of his screen, and he joked that it gave him an excuse to file an expense report for a new monitor. This anecdote emphasized the enormous processing capacity now available in relatively affordable hardware setups. The server occupies just 2U of rack space, and Adrian speculated that it could potentially deliver hundreds of channels in a fully loaded configuration, showcasing the leaps in efficiency and power in modern server hardware.

I think he used the second person — “it gives you an excuse to file an expense report for a new monitor” — but close enough.

NETINT Symposium - Figure 2. Infrastructure is getting cheaper and more capable.
Figure 2. Infrastructure is getting cheaper and more capable.

Why Norsk?

Adrian then shifted his focus to Norsk. He emphasized that Norsk is designed to cater to large broadcasters and systems integrators who require more than just off-the-shelf solutions. These clients often need specialized functionalities, like the ability to make automated decisions such as switching to a backup source if the primary one fails, without the need for human intervention.

They may also require complex multi-camera setups and dynamic overlays of scores or graphics during live events. Norsk is engineered to simplify these historically challenging tasks, enabling clients to easily put together sophisticated streaming solutions.

NETINT Symposium - Figure 3. Why Norsk in a nutshell.
Figure 3. Why Norsk in a nutshell.

He also pointed out that while some existing solutions may offer these features out of the box, creating such capabilities from scratch usually requires a significant engineering effort and demands professionals with advanced skills and a deep understanding of media technology, including intricate details of different video and container formats and how to handle them.

According to Adrian, Norsk eliminates these complexities, making it easier for companies to implement advanced streaming functionalities without the need for specialized knowledge. In short, Norsk fills the gap in the market for large broadcasters and systems integrators who require customized, automated decision-making capabilities in their streaming solutions.

Norsk In Action

Adrian then began demonstrating Norsk’s operation. He started by showing Figure 4 as an example of an output that Norsk might produce. This involved multiple inputs and overlays of scores or graphics that might need to update dynamically.

NETINT Symposium - Figure 4.  A typical production with multiple inputs and overlays that needed to change dynamically.
Figure 4. A typical production with multiple inputs and overlays
that needed to change dynamically.

Figure 5 shows the code Norsk uses to produce this output in its entirety via its “low code” approach. Parsing through the code, in the top section, you see the inputs, outputs, and transformation nodes. In this example, Norsk ingests RTMP and SRT (and also a logo from the file) and publishes the output over WHEP, a WebRTC-HTTP Egress protocol.  However, with Norsk it is easy to accommodate any of the common formats; for example, to change the output to (Low Latency) HLS, you would simply replace the “whep” output with HLS, and you’d be done.

NETINT Symposium - Figure 5. Norsk’s low code approach to the production shown in Figure 4.
Figure 5. Norsk’s low-code approach to the production shown in Figure 4.

The next section of code directs how the media flows between the nodes. Compose takes the video from the various inputs, while the audio mixer combines the audio from inputs 1 and 2.  Finally, the WHEP output subscribes to the outputs of the audio mixer and compose nodes. That’s all the code needed to create a complex picture in picture.

Adrian then went over the building blocks of how Norsk solutions can be constructed.  This started with an example of a pass-through setup where an RTMP input is published as a WebRTC output (Figure 6). With Norsk, all that’s needed is to specify that the output should get its audio and video from a particular input node, in this case, RTMP.

He then shared that Norsk is designed to be format-agnostic so that if the input node changes to another format like SRT or SDI, everything else in the setup will continue to function seamlessly. This ease of use allows for the quick development of sophisticated streaming solutions without requiring deep technical expertise.

NETINT Symposium - Figure 6. A simple example of an RTMP input published as WebRTC.
Figure 6. A simple example of an RTMP input published as WebRTC.

Adrian then described how Norsk handles potential incompatibilities that might arise in a workflow. In the above example he noted that WebRTC supports only the Opus audio codec, which is not supported by RTMP.

In these cases, Norsk automatically identifies the incoming audio codec (in this case, AAC) and transcodes it to Opus for compatibility with WebRTC. It also changed the encapsulation of the H.264 video for WebRTC compatibility. These automated adjustments showcase Norsk’s ability to simplify complex streaming workflows by making intelligent decisions to ensure compatibility and functionality.

Norsk will automatically adjust your workflow to make it work

NETINT Symposium - Figure 7. Norsk will automatically adjust your workflow to make it work; in this case, converting AAC to Opus and encapsulating the H.264 encoded video for WebRTC output.
Figure 7. Norsk will automatically adjust your workflow to make it work;
in this case, converting AAC to Opus and encapsulating the H.264 encoded video
for WebRTC output.

Next in the quick tour of “building blocks,” Adrian showed how easy it is to build a source switcher, allowing the user to switch dynamically between two camera inputs (Figure 8). He explained how id3as’ low-code approach made it easy and natural to extend this (for example to handle an unknown number of sources that might come and go during a live event).  

NETINT Symposium - Figure 8. A simple production with two cameras, a source switcher, and WebRTC output.
Figure 8. A simple production with two cameras, a source switcher, and WebRTC output.

According to Adrian, this simplicity allows engineers building solutions with Norsk to focus on the user experience they want to deliver to their customers as well as how to automate and simplify operations. They can focus on the intended result, not on the highly complex media technology required to deliver that result. This puts their logic into a very transparent context and simplifies building an application that delivers what’s intended.

Visualizing Productions

To better manage and control operations, Norsk supports visualizations via an OpenTelemetry API, which enables real-time data retrieval and input into a decisioning system for monitoring. As well as simple integration with these monitoring tools, Norsk also includes the visualizer shown in Figure 9 that renders this data as an easy-to-understand flow of media between nodes. You’ll see another two examples of this below.

NETINT Symposium - Figure 9. Norsk’s workflow visualization makes it simple to understand the media flow within an application.
Figure 9. Norsk’s workflow visualization makes it simple
to understand the media flow within an application.

Adrian then returned to the first picture-in-picture application shown to illustrate how the effect was created. You see that it’s very easy to position, size, and control each of the three elements, so the engineer can focus on the desired output, not anything to do with what the media itself looks like.

NETINT Symposium - Figure 10. Integrating three production inputs into a picture-in-picture presentation in Norsk.
Figure 10. Integrating three production inputs into a picture-in-picture presentation in Norsk.

Adrian highlighted the convenience and flexibility of Norsk’s low-code approach by describing how the system handles dynamic updates and configurations using code. He emphasized that the entire process of making configuration changes, like repositioning embedded areas or switching sources, involves just a few lines of code. This approach allows users to easily build complex functionalities like a video mixer with minimal engineering efforts.

Additionally, Adrian described how overlays are seamlessly integrated into the workflow. He explained that a browser overlay is treated as just another source which can be transformed and composed alongside other sources. By combining and outputting these elements, a sophisticated output with overlays can be achieved with minimal code.

Adrian emphasized that the features he demonstrated are sufficient to build a comprehensive live production system using Norsk like that shown in Figure 11. With Norsk’s low-code approach, he asserted, there are no additional complex calls required to achieve the level of sophistication demonstrated. With Norsk, he reiterated engineers building media applications can focus on creating the desired user experience rather than dealing with intricate technical details.

NETINT Symposium - Figure 11. Norsk enables productions like this with just a few lines of code.
Figure 11. Norsk enables productions like this with just a few lines of code.

Taking a big-picture view of how productions are created and refined, Adrian shared how the entire process of describing media requirements and building proof of concepts is streamlined with Norsk’s approach. With just a few lines of code, proof of concepts can be developed in a matter of hours or days. This leads to shorter feedback cycles with potential users, enabling quicker validation of whether the solution meets their needs. In this manner, Adrian noted that Norsk enables rapid feature development and allows for quicker feature launches to the market.

Integrating Encoding Hardware

Adrian then shifted his focus to integrations with encoding hardware, noting that many customers have production hardware that utilizes transcoders and VPUs like those supplied by NETINT to achieve high-scale performance. However, the development teams might not have the same production setup for testing and development purposes. Norsk addresses this challenge by providing an easy way for developers to work productively on their applications without requiring the exact production hardware.

You see this in Figure 12, an example of where developers can configure different settings for different environments. For instance, in a production or QA environment, the output settings could be configured for 1080p at 60 frames with specific Quadra configurations.

In contrast, in a development environment, the output settings might be configured for the x264 codec outputting 720p with different parameters, like using the ultrafast preset and zero latency. This approach allows engineers to have a productive development experience while not requiring the same processing power or hardware as the production setup.

NETINT Symposium - Figure 12. Norsk can use one set of transcoding parameters for development (on the right), and another for production.
Figure 12. Norsk can use one set of transcoding parameters
for development (on the right), and another for production.

Adrian then described how Norsk maximizes the acceleration hardware capabilities of third-party transcoders to optimize performance, sharing that with the NETINT cards, Norsk outperformed FFmpeg. For example, when using hardware transcoders, it’s generally more efficient to keep the processing on the hardware as much as possible to avoid unnecessary data transfers.

Adrian provided a comparison between scenarios where hardware acceleration is used and scenarios where it’s not. In one example, he showed how a NETINT T408 was used for hardware decoding, but some manipulations like picture-in-picture and resizing weren’t natively supported by the hardware. In this case, Norsk pulled the content to the CPU, performed the necessary manipulations, and then sent it back to the hardware for encoding (Figure 13).

NETINT Symposium - Figure 13. Working with the T408, Norsk had to scale and overlay via the host CPU.
Figure 13. Working with the T408, Norsk had to scale and overlay via the host CPU.

In contrast, with a Quadra card that does support onboard scaling and overlay, Norsk performed these functions on the hardware, remarkably using the same exact application code as for the T408 version (Figure 14). This way, Adrian emphasized, Norsk maximized the efficiency of the hardware transcoder and optimized overall system performance.

NETINT Symposium - Figure 14. Norsk was able to scale an overlay on the Quadra using the same code as on the T408
Figure 14. Norsk was able to scale an overlay on the Quadra using the same code as on the T408.

Adrian also highlighted the practicality of using Norsk by offering trial licenses for users to experience its capabilities. The trial license allows users to explore Norsk’s features and benefits, showcasing how it leverages emerging hardware technologies in the market to deliver high-density, high-availability, and energy-efficient media experiences. He noted that the trial software was fully capable, though no single session can exceed 20 minutes in duration.

Adrian then took a question from the audience, addressing Norsk’s support for SCTE-35. Adrian highlighted that Norsk is capable of SCTE-35 insertion to signal events such as ad insertion and program switching. Additionally, he noted that Norsk allows the insertion of tags into HLS and DASH manifest files, which can trigger specific events in downstream systems. This functionality enables seamless integration and synchronization with various parts of the media distribution workflow.

Adrian also mentioned that Norsk offers integration with digital rights management (DRM) providers. This means that after content is processed and formatted, it can be securely packaged to ensure that only authorized viewers have access to it. Norsk’s background in the broadcast industry has enabled it to incorporate these capabilities that are essential for delivering content to the right audiences while maintaining content protection and security.

For more information about Norsk, contact the company via their website or request a meeting. And if you’ll be at IBC, you can set up a meeting with them HERE.

ON-DEMAND: Adrian Roe, CEO at id3as | Make Live Easy with NORSK SDK

Beyond Traditional Transcoding: NETINT’s Pioneering Technology for Today’s Streaming Needs

Welcome to our here’s-what’s-new-since-last-IBC-so-you-should-schedule-a-meeting-with-us blog post. I know you’ve got many of these to wade through, so I’ll be brief.

First, a brief introduction. We’re NETINT, the ASIC-based transcoding company. We sell standalone products like our T408 video transcoder and Quadra VPUs ( for video transcoding units) and servers with ten of either device installed. All offer exceptional throughput at an industry-low cost per stream and power consumption per stream. Our products are denser, leaner, and greener than any competitive technology.
They’re also more innovative. The first-generation T408 was the first ASIC-based hardware transcoder available for at least a decade, and the second-generation Quadra was the first hardware transcoder with AV1 and AI processing. Our Quadra shipped before Google and Meta shipped their first generation ASIC-based transcoders and they still don’t support AV1.
That’s us; here’s what’s new.

Capped CRF Encoding

We’ve added capped CRF encoding to our Quadra products for H.264, HEVC, and AV1, with capped CRF coming for the T408 and T432 (H.264/HEVC). By way of background, with the wide adoption of content-adaptive encoding techniques (CAE), constant rate factor (CRF) encoding with a bit rate cap gained popularity as a lightweight form of CAE to reduce the bitrate of easy-to-encode sequences, saving delivery bandwidth and delivering CBR-like quality on hard-to-encode sequences. Capped CRF encoding is a mode that we expect many of our customers to use.

Figure 1 shows capped CRF operation on a theoretical football clip. The relevant switches in the command string would look something like this:

-crf 21  -maxrate 6MB

This directs FFmpeg to deliver at least the quality of CRF 21, which for H.264 typically equals around a 95 VMAF score. However, the maxrate switch ensures that the bitrate never exceeds 6 Mbps.

As shown in the figure, in operation, the Quadra VPU transcodes the easy-to-encode sideline shots at CRF 21 quality, producing a bitrate of around 2 Mbps. Then, during actual high-motion game footage, the 6MB cap would control, and the VPU would deliver the same quality as CBR. In this fashion, capped CRF saves bandwidth with easy-to-encode scenes while delivering equivalent to CBR quality with hard-to-encode scenes.

Figure 1. Capped CRF in operation. Relatively low-motion sideline shots are encoded to CRF 21 quality (~95 VMAF), while the 6 Mbps bitrate cap controls during high-motion game footage. Transcoding.
Figure 1. Capped CRF in operation. Relatively low-motion sideline shots are encoded to CRF 21 quality (~95 VMAF), while the 6 Mbps bitrate cap controls during high-motion game footage.

By deploying capped CRF, engineers can efficiently deliver high-quality video streams, enhance viewer experiences, and reduce operational expenses. As the demand for video streaming continues to grow, Capped CRF emerges as a game-changer for engineers striving to stay at the forefront of video delivery optimization.

You can read more about capped CRF operation and performance in Get Free CAE on NETINT VPUs with Capped CRF.

Peer-to-Peer Direct Memory Access (DMA) for Cloud Gaming

Peer-to-peer DMA is a feature that makes the NETINT Quadra VPU ideal for cloud gaming. By way of background, in a cloud-gaming workflow, the GPU is primarily used to render frames from the game engine output. Once rendered, these frames are encoded with codecs like H.264 and HEVC.

Many GPUs can render frames and transcode to these codecs, so it might seem most efficient to perform both operations on the same GPU. However, encoding demands a significant chunk of the GPU’s resources, which in turn reduces overall system throughput. It’s not the rendering engine that’s stretched to its limits but the encoder.

What happens when you introduce a dedicated video transcoder into the system using normal techniques? The host CPU manages the frame transfer between the GPU and the transcoder, which can create a bottleneck and slow system performance.

Figure 2. Peer-to-peer DMA enables up to 200 720p60 game streams from a single 2RU server. Transcoding.
Figure 2. Peer-to-peer DMA enables up to 200 720p60 game streams from a single 2RU server.

In contrast, peer-to-peer DMA allows the GPU to send frames directly to the transcoder, eliminating CPU involvement in data transfers (Figure 2). With peer-to-peer DMA enabled, the Quadra supports latencies as low as 8ms, even under heavy loads. It also unburdens the CPU from managing inter-device data transfers, freeing it to handle other essential tasks like game logic and physics calculations. This optimization enhances the overall system performance, ensuring a seamless gaming experience.

Some NETINT customers are using Quadra and peer-to-peer DMA to produce 200 720p60 game streams from a single 2RU server, and that number will increase to 400 before year-end. If you’re currently assembling an infrastructure for cloud gaming, come see us at IBC.

Logan Video Server

NETINT started selling standalone PCIe and U.2 transcoding devices, which our customers installed into servers. In late 2022, customers started requesting a prepackaged solution comprised of a server with ten transcoders installed. The Logan Video Server is our first response.

Logan refers to NETINT’s first-generation G4 ASIC, which transcodes to H.264 and HEVC. The Logan Video Server, which launched in the first quarter of 2023, includes a SuperMicro server with a 32-core AMD CPU running Ubuntu 20.04 LTS and ten NETINT T408 U.2 transcoder cards (which cost $300 each) for $8,900. There’s also a 64-core option available for $11,500 and an 8-core option for $7,000.

The value proposition is simple. You get a break on price because of volume commitments and don’t have to install the individual cards, which is generally simple but still can take an hour or two. And the performance with ten installed cards is stunning, given the price tag.

You can read about the performance of the 32-core server model in my review here, which also discusses the software architecture and operation. We’ll share one table, which shows one-to-one transcoding of 4K, 1080p, and 720p inputs with FFmpeg and GStreamer.

At the $8,900 cost, the server delivers a cost per stream as low as $445 for 4K, $111.25 for 1080p, and just over $50 for 720p at normal and low latency. Since each T408 only draws 7 watts and CPU utilization is so low, power consumption is also exceptionally low.

Meet NETINT at IBC - Transcoding - Table-1
Table 1. One-to-one transcoding performance for 4K, 1080p, and 720p.

With impressive density, low power consumption, and multiple integration options, the NETINT Video Transcoding Server is the new standard to beat for live streaming applications. With a lower-priced model available for pure encoding operations and a more powerful model for CPU-intensive operations, the NETINT Logan server family meets a broad range of requirements.

Quadra Video Server

Once the Logan Video Server became available, customers started asking about a similarly configured server for NETINT’s Quadra line of video transcoding units (VPUs), which adds AV1 output, onboard scaling and overlay, and two AI processing engines. So, we created the Quadra Video Server.

This model uses the same Supermicro chassis as the Logan Video Server and the same Ubuntu operating system but comes with ten Quadra T1U U.2 form factor VPUs, which retail for $1,500 each. Each T1U offers roughly four times the throughput of the T408, performs on-board scaling and overlay, and can output AV1 in addition to H.264 and HEVC.

The CPU options are the same as the Logan server, with the 8-core unit costing $19,000, the 32-core unit costing $21,000, and the 64-core model costing $24,000. That’s 4X the throughput at just over 2x the price.

You can read my review of the 32-core Quadra Video Server here. I’ll again share one table, this time reporting encoding ladder performance at 1080p for H.264 (120 ladders), HEVC (140), and AV1 (120), and 4K for HEVC (40) and AV1 (30).

In comparison, running FFmpeg using only the CPU, the 32-core system only produced nineteen H.264 1080p ladders, five HEVC 1080p ladders, and six AV1 1080p ladders. Given this low-volume throughput at 1080p, we didn’t bother trying to duplicate the 4K results with CPU-only transcoding.

Figure 2. Encoding ladder performance of the Quadra Video Server.
Table 2. Encoding ladder performance of the Quadra Video Server.

Beyond sheer transcoding performance, the review also details AI-based operations and performance for tasks like region of interest transcoding, which can preserve facial quality in security and other relatively low-quality videos, and background removal for conferencing applications.

Where the Logan Video Server is your best low-cost option for high volume H.264 and HEVC transcoding, the Quadra Video Server quadruples these outputs, adds AV1 and onboard scaling and overlay, and makes AI processing available.

Come See Us at the Show

We now return to our normally scheduled IBC pitch. We’ll be in Stand 5.A86 and you can book a meeting by clicking here.

Figure 3. Book a meeting.
.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

NETINT Buyer’s Guide. Choosing the Right VPU & Server for Your Workflow.

This guide is designed to help you choose the optimum NETINT Video Processing Unit (VPU) for your encoding workflow.

As an overview, note that all NETINT hardware products (VPUs and transcoders) run the same basic software controlled via FFmpeg and GStreamer patches or an SDK. This includes load balancing of all encoding resources in a server. In addition, both generations are similar in terms of latency and HDR support.

Question 1. Which ASIC Architecture: Codensity G4 (Logan) or Codensity G5 (Quadra)?

Tables 1 and 2 show the similarities and differences between Codensity G4 ASIC-powered products (T408 and T432) and Codensity G5-based products (Quadra T1U, T1A, T2A). Both architectures are available in either the U.2 or AIC form factor, the latter all half-height half-length (HHHL) configurations.

From a codec perspective, the main difference is that G5-based products support AV1 encoding and VP9 decoding. In terms of throughput, G5-based products offer four times the throughput but cost roughly three times more than G4, making the cost per output stream similar but with greater stream densities per host server. G5 power consumption is roughly 3x higher per ASIC than G4, but the throughput is 4x, making power consumption per stream lower.

Choosing the Right VPU & Server - Table 1. Codec support, throughput, and power consumption.
Table 1. Codec support, throughput, and power consumption.

Table 2 covers other hardware features. From an encoding perspective, G5-based products enable tuning of quality and throughput to match your applications, while quality and throughput are fixed for G4-based products. The G5’s quality ceiling is higher than the G4, at the cost of throughput, and the quality floor is lower, with an option for higher throughput.

G5-based products are much more capable hardware-wise, performing scaling, overlay, and audio compression and offer AI processing of 15 TOPS for T1U and 18 TOPS for T1A (36 TOPS T2A). In contrast, G4-based products scale, overlay, and encode audio via the host CPU and offer no AI processing. You can read about Quadra’s AI capability here.

Peer-to-peer DMA is a feature that allows G5-based products to communicate directly with some specific GPUs, which is particularly useful in cloud gaming. This is only available on G5-based products. Learn about peer-to-peer DMA here.

Note that G4 and G5-based devices can co-exist on the same server, so you can add G5 devices to a server with G4 devices already installed and vice versa.

Choosing the Right VPU & Server - Table 2 Advanced hardware functionality.
Table 2. Advanced hardware functionality.

Observations:

  • Codensity G4 and G5-based VPUs offer similar cost-per-stream, with Quadra slightly more efficient on a watts-per-stream basis. Both products transcode to H.264 and HEVC formats (G5 encodes to AV1 and decodes VP9).

  • Choose G4-based products for:
    • The absolute lowest overall cost
    • Compatibility with existing G4-based encoding stacks
    • Interactive same resolution-in/out productions (minimum scaling and overlay)

  • Choose G5-based products for:
    • AV1 output
    • AI integration
    • Applications that need quality and throughput tuning
    • Applications that involve scaling and overlay
    • Maximum throughput from a single server
    • Cloud gaming

Question 2: Which G4-based Product?

This section discusses your G4-based options shown in Figure 1, with the U.2-based T408 in the background and AIC-form factor T432 in the foreground. These products are designated as Transcoders since this is their primary hardware-based function.

Choosing the Right VPU & Server - Figure 1. The NETINT T408 in the back, T432 in the front.
Figure 1. The NETINT T408 in the back, T432 in the front.

Table 3 identifies the key differences between NETINT’s two G4-based VPUs, the T408, which includes a single G4 ASIC in a U.2 form factor, and the T432, which includes four G4 ASICS in an AIC half-height half-length configuration.

Choosing the Right VPU & Server - Table 3. NETINT’s two G4-based products.
Table 3. NETINT’s two G4-based products.

Observations:

  • The U.2-based T408 offers the best available density for installing units into a 1RU server.
  • The AIC-based T432 is the best option for computers without U.2 connections and for maximum server chassis density.

Question 3: Which G5-based Product?

Figure 2 identifies the three Quadra G5-based products, with the U.2-based T1U in the back, the AIC-based T1A in the middle, and the AIC-based T2A in the front. These products are designated Video Processing Units, or VPUs, because their hardware functionality extends far beyond simple transcoding.

Choosing the Right VPU & Server - Figure 2. The Quadra T1U in the back, T1A in the middle, and T2A in front.
Figure 2. The Quadra T1U in the back, T1A in the middle, and T2A in front.

Table 3 identifies the key differences between NETINT’s three G5-based VPUs:

  • The T1U includes a single G5 ASIC in a U.2 form factor.
  • The T1A includes a single G5 ASIC in an AIC half-height half-length configuration.
  • The T2A includes two G5 ASICs in an AIC half-height half-length configuration.
Choosing the Right VPU & Server - Table 4. NETINT’s two G4-based products.
Table 4. NETINT’s two G4-based products.

Observations:

  • The U.2-based Quadra T1U offers the best density for installing in a 1RU server.
  • The Quadra T2A offers the best density for AIC-based installation and is ideal for cloud gaming servers that need peer-to-peer DMA communication with GPUs.
  • The AIC-based Quadra T1A is the most affordable AIC option for installs that don’t need maximum density.

Question 4: VPU or Server?

NETINT offers two video servers that use the same Supermicro 1114S-WN10RT server chassis; the Logan Video Server contains ten T408 U.2 VPUs, while the Quadra Video Server contains ten Quadra T1U VPUs. Servers offer a turnkey option for fast and simple deployment.

An advantage of buying a NETINT Video Server is all components, including CPU, RAM, hard drive, OS, and software versions, have been extensively tested for compatibility, stability, and performance, making them the easiest and fastest way to transition from software to hardware encoding.

As for the choice between servers, your answer to question 1 should guide your selection.

If you have any questions about any products, please contact us here.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

Choosing Transcoding Hardware: Deciphering the Superiority of ASIC-based Technology

Which technology reigns supreme in transcoding: CPU-only, GPU, or ASIC-based? Kenneth Robinson’s incisive analysis from the recent symposium makes a compelling case for ASIC-based transcoding hardware, particularly NETINT’s Quadra. Robinson’s metrics prioritized viewer experience, power efficiency, and cost. While CPU-only systems appear initially economical, they falter with advanced codecs like HEVC. NVIDIA’s GPU transcoding offers more promise, but the Quadra system still outclasses both in quality, cost per stream, and power consumption. Furthermore, Quadra’s adaptability allows a seamless switch between H.264 and HEVC without incurring additional costs. Independent assessments, such as Ilya Mikhaelis‘, echo Robinson’s conclusions, cementing ASIC-based transcoding hardware as the optimal choice.

Choosing transcoding hardware

During the recent symposium, Kenneth Robinson, NETINT’s manager of Field Application Engineering, compared three transcoding technologies: CPU-only, GPU, and ASIC-based transcoding hardware. His analysis, which incorporated quality, throughput, and power consumption, is useful as a template for testing methodology and for the results. You can watch his presentation here and download a copy of his presentation materials here.

Figure 1. Overall savings from ASIC-based transcoding (Quadra) over GPU (NVIDIA) and CPU.
Figure 1. Overall savings from ASIC-based transcoding (Quadra) over GPU (NVIDIA) and CPU.

As a preview of his findings, Kenneth found that when producing H.264, ASIC-based hardware transcoding delivered CAPEX savings of 86% and 77% compared to CPU and GPU-based transcoding, respectively. OPEX savings were 95% vs. CPU-only transcoding and 88% compared to GPU.

For the more computationally complex HEVC codec, the savings were even greater. As compared to CPU-based transcoding, ASICs saved 94% on CAPEX and 98% on OPEX. As compared to GPU-based transcoding, ASICs saved 82% on CAPEX and 90% on OPEX. These savings are obviously profound and can make the difference between a successful and profitable service and one that’s mired in red ink.

Let’s jump into Kenneth’s analysis.

Determining Factors

Digging into the transcoding alternatives, Kenneth described the three options. First are CPUs from manufacturers like AMD or Intel. Second are GPUs from companies like NVIDIA or AMD. Third are ASICs, or Application Specific Integrated Circuits, from manufacturers like NETINT. Kenneth noted that NETINT calls its Quadra devices Video Processing Units (VPU), rather than transcoders because they perform multiple additional functions besides transcoding, including onboard scaling, overlay, and AI processing.

He then outlined the factors used to determine the optimal choice, detailing the four factors shown in Figure 2. Quality is the average quality as assessed using metrics like VMAF, PSNR, or subjective video quality evaluations involving A/B comparisons with viewers. Kenneth used VMAF for this comparison. VMAF has been shown to have the highest correlation with subjective scores, which makes it a good predictor of viewer quality of experience.

Choosing transcoding hardware - Determining Factors
Figure 2. How Kenneth compared the technologies.

Low-frame quality is the lowest VMAF score on any frame in the file. This is a predictor for transient quality issues that might only impact a short segment of the file. While these might not significantly impact overall average quality, short, low-quality regions may nonetheless degrade the viewer’s quality of experience, so are worth tracking in addition to average quality.

Server capacity measures how many streams each configuration can output, which is also referred to as throughput. Dividing server cost by the number of output streams produces the cost per stream, which is the most relevant capital cost comparison. The higher the number of output streams, the lower the cost per stream and the lower the necessary capital expenditures (CAPEX) when launching the service or sourcing additional capacity.

Power consumption measures the power draw of a server during operation. Dividing this by the number of streams produced results in the power per stream, the most useful figure for comparing different technologies.

Detailing his test procedures, Kenneth noted that he tested CPU-only transcoding on a system equipped with an AMD Epic 32-core CPU. Then he installed the NVIDIA L4 GPU (a recent release) for GPU testing and NETINT’s Quadra T1U U.2 form factor VPU for ASIC-based testing.

He evaluated two codecs, H.264 and HEVC, using a single file, the Meridian file from Netflix, which contains a mix of low and high-motion scenes and many challenging elements like bright lights, smoke and fog, and very dark regions. If you’re testing for your own deployments, Kenneth recommended testing with your own test footage.

Kenneth used FFmpeg to run all transcodes, testing CPU-only quality using the x264 and x265 codecs using the medium and very fast presets. He used FFmpeg for NVIDIA and NETINT testing as well, transcoding with the native H.264 and H.265 codec for each device.

H.264 Average, Low-Frame, and Rolling Frame Quality

The first result Kenneth presented was average H.264 quality. As shown in Figure 3, Kenneth encoded the Meridian file to four output files for each technology, with encodes at 2.2 Mbps, 3.0 Mbps, 3.9 Mbps, and 4.75 Mbps. In this “rate-distortion curve” display, the left axis is VMAF quality, and the bottom axis is bitrate. In all such displays, higher results are better, and Quadra’s blue line is the best alternative at all tested bitrates, beating NVIDIA and x264 using the medium and very fast presets.

Figure 3. Quadra was tops in H.264 quality at all tested bitrates.
Figure 3. Quadra was tops in H.264 quality at all tested bitrates.

Kenneth next shared the low-frame scores (Figure 4), noting that while the NVIDIA L4’s score was marginally higher than the Quadra’s, the difference at the higher end was only 1%. Since no viewer would notice this differential, this indicates operational parity in this measure.

Figure 4. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.
Figure 4. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.

The final H.264 quality finding displayed a 20-second rolling average of the VMAF score. As you can see in Figure 5, the Quadra, which is the blue line, is consistently higher than the NVIDIA L4 or medium or very fast. So, even though the Quadra had a lower single-frame VMAF score compared to NVIDIA, over the course of the entire file, the quality was predominantly superior.

Figure 5. 20-second rolling frame quality over file duration.
Figure 5. 20-second rolling frame quality over file duration.

HEVC Average, Low-Frame, and Rolling Frame Quality

Kenneth then related the same results for HEVC. In terms of average quality (Figure 6), NVIDIA was slightly higher than the Quadra, but the delta was insignificant. Specifically, NVIDIA’s advantage starts at 0.2% and drops to 0.04% at the higher bit rates. So, again, a difference that no viewer would notice. Both NVIDIA and Quadra produced better quality than CPU-only transcoding with x265 and the medium and very fast presets.

Figure 6. Quadra was tops in H.264 quality at all tested bitrates.
Figure 6. Quadra was tops in H.264 quality at all tested bitrates.

In the low-frame measure (Figure 7), Quadra proved consistently superior, with NVIDIA significantly lower, again a predictor for transient quality issues. In this measure, Quadra also consistently outperformed x265 using medium and very fast presets, which is impressive.

Figure 7. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.
Figure 7. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.

Finally, HEVC moving average scoring (Figure 8) again showed Quadra to be consistently better across all frames when compared to the other alternatives. You see NVIDIA’s downward spike around frame 3796, which could indicate a transient quality drop that could impact the viewer’s quality of experience.

Figure 8. 20-second rolling frame quality over file duration.
Figure 8. 20-second rolling frame quality over file duration.

Cost Per Stream and Power Consumption Per Stream - H.264

To measure cost and power consumption per stream, Kenneth first calculated the cost for a single server for each transcoding technology and then measured throughput and power consumption for that server using each technology. Then, he compared the results, assuming that a video engineer had to source and run systems capable of transcoding 320 1080p30 streams.

You see the first step for H.264 in Figure 9. The baseline computer without add-in cards costs $7,100 but can only output fifteen 1080p30 streams using an average of the medium and veryfast presets, resulting in a cost per stream was $473. Kenneth installed two NVIDIA L4 cards in the same system, which boosted the price to $14,214, but more than tripled throughput to fifty streams, dropping cost per stream to $285. Kenneth installed ten Quadra T1U VPUs in the system, which increased the price to $21,000, but skyrocketed throughput to 320 1080p30 streams, and a $65 cost per stream.

This analysis reveals why computing and focusing on the cost per stream is so important; though the Quadra system costs roughly three times the CPU-only system, the ASIC-fueled output is over 21 times greater, producing a much lower cost per stream. You’ll see how that impacts CAPEX for our 320-stream required output in a few slides.

Figure 9. Computing system cost and cost per stream.
Figure 9. Computing system cost and cost per stream.

Figure 10 shows the power consumption per stream computation. Kenneth measured power consumption during processing and divided that by the number of output streams produced. This analysis again illustrates why normalizing power consumption on a per-stream basis is so necessary; though the CPU-only system draws the least power, making it appear to be the most efficient, on a per-stream basis, it’s almost 20x the power draw of the Quadra system.

Figure 10. Computing power per stream for H.264 transcoding.
Figure 10. Computing power per stream for H.264 transcoding.

Figure 11 summarizes CAPEX and OPEX for a 320-channel system. Note that Kenneth rounded down rather than up to compute the total number of servers for CPU-only and NVIDIA. That is, at a capacity of 15 streams for CPU-only transcoding, you would need 21.33 servers to produce 320 streams. Since you can’t buy a fractional server, you would need 22, not the 21 shown. Ditto for NVIDIA and the six servers, which, at 50 output streams each, should have been 6.4, or actually 7. So, the savings shown are underrepresented by about 4.5% for CPU-only and 15% for NVIDIA. Even without the corrections, the CAPEX and OPEX differences are quite substantial.

Figure 11. CAPEX and OPEX for 320 H.264 1080p30 streams.
Figure 11. CAPEX and OPEX for 320 H.264 1080p30 streams.

Cost Per Stream and Power Consumption Per Stream - HEVC

Kenneth performed the same analysis for HEVC. All systems cost the same, but throughput of the CPU-only and NVIDIA-equipped systems both drop significantly, boosting their costs per stream. The ASIC-powered Quadra outputs the same stream count for HEVC as for H.264, producing an identical cost per stream.

Figure 12. Computing system cost and cost per stream.
Figure 12. Computing system cost and cost per stream.

The throughput drop for CPU-only and NVIDIA transcoding also boosted the power consumption per stream, while Quadra’s remained the same.

Figure 13. Computing power per stream for H.264 transcoding.
Figure 13. Computing power per stream for H.264 transcoding.

Figure 14 shows the total CAPEX and OPEX for the 320-channel system, and this time, all calculations are correct. While CPU-only systems are tenuous–at best– for H.264, they’re clearly economically untenable with more advanced codecs like HEVC. While the differential isn’t quite so stark with the NVIDIA products, Quadra’s superior quality and much lower CAPEX and OPEX are compelling reasons to adopt the ASIC-based solution.

Figure 14. CAPEX and OPEX for 320 1080p30 HEVC streams.
Figure 14. CAPEX and OPEX for 320 1080p30 HEVC streams.

As Kenneth pointed out in his talk, even if you’re producing only H.264 today, if you’re considering HEVC in the future, it still makes sense to choose a Quadra-equipped system because you can switch over to HEVC with no extra hardware cost at any time. With a CPU-only system, you’ll have to more than double your CAPEX spending, while with NVIDIA,  you’ll need to spend another 25% to meet capacity.

The Cost of Redundancy

Kenneth concluded his talk with a discussion of full hardware and geo-redundancy. He envisioned a setup where one location houses two servers (a primary and a backup) for full hardware redundancy. A similar setup would be replicated in a second location for geo-redundancy. Using the Quadra video server, four servers could provide both levels of redundancy, costing a total of $84,000. Obviously, this is much cheaper than any of the other transcoding alternatives.

NETINT’s Quadra VPU proved slightly superior in quality to the alternatives, vastly cheaper than CPU-only transcoding, and very meaningfully more affordable than GPU-based transcoders. While these conclusions may seem unsurprising – an employee at an encoding ASIC manufacturer concludes that his ASIC-based technology is best — you can check Ilya Mikhaelis’ independent analysis here and see that he reached the same result.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

From CPU to GPU to ASIC: Mayflower’s Transcoding Journey

Ilya’s transcoding journey took him from $10 million to under $1.5 million CAPEX while cutting power consumption by over 90%. This analytical deep-dive reveals the trials, errors, and successes of Mayflower’s quest, highlighting a remarkable reduction in both cost and power consumption.

From CPU to GPU to ASIC: The Transcoding Journey

Ilya Mikhaelis

Ilya Mikhaelis is the streaming backend tech lead for Mayflower, which builds and hosts streaming infrastructures for multiple publishers. Mayflower’s infrastructure handles over 10,000 incoming streams and over one million plus outgoing streams at a latency that averages one to two seconds.

Ilya’s challenge was to find the most cost-effective technology to transcode the incoming streams. His journey took him from CPU-based transcoding to GPU and then two generations of ASIC-based transcoding. These transitions slashed total production transcoding costs from $10 million dollars to just under $1.5 million dollars while reducing power consumption by over 90%, from 325,000 watts to 33,820 watts.

Ilya’s rigorous textbook-worthy testing methodology and findings are invaluable to any video engineer seeking the highest quality transcoding technology at the lowest capital cost and most efficient power usage. But let’s start at the beginning.

The Mayflower Internal CDN

As Ilya describes it, “Mayflower is a big company, under which different projects stand. And most of these projects are about high-load, live media streaming. Moreover some of Mayflower resources were included  in the top 50 of the most visited sites worldwide. And all these streaming resources are handled by one internal CDN, which was completely designed and implemented by my team.”

Describing the requirements, Ilya added, “The typical load of this CDN is about 10,000 incoming simultaneous streams and more than one million outgoing simultaneous streams worldwide. In most cases, we target a latency of one to two seconds. We try to achieve a real-time experience for our content consumers, which is why we need a fast and effective transcoding solution.”

To build the CDN, Mayflower used bare metal servers to maximize network and resource utilization and run a high-performance profile to achieve stable stream processing and keep encoder and decoder queues around zero. As shown in Figure 1, the CDN inputs streams via WebRTC and RTMP and delivers with a mix of WebRTC, HLS, and low latency HLS. It uses customized WebRTC inside the CDN to achieve minimum latency between servers.

Figure 1. Mayflower’s Low Latency CDN
Figure 1. Mayflower’s Low Latency CDN .

Ilya’s team minimizes resource wastage by implementing all high-level network protocols, like WebRTC, HLS, and low latency HLS, on their own. They use libav, an FFmpeg component, as a framework for transcoding inside their transcoder servers.

The Transcoding Pipeline

In Mayflower’s transcoding pipeline (Figure 2), the system inputs a single WebRTC stream, which it converts to a five-rung encoding ladder. Mayflower uses a mixture of proprietary and libav filters to achieve a stable frame rate and stable load. The stable frame rate is essential for outgoing streams because some protocols, like low latency HLS or HLS, can’t handle variable frame rates, especially on Apple devices.

Figure 2. Mayflower’s Low Latency CDN.
Figure 2. Mayflower’s Low Latency CDN.

CPU-Only Transcoding - Too Expensive, Too Much Power

After creating the architecture, Ilya had to find a transcoding technology as quickly as possible. Mayflower initially transcoded on a Dell R940, which currently costs around $20,000 as configured for Mayflower. When Ilya’s team first implemented software transcoding, most content creators input at 720p. After a few months, as they became more familiar with the production operation, most switched to 1080p, dramatically increasing the transcoding load.

You see the numbers in Figure 3. Each server could produce only 20 streams, which at a server cost of $20,000 meant a per stream cost of $1,000. At this capacity, scaling up to handle the 10,000 incoming streams would require 500 servers at a total cost of $10,000,000.

Total power consumption would equal 500 x 650, or 325,000 watts. The Dell R940 is a 3RU server; at an estimated monthly cost of $125 for colocation, this would add $750,000 per year. 

Figure 3. CPU-only transcoding was very costly and consumed excessive power.
Figure 3. CPU-only transcoding was very costly and consumed excessive power.

These numbers caused Ilya to pause and reassess. “After all these calculations, we understood that if we wanted to play big, we would need to find a cheaper transcoding solution than CPU-only with higher density per server, while maintaining low latency. So, we started researching and found some articles on companies like Wowza, Xilinx, Google, Twitch, YouTube, and so on. And the first hint was GPU. And when you think GPU, you think NVIDIA, a company all streaming engineers are aware of.”

“After all these calculations, we understood that if we wanted to play big, we would need to find a cheaper transcoding solution than CPU-only with higher density per server, while maintaining low latency.”

GPUs - Better, But Still Too Expensive

Ilya initially considered three NVIDIA products: the Tesla V100, Tesla P100, and Tesla T4. The first two, he concluded, were best for machine learning, leaving the T4 as the most relevant option. Mayflower could install six T4s into each existing Dell server. At a current cost of around $2,000 for each T4, this produced a total cost of $32,000 per server.

Under capacity testing, the T4-enabled system produced 96 streams, dropping the per-stream cost to $333. This also reduced the required number of servers to 105, and the total CAPEX cost to $3,360,000.

With the T4s installed, power consumption increased to 1,070 watts for a total of 112,350 watts. At $125 per month per server, the 105 servers would cost $157,500 annually to house in a colocation facility.

Figure 4. Capacity and costs for an NVIDIA T4-based solution.
Figure 4. Capacity and costs for an NVIDIA T4-based solution.

Round 1 ASICs: The NETINT T432

The NVIDIA numbers were better, but as Ilya commented, “It looked like we found a possible candidate, but we had a strong sense that we needed to further our research. We decided to continue our journey and found some articles about a company named NETINT and their ASIC-based solutions.”

Mayflower first ordered and tested the T432 video transcoder, which contains four NETINT G4 ASICs in a single PCIe card. As detailed by Ilya, “We received the T432 cards, and the results were quite exciting because we produced about 25 streams per card. Power consumption was much lower than NVIDIA, only 27 watts per card, and the cards were cheaper. The whole server produced 150 streams in full HD quality, with a power consumption of 812 watts. For the whole production, we would pay about 2 million, which is much cheaper than NVIDIA solution.”

You see all this data in Figure 5. The total number of T432-powered servers drops to 67, which reduces total power to 54,404 watts and annual colocation to $100,500.

Figure 5. Capacity and costs for the NETINT T432 solution.
Figure 5. Capacity and costs for the NETINT T432 solution.

While costs and power consumption kept improving, Ilya noticed that the CDN’s internal queue started increasing when processing with T432-equipped systems. Initially, Ilya thought the problem was the lack of onboard scaling on the T432, but then he noticed that “even when producing all these ABR ladders, our CPU load was about only 40% during high load hours. The bottleneck was the card’s decoding and encoding capacity, not onboard scaling.”

Finally, he pinpointed the increase in the internal queue to the fact that the T432’s decoder couldn’t maintain 4K60 fps decode for H.264 input. This was unacceptable because it increased stream latency. Ilya went searching one last time; fortunately, the solution was close at hand.

Round 2 ASICs: The NETINT Quadra T2 - The Transcoding Monster

Ilya next started testing with the NETINT Quadra T2 video processing unit, or VPU, which contains two NETINT G5 chips in a PCIe card. As with the other cards, Ilya could install six in each Dell server.

“All those disadvantages were eliminated in the new NETINT card – Quadra…We have already tested this card and have added servers with Quadra to our production. It really seems to be a transcoding monster.”

Ilya’s team liked what they found. “All those disadvantages were eliminated in the new NETINT card – Quadra. It has a hardware scaler inside with an optimized pipeline: decoder – scaler – encoder in the same VPU. And H264 4K60 decoding is not a problem for it. We have already tested this card and have added servers with Quadra to our production. It really seems to be a transcoding monster.”

Figure 6 shows the performance and cost numbers. Equipped with the six T2 VPUs, each server could output 270 streams, reducing the number of required servers from 500 for CPU-only to a mere 38. This dropped the per stream cost to $141, less than half of the NVIDIA T4 equipped system, and cut the total CAPEX down to $1,444,000. Total power consumption dropped to 33,820 watts, and annual colocation costs for the 38 3U servers were $57,000.

Figure 6. Capacity and costs for the NETINT Quadra T2 solution.
Figure 6. Capacity and costs for the NETINT Quadra T2 solution.

Cost and Power Summary

Figure 7 presents a summary of costs and power consumption, and the numbers speak for themselves. In Ilya’s words, “It is obvious that Quadra T2 dominates by all characteristics, and according to our team experience, it is the best transcoding solution on the market today.”

Figure 7. Summary of costs and power consumption.
Figure 5. Capacity and costs for the NETINT T432 solution.

“It is obvious that Quadra T2 dominates by all characteristics, and according to our team experience, it is the best transcoding solution on the market today.”

Ilya also commented on the suitability of the Dell R940 system. “I want to emphasize that the DELL R940 isn’t the best server for VPU and GPU transcoders. It has a small density of PCIe slots and, as a result, a small density of VPU/GPU. Moreover, in the case of  Quadra and even T432, you don’t need such powerful CPUs.”

In terms of other servers to consider, Ilya stated, “Nowadays, you may find platforms on the market with even 16 PCIe slots. In such systems, especially if you use Quadra, you don’t need powerful CPUs inside because everything is done on the VPU. But for us, it was a legacy with which we needed to live.”

Video engineers seeking the optimal transcoding solution can take a lot from Ilya’s transcoding journey: a willingness to test a range of potential solutions, a rigorous focus on cost and power consumption per stream, and extreme attention to detail. At NETINT, we’re confident that this approach will lead you to precisely the same conclusion as Ilya, that the Quadra T2 is “the best transcoding solution on the market today.”

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

Seamless Client Onboarding – Hardware and Software Synergy – interview with Kenneth Robinson

A crucial aspect of NETINT’s value proposition is its proactive and holistic customer support, from the pre-purchase phase to onboarding and the post-purchase journey. NETINT streamlines this transition with seamless hardware installation facilitated by compliance with U.2 and PCIe standards and intuitive software integration via tools like FFmpeg and GStreamer, and an SDK.

A recent conversation with Kenneth Robinson, NETINT’s Manager of Field Application Engineering, detailed how he and his team support NETINT customers through the buying, onboarding and implementation process and beyond. By way of background, Robinson joined NETINT in January 2023 and brings substantial expertise from his prior tenure at a video gateway development company. During the conversation, he described how his team’s adeptness with scripting and debugging simplifies and accelerates customer deployments.

The discussion also spotlights the efficiency of NETINT’s transcoder management, GStreamer’s increased usage among NETINT customers due to its hyperthreaded efficiency, and several strategic recommendations for potential server buyers. Robinson’s insights solidify NETINT’s reputation as a client-centric enterprise, leveraging both its technological prowess and dedicated human capital.

From Jan Ozer

This interview is with Kenneth Robinson, NETINT’s manager of field application engineering. We discussed how Kenneth and his team help get NETINT customers up and running, including hardware and software installation and the operation of software like GStreamer and FFmpeg.

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
Kenneth, tell us a little bit about yourself. What’s your background, and how long have been with NETINT?

Kenneth:
I’ve been with NETINT since January of this year (2023). Prior to that, I worked for a company that developed video gateways for big MSOs for installation in hotels and other uses. I ran a team of quality engineers and managed the support team there as well.

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
So, you’re comfortable with video and video-related technologies?

Kenneth:
Oh yes. And familiar with a lot of different ways to deliver video, like streaming and multicast.

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
What’s the typical skillset of your FAE team?

Kenneth:
They are software people. They understand software and debugging software and write scripts to help customers test or debug different issues. So very good communicators. They work with our customers to make sure that NETINT cards benefit them in the way that they are supposed to..

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
What do you see as your role in the company?

Kenneth:
I see it as ensuring that our customers get the support they need in a timely manner and making sure the transition from their current transcoders to NETINT transcoders happens smoothly, quickly, and efficiently. And that any roadblocks are removed in a very timely manner for them.

Supporting New Customer Installations

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
How’s the typical process work? Do you start when customers are evaluating NETINT products, or after they decide to purchase and deploy them?

Kenneth:
Both situations. Often the sales team will include me in a customer call to learn exactly how they want to use our products and to make sure we can deliver what they need. And then the other half is usually after a customer buys one of our products.

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
How does that work? When a customer buys a product, what happens? It gets shipped, and they receive it. How do they get the software and documentation?

Kenneth:

We know they’ve received the product based on the tracking number. Then we’ll reach out to the customer and send links to our documentation portal with the software SDK. This has the installation guide, integration guides, application notes, and everything they need to install and get up and running. And then we’ll usually follow up every couple of weeks or so just to make sure the process is going smoothly.

But, if at any point the customer has a question, they can reach out to us, and we will be happy to help them

Hardware Installation

Figure 1. NETINT offers products in two form factors, U.2 and PCIe.
Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
What’s the hardware installation like?

Kenneth:

So, the hardware is very simple. We have two form factors. We have the PCIe form factor, which is just like any network card or GPU that you just install. And then there’s the U.2 form factor, which is the same as a hard drive. So, there’s nothing special required or special tools or knowledge; if you’ve worked on a computer before, you should be able to install either form factor.

 

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
In the nine months you’ve been here, what types of incompatibilities have you seen with the servers in the field?

Kenneth:

We haven’t seen any incompatibilities. Our products have worked on every server that we’ve tried because we follow the different standards for the U.2 and PCIe form factors.

Software Installation and Operation

Figure 2 - The Quadra Server - software architecture for controlling the Quadra Server
Figure 2. You can control all transcoders with FFmpeg, GStreamer, or the API (libxcoder).
Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
So, the hardware installation is straightforward. What’s the software installation like?

Kenneth:

The software is relatively easy. We work with FFmpeg and GStreamer, but our software code is not pushed into the repository. So, part of our SDK is a patch that you apply and then compile FFmpeg or GStreamer, though we have installation scripts that will automate that process for you. If you just want to run a quick test, the installation scripts are very good and will get you up and running in a matter of minutes.

We also have an API, so the customer can access the cards directly and not rely on FFMPEG or Gstreamer.

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
If you install multiple cards, how does the software distribute jobs among those cards?

Kenneth:

There are two ways. You can specify the exact card you want to use as the encoder or decoder. Or, you can allow a resource manager to manage that, and it will send each job to whichever decoder or encoder has the capacity.

FFmpeg, Gstreamer, or API?

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
In terms of software control, what’s the typical customer doing? We’ve got GStreamer, FFmpeg, and the API. What percentage are using each alternative?

Kenneth:

The majority is FFmpeg and, after that, the API. Then there’s a small number that use GStreamer, although GStreamer is slowly getting more popular.

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
Why is that?

Kenneth:

We found that when FFmpeg scales multiple files simultaneously, like when creating an encoding ladder, it sometimes would bottleneck. While the capacity was good, it wasn’t great. If we tried Gstreamer, the capacity increased significantly enough that it made sense to use GStreamer for that workflow.

Server vs. Individual Cards

Figure 3. NETINT offers two servers populated with ten Quadras or T408s.
Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
Let’s switch gears a bit. What’s your experience with the server? What would you advise someone to buy a server fully loaded with Quadras or T408s versus buying the cards and installing them themselves?

Kenneth:

If you need a custom architecture, like adding GPUs for cloud gaming, you should buy the cards and install them yourself. If you intend to perform high-volume file-based transcoding or live streaming, you should consider either server.

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
So, if you’ve got a set application and you just want to get a device in and start working, the servers are a good option. If you’re going to customize your servers, buy the cards.

Kenneth:

Yes, that’s correct.

Seamless Client Onboarding - Hardware and Software Synergy - Kenneth Robinson from NETINT

Jan:
That’s all I have. Thanks for taking the time today.

Kenneth:

Thanks for having me.

Watch on-demand: Symposium on Building Your Live Streaming Cloud

Cloud services are an effective way to begin live streaming, but once you reach a particular scale, you may realize that you’re paying too much and can save significant OPEX by deploying your own transcoding infrastructure. The question is, how to get started? 

Build Your Own Live Streaming Cloud symposium was a huge hit, with many insights from industry insiders on how to build a live streaming cloud. Here are replays of the event. (For the best viewing experience, please watch from your desktop.)

From Cloud to Local Transcoding For Minimum Latency and Maximum Quality

From Cloud to Local Transcoding

Over the last ten years or so, most live productions have migrated towards a workflow that sends a contribution stream from the venue into the cloud for transcoding and delivery. For live events that need absolute minimum latency and maximum quality, it may be time to rethink that workflow, particularly if you’ve got multiple sharable inputs at the venue.

So says Bart Snoeks, Account & Partnership Director of THEO Technologies (“THEO”). By way of background, THEO invented and has commercially implemented the High-Efficiency Streaming Protocol (HESP), an adaptive HTTP- based video streaming protocol that enables sub-second end-to-end latency. You see how HESP compares to other low latency protocols in the table shown in Figure 1 from the HESP Alliance website – the organization focused on promoting and further advancing HESP.

Figure 1. HESP compared to other low latency protocols.

THEO has productized HESP as a real-time streaming service called THEOlive, which targets applications like live sports and betting, casino igaming, live auctions, and other events that require high-quality video at exceptionally low latency with delivery at scale. For example, in the case of in-play betting, cutting latency from 8 to 10 seconds (HLS) to under one second expands the betting window during the critical period just before the event.

When streaming casino games, ultra-low latency promotes fluent interactions between the players and ensures that all players see the turn of the cards in real time. When latency is lower, players can bet more quickly, increasing the number of hands that can be played.

According to Snoeks, a live streaming workflow that sends a contribution stream to the cloud for transcoding will always increase latency and can degrade quality as re-transcoding is needed. It’s especially poorly suited for stadium venues with multiple camera locations that want to enhance the attendee experience with multiple live feeds. In those latency-critical use cases you are actually adding network latency with a roundtrip to and from the cloud. Instead, it makes much more sense creating your encoding ladder and packaging on-site and pulling that directly from the origin to a private CDN for delivery.

Let’s take a step back and examine these two workflows.

Live Streaming Workflows

As stated at the top, most live-streaming productions encode a single contribution stream on-site and send that into the cloud for transcoding to a full ladder, packaging, and delivery. You see this workflow in Figure 2.

Figure 2. Encoding a contribution stream on-site to deliver to the cloud for transcoding, packaging, and delivery

This schema has multiple advantages. First, you’re sending a single stream to the cloud, lowering bandwidth requirements. Second, you’re centralizing your transcoding assets in a single location in the cloud, which typically enables better utilization.

According to Snoeks, however, this workflow will add 200 to 500  milliseconds of latency at a minimum, depending on the encoding speed, quality, and contribution protocol. In addition, though high-quality contribution encoders can minimize generational loss from the contribution stream, lower-quality transcoders can noticeably degrade the quality of the final output. You also need a contribution encoder for each camera, which can jack up hardware costs in high-volume igaming and similar applications.

Instead, for some specific use cases, you should consider the workflow shown in Figure 3. Here, you transcode on-site and send the full encoding ladder to a public CDN for external delivery and to a private CDN or equivalent for local viewing. This decreases latency to a minimum and produces absolute top quality as you avoid the additional transcoding step.

From Cloud to Local Transcoding - Figure-2
Figure 3. Encoding and packaging the encoding ladder on site and transmitting the streams to a public CDN for external viewers and a private CDN for local viewers.

This schema is particularly useful for venues that want to enhance the in-stadium experience with multiple camera feeds. Imagine a stock car race where an attendee only sees his driver on the track once every minute or so. Encoding on-site might allow attendees to watch the camera view from inside their favorite driver’s car with near real-time latency. It might let golf fans follow multiple groups while parked at a hole or following their favorite player.

If you’re encoding input from many cameras, say in a casino or even racetrack environment, the cost of on-site encoding might be less than the cost of the individual contribution encoders. So, you get the best of all worlds, lower cost per stream, lower latency, higher quality, and a better in-person experience where applicable.

If you’re interested in learning about your transcoding options, check out our symposium Building Your Own Live Streaming Cloud, where you can hear from multiple technology experts discussing transcoding options like CPU-only, GPU, and ASIC-based transcoding and their respective costs, throughput, and density.

If you’re interested in learning more about HESP, THEO in general, or THEOlive, watch for an upcoming episode of Voices of Video, where I interview Pieter-Jan Speelman, CTO of THEO Technologies. We’ll discuss HESP’s history and evolution, the power of THEOlive real-time streaming technology, and how to use it in your live production stack. Make sure you don’t miss it!

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

Get Free CAE on NETINT VPUs with Capped CRF

Capped CRF

NETINT recently added capped CRF to the rate control mechanism across our Video Processing Unit (VPU) product lines. With the wide adoption of content-adaptive encoding techniques (CAE), constant rate factor (CRF) encoding with a bit rate cap gained popularity as a lightweight form of CAE to reduce the bitrate of easy-to-encode sequences, saving delivery bandwidth with constant video quality. It’s a mode that we expect many of our customers to use, and this document will explain what it is, how it works, and how to get the most use from the feature.

In addition to working with H.264, HEVC, and AV1 on the Quadra VPU line, capped CRF works with H.264 and HEVC on the T408 and T432 video transcoders. This document details how to encode with capped CRF using the H.264 and HEVC codecs on Quadra VPUs, though most application scenarios apply to all codecs across the NETINT VPU lines.

What is Capped CRF and How Does it Work?

Capped CRF is a bitrate control technique that combines constant rate factor (CRF) encoding with a bit rate cap. Multiple codecs and software encoders support it, including x264 and x265 within FFmpeg. In contrast to CBR and VBR encoding, which encode to a specified target bitrate (and ignore output quality), CRF encodes to a specified quality level and ignores the bitrate.

CRF values range from 0-51, with lower numbers delivering higher quality at higher bitrates (less savings) and higher CRF values delivering lower quality levels at lower bitrates (more bitrate savings). Many encoding engineers will utilize values spanning 21 to 23. Which is right for you? As you will read below, your desired quality and bitrate savings balance determines the best value for your use case.

For example, with the x264 codec, if you transcode to CRF 23, the encoder typically outputs a file with a VMAF quality of 93-95. If that file is a 4K60 soccer match, the bitrate might be 30 Mbps. If it’s a 1080p talking head, it might be 1.2 Mbps. Because CRF delivers a known quality level, it’s ideal for creating archival copies of videos. However, since there’s no bitrate control, in most instances, CRF alone is unusable for streaming delivery.

When you combine CRF with a bit rate cap, you get the best of both worlds, a bit rate reduction with consistent quality for easy-to-encode clips and similar to CBR quality and bitrate or more complex clips.

Here’s how capped CRF could be used with the Quadra VPU:

ffmpeg -i input crf=23:vbvBufferSize=1000:bitrate=6000000 output

The relevant elements are:

  • CRF=23 – sets the quality target at around 95 VMAF

  • vbvBufferSize=1000 – sets the VBV buffer to one second (1000 ms)

  • bitrate=6000000 – caps the bitrate at 6 Mbps.

These commands would produce a file that targets close to 95 VMAF quality but, in all cases, peaks at around 6 Mbps.

For a simple-to-encode talking head clip, Quadra produced a file with an average bitrate of 1,274 kbps and a VMAF score of 95.14. Figure 1 shows this output in a program called Bitrate Viewer. Since the entire file is under the 6 Mbps cap, the CRF value controls the bitrate throughout.

Encoding this clip with Quadra using CBR at 6 Mbps produced a file with a bit rate of 5.4 Mbps and a VMAF score of 97.50. Multiple studies have found that VMAF scores above 95 are not perceptible by viewers, so the extra 2.26 VMAF score doesn’t improve the viewer’s quality of experience (QoE). In this case, capped CRF reduces your bandwidth cost by 76% without impacting QoE.

Figure 1. Capped CRF encoding a simple-to-encode video in Bitrate Viewer.

You see this in Figure 2, showing the capped CRF frame with a VMAF score of 94.73 on the left and the CBR frame with a VMAF score of 97.2 on the right. The video on the right has a bit rate over 4 Mbps larger than the video on the left, but the viewer wouldn’t notice the difference.

Figure 2. Frames from the talkinghead clip. Capped CRF at 1.23 Mbps on the left,
CBR at 5.4 Mbps on the right. No viewer would notice the difference.

Figure 3 shows capped CRF operation with a hard-to-encode American football clip. The average bitrate is 5900 kbps, and the VMAF score is 94.5. You see that the bitrate for most of the file is pushing against the 6 Mbps cap, which means that the cap is the controlling element. In the two regions where there are slight dips, the CRF setting controls the quality.

Figure 3. Capped CRF encoding a hard-to-encode video in Bitrate Viewer.

In contrast, the CBR encode of the football clip produced a bit rate of 6,013 kbps and a VMAF score of  94.73. Netflix has stated that most viewers won’t notice a VMAF differential under 6 points, so a viewer would not perceive the .25 VMAF delta between the CBR and capped CRF file. In this case, capped CRF reduced delivery bandwidth by about 2% without impacting QoE.

Of course, as shown in Figure 2, the two-minute segment tested was almost all high motion. The typical sports broadcast contains many lower-motion sequences, including some commercials, cutting to the broadcasters, or during timeouts and penalty calls. In most cases, you would expect many more dips like those shown in Figure 2 and more substantial savings.

So, the benefits of capped CRF are as follows:

  • You can use a single ladder for all your content, automatically saving bitrate on easy-to-encode clips and delivering the equivalent QoE on hard-to-encode clips.
  • Even if you modify your ladder by type of content, you should save bandwidth on easy-to-encode regions within all broadcasts without impacting QoE.
  • Provides the benefit of CAE without the added integration complexity or extra technology licensing cost. Capped CRF is free across all NETINT VPU and video transcoder products.

Producing Capped CRF

Using the NETINT Quadra VPU series, the following commands for H.264 capped CRF will optimize video quality and deliver a file or stream with a fully compliant VBV buffer. As noted previously, this command string with the appropriate modifications to codec value will work across the entire NETINT product line. For example, to output HEVC, change -c:v h264_ni_quadra_enc to -c:v h265_ni_quadra_enc.

Here’s the command string.

ffmpeg -y -i input.mp4 -y -c:v h264_ni_quadra_enc -xcoder-params “gopPresetIdx=5:RcEnable=0:crf=23:intraPeriod=120:lookAheadDepth=10:cuLevelRCEnable=1:v
bvBufferSize=1000:bitrate=6000000:tolCtbRcInter=0:tolCtbRcIntra=0:zeroCopyMode=0″ output.mp4

Here’s a brief explanation of the encoding-related switches.

  • -c:v h264_ni_quadra_enc -xcoder-params – Selects Quadra’s H.264 codec and identifies the codec commands identified below.

  • gopPresetIdx=5 – this chooses the Group of Pictures (GOP) pattern, or the mixture of B-frame and P-frames within each GOP. You should be able to adjust this without impacting capped CRF performance.

  • RcEnable=0 – this disables rate control. You must use this setting to enable capped CRF.

  • crf=23 – this chooses the CRF value. You must include a CRF value within your command string to enable capped CRF.

  • intraPeriod=120 – This sets the GOP size to four seconds which we used for all tests. You can adjust this setting to your normal target without impacting CRF operation.

  • lookAheadDepth=10 – This sets the lookahead to 10 frames. You can adjust this setting to your normal target without impacting CRF operation.

  • cuLevelRCEnable=1 – this enables coding unit-level rate control. Do not adjust this setting without verifying output quality and VBV compliance.

  • vbvBufferSize=1000 – This sets the VBV buffer size. You must set this to trigger capped CRF operation.

  • bitrate=6000000 – This sets the bitrate. You must set this to trigger capped CRF operation. You can adjust this setting to your target without impacting CRF operation.

  • tolCtbRcInter=0 – This defines the tolerance of CU-level rate control for P-frames and B-frames. Do not adjust this setting without verifying output quality and VBV compliance.

  • tolCtbRcIntra=0 – This sets the tolerance of CU level rate control for I-frames. Do not adjust this setting without verifying output quality and VBV compliance.

  • zeroCopyMode=0 – this enables or disables the libxcoder zero copy feature. Do not adjust this setting without verifying output quality and VBV compliance.

You can access additional information about these controls in the Quadra Integration and Programming Guide.

Choosing the CRF Value and Bitrate Cap – H.264

Deploying capped CRF involves two significant decisions, choosing the CRF value and setting the bitrate cap. Choosing the CRF value is the most critical decision, so let’s begin there.

Table 1 shows the bitrate and VMAF quality of ten files encoded with the H.264 codec using the CRF values shown with a 6 Mbps cap and using CBR encoding with a 6 Mbps cap. The table presents the easy-to-encode files on top, showing clip-specific results and the average value for the category. The Delta from CBR shows the bitrate and VMAF differential from the CBR score. Then the table does the same for hard-to-encode clips, showing clip-specific results and the average value for the category. The bottom two rows present the overall average bitrate and VMAF values and the overall savings and quality differential from CBR.

Capped CRF - Table 1. CBR and capped CRF bitrates and VMAF scores for H.264 encoded clips.
Table 1. CBR and capped CRF bitrates and VMAF scores for H.264 encoded clips.

As mentioned, with CRF, lower values produce higher quality. In the table, CRF 19 produces the highest quality (and lowest bitrate savings), and CRF 27 delivers the lowest quality (and highest bitrate savings). What’s the right CRF value? The one that delivers the target VMAF score for your typical clips for your target audience.

For the test clips shown, CRF 19 produces an average quality of well over 95; as mentioned above, VMAF scores beyond 95 aren’t perceivable by the average viewer, so the extra bandwidth needed to deliver these files is wasted. Premium services should choose CRF values between 21-23 to achieve the top rung quality of around 95 VMAF scores. These deliver more significant bandwidth savings than CRF 19 while preserving the desired quality level. In contrast, commodity services should experiment with higher values like 25-27 to deliver slightly lower VMAF scores while achieving more significant bandwidth savings.

What bitrate cap should you select? CRF sets quality, while the bitrate cap sets the budget. In most cases, you should consider using your existing cap. As we’ve seen, with easy-to-encode clips, capped CRF should deliver about the same quality of experience with the potential for bitrate savings. For hard-to-encode clips, capped CRF should deliver the same QoE with the potential for some bitrate savings on easy-to-encode sections of your broadcast.

Note that identifying the optimal CRF value will vary according to the complexity of your video files, as well as frame rate, resolution, and bitrate cap. If you plan to implement capped CRF with Quadra or any encoder, you should run similar tests on your standard test clips using your encoding parameters and draw your own conclusions.

Now let’s examine capped CRF and HEVC.

Choosing the CRF Value and Bitrate Cap – HEVC

Table 2 shows the results of HEVC encodes using CBR at 4.5 Mbps and the specified CRF values with a cap of 4.5 Mbps. With these test clips and encoding parameters, Quadra’s CRF values produce nearly the same result, with CRF values 21-23 appropriate for premium services and 25 – 27 good settings for UGC content.

Capped CRF - Table 2. CBR and capped CRF bitrates and VMAF scores for HEVC encoded clips.
Table 2. CBR and capped CRF bitrates and VMAF scores for HEVC encoded clips.

Again, the cap is yours to set; we arbitrarily reduced the H.264 bitrate cap of 6 Mbps by 25% to determine the 4.5 Mbps cap for HEVC.

Capped CRF Performance

Note that as currently tested, capped CRF comes with a modest performance hit, as shown in Table 3. Specifically, in CBR mode, Quadra output twenty 1080p30 H.264-encoded streams. This dropped to sixteen using capped CRF, a reduction of 20%.

For HEVC, throughput dropped from twenty-three to eighteen 1080p30 streams, a reduction of about 22%. We performed all tests using CRF 21, with a 6 Mbps cap for H.264 and 4.5 Mbps for HEVC. Note that these are early days in the CRF implementation, and it may be that this performance delta is reduced or even eliminated over time.

Capped CRF - Table 3. 1080p30 outputs produced using the techniques shown.
Table 3. 1080p30 outputs produced using the techniques shown.

We installed the Quadra in a workstation powered by a 3.6 GHz AMD Ryzen 5 5600X 6-Core Processor running Ubuntu 18.04.6 LTS with 16 GB of RAM. As you can see in the table, we also tested output for the x264 codec in FFmpeg using the medium and veryfast presets, producing two and five 1080p30 outputs, respectively. For x265, we tested using the medium and ultrafast presets and the workstation produced one and three 1080p30 streams.

Even at the reduced throughput, Quadra’s CRF output dwarfs the CPU-only output. When you consider that the NETINT Quadra Video Server packs ten Quadra VPUs into a single 1RU form factor, you get a sense of how VPUs offer unparalleled density and the industry’s lowest cost per stream and power consumption per stream.

Bandwidth is one of the most significant costs for all live-streaming productions. In many applications, capped CRF with the NETINT Quadra delivers a real opportunity to reduce bandwidth cost with no perceived impact on viewer quality of experience.