Beyond Traditional Transcoding: NETINT’s Pioneering Technology for Today’s Streaming Needs

Welcome to our here’s-what’s-new-since-last-IBC-so-you-should-schedule-a-meeting-with-us blog post. I know you’ve got many of these to wade through, so I’ll be brief.

First, a brief introduction. We’re NETINT, the ASIC-based transcoding company. We sell standalone products like our T408 video transcoder and Quadra VPUs ( for video transcoding units) and servers with ten of either device installed. All offer exceptional throughput at an industry-low cost per stream and power consumption per stream. Our products are denser, leaner, and greener than any competitive technology.
They’re also more innovative. The first-generation T408 was the first ASIC-based hardware transcoder available for at least a decade, and the second-generation Quadra was the first hardware transcoder with AV1 and AI processing. Our Quadra shipped before Google and Meta shipped their first generation ASIC-based transcoders and they still don’t support AV1.
That’s us; here’s what’s new.

Capped CRF Encoding

We’ve added capped CRF encoding to our Quadra products for H.264, HEVC, and AV1, with capped CRF coming for the T408 and T432 (H.264/HEVC). By way of background, with the wide adoption of content-adaptive encoding techniques (CAE), constant rate factor (CRF) encoding with a bit rate cap gained popularity as a lightweight form of CAE to reduce the bitrate of easy-to-encode sequences, saving delivery bandwidth and delivering CBR-like quality on hard-to-encode sequences. Capped CRF encoding is a mode that we expect many of our customers to use.

Figure 1 shows capped CRF operation on a theoretical football clip. The relevant switches in the command string would look something like this:

-crf 21  -maxrate 6MB

This directs FFmpeg to deliver at least the quality of CRF 21, which for H.264 typically equals around a 95 VMAF score. However, the maxrate switch ensures that the bitrate never exceeds 6 Mbps.

As shown in the figure, in operation, the Quadra VPU transcodes the easy-to-encode sideline shots at CRF 21 quality, producing a bitrate of around 2 Mbps. Then, during actual high-motion game footage, the 6MB cap would control, and the VPU would deliver the same quality as CBR. In this fashion, capped CRF saves bandwidth with easy-to-encode scenes while delivering equivalent to CBR quality with hard-to-encode scenes.

Figure 1. Capped CRF in operation. Relatively low-motion sideline shots are encoded to CRF 21 quality (~95 VMAF), while the 6 Mbps bitrate cap controls during high-motion game footage. Transcoding.
Figure 1. Capped CRF in operation. Relatively low-motion sideline shots are encoded to CRF 21 quality (~95 VMAF), while the 6 Mbps bitrate cap controls during high-motion game footage.

By deploying capped CRF, engineers can efficiently deliver high-quality video streams, enhance viewer experiences, and reduce operational expenses. As the demand for video streaming continues to grow, Capped CRF emerges as a game-changer for engineers striving to stay at the forefront of video delivery optimization.

You can read more about capped CRF operation and performance in Get Free CAE on NETINT VPUs with Capped CRF.

Peer-to-Peer Direct Memory Access (DMA) for Cloud Gaming

Peer-to-peer DMA is a feature that makes the NETINT Quadra VPU ideal for cloud gaming. By way of background, in a cloud-gaming workflow, the GPU is primarily used to render frames from the game engine output. Once rendered, these frames are encoded with codecs like H.264 and HEVC.

Many GPUs can render frames and transcode to these codecs, so it might seem most efficient to perform both operations on the same GPU. However, encoding demands a significant chunk of the GPU’s resources, which in turn reduces overall system throughput. It’s not the rendering engine that’s stretched to its limits but the encoder.

What happens when you introduce a dedicated video transcoder into the system using normal techniques? The host CPU manages the frame transfer between the GPU and the transcoder, which can create a bottleneck and slow system performance.

Figure 2. Peer-to-peer DMA enables up to 200 720p60 game streams from a single 2RU server. Transcoding.
Figure 2. Peer-to-peer DMA enables up to 200 720p60 game streams from a single 2RU server.

In contrast, peer-to-peer DMA allows the GPU to send frames directly to the transcoder, eliminating CPU involvement in data transfers (Figure 2). With peer-to-peer DMA enabled, the Quadra supports latencies as low as 8ms, even under heavy loads. It also unburdens the CPU from managing inter-device data transfers, freeing it to handle other essential tasks like game logic and physics calculations. This optimization enhances the overall system performance, ensuring a seamless gaming experience.

Some NETINT customers are using Quadra and peer-to-peer DMA to produce 200 720p60 game streams from a single 2RU server, and that number will increase to 400 before year-end. If you’re currently assembling an infrastructure for cloud gaming, come see us at IBC.

Logan Video Server

NETINT started selling standalone PCIe and U.2 transcoding devices, which our customers installed into servers. In late 2022, customers started requesting a prepackaged solution comprised of a server with ten transcoders installed. The Logan Video Server is our first response.

Logan refers to NETINT’s first-generation G4 ASIC, which transcodes to H.264 and HEVC. The Logan Video Server, which launched in the first quarter of 2023, includes a SuperMicro server with a 32-core AMD CPU running Ubuntu 20.04 LTS and ten NETINT T408 U.2 transcoder cards (which cost $300 each) for $8,900. There’s also a 64-core option available for $11,500 and an 8-core option for $7,000.

The value proposition is simple. You get a break on price because of volume commitments and don’t have to install the individual cards, which is generally simple but still can take an hour or two. And the performance with ten installed cards is stunning, given the price tag.

You can read about the performance of the 32-core server model in my review here, which also discusses the software architecture and operation. We’ll share one table, which shows one-to-one transcoding of 4K, 1080p, and 720p inputs with FFmpeg and GStreamer.

At the $8,900 cost, the server delivers a cost per stream as low as $445 for 4K, $111.25 for 1080p, and just over $50 for 720p at normal and low latency. Since each T408 only draws 7 watts and CPU utilization is so low, power consumption is also exceptionally low.

Meet NETINT at IBC - Transcoding - Table-1
Table 1. One-to-one transcoding performance for 4K, 1080p, and 720p.

With impressive density, low power consumption, and multiple integration options, the NETINT Video Transcoding Server is the new standard to beat for live streaming applications. With a lower-priced model available for pure encoding operations and a more powerful model for CPU-intensive operations, the NETINT Logan server family meets a broad range of requirements.

Quadra Video Server

Once the Logan Video Server became available, customers started asking about a similarly configured server for NETINT’s Quadra line of video transcoding units (VPUs), which adds AV1 output, onboard scaling and overlay, and two AI processing engines. So, we created the Quadra Video Server.

This model uses the same Supermicro chassis as the Logan Video Server and the same Ubuntu operating system but comes with ten Quadra T1U U.2 form factor VPUs, which retail for $1,500 each. Each T1U offers roughly four times the throughput of the T408, performs on-board scaling and overlay, and can output AV1 in addition to H.264 and HEVC.

The CPU options are the same as the Logan server, with the 8-core unit costing $19,000, the 32-core unit costing $21,000, and the 64-core model costing $24,000. That’s 4X the throughput at just over 2x the price.

You can read my review of the 32-core Quadra Video Server here. I’ll again share one table, this time reporting encoding ladder performance at 1080p for H.264 (120 ladders), HEVC (140), and AV1 (120), and 4K for HEVC (40) and AV1 (30).

In comparison, running FFmpeg using only the CPU, the 32-core system only produced nineteen H.264 1080p ladders, five HEVC 1080p ladders, and six AV1 1080p ladders. Given this low-volume throughput at 1080p, we didn’t bother trying to duplicate the 4K results with CPU-only transcoding.

Figure 2. Encoding ladder performance of the Quadra Video Server.
Table 2. Encoding ladder performance of the Quadra Video Server.

Beyond sheer transcoding performance, the review also details AI-based operations and performance for tasks like region of interest transcoding, which can preserve facial quality in security and other relatively low-quality videos, and background removal for conferencing applications.

Where the Logan Video Server is your best low-cost option for high volume H.264 and HEVC transcoding, the Quadra Video Server quadruples these outputs, adds AV1 and onboard scaling and overlay, and makes AI processing available.

Come See Us at the Show

We now return to our normally scheduled IBC pitch. We’ll be in Stand 5.A86 and you can book a meeting by clicking here.

Figure 3. Book a meeting.
.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

NETINT Buyer’s Guide. Choosing the Right VPU & Server for Your Workflow.

This guide is designed to help you choose the optimum NETINT Video Processing Unit (VPU) for your encoding workflow.

As an overview, note that all NETINT hardware products (VPUs and transcoders) run the same basic software controlled via FFmpeg and GStreamer patches or an SDK. This includes load balancing of all encoding resources in a server. In addition, both generations are similar in terms of latency and HDR support.

Question 1. Which ASIC Architecture: Codensity G4 (Logan) or Codensity G5 (Quadra)?

Tables 1 and 2 show the similarities and differences between Codensity G4 ASIC-powered products (T408 and T432) and Codensity G5-based products (Quadra T1U, T1A, T2A). Both architectures are available in either the U.2 or AIC form factor, the latter all half-height half-length (HHHL) configurations.

From a codec perspective, the main difference is that G5-based products support AV1 encoding and VP9 decoding. In terms of throughput, G5-based products offer four times the throughput but cost roughly three times more than G4, making the cost per output stream similar but with greater stream densities per host server. G5 power consumption is roughly 3x higher per ASIC than G4, but the throughput is 4x, making power consumption per stream lower.

Choosing the Right VPU & Server - Table 1. Codec support, throughput, and power consumption.
Table 1. Codec support, throughput, and power consumption.

Table 2 covers other hardware features. From an encoding perspective, G5-based products enable tuning of quality and throughput to match your applications, while quality and throughput are fixed for G4-based products. The G5’s quality ceiling is higher than the G4, at the cost of throughput, and the quality floor is lower, with an option for higher throughput.

G5-based products are much more capable hardware-wise, performing scaling, overlay, and audio compression and offer AI processing of 15 TOPS for T1U and 18 TOPS for T1A (36 TOPS T2A). In contrast, G4-based products scale, overlay, and encode audio via the host CPU and offer no AI processing. You can read about Quadra’s AI capability here.

Peer-to-peer DMA is a feature that allows G5-based products to communicate directly with some specific GPUs, which is particularly useful in cloud gaming. This is only available on G5-based products. Learn about peer-to-peer DMA here.

Note that G4 and G5-based devices can co-exist on the same server, so you can add G5 devices to a server with G4 devices already installed and vice versa.

Choosing the Right VPU & Server - Table 2 Advanced hardware functionality.
Table 2. Advanced hardware functionality.

Observations:

  • Codensity G4 and G5-based VPUs offer similar cost-per-stream, with Quadra slightly more efficient on a watts-per-stream basis. Both products transcode to H.264 and HEVC formats (G5 encodes to AV1 and decodes VP9).

  • Choose G4-based products for:
    • The absolute lowest overall cost
    • Compatibility with existing G4-based encoding stacks
    • Interactive same resolution-in/out productions (minimum scaling and overlay)

  • Choose G5-based products for:
    • AV1 output
    • AI integration
    • Applications that need quality and throughput tuning
    • Applications that involve scaling and overlay
    • Maximum throughput from a single server
    • Cloud gaming

Question 2: Which G4-based Product?

This section discusses your G4-based options shown in Figure 1, with the U.2-based T408 in the background and AIC-form factor T432 in the foreground. These products are designated as Transcoders since this is their primary hardware-based function.

Choosing the Right VPU & Server - Figure 1. The NETINT T408 in the back, T432 in the front.
Figure 1. The NETINT T408 in the back, T432 in the front.

Table 3 identifies the key differences between NETINT’s two G4-based VPUs, the T408, which includes a single G4 ASIC in a U.2 form factor, and the T432, which includes four G4 ASICS in an AIC half-height half-length configuration.

Choosing the Right VPU & Server - Table 3. NETINT’s two G4-based products.
Table 3. NETINT’s two G4-based products.

Observations:

  • The U.2-based T408 offers the best available density for installing units into a 1RU server.
  • The AIC-based T432 is the best option for computers without U.2 connections and for maximum server chassis density.

Question 3: Which G5-based Product?

Figure 2 identifies the three Quadra G5-based products, with the U.2-based T1U in the back, the AIC-based T1A in the middle, and the AIC-based T2A in the front. These products are designated Video Processing Units, or VPUs, because their hardware functionality extends far beyond simple transcoding.

Choosing the Right VPU & Server - Figure 2. The Quadra T1U in the back, T1A in the middle, and T2A in front.
Figure 2. The Quadra T1U in the back, T1A in the middle, and T2A in front.

Table 3 identifies the key differences between NETINT’s three G5-based VPUs:

  • The T1U includes a single G5 ASIC in a U.2 form factor.
  • The T1A includes a single G5 ASIC in an AIC half-height half-length configuration.
  • The T2A includes two G5 ASICs in an AIC half-height half-length configuration.
Choosing the Right VPU & Server - Table 4. NETINT’s two G4-based products.
Table 4. NETINT’s two G4-based products.

Observations:

  • The U.2-based Quadra T1U offers the best density for installing in a 1RU server.
  • The Quadra T2A offers the best density for AIC-based installation and is ideal for cloud gaming servers that need peer-to-peer DMA communication with GPUs.
  • The AIC-based Quadra T1A is the most affordable AIC option for installs that don’t need maximum density.

Question 4: VPU or Server?

NETINT offers two video servers that use the same Supermicro 1114S-WN10RT server chassis; the Logan Video Server contains ten T408 U.2 VPUs, while the Quadra Video Server contains ten Quadra T1U VPUs. Servers offer a turnkey option for fast and simple deployment.

An advantage of buying a NETINT Video Server is all components, including CPU, RAM, hard drive, OS, and software versions, have been extensively tested for compatibility, stability, and performance, making them the easiest and fastest way to transition from software to hardware encoding.

As for the choice between servers, your answer to question 1 should guide your selection.

If you have any questions about any products, please contact us here.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud