Unveiling the Quadra Server: The Epitome of Power and Scalability

The Quadra Server review by Jan Ozer from NETINT Technologies

Streaming engineers face constant pressure to produce more streams at a lower cost per stream and reduced power consumption. However, those considering new transcoding technologies need a solution that integrates with their existing workflows while delivering the quality and flexibility of software with the cost efficiency of ASIC-based hardware.

If this sounds like you, the US $21,000 NETINT Quadra Video Server could be the ideal solution. Combining the Supermicro 1114S-WN10RT AMD EPYC 7543P-powered server hosting ten NETINT Quadra T1U Video Processing Units (VPUs), is a power house. The Quadra server outputs H.264, HEVC, and AV1 streams at normal or low latency, and you can control operation via FFmpeg, GStreamer, or a low-level API. This makes the server a drop-in replacement for a traditional FFmpeg-based software or GPU-based encoding stack.

As you’ll see below, the 1RU form factor server can output up to 20 8Kp30 streams, 80 4Kp30 streams, up to 320 1080p30 streams, and 640 720p30 streams for live and interactive video streaming applications. For ABR production, the server can output over 120 encoding ladders in H.264, HEVC, and AV1 formats. This unparalleled density enables all video engineers to greatly expand capacity while shrinking the number of required servers and the associated power bills.

I’ll start this review with a technical description of the server and transcoding hardware. Then we’ll review some performance results for one-to-one streaming and H.264, HEVC, and AV1 ladder generation and finish with a look at the Quadra server’s AI-based features and output.

The Quadra Server - Quadra video processing server powered by 10 Quadras, ASIC-based VPUs from NETINT
Figure 1. The Quadra Video Server powered by the Codensity G5 ASIC.

Hardware Specs - The Quadra Server

The NETINT Quadra Video Server uses the Supermicro 1114S-WN10RT server platform with a 32-core AMD EPYC 7543P CPU running Ubuntu 20.04.05 LTS. The Quadra server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots that house the Quadra T1U VPUs. NETINT also offers the Quadra server with two other CPUs, the 64-core AMD EPYC 7713P processor ($24,000) for more demanding applications and the economical 8-core AMD EPYC 7232P processor ($19,000) for pure transcoding applications that may not require a 32-core CPU.

Supermicro* is a leading server and storage vendor that designs, develops, and manufactures primarily in the United States. Supermicro* adheres to high-quality standards, with a quality management system certified to the ISO 9001:2015 and ISO 13485:2016 standards, and an environmental management system certified to the ISO 14001:2015 standard. Supermicro is also a leader in green computing and reducing data center footprints (see the white paper Green Computing: Top Ten Best Practices for a Green Data Center). As you’ll see below, this focus has resulted in an extremely power-efficient server to house the NETINT Quadra VPUs.

*We are the leading server and storage vendor that designs, develops, and manufactures the majority of our development in the United States – at our headquarters in San Jose, Calif. Our Quality Management System is certified to ISO 9001:2015 and ISO 13485:2016 standards and our Environmental Management System is certified to ISO 14001:2015 standard. In addition to that, the Supermicro Information Security Managemen

SOURCE: https://www.supermicro.com/en/about

Hardware Specs – Quadra VPUs

The Quadra T1U VPUs are powered by the NETINT Codensity G5 ASIC and packaged in a U.2 form factor that plugs into the NVMe slots in the server and communicate via the ultra-high bandwidth PCIe 4.0 bus. Quadra VPUs can decode H.264, HEVC, and VP9 inputs and encode into the H.264, HEVC, and AV1 standards.

Beyond transcoding, Quadra VPUs house 2D processing engines that can crop, pad, and scale video, and perform video overlay, YUV and RGB conversion, reducing the load on the host CPU to increase overall throughput. These engines can perform xStack operations in hardware, making the Quadra server ideal for conferencing and security applications that combine multiple feeds into a multi-pane output mosaic window.

Each Quadra T1U in the Quadra server includes a 15 TOPS Deep Neural Network Inference Engine that can support models trained with all major deep learning frameworks, including Caffe, TensorFlow, TensorFlow Lite, Keras, Darknet, PyTorch, and ONNX. NETINT supplies several reference models, including a facial detection model that uses region of interest encoding to improve facial quality on security and other highly compressed streams. Another model provides background removal for conferencing applications.

Operational Overview

We tested the Quadra server with FFmpeg and GStreamer. Operationally, both GStreamer and FFmpeg communicate with the libavcodec layer that functions between the Quadra NVME interface and the FFmpeg/GStreamer software layers. This allows existing FFmpeg and GStreamer-based transcoding applications to control server operation with minimal changes.

Figure 2 - The Quadra Server - software architecture for controlling the Quadra Server
Figure 2. The software architecture for controlling the server.

To allocate jobs to the ten Quadra T1U VPUs, the Quadra device driver software includes a resource management module that tracks Quadra capacity and usage load to present inventory and status on available resources and enable resource distribution. There are several modes of operation, including auto, which automatically distributes the work among the available VPUs.

Alternatively, you can manually assign decoding and encoding tasks to different Quadra VPUs in the command line or application and even control which streams are decoded by the host CPU or a Quadra. With these and similar controls, you can most efficiently balance the overall transcoding load between the Quadra and host CPU and maximize throughput. We used auto distribution for all tests.

We tested running FFmpeg v 5.2.3 and GStreamer version 1.18 (with FFmpeg v 4.3.1), and with Quadra release 3.2.0. As you’ll see, we weren’t able to complete all tests in all modes with both software programs, so we presented the results we were able to complete.

In all tests, we configured the Quadra VPUs for maximum throughput as opposed to maximum quality. You can read about the configuration options and their impact on output quality and performance in Benchmarking Hardware Transcoder Performance. While quality will relate to each video and encoding configuration, the configuration used should produce quality at least equal to the veryfast x264 and x265 presets, with quality up to the slow presets available in configurations that optimize quality over throughput.

We tested multiple facets of system performance. The first series of tests involved a single stream in and single stream out, either at the same resolution as the incoming stream or scaled down and output at a lower resolution. Many applications use this mode of operation, including gaming, gambling, and auctions.

The second use case is ABR distribution, where a single input stream is transcoded to a full encoding ladder. Here we supplemented the results with software-only transcodes for comparison purposes. To assess AI-related throughput, we tested region-of-interest transcoding and background removal.

In most modes, we tested normal and low-latency performance. To simulate live streaming and minimize file I/O as a drag on system performance, we retrieved the source file from a RAM drive on the Quadra server and delivered the encoded file to RAM.

Same-Resolution Transcoding

Table 1 shows transcoding results for 8K, 4K, 1080p, and 720p in latency tolerant and low-delay modes. The number represents the amount of full frame rate outputs produced by the system at each configuration.

These results are most relevant for interactive gambling and similar applications that input a single stream, transcode the stream at full resolution, and stream it out. You see that 8K streaming is not available in the AV1 format and that H.264 and HEVC are not available in low latency mode with either program. Interestingly, FFmpeg outperformed GStreamer at this resolution while the reverse was true at 1080p.

4K and 720p results were consistent for all input and output codecs and for normal and low delay modes. All output numbers are impressive, but the 640 720p streams for AV1, H.264, or HEVC is remarkable density for a 1RU rack server.

At 1080p there are minor output differences between normal and low-delay mode and the different codecs, though the codec-related differences aren’t that substantial. Interestingly, HEVC throughput is slightly higher than H.264, with AV1 about 16% behind HEVC.

Jan Ozer - the Quadra server review-table-1
Table 1. Same resolution transcoding results.

Table 2 shows a collection of maximum data points (worst case) from the transcoding results presented in Table 1. As you can see, both Max CPU and power consumption track upwards with the number of streams produced. Max latency (decode plus encode) in normal latency mode tracks downward with the stream resolution, becoming quite modest at 720p. Max latency (decode plus encode) in low-delay mode for both decoding and encoding starts and stays under 30.9 milliseconds, which is less than a single frame.

Jan Ozer - the Quadra server review-table-2
Table 2. Maximum CPU, power consumption, and latency data for pure transcoding.

As between FFmpeg and GStreamer, the latter proved more CPU and power efficient than the former, in both normal and low-delay modes. For example, in all tests, GStreamer’s CPU utilization was less than half of FFmpeg, through the power consumption delta was generally under 20%.

At 8K and 4K resolutions, the latency reported was about even between the two programs, but at the lower resolutions in low-delay mode, GStreamer’s latency was often half that of FFmpeg. You can see an example of these two observations in Table 3, reporting 720p HEVC input and output as HEVC. Though the throughput was identical, GStreamer used much less energy and produced much lower latency. As you’ll see in the next section, this dynamic stayed true in transcoding with scaling tests, making GStreamer the superior app for applications involving same-resolution transcoding and transcoding with scaling. 

Quadra Server - Table 3. GStreamer was much more CPU and power efficient and delivered substantially lower latency than FFmpeg in these same resolution transcode tests.
Table 3. GStreamer was much more CPU and power-efficient
and delivered substantially lower latency than FFmpeg
in these same resolution transcode tests.

Transcoding and Scaling

Table 4 shows transcoding while scaling results, first 8K input to 4K output, then 4K to 1080p, and lastly 1080p to 720p. If you compare Table 3 with Table 1, you’ll see that performance tracks the input resolution, not output, which makes sense because decoding is a separate operation that obviously involves its own hardware limits.

Jan Ozer - the Quadra server review-table-4
Table 4. Transcoding while scaling results.

As the Quadra VPUs perform scaling on-board, there was no drop in throughput with the scaling related tests; rather, there was a slight increase in 8K > 4K and 4K > 1080p outputs over the same resolution transcoding reported in Table 1. In terms of throughput, the results were consistent between the codecs and software programs.

Table 5 shows the max CPU and power usage for all the transcodes in Table 3, which increased somewhat from the low-quantity high-resolution transcodes to the high-quantity low-resolution transcodes but was well within the performance envelope for this 32-core server.

The Max latency for all normal encodes was relatively consistent between five and six frames. With low delay engaged, 8K > 4K latency didn’t drop that significantly, though you’d assume that 8K to 4K transcodes are uncommon. Latency dropped to below a single frame in the two lower resolution transcodes.

Jan Ozer - the Quadra server review-table-5
Table 5. Maximum CPU, power consumption, and latency data for transcoding while scaling.

As between FFmpeg and GStreamer we saw the same dynamic as with full resolution transcodes; in most tests, GStreamer consumed significantly less power and produced sharply lower latency. You can see an example of this in Table 6, reporting the results of 1080p incoming HEVC output to AV1 at 720p. 

Table 6. GStreamer was much more CPU and power-efficient
and delivered much lower latency than FFmpeg in this scale then transcode tests.

Encoding Ladder Testing

Table 7 shows the results of full ladder testing with CPU, latency, and power consumption embedded in the output instances. Note that we tested a five-rung ladder for H.264, and four-rung ladders for HEVC and AV1. We didn’t test 4K H.264 output because few services would deploy this configuration. Also, we didn’t test with GSteamer because NETINT’s current GStreamer implementation can’t use Quadra’s internal scalers when producing more than a single file, an issue that the NETINT engineering team will resolve soon. Also, as you can see, low-delay mode wasn’t available for 4K testing. 

This fine print behind us, as with the single file testing, throughput was impressive. The ability to deliver up to 140 HEVC 4-rung ladders from a single 1RU rack, in either normal or low-latency mode, is remarkable.

Jan Ozer - the Quadra server review-table-7
Table 7: Encoding ladder throughput. 

For comparison purposes, we produced the equivalent encoding ladders on the same server using software-only encoding with FFmpeg and the x264, x265, and SVT-AV1 codecs. To match the throughput settings used for Quadra, we used the ultrafast preset for x264 and x265, and preset eleven for SVT-AV1. You see the results in Table 8

Note that these numbers over-represent software-based output since no engineer would produce a live stream with CPU utilization over 60 – 65%, since a sudden spike in CPU usage would crash all the streams. Not only is CPU utilization much lower for the Quadra-driven encodes, minimizing the risk of exceeding CPU capacity, Quadra-based transcoding is much more determinist than CPU-based transcoding, so CPU requirements don’t typically change in midstream.

All that said, Quadra proved much more efficient than software-based encoding for all codecs, particularly HEVC and AV1. In Table 7, the Multiple column shows the number of servers required to produce the same output as the Quadra server, plus the power consumed by all these servers. For H.264, you would need six servers instead of a single Quadra server to produce the 120 instances, and power costs would be nearly six times higher. That’s running each Quadra server at 98.3% CPU utilization. Running at a more reasonable 60% utilization would translate to ten servers and 4,287 watts per hour.

Jan Ozer - the Quadra server review-table-8
Table 8. Ladders, CPU utilization, and power consumed for CPU-only transcoding.

Even without factoring in the 60% CPU-utilization limits, the comparison reaches untenable levels with HEVC and AV1. As the data shows, CPU-based transcoding simply can’t keep up with these more complex codecs, while the ASIC-driven Quadra remains relatively consistent. 

AI-Related Functions

The next two tables benchmark AI-related functions, first region of interest encoding, then background removal. Briefly, region of interest encoding uses AI to search for faces in a stream and then increases the bits assigned to those faces to increase quality. This is useful in surveillance videos or any low-bitrate video environment where facial quality is important. 

We tested 1080p AVC input and output with FFmpeg only, and the system delivered sixty outputs in both normal and low-delay modes, with very modest CPU utilization and power consumption. For more on Quadra’s AI-related functions, and for an example of the region of interest filter, see an Introduction to AI Processing on Quadra.

Jan Ozer - the Quadra server review-table-9
Table 9. Throughput for Region of Interest transcoding via Artificial Intelligence.

Table 10 shows 1080p input/output using the AVC codec with background removal, which is useful in conferencing and other applications to composite participants in a virtual environment (see Figure 2). This task involves considerably more CPU but delivers slightly greater throughput.

Jan Ozer - the Quadra server review-table-10
Table 10. Throughput for background removal and transcoding via Artificial Intelligence.

As you can read about in the Introduction to AI Processing on Quadra, Quadra comes with these and other AI-based applications and can deploy AI-based models developed in most machine learning programs. Over time, AI-based operations will become increasingly integral to video transcoding functions, and the Quadra Video Server provides a future-proof platform for that integration.

Figure 3 -The Quadra Server - Compositing participants in a virtual environment with background removal
Figure 3. Compositing participants in a virtual environment with background removal

Conclusion

While there’s a compelling case for ASIC-based transcoding solely for H.264 production, these tests show that as applications migrate to more complex codecs like HEVC and AV1, CPU-based transcoding is untenable economically and for the environment. Beyond pure transcoding functionality, if there’s anything that the ChatGPT-era has proven, it’s that AI-based transcoding-related functions will become mainstream much sooner than anyone might have thought. With highly efficient ASIC-based transcoding hardware and AI engines, the Quadra Video Server checks all the boxes for a server to strongly consider for all high-volume live streaming applications. 

Hardware Transcoding: What it Is, How it Works, and Why You Care

What is Transcoding?

Like most terms relating to streaming, transcoding is defined more by practice than by a dictionary. In fact, transcoding isn’t in Websters or many other dictionaries. That said, it’s generally accepted that transcoding means converting a file from one format to another.  More particularly, it’s typically used within the context of a live-streaming application.

As an example, suppose you were watching a basketball game on NBA.tv. Assuming that the game is produced on-site, somewhere in the arena, a video mixer pulls together all video, audio, and graphics. The output would typically be fed into a device that compresses it to a high-bitrate H.264 or another compressed format and sends it to the cloud. You would typically call this live encoding; if the encoder is hardware-based, it would be hardware-based live encoding.

In the cloud, the incoming stream is transcoded to lower resolution H.264 streams for delivery to mobile and other devices or HEVC for delivery to a smart TV. This can be done in software but is typically performed using a hardware transcoder because it’s more efficient. More on this below.

Looking further into the production and common uses of streaming terminology, during the event or after, a video editor might create short highlights from the original H.264 video to share on social media. After editing the clip, they would encode it to H.264 or another compressed format to upload to Instagram or Facebook. You would typically call rendering the output from the software editor encoding, not transcoding, even though the software converts the H.264 input file to H.264 output, just like the transcoder.

Play Video about NETINT-Jan Ozer-Hardware Transcoding v Encoding
HARD QUESTIONS ON HOT TOPICS: Transcoding versus Encoding.
Watch the full conversation on YouTube: https://youtu.be/BcDVnoxMBLI

Boiling all this down in terms of common usage:

  • You encode a live stream from video input, in software or in hardware, to send it to the cloud for distribution. You use a live encoder, either hardware or software, for this.
  • In the cloud, you transcode the incoming stream to multiple resolutions or different formats using a hardware or software transcoder.
  • When outputting video for video-on-demand (VOD) deployment, you typically call this encoding (and not transcoding), even if you’re working from the same compressed format as the transcoding device.

Hardware Transcoding Alternatives

Anyone who has ever encoded a file knows that it’s a demanding process for your computer. When producing for VOD, time matters, but if the process takes a moment or two longer than planned, no one really notices. Live, of course, is different; if the video stream slows or is interrupted, viewers notice and may click to another website or change channels.

This is why hardware transcoding is typically deployed for high-volume transcoding applications. You can encode with a CPU and software, but CPUs perform multiple functions within the computer and are not optimized for transcoding. This means that a single server can produce fewer streams than hardware transcoders, which translates to higher CAPEX and power consumption.

Like the name suggests, hardware-based transcoding uses hardware devices other than the CPU to transcode the video. One alternative are graphics processing units (GPUs), which are highly optimized for graphic-intensive applications like gaming. Transcoding is supported with dedicated hardware circuits in the GPU, but the vast majority of circuits are for graphics and other non-transcoding functions. While GPUs are more efficient than CPUs for transcoding, they are expensive and consume significant power.

ASIC-Based Transcoding

Which takes us to ASICs. Application-Specific Integrated Circuits (ASICs) are designed for a specific task or application, like video transcoding. Because they‘re designed for this task, they are more efficient than CPU or GPU-based encoding, more affordable, and more power-efficient.

Because they‘re designed for this task, Application-Specific Integrated Circuits (ASICs) are more efficient than CPU or GPU-based encoding, more affordable, and more power-efficient.

ALEX LIU, Co-Founder,
COO at NETINT Technologies Inc.

ASICs are also very compact, so you can pack more ASICs into a server than GPUs or CPUs, increasing the output from that server. This means that fewer servers can deliver the same number of streams than with GPU or CPU-based transcoding, which saves additional server storage cost and maintenance.

While we’re certainly biased, if you’re looking for a cost-effective and power-efficient hardware alternative for high-volume transcoding applications, ASIC transcoders are the way to go. Don’t take our word for it; you can read here how YouTube converted much of their production operation to the ASIC-based Argos VCU (for video compression unit). Meta recently also released their own encoding ASIC. Of course, neither of these are for sale to the public; the primary vendor for ASIC-based transcoders is NETINT.

NETINT Video Transcoding Server – ASIC technology at its best

NETINT Video Transcoding Server - quality-speed-density

Many high-volume streaming platforms and services still deploy software-only transcoding, but high energy prices for private data centers and escalating public cloud costs make the OPEX, carbon footprint, and dismal scalability unsustainable. Engineers looking for solutions to this challenge are actively exploring hardware that can integrate with their existing workflows and deliver the quality and flexibility of software with the performance and operational cost efficiency of purpose-built hardware. 

If this sounds like you, the USD $8,900 NETINT Video Transcoding Server could be the ideal solution. The server combines the Supermicro 1114S-WN10RT AMD EPYC 7543P-powered 1RU server with ten NETINT T408 video transcoders that draw just 7 watts each. Encoding HEVC and H.264 at normal or low latency, you can control transcoding operations via  FFmpeg, GStreamer, or a low-level API. This makes the server a drop-in replacement for a traditional x264 or x265 FFmpeg-based or GPU-powered encoding stack.

NETINT Video Transcoding Server

Due to the performance advantage of ASICs compared to software running on x86 CPUs, the server can perform the equivalent work of roughly 10 separate machines running a typical open-source FFmpeg and x264 or x265 configuration. Specifically,  the server can simultaneously transcode twenty 4Kp30 streams, and up to 80 1080p30 live streams. In ABR mode, the server transcodes up to 30 five-rung H.264 encoding ladders from 1080p to 360p resolution, and up to 28 four-rung HEVC encoding ladders. For engineers delivering UHD, the server can output seven 6-rung HEVC encoding ladders from 4K to 360p resolution, all while drawing less than 325 watts of total power.

This review begins with a technical description of the server and transcoding hardware and the options available to drive the encoders, including the resource manager that distributes jobs among the ten transcoders. Then we’ll review performance results for one-to-one streaming and then H.264 and HEVC ladder generation, and finish with a look at the server’s ultra-efficient power consumption.

NETINT Transcoding Server with 10 T408 Video Transcoders

Hardware Specs

Built on the Supermicro 1114S-WN10RT 1RU server platform, the NETINT Video Transcoding Server features ten NETINT Codensity ASIC-powered T408 video transcoders, and runs Ubuntu 20.04.05 LTSThe server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots to house the ten U.2 T408 video transcoders.

You can buy the server with any of three AMD EPYC processors with 8 to 64 cores. We performed the tests for this review on the 32-core AMD EPYC 7543P CPU that doubles to 64 threads with multithreading.  The server configured with the AMD EPYC 7713P processor with 64-cores and 128-threads sells for USD $11,500, and the economical AMD EPYC 7232P processor-based server with 8-cores and 16-threads lists for USD $7,000.

Regarding the server hardware, Supermicro is a leading server and storage vendor that designs, develops, and manufactures primarily in the United States. Supermicro adheres to high-quality standards, with a quality management system certified to the ISO 9001:2015 and ISO 13485:2016 standards and an environmental management system certified to the ISO 14001:2015 standard. Supermicro is also a leader in green computing and reducing data center footprints (see the white paper Green Computing: Top Ten Best Practices for a Green Data Center). As you’ll see below, this focus has resulted in an extremely power-efficient machine when operated with NETINT video transcoders.

Let’s explore the system - NETINT Video Transcoding Server

With this as background, let’s explore the system. Once up and running in Ubuntu, you can check T408 status via the ni_rsrc_mon_logan command, which reveals the number of T408s installed and their status. Looking at Figure 1, the top table shows the decoder performance of the installed T408s, while the bottom table shows the encoding performance.

Figure 1. Tracking the operation of the T408s, decode on top, encode on the bottom.

About the T408

T408s have been in service since 2019 and are being used extensively in hyper-scale platforms and cloud gaming applications. To date, more than 200 billion viewer minutes of live video have been encoded using the T408. This makes it one of the bestselling ASIC-based encoders on the market.

The NETINT T408 is powered by the Codensity G4 ASIC technology and is available in both PCIe and U.2 form factors. The T408s installed in the server are the U.2 form factor plugged into ten NVMe bays. The T408 supports close caption passthrough, and EIA CEA-708 encode/decode, along with support for High Dynamic Range in HDR10 and HDR10+ formats.

“To date, more than 200 billion viewer minutes of live video have been encoded using the T408. This makes it one of the bestselling ASIC-based encoders on the market.” 

ALEX LIU, Co-Founder,
COO at NETINT Technologies Inc.

The T408 decodes and encodes H.264 and HEVC on board but performs all scaling and overlay operations via the host CPU. For one-to-one same-resolution transcoding, users can select an option called YUV Bypass that sends the video transcoded by the T408 directly to the T408 encoder. This eliminates high-bandwidth trips through the bus to and from system memory, reducing the load on the bus and CPU. As you’ll see, in pure 1:1 transcode applications without overlay, CPU utilization is very low, so the T408 and server are very efficient for cloud gaming and other same-resolution, low-latency interactive applications. 

Netint Codensity, ASIC-based T408 Video Transcoder
Figure 2. The T408 is powered by the Codensity G4 ASIC.

Testing Overview

We tested the server with FFmpeg and GStreamer. As you’ll see, in most operations, performance was similar. In some simple transcoding applications, FFmpeg pulled ahead, while in more complex encoding ladder productions, particularly 4K encoding, GStreamer proved more performant, particularly for low-latency output.

Figure 3. The software architecture for controlling the server.  

Operationally, both GStreamer and FFmpeg communicate with the libavcodec layer that functions between the T408 NVME interface and the FFmpeg software layer. This allows existing FFmpeg and GStreamer-based transcoding applications to control server operation with minimal changes.

To allocate jobs to the ten T408s, the T408 device driver software includes a resource management module that tracks T408 capacity and usage load to present inventory and status on available resources and enable resource distribution. There are several modes of operation, including auto, which automatically distributes the work among the available resources.

Alternatively, you can manually assign decoding and encoding tasks to different T408 devices in the command line or application and control which streams are decoded by the host CPU or a T408. With these and similar controls, you can efficiently balance the overall transcoding load between the T408s and host CPU to maximize throughput. We used auto distribution for all tests.

Testing Procedures

We tested using Server version 1.0, running FFmpeg v4.3.1 and GStreamer v1.18 and T408 release 3.2.0. We tested with two use cases in mind. The first is a stream in-single stream out, either at the same resolution as the incoming stream or output at a lower resolution.  This mode of operation is used in many interactive applications like cloud gaming, real-time gaming, and auctions where the absolute lowest latency is required. We also tested scaling performance since many interactive applications scale the input to a lower resolution.

The second use case is ABR, where a single input stream is transcoded to a full encoding ladder. In both modes, we tested normal and low-latency performance. To simulate live streaming and minimize file I/O as a drag on system performance, we retrieved the source file from a RAM drive on the server and delivered the encoded file to RAM.

Play Video about NETINT Video Transcoding Server - ASIC technology at its best
HARD QUESTIONS ON HOT TOPICS
All you need to know about NETINT Transcoding Server powered by ASICs
Watch the full conversation on YouTube: https://youtu.be/6j-dbPbmejw

One-to-One Performance

Table 1 shows transcoding results for 4K, 1080p, and 720p in latency tolerant and low-delay modes. Instances is the number of full frame rate outputs produced by the system, with CPU utilization shown for reference. These results are most relevant for cloud gaming and similar applications that input a single stream, transcode the stream at full resolution, and distribute it.

As you can see, 4K results peak at 20 streams for all codecs, though results differ by the software program used to generate the streams. The number of 1080p outputs range from 70 – 80, while 720p streams range from 140 to 170. As you would expect, CPU utilization is extremely low for all test cases as the T408s are shouldering the complete decoding/encoding role. This means that performance is limited by T408 throughput, not CPU, and that the 64-core CPU probably wouldn’t produce any extra streams in this use case. For pure encoding operations, the 8-core server would likely suffice, though given the minimal price differential between the 8-core and 32-core systems, opting for the higher-end model is a prudent investment.

Latency

As for latency, in the normal mode, latency averaged around 45 ms for 4K transcoding and 34 ms for 1080p and 720p transcoding. In low delay mode, this dropped to around 24 ms for 4K, 7 ms for 1080p, and 3 ms for 720, all at 30 fps transcoding and measured with FFmpeg. For reference, at 30 fps, each frame is displayed for 33.33 ms. Even in latency-tolerant mode, latency is just over 1.36 frames for 4K and under a single frame for 1080p and 720p. In low delay modes, all resolutions are under a single frame of latency.

It’s worth noting that while software performance would drop significantly from H.264 to HEVC, hardware performance does not. Thus questions of codec performance for more advanced standards like HEVC do not apply when using ASICs. This is good news for engineers adopting HEVC, and those considering HEVC in the future. It means you can buy the server, comfortable in the knowledge that it will perform equally well (if not better) for HEVC encoding or transcoding.

Table 1. Full resolution transcodes with FFmpeg and Gstreamer
in regular and low delay modes.

Table 2 shows the performance when scaling from 4K to 1080p and from 1080p to 720p, again by the different codecs in and out. Since scaling is performed by the host CPU, CPU usage increases significantly, particularly on the higher volume 1080p to 720p output. Still, given that CPU utilization never exceeds 35%, it appears that the gating factor to system performance is T408 throughput. Again, while the 8-core system might be able to produce similar output if your application involves scaling, the 32-core system is probably better advised.

In these tests, latency was slightly higher than pure transcoding. In normal mode, 4K > 1080p latencies topped out at 46 ms and dropped to 39 ms for 1080p > 720p scaling, just over a single frame of latency. In low latency mode, these results dropped to 10 ms for 4K > 1080p and 10 ms for 1080p > 720p. As before, these latency results are for 30fps and were measured with FFmpeg.

Table 2: Performance while scaling from 4K to 1080p and 1080p to 720p.

The final set of tests involves transcoding to the AVC and HEVC encoding ladders shown in Table 3. These results will be most relevant to engineers distributing full encoding ladders in HLS, DASH, or CMAF containers.

Here we see the most interesting discrepancies between FFmpeg and GStreamer, particularly in low delay modes and in 4K results. In the 1080p AVC tests, FFmpeg produced 30 5-rung encoding ladders in normal mode but dropped to nine in low-delay mode. GStreamer produced 30 encoding ladders in both modes using substantially lower CPU resources. You see the same pattern in the 1080p four-rung HEVC output where GStreamer produced more ladders than FFmpeg using lower CPU resources in both modes.

Table 3. Full encoding ladders output in the listed modes.

FFmpeg produced very poor results in 4K testing, particularly in low latency mode, and it was these results that drove the testing with GStreamer. As you can see, GStreamer produced more streams in both modes and CPU utilization again remained very low. As with the previous results, the low CPU utilization means that the results reflect the encoding limits of the T408. For this reason, it’s unlikely that the higher end server would produce more encoding ladders.

In terms of latency, in normal mode, latency was 59 ms for the H.264 ladder, 72 ms for the 4 rung 1080p HEVC ladder, and 52 ms for the 4K HEVC ladder. These numbers dropped to 5 ms, 7 ms, and 9 ms for the respective configurations in low latency mode.

Power Consumption

Power consumption is an obvious concern for all video engineers and operations teams. To assess system power consumption, we tested using the IPMI Tool. When running completely idle, the system consumed 154 watts, while at maximum CPU, the unit averaged 400 watts with a peak of 425 watts.

We measured consumption during the three basic operations tested, pure transcoding, transcoding with scaling, and ladder creation, in each case testing the GStreamer scenario that produced the highest recorded CPU usage. You see the results in Table 4.

When you consider that CPU-only transcoding would yield a fraction of the outputs shown while consuming 25-30% more power, you can see that the T408 is exceptionally efficient when it comes to power consumption. The Watts/Output figure provides a useful comparison for other competitive systems, whether CPU or GPU-based.

Table 4. Power consumption during the specified operation.

Conclusion

With impressive density, low power consumption, and multiple integration options, the NETINT Video Transcoding Server is the new standard to beat for live streaming applications. With a lower price model available for pure encoding operations, and a more powerful model for CPU-intensive operations, the NETINT server family meets a broad range of requirements.