Skip to content
Quadra encoding for UGC

Most NETINT customers use our ASIC-based devices for live transcoding. But as this article shows, if you’re encoding large volumes of UGC video to H.264 output, NETINT Quadra can significantly reduce CAPEX and OPEX while delivering equal or better quality than x264.

The predominant use for ASIC-based transcoders like NETINT’s Quadra Video Processing Unit (VPU) has been live transcoding for cloud gaming, interactive video applications, and other high-volume applications. In this document, we explore using Quadra VPUs for large-scale VOD transcoding for applications like user-generated content sites and social media.

Given the millions of processing cores required to transcode using software the volume of video social platforms must handle, Meta and Google have developed their own encoding ASICs, with Google’s Argos ASIC reportedly replacing over 10 million CPUs dedicated to CPU-based transcoding. While few, if any, services shoulder the same encoding load as YouTube, our tests reveal that a fully loaded Quadra video server ($21,000) can replace up to 46 servers transcoding via CPUs.

About NETINT, ASICs, and VPUs

For those unfamiliar with NETINT technology, and products, we were founded in 2015 and launched our first ASIC-powered transcoder in 2018.

ASIC stands for Application-Specific Integrated Circuit, a chip purpose-built for transcoding. As compared to general-purpose CPUs and GPUs, which perform many more functions and devote less real estate to transcoding, NETINT’s transcoding ASICs are less expensive, more efficient, deliver much greater throughput, and consume much less power.

NETINT’s most recent ASIC generation named Quadra adds additional video processing capability which caused us to create a new category called Smart VPU (video processing unit) because in addition to transcoding scaling and overlay functions and up to 36 TOPS of AI processing power is available. Quadra VPUs are available in three form factors and in the Quadra Video Server, an integrated Ubuntu-based server with ten Quadra VPUs.

All NETINT products are designed as drop-in replacements or for expansion of existing hardware or CPU-based transcoding systems. As such, you can control transcoding operations via FFmpeg, GStreamer, or an SDK with load management provided in the basic software stack.

Back to Our Analysis

Early hardware transcoders of all designs – CPUs, FPGAs, and ASICs – were rightfully criticized for subpar quality. While it’s true that software-based transcoders are capable of producing higher quality than most hardware, they require heavy server configurations and for advanced codec support, may need multiple machines to encode a single stream in real-time or for fast turn around applications. In contrast, NETINT’s ASICs are used by several premium OTT brands worldwide and for other quality sensitive applications. As you’ll see in this analysis, Quadra’s output quality is quite competitive making the adoption of ASICs feasible for many services.

Slash UGC Transcoding Costs with ASICs, Quadra and VPU
FIGURE 1.  Picture of the Quadra T1U tested in this analysis.

To assess this, we downloaded 27 random test clips from YouTube’s UGC Dataset, described as “A large scale dataset containing YouTube User Generated Content intended for video compression and quality assessment research. This allowed us to evaluate suitability for UGC workflows with actual UGC video files, each 20 seconds long. (Of note, a leading short video sharing social platform has deployed NETINT VPUs into production as of this writing.)

Test Procedure Using YouTube Test Clips

To assess this, we downloaded 27 random test clips from YouTube’s UGC Dataset, described as “A large scale dataset containing YouTube User Generated Content intended for video compression and quality assessment research. This allowed us to evaluate suitability for UGC workflows with actual UGC video files, each 20 seconds long. (Of note, a leading short video sharing social platform has deployed NETINT VPUs into production as of this writing.)

Quadra encoding for UGC - We tested UGC clips from this YouTube database.
FIGURE 4. We tested UGC clips from this YouTube database.

Many (if not most) large-scale services use some form of content-adaptive encoding to choose the target bitrates for their encodes. Given the range of content we were evaluating, a single target bitrate for all 1080p encodes made no sense. To find a target appropriate for the UGC quality levels, we encoded all files with x264 using a CRF value of 27, which delivered a VMAF score of around 90-91. This provided a data rate target.

For x264, I created a single-pass command string that targeted the same bitrate but used 200% constrained VBR. A sample command string looked like this:

ffmpeg  -y -i Anim_1.mp4 -b:v 3200k -maxrate 6400k -bufsize 6400k  -g 60 -rc-lookahead 40 -an -c:v libx264 Anim_1_x264_3200.mp4

I’m aware that 40 is the default value for lookahead with x264 but wanted to make sure that the command string matched that used by Quadra as closely as possible. I assumed that most services producing UGC use the x264 medium preset, which is the default, to achieve high throughput and low cost per stream. Since I didn’t specify otherwise in the command string, that’s the preset used by FFmpeg.

For the Quadra, I created a single-pass command string that used a 2-second VBV buffer and a 40-frame lookahead and enabled Rate-Distortion Optimization, a technique that improves quality slightly (for H.264) but also decreases throughput slightly. The Quadra command string is this:

ffmpeg -y -c:v h264_ni_quadra_dec -xcoder-params “out=hw” -i Anim_1.mp4 -y -c:v h264_ni_quadra_enc -xcoder-params “gopPresetIdx=5:RcEnable=1:intraPeriod=60:lookaheadDepth=40:EnableRdoQuant=1:rdoLevel=1:vbvBufferSize=2000:bitrate=3200000:zeroCopyMode=0” Anim_1_Qua_H264_3200.mp4

I used single pass because I assumed that most services want to minimize costs and would use a single-pass technique like capped CRF for encoding. Though both the Quadra and x264 support capped CRF, matching the bitrates was essential to a fair comparison, and this would be nearly impossible using capped CRF for both. As evaluated, all bitrates were within less than a 4% differential.

Quality Results

During our testing and VMAF comparisons, which I performed in the Moscow State University Video Quality Measurement Tool, I noticed that the first two seconds or so of many of the x264 encoded videos suffered from low quality. You can see this in Figure 3, the VQMT Results Plot that shows the VMAF score for each frame over both 20-second files; x264 in red and Quadra in green.

As you can see, the red x264 scores suffer a drop at the start and then recover at around 60 frames. Since this occurred in many clips, I excluded the first 60 frames from the VMAF calculation.

Quadra encoding for UGC - Since the first two seconds of the x264 encodes had low quality in many clips, we excluded the first 60 frames from the VMAF calculation for both files.
FIGURE 4. Since the first two seconds of the x264 encodes had low quality in many clips,
we excluded the first 60 frames from the VMAF calculation for both files.

Table 1 shows the overall results in all test categories, which included from 1 – 4 clips. As you can see, the bitrate of the Quadra clips are within 98-102% of the bitrate of the x264 clip. I computed the VMAF score using the Harmonic mean method, which incorporates quality variations (see here). As you can see, rather than the typical premium content target of 94-95 VMAF points, we successfully mimicked the YouTube targets for H.264 encoded video.

Quadra encoding for UGC - Comparative bitrates and VMAF scores using the harmonic mean.
TABLE 1. Comparative bitrates and VMAF scores using the harmonic mean.

Figure 4 shows the same data in graphic form. Not a huge difference, but Quadra clearly holds its own as compared to the x264 medium preset when encoding UGC content to UGC quality levels, which was the object of the exercise.

Quadra encoding for UGC - VMAF quality comparisons over the UGC clips evaluated.
FIGURE 4. VMAF quality comparisons over the UGC clips evaluated.

Mission accomplished from a quality perspective, now, let’s look at throughput.

Comparative Throughput

I compared throughput on a Dell Server equipped with a 2.2 GHz AMD Ryzen 5 5600x 6-core/12-thread CPU running Ubuntu 20.04.3 LTS with 16 GB of RAM. The Quadra VPU is a T1U device. I performed all tests using FFmpeg, using version 5.0 to drive the Quadra transcoder, and version 6.0 for x264. I used a 12-minute 1080p30 file as the source for all tests.

To compare throughput, I first needed to determine the number of simultaneous jobs that consumed about 100% of available resources. Any less than 100% and I’d be wasting resources; adding jobs after 100% might actually decrease throughput because the OS would have to juggle more tasks.

Quadra has a utility that displays system load on Quadra’s four main hardware components: decoder, encoder, scalar, and AI cores. You see this in Figure 5, with the encoder load at 99%, achieved when transcoding four simultaneous files (INST is short for instance). In this configuration, the system transcoded 640 frames per second. When I evaluated five simultaneous files, the throughput dropped to 632 frames per second. Accordingly, four simultaneous jobs were the most efficient configuration, and the Quadra encoded the four 12-minute source files in 2:15 (min:sec).

Quadra encoding for UGC - Quadra achieved the highest throughput when encoding four simultaneous files.
FIGURE 5. Quadra achieved the highest throughput when encoding four simultaneous files.

Figure 6 shows CPU utilization during this processing. The workstation has twelve cores, so the theoretical capacity is 1200%. You see that each Quadra instance of FFmpeg required about 5.5%, totaling just over 22% of total CPU usage or under 2% of the available 1200%. That’s because Quadra decodes the incoming H.264 file and transcodes the output file using on-board hardware using minimal system resources. This low CPU usage will become relevant in a few moments.

Quadra encoding for UGC - In use, Quadra’s four FFmpeg instances consumed only 22% of the available 1200% of CPU in this 12-thread server.
FIGURE 6. In use, Quadra’s four FFmpeg instances consumed only 22% of the available 1200% of CPU in this 12-thread server.

Contrast this with the CPU utilized by FFmpeg when encoding using the x264 codec, as shown in Figure 7. Here, CPU utilization totaled 1195%, or 99.6% of the available CPU resources. While the most efficient configuration for x264 was three simultaneous files, just one less than Quadra, the encoding time jumped from 2:15 with Quadra to 7:45 for FFmpeg and x264.

Contrast this with the CPU utilized by FFmpeg when encoding using the x264 codec, as shown in Figure 7. Here, CPU utilization totaled 1195%, or 99.6% of the available CPU resources. While the most efficient configuration for x264 was three simultaneous files, just one less than Quadra, the encoding time jumped from 2:15 with Quadra to 7:45 for FFmpeg and x264.

Quadra encoding for UGC - With three simultaneous transcodes with x264, FFmpeg consumed 99.6 of the available resources.
FIGURE 7. With three simultaneous transcodes with x264, FFmpeg consumed 99.6 of the available resources.

This data feeds the calculations shown in Table 2. Quadra produced 48 minutes of video, or 80% of an hour, in 2:15, or 135 seconds. This means that Quadra can produce an hour of video in just under 169 seconds. As there are 86,400 seconds each day, this translates to 512 hours of encoded video per day.

Quadra encoding for UGC - Throughput in hours per day with Quadra and CPU-only with x264.
TABLE 2. Throughput in hours per day with Quadra and CPU-only with x264.

In contrast, encoding with the CPU-only and x264, FFmpeg produced 36 minutes of video, 60% of an hour, in 7:45, or 465 seconds. This translates to an hour of encoded video every 775 seconds, or 111 hours of encoded video per day. This means that Quadra produces about 4.6x the throughput of CPU-only transcoding with a single Quadra in the server. Just wait till you see what can happen when you put ten Quadra VPUs in that same server.

Table 3 translates these figures into a three-year financial comparison. Here are the assumptions:

  • Server cost – $5,000
  • Quadra cost – $1,500
  • Annual power cost at 500 watts draw at $0.25 equals $1,200
  • Monthly co-location cost for 1RU rack is $75.

The top line shows that Quadra costs $12,800 over three years but produces 560,640 hours of video for a cost per encoded hour of $0.0228. The second line shows that the CPU-only system costs $1,500 less but outputs only 122,075 hours of video, for a cost of $0.0926.

The third line shows the hardware cost for the CPU-only systems to match the Quadra output, assuming that you could buy 4.6 systems, which obviously you can’t. This drives the total spend to $51,896, though the cost per hour obviously stays the same.

So, to produce 560,640 hours of encoded video over three years, you’d spend $12,800 for the Quadra-based system or $51,896 to produce via CPU-only. At these transcoding levels, Quadra delivers a 75% savings. And yet, the savings stacks up even more when you use the Quadra Video Server.

Quadra encoding for UGC - Three-year cost comparison, Quadra vs. CPU-only.
TABLE 3. Three-year cost comparison, Quadra vs. CPU-only.

Boosting Capacity with the Quadra Video Server

Table 4 shows a higher-end use case where the publisher buys a Quadra server with ten T1Us installed. As we discussed in the article around Figure 5, since the CPU required for each Quadra FFmpeg instance is so low, a server can easily support ten T1Us or even more without reducing the throughput of each T1U. This means a system with ten T1Us can produce 10x the throughput in the same 1RU footprint and three-year OPEX and Colo cost. The Quadra Video Server costs $21,000 and has a three-year cost total of $27,300, with a cost per hour of $0.0049.

The Quadra Video Server delivers 10x the performance of a single T1U.
FIGURE 8. The Quadra Video Server delivers 10x the performance of a single T1U.

To match this with a CPU-only system, you’d have to purchase 45.93 systems, which we’ll assume you can do to keep the numbers clean. Your cost per hour is the same as Table 3, but your three-year spending jumps to $518,963, or $491,663 more than the Quadra system. At these load levels, the Quadra Video Server delivers about 95% savings.

Quadra encoding for UGC - Comparing the Quadra server with CPU-only transcoding.
TABLE 4.  Comparing the Quadra server with CPU-only transcoding.

In short, Quadra delivers slightly higher quality than x264 medium in the tested configuration while saving as much as 95% in high-volume use cases. Now you can see why YouTube and Meta produced their own ASICs.

That said, regretfully, neither company sells its ASIC-based transcoders on the open market. But the good news is that if you are a high-volume UGC site, you can achieve similar benefits deploying NETINT Quadra VPUs as standalone device(s) in your current infrastructure, or with Quadra Video Servers.

Picture of Jan Ozer

Jan Ozer

is Senior Director of Video Marketing at NETINT.

Jan is also a contributing editor to Streaming Media Magazine , writing about codecs and encoding tools. He has written multiple authoritative books on video encoding, including ‘Video Encoding by the Numbers: Eliminate the Guesswork from your Streaming Video’ and ‘ Learn to Produce Video with FFmpeg: In Thirty Minutes or Less’ and has produced multiple training courses relating to streaming media production.

ON-DEMAND: Building Your Own Live Streaming Cloud