If you’re encoding UGC social videos – VOD or live – Quadra delivers significant cost savings with no loss in quality.
transcode user generated videos
Virtually all video-oriented social media sites must now transcode user generated videos captured primarily with smartphones and uploaded or live-streamed at multiple resolutions, from 360p to 1080p and even 4K. This post compares the quality and throughput of the x264 and x265 codecs with the H.264 and HEVC output from the NETINT Quadra VPU (video processing unit).
What’s the punch line? A single Quadra VPU (most servers can accommodate up to 10 or more VPUs) delivers the same or better quality than either open-source codec with a ~4.5x increase in throughput for H.264 and ~9.4x increase for HEVC over a 12-core CPU running x264 or x265. This translates to very significant CAPEX and OPEX savings at even small video ingest volumes. The savings balloon to wild proportions at the scale of an X (Twitter), Meta (Facebook and Instagram), or Snap.
For perspective, ASICs in high volume video workflows are not new. YouTube and Meta have built their own ASICs precisely so that the platforms could be sustainable from an economic and environmental perspective. But, the good news is that if a custom ASIC isn’t in your R&D budget, we were the first commercial solution in the market and now have in deployments our second generation Quadra line.
Note that we explored similar quality and throughput comparisons for general UGC content in this article entitled Slash UGC Transcoding Costs with ASICs. Testing performed for this article is completely new and focused solely on vertical videos. In the interest of time and space, we will refer to some workflow analysis produced in the earlier article.
About NETINT, ASICs, and VPUs
NETINT was founded in 2015 and launched its first ASIC-based transcoder in 2018. ASIC stands for Application-Specific Integrated Circuit, which means a chip purpose-built for transcoding. As compared to general-purpose CPUs and GPUs perform general computing functions and devote less real estate to transcoding. NETINT’s transcoding ASICs are less expensive, more efficient, deliver much greater throughput, and consume much less power. In real-world use, large platforms report operational cost savings of 80% and a reduction in estimated carbon emissions even higher.
NETINT’s current generation ASIC-based VPU is named Quadra, in addition to transcoding, it performs scaling and overlay and has 15 TOPS of AI processing. Quadra is available as a standalone product with three form factors (U.2 and PCIe) and comes in single and dual chip configurations. For those looking for a turnkey solution the Quadra Video Server includes ten Quadra T1U VPUs and is available with AMD and Ampere CPUs from 8 cores all the way to 96 cores.
H.264 Test Description
To run these tests, we downloaded 29 vertical video clips in 360p (4), 480p (5), 720p (9), and 1080p (11) resolutions from YouTube’s UGC Dataset, a “large scale dataset containing YouTube User Generated Content intended for video compression and quality assessment research.” This allowed us to evaluate suitability for UGC with actual UGC video files, each 20 seconds long.
We produced all clips and throughput tests on a Dell Server equipped with a 2.2 GHz AMD Ryzen 5 5600x 6-core/12-thread CPU running Ubuntu 20.04.3 LTS with 16 GB of RAM. The Quadra VPU is a T1U device. I performed all tests with FFmpeg, using version 5.0 to drive the Quadra and version 6.0 for x264.
Our goal was to compare the encoders at the quality level typically used by a UGC site such as a social network. Given the range of content from the YouTube dataset, this meant encoding each clip at a custom bitrate. To find this, we encoded each clip using x264 at a CRF value of 23 and measured the VMAF score. Then, we adjusted this bitrate up or down to produce the desired quality level.
We encoded the x264 videos using the following command string, which used 200% constrained VBR encoding with a 40-frame lookahead. Since we didn’t specify a preset, FFmpeg used the default medium preset. For simplicity, I used a 60-frame GOP size for all files.
input.mp4 -b:v 4500k -maxrate 9000k -bufsize 9400k -g 60 -rc-lookahead 40 -an -c:v libx264 output.mp4
The corresponding command string for Quadra was:
ffmpeg -y -c:v h264_ni_quadra_dec -xcoder-params “out=sw” -i input.mp4 -y -c:v h264_ni_quadra_enc -xcoder-params “gopPresetIdx=5:RcEnable=1:intraPeriod=60:lookAheadDepth=40:vbvBufferSize=2000:bitrate=4500000:zeroCopyMode=0” output.mp4
- -c:v h264_ni_quadra_dec -xcoder-params “out=sw” – decodes the incoming H.264 video using Quadra’s hardware decoder, which reduces the decode load on the CPU and improves throughput.
- -c:v h264_ni_quadra_enc – calls the Quadra’s H.264 encoder.
- gopPresetIdx=5 – sets three B-frames between each P-frame in the GOP.
- RCEnable=1 – sets the bitrate control to average bitrate mode.
- intraPeriod=60 – sets the GOP size to 60 frames.
- LookAheadDepth=40 – sets the lookahead to 40 frames.
- vbvBufferSize=2000 – sets the VBV buffer to 2000 milliseconds, or 2 seconds
- bitrate=4500000 – sets the bitrate.
- zeroCopyMode=0 – reduces overhead by minimizing data transfer load to the Quadra.
As you’ll see with the HEVC encodes, Quadra has a function called Rate Distortion Optimization that video engineers can use to balance quality and throughput. Where Quadra offers three RDO settings for HEVC (RDO1, RDO2, RDO3), there’s only one for H.264 (RDO1). We did not enable RDO1 for H.264 for these tests to best match the performance of the medium x264 preset.
To ensure a fair comparison, we checked that all Quadra encodes were within 3% of the bitrate of the x264 (and x265) encodes, reencoding as needed. Then we compared average VMAF scores computed using the Harmonic mean method by the Moscow State University Video Quality Measurement Tool.
H.264 Quality Comparisons
As I discussed in detail in this article (Slash UGC Transcoding Costs with ASICs), most x264 and x265 encodes showed significant quality deficits for the first 50 frames or so. If your typical clip is 20 seconds long, this may be significant, both in metric scoring and in the potential impact on viewer Quality of Experience (QoE). If your clips are much longer, the impact of these first 50 frames is likely irrelevant to both scoring and QoE.
In the first article, we decided to exclude the first 60 frames from our metric calculations, and we did the same here. If you’re running your own tests, see if your x264/x265 encodes produce the same deficits, and draw your own conclusions as to the impact on QoE.
Table 1 shows the H.264 quality results. As you can see, Quadra delivered nearly identical quality as x264 medium at all resolutions and on average.
TABLE 1. Quality comparisons between x264 Medium and Quadra H.264.
I’ll present throughput results after discussing HEVC quality.
HEVC Test Description
We used the same test clips for HEVC but adjusted the bitrate down by 40% to achieve a similar target VMAF score. We used the following x265 command string to implement 200% constrained VBR with a 60-frame GOP size and 40-frame lookahead. Again, since we didn’t specify a preset, FFmpeg used the default medium preset.
ffmpeg -y -i input.mp4 -b:v 2800k -maxrate 5600k -bufsize 5600k -x265-params keyint=60:rc-lookahead=40 -an -c:v libx265 output.mp4
Below is the Quadra command string, and I’ll describe the highlighted changes.
ffmpeg -y -c:v h264_ni_quadra_dec -xcoder-params “out=sw” -i VV_1080P_1.mp4 -y -c:v h265_ni_quadra_enc -xcoder-params “gopPresetIdx=5:RcEnable=1:intraPeriod=60:lookAheadDepth=40:vbvBufferSize=2000:bitrate=2250000:zeroCopyMode=0:EnableRdoQuant=1:rdoLevel=1” Quad_HEVC_RDO_1_VV_1080P_1.mp4
- -c:v h265_ni_quadra_enc – calls the Quadra’s H265/HEVC encoder.
- EnableRdoQuant=1 – enables Rate Distortion Optimization (RDO).
- rdoLevel=1 – sets the RDO level at 1, which is the lowest setting.
Like x264 presets, RDO adjusts quality and throughput, and a setting of RDO3 will increase the VMAF score by 1-3 points with some loss of throughput. We chose RD01 to best match the medium x265 preset.
HEVC Quality Comparisons
Again, we assessed four resolutions with the results shown in Table 2. Here, Quadra delivered better than medium quality at every resolution, which averaged to 1 VMAF point overall. Since a just noticeable difference for VMAF requires 3-6 points, this wouldn’t be a visible difference, but it pegs Quadra’s quality at RDO1 as slightly higher than x265 using the medium preset.
TABLE 2. Quality comparisons between x265 Medium and Quadra H.264.
Now, let’s look at throughput.
Again, we tested on a 12-core Dell Server running Ubuntu 20.04.3 LTS with 16 GB of RAM. The Quadra VPU used was the T1U model. I used a 12-minute 1080×1920 30p file as the source for all performance tests.
As discussed in more detail here, before assessing throughput, I tested with each codec/encoder to ascertain how many simultaneous encodes produced the highest overall throughput.
For x264 and x265, three simultaneous encodes produced the highest throughput; with Quadra, the magic number was four for both codecs.
Table three shows the results. For x264, it took 6:31 (min:sec) to produce three files, which translated to 166 frames per second. Quadra produced four files in 1:53, or 765 frames per second, which is 4.61x faster than x264.
As you would expect, x265 was much slower, processing three encodes in 14:35, or 74 frames per second. Producing HEVC, Quadra was slightly slower than H.264, which is the result of using RDO1 for this encode. Otherwise, HEVC would be as fast or even faster than H.264. Nonetheless, Quadra produced HEVC at 697 frames per second, or 9.41x faster than x265.
TABLE 3. Comparative throughput for x264, x265, and Quadra
To put these numbers in perspective, a Quadra T1U costs around $1,500, while a 12-core server in a 1RU form factor will cost $2,000 or more. Reproducing the test system used here would require one server and one Quadra, costing about $3,500 for both H264 and HEVC (Table 4). You would have to buy five CPUs to match Quadra’s H.264 output and ten to match HEVC output.
You see the savings Quadra delivers in each scenario in the bottom line of the table. Obviously, this is CAPEX-only savings; in operation, you’d have to fund 5x the power and co-location/housing costs for H.264 and 10x for HEVC.
TABLE 4. The comparative cost for low volumes transcoding.
To meet higher quantities of demand with Quadra, you can simply add additional Quadra units to the server or buy a server with ten Quadras for $21,000 for turnkey operation. Since Quadra operation is almost completely self-contained, installing ten Quadras in a server will 10x the throughput. You can read more about the Quadra Video Server here and here.
With CPU-only transcoding, you have to buy more CPUs. To match the Quadra Video Server’s H.264 output, you would need 46 CPUs, for HEVC 94 CPUs. You see the math comparisons in Table 5; again, this is only CAPEX, not OPEX. Power, storage, and environmental impact are all separate.
TABLE 5. The comparative cost for server-level transcoding.
In conclusion: These numbers are striking, but you don’t have to take our word for it. Check out this case study for live interactive video and this article for VOD transcoding. You’ll learn that an increasing number of video engineers are discovering just what we prove above; that if you’re currently transcoding at high volumes with CPU-only encoders, you can produce the same or better quality at a much lower cost with Quadra VPUs.