Real-time streaming experiences like live events, interactive video, cloud gaming, video communications, and virtual worlds are seeing massive consumer adoption. Meeting this demand with CPU-based codecs like x264 is expensive and inefficient, unnecessarily boosting CAPEX, OPEX, and carbon emissions.
The trend for large platforms like YouTube is to build custom Application Specific Integrated Circuits, or ASICs, like Google’s Argos Video Coding Unit (VCU), which according to one report, has replaced over 10 million Intel CPUs in YouTube alone.
While most companies can’t build their own ASIC, NETINT’s Codensity ASIC-powered T408 video transcoder can deliver the same benefits for producers that encode, transcode and process, massive quantities of live or interactive streams.
This How-To Guide compares the output quality, CAPEX, OPEX and carbon emissions for three production scenarios, as follows:
- Encoding with FFmpeg using x264 on a 32-core AWS instance.
- Encoding with FFmpeg using x264 on a 32-core server.
- Encoding with ten NETINT T408 video transcoders on the same 32-core server.
Then it briefly addresses implementation details like factors to consider when buying a server to house the T408s and software options for managing transcoding activities.
Table 1 shows a three-year financial summary of the three approaches.
Table 1. Three-year cost summary for producing 100 H.264
live encoding ladders for 24/7 operation.
About the T408
Briefly, NETINT designs, develops, and sells ASIC-powered transcoders (also referred to as VPUs) like the Codensity T408, which is a video transcoder in a U.2 form-factor containing a single Codensity G4 ASIC (Figure 1). Operating in x86 and Arm-based servers, T408 transcoders utilize ASIC-based video processors to output H.264 or HEVC at up to 4Kp60, or 4x 1080p60 streams per T408 module. As you’ll see in the test results below, at lower resolutions, the T408 can produce even more simultaneous streams. The density numbers are shocking, but real. Users always like what they see – it’s the magic of silicon when designed right.
By offloading complex encode/decode processing to the ASIC, T408 video transcoders minimize host CPU utilization. The result is a significant improvement in real-time transcoding density compared to any software or even GPU-based transcoding solution.
For operation, NETINT offers highly efficient FFmpeg and GStreamer SDKs that allow operators to apply an FFmpeg/libavcodec or GStreamer patch to complete the integration. We performed all tests for this How-To Guide using the FFmpeg integration.
Figure 1. NETINT Codensity T408 in a U.2 form factor.
In terms of power, each T408 U.2 module consumes just 7W of power at full load while delivering encoding output that equals or exceeds a 1RU server that consumes 250W of power during software-based H.264 encoding. This efficiency allows T408-equipped systems to deliver massive reductions in CAPEX, OPEX, and carbon emissions.
To compute the costs detailed in this guide, we assumed that a producer needed to encode 100 simultaneous H.264 encoding ladders for 24/7/365 operation. Here are the ladders.
- 1080p @ 5 Mbps
- 1080p @ 3.5 Mbps
- 720p @ 2 Mbps
- 540p @ 1 Mbps
- 360p @ 600 kbps
We tested three scenarios. To assess the cost of producing on AWS, testing was performed using a C6g.8xlarge CPU, which was the instance recommended when we used the AWS Pricing Calculator to project AWS costs.
Then we produced the files on an AMD machine, first using only the CPU, and then using ten installed T408s. Specifically, we tested on an AMD EPYC 7351P 16-Core CPU-based workstation running Ubuntu with 32 total cores and 64 GB of RAM.
We ran all tests using a 3-minute excerpt from the Netflix test clip Meridian, which we first converted to 1080p30 @ 90 Mbps.
We drove all systems remotely via SSH, retrieving the same test file from a RAM drive and writing the output files to RAM, removing disk I/O from the equation, and simulating live operation. We tested three encoding schemes:
- x264 using the medium preset (on AWS and the AMD system)
- x264 using the veryfast preset (on AWS and the AMD system)
- NETINT H264 encoder using the default settings and decoding the incoming H.264 stream using the onboard T408 decoder (on the AMD system)
We produced all streams with a two-second GOP size using CBR bitrate control. All command strings are shown in Appendix I. FFmpeg 4.3.1 was used for all tests.
The goal of each test was to determine how many 30 fps encoding ladders each system could produce. To test this, we opened multiple instances and started encoding ladders until one or more dropped below 30 fps.
As an example, Figure 2 shows three ladders encoding at 30 fps using x.264 and the veryfast preset on the AMD-based system. When we started encoding an additional ladder in the fourth window, all ladders dropped below 30 fps and never recovered.
Figure 2. Three ladders encoding at 30 fps using x264 and the veryfast preset.
A look at the CPU utilization shown in the Top utility on the bottom of Figure 2 reveals why. With 32 total cores, the system has an available 3200% CPU. The combined utilization of the three instances shown in the figure exceeded 2600 or just under nine cores each to produce the three 30 fps streams. With only about six cores left, the system lacked the resources to produce another stream at the full 30 fps.
Using this procedure, the AMD-based computer produced a single ladder with x264 and the medium preset, and the three ladders shown in Figure 1 with x264 and the veryfast preset. The AWS instance produced 1 ladder with the medium preset and four ladders with the veryfast preset. In contrast, when utilizing the ten T408s installed in the system, the AMD-based system produced 23 simultaneous ladders.
Figure 3 shows decode (top) and encoder (bottom) utilization of the T408s when producing this output. During this trial, CPU utilization per FFmpeg instance averaged about 42% for each encoding ladder for a total load of under 1000% (23*42%=966%). Since 32-cores/3200% were available on this system, this left sufficient resources for file I/O and other encoding-related operations.
We discuss more about the operational aspects of the T408-based encoding farm and how to configure the server that houses it in the implementation section below.
Figure 3. T408 utilization when producing 23 simultaneous ladders.
Once we ascertained the number of simultaneous streams each encoding scheme produced, we divided this into 100 to compute the number of servers required to produce the 100 simultaneous streams. This yielded the data shown in Table 2.
Table 2. The number of servers needed to produce 100 simultaneous streams
using the three encoding techniques.
Note that the actual number of ladders that each production scheme can produce at 30fps will vary based upon the encoding ladder and the content, including differences in resolution, the number of rungs, frame rate, and the encoding complexity of the video itself. Still, these factors should impact the three schemas similarly, and encoding even three fewer simultaneous ladders with the T-408-based solution would still enable five servers to produce the 100 target ladders, so there would be no change in the CAPEX, OPEX, and carbon emissions reported below.
Let’s take a quick look at output quality. By way of background, when hardware-based H.264 encoding first debuted, output quality trailed that produced by software encoding techniques, often by a significant margin. Today, at least as it relates to the T408, quality won’t be an issue. You see this in Table 3, which shows that the T408-encoded stream rated best over both x264 medium and x264 veryfast with the 1080p Meridian test clip used for this study.
Table 3. VMAF and PSNR values for the top-rung of the
encoding ladder (1080p@5 Mbps)
As with throughput, comparative quality will vary based upon multiple factors. We’re preparing a more detailed comparison for future presentations demonstrating that NEINT’s ASIC-based transcoders deliver quality equivalent to or better than the software-based codecs/presets typically used for live encoding and transcoding.
CAPEX and OPEX Comparisons
Looking at Table 2, it’s clear that few producers will attempt to produce live streams with software only encoding using x264 and the medium preset – it’s simply too expensive. So, we focused our economic comparison on the difference between producing the 100 streams with x264 and the veryfast preset and the T408 video transcoder.
For the x264/very fast production, we priced two alternatives, encoding via AWS and buying servers and running them in a colocation facility. For the T408 production, we priced buying the necessary servers and installing them in a colocation facility. In all three instances, we computed the three-year cost total for CAPEX and OPEX.
x264/Veryfast - AWS
Again, the C6g.8xlarge instance tested produced four simultaneous encoding ladders with FFmpeg using the x264 codec and the veryfast preset. Accordingly, you would need 25 instances to produce the 100 simultaneous target streams.
AWS offers an estimator at https://calculator.aws/#/addService/EC2. To estimate the cost, you enter the number of cores (32), the number of servers (25), estimated utilization (100%), and the commitment period (three years), and AWS provides a monthly estimate. As shown in Table 4, this was $8,653, which we multiplied by 12 for the yearly cost, and by three for the three-year total of $311,490.
Table 4. Three-year cost for producing 100 simultaneous encoding ladders with
the x264 codec/veryfast preset using AWS.
x264/Veryfast - Buy Servers and Co locate
The next option required buying 34 servers, which we assumed would cost $8000 each, and running them from a colocation facility that would charge $69/month per server. We priced this option using a colocation facility rather than on-premises because colocation costs are widely available and reasonably consistent, while internal housing costs vary by company, location, and accounting method and practices.
As shown in Table 5, these price and cost estimates produced a CAPEX charge of $272,000, and a three-year OPEX charge of $84,456, totaling $356,456 for the three-year period. Overall, this was the most expensive option.
Table 5. Three year cost for producing 100 simultaneous encoding ladders with
the x264 codec/veryfast preset by buying and colocating the servers.
T408 - Buy Servers and Co locate
Table 6 shows the three-year cost of purchasing five 32-core servers with ten T408 cards each and running them from a colocation facility. This produced CAPEX of $55,000, OPEX of $12,420, and a three-year total of $67,420.
Table 6. Three-year cost for producing 100 simultaneous encoding ladders by
purchasing and collocating fiveT408-equipped servers.
Table 7 compares the three-year cost for all three schemes. The T408-based option represents a 78% cost reduction over encoding with AWS and an 81% savings over purchasing and collocating servers.
Table 7. Three year cost comparison.
Now we turn our attention to carbon emissions which are presented in Table 8. In the table, the AMD – CPU only and AMD – T408 server watts/hour are actual measurements on the test system during operation. To estimate the AWS server watts/hour, we reduced the CPU-only AMD number by 60%, which is the savings that Amazon claims that Graviton3 CPUs provide over other CPUs. In all three cases, we multiplied this by the number of servers, then hours, days, and years, to compute the three-year power consumption total.
Table 8. Three-year carbon emission comparisons.
To compute metric tons of CO2, we used the EPA Greenhouse Gas Equivalencies Calculator available here. To estimate the CO2 emissions, you enter in the total kilowatt hours used, and the calculator displays the metric tons equivalent of greenhouse gas emissions. The T408-based option represents a 50% savings over AWS (assuming Amazon’s 60% savings estimate is accurate) and an 85% reduction as compared to purchasing and running 34 servers.
Choosing a CPU/Server for the T408s
We’ll conclude with a look at choosing a server for the T408s and how to interface with and operate the transcoders. In terms of operating system, the T408 transcoding software supports the following:
- OS: Ubuntu 16.04.3 LTS; kernel: 4.10.0-28-generic
- OS: Ubuntu 16.04.3 LTS; kernel: 4.15.0-64-generic
- OS: Ubuntu 18.04.2 LTS; kernel: 4.15.0-45-generic
- OS: CentOS 7.2.1511; kernel: 3.10.0-327.el7.x86_64
- OS: CentOS 7.5.1804; kernel: 3.10.0-862.11.6.el7.x86_64
- OS: CentOS 7.6.1810; kernel: 3.10.0-957.el7.x86_64
The minimum requirements for the system housing the T408s is an Intel i5 CPU or equivalent with 4GB DDR3 or DDR4. However, you may need a more powerful CPU depending upon your specific transcoding application.
To explain, the T408 can decode incoming H.264/HEVC streams via onboard decoders and performs all encoding onboard. However, scaling the source video to lower resolutions for different rungs on the encoding ladder is performed by the host CPU.
For this reason, your selection of the host CPU(s) for the server should consider the specific tasks the system will perform. In a gaming or other interactive environment, where you are inputting a single stream and outputting a single stream at the same resolution, host CPU requirements should be limited, and a modest CPU should perform well.
In contrast, if your application involves creating full encoding ladders from HD or particularly 4K source videos, a more powerful CPU will increase overall system throughput. For example, we tested the same encoding ladder on a system with 64-cores, and ten T408s, and the system produced 30 simultaneous ladders.
Note that codec selection will also impact throughput and performance, though nowhere near as much as the difference between x264 and x265 in CPU-based encoding. Specifically, as compared to H.264, you should expect a drop in throughput for HEVC transcoding of between 2-5%, though again, this is task and CPU-dependent.
The bottom line is that it’s hard to predict which CPU configuration will perform best in your T408 host. During your pre-deployment testing, you should plan to test different CPUs to find the optimal configuration.
Operating the T408 System
As mentioned above, NETINT offers highly efficient FFmpeg and GStreamer SDKs that allow operators to apply an FFmpeg/libavcodec or GStreamer patch to complete the integration.
Figure 4. The T408 transcoding architecture.
In the FFmpeg implementation, the libavcodec patch on the host server functions between the T408 NVMe interface and the FFmpeg software layer, allowing existing FFmpeg-based video transcoding applications to control T408 operation with minimal changes.
The T408 device driver software includes a resource management module that tracks T408 capacity and usage load to present inventory and status on available resources and enable resource distribution. User applications can build their own resource management schemes on top of this resource pool or let the NETINT server automatically distribute the decoding and encoding tasks.
In this mode, users simply launch multiple transcoding jobs, and the device driver will automatically distribute the decode/encode tasking among the available resources. We used this mode of automatic distribution to produce the 23 ladders reported for the T408-based system.
Or, users can assign different decoding and encoding tasks to different T408 devices and even control which streams are decoded by the host CPU or a T408. With these and similar controls, users can most efficiently balance the overall transcoding load between the T408s and host CPU and maximize throughput. Note that these resource management functions have been integrated into FFmpeg to simplify operation.
Summary and Conclusion
Overall, the T408 delivers dramatic reductions in CAPEX and OPEX and significantly cuts carbon emissions as compared to other production alternatives, all while producing quality similar to CPU-only transcoding. With a highly functional resource management schema and FFmpeg and GStreamer integrations, implementing a T408-based solution should be fast and simple for most streaming producers.