Beyond Traditional Transcoding: NETINT’s Pioneering Technology for Today’s Streaming Needs

Welcome to our here’s-what’s-new-since-last-IBC-so-you-should-schedule-a-meeting-with-us blog post. I know you’ve got many of these to wade through, so I’ll be brief.

First, a brief introduction. We’re NETINT, the ASIC-based transcoding company. We sell standalone products like our T408 video transcoder and Quadra VPUs ( for video transcoding units) and servers with ten of either device installed. All offer exceptional throughput at an industry-low cost per stream and power consumption per stream. Our products are denser, leaner, and greener than any competitive technology.
They’re also more innovative. The first-generation T408 was the first ASIC-based hardware transcoder available for at least a decade, and the second-generation Quadra was the first hardware transcoder with AV1 and AI processing. Our Quadra shipped before Google and Meta shipped their first generation ASIC-based transcoders and they still don’t support AV1.
That’s us; here’s what’s new.

Capped CRF Encoding

We’ve added capped CRF encoding to our Quadra products for H.264, HEVC, and AV1, with capped CRF coming for the T408 and T432 (H.264/HEVC). By way of background, with the wide adoption of content-adaptive encoding techniques (CAE), constant rate factor (CRF) encoding with a bit rate cap gained popularity as a lightweight form of CAE to reduce the bitrate of easy-to-encode sequences, saving delivery bandwidth and delivering CBR-like quality on hard-to-encode sequences. Capped CRF encoding is a mode that we expect many of our customers to use.

Figure 1 shows capped CRF operation on a theoretical football clip. The relevant switches in the command string would look something like this:

-crf 21  -maxrate 6MB

This directs FFmpeg to deliver at least the quality of CRF 21, which for H.264 typically equals around a 95 VMAF score. However, the maxrate switch ensures that the bitrate never exceeds 6 Mbps.

As shown in the figure, in operation, the Quadra VPU transcodes the easy-to-encode sideline shots at CRF 21 quality, producing a bitrate of around 2 Mbps. Then, during actual high-motion game footage, the 6MB cap would control, and the VPU would deliver the same quality as CBR. In this fashion, capped CRF saves bandwidth with easy-to-encode scenes while delivering equivalent to CBR quality with hard-to-encode scenes.

Figure 1. Capped CRF in operation. Relatively low-motion sideline shots are encoded to CRF 21 quality (~95 VMAF), while the 6 Mbps bitrate cap controls during high-motion game footage. Transcoding.
Figure 1. Capped CRF in operation. Relatively low-motion sideline shots are encoded to CRF 21 quality (~95 VMAF), while the 6 Mbps bitrate cap controls during high-motion game footage.

By deploying capped CRF, engineers can efficiently deliver high-quality video streams, enhance viewer experiences, and reduce operational expenses. As the demand for video streaming continues to grow, Capped CRF emerges as a game-changer for engineers striving to stay at the forefront of video delivery optimization.

You can read more about capped CRF operation and performance in Get Free CAE on NETINT VPUs with Capped CRF.

Peer-to-Peer Direct Memory Access (DMA) for Cloud Gaming

Peer-to-peer DMA is a feature that makes the NETINT Quadra VPU ideal for cloud gaming. By way of background, in a cloud-gaming workflow, the GPU is primarily used to render frames from the game engine output. Once rendered, these frames are encoded with codecs like H.264 and HEVC.

Many GPUs can render frames and transcode to these codecs, so it might seem most efficient to perform both operations on the same GPU. However, encoding demands a significant chunk of the GPU’s resources, which in turn reduces overall system throughput. It’s not the rendering engine that’s stretched to its limits but the encoder.

What happens when you introduce a dedicated video transcoder into the system using normal techniques? The host CPU manages the frame transfer between the GPU and the transcoder, which can create a bottleneck and slow system performance.

Figure 2. Peer-to-peer DMA enables up to 200 720p60 game streams from a single 2RU server. Transcoding.
Figure 2. Peer-to-peer DMA enables up to 200 720p60 game streams from a single 2RU server.

In contrast, peer-to-peer DMA allows the GPU to send frames directly to the transcoder, eliminating CPU involvement in data transfers (Figure 2). With peer-to-peer DMA enabled, the Quadra supports latencies as low as 8ms, even under heavy loads. It also unburdens the CPU from managing inter-device data transfers, freeing it to handle other essential tasks like game logic and physics calculations. This optimization enhances the overall system performance, ensuring a seamless gaming experience.

Some NETINT customers are using Quadra and peer-to-peer DMA to produce 200 720p60 game streams from a single 2RU server, and that number will increase to 400 before year-end. If you’re currently assembling an infrastructure for cloud gaming, come see us at IBC.

Logan Video Server

NETINT started selling standalone PCIe and U.2 transcoding devices, which our customers installed into servers. In late 2022, customers started requesting a prepackaged solution comprised of a server with ten transcoders installed. The Logan Video Server is our first response.

Logan refers to NETINT’s first-generation G4 ASIC, which transcodes to H.264 and HEVC. The Logan Video Server, which launched in the first quarter of 2023, includes a SuperMicro server with a 32-core AMD CPU running Ubuntu 20.04 LTS and ten NETINT T408 U.2 transcoder cards (which cost $300 each) for $8,900. There’s also a 64-core option available for $11,500 and an 8-core option for $7,000.

The value proposition is simple. You get a break on price because of volume commitments and don’t have to install the individual cards, which is generally simple but still can take an hour or two. And the performance with ten installed cards is stunning, given the price tag.

You can read about the performance of the 32-core server model in my review here, which also discusses the software architecture and operation. We’ll share one table, which shows one-to-one transcoding of 4K, 1080p, and 720p inputs with FFmpeg and GStreamer.

At the $8,900 cost, the server delivers a cost per stream as low as $445 for 4K, $111.25 for 1080p, and just over $50 for 720p at normal and low latency. Since each T408 only draws 7 watts and CPU utilization is so low, power consumption is also exceptionally low.

Meet NETINT at IBC - Transcoding - Table-1
Table 1. One-to-one transcoding performance for 4K, 1080p, and 720p.

With impressive density, low power consumption, and multiple integration options, the NETINT Video Transcoding Server is the new standard to beat for live streaming applications. With a lower-priced model available for pure encoding operations and a more powerful model for CPU-intensive operations, the NETINT Logan server family meets a broad range of requirements.

Quadra Video Server

Once the Logan Video Server became available, customers started asking about a similarly configured server for NETINT’s Quadra line of video transcoding units (VPUs), which adds AV1 output, onboard scaling and overlay, and two AI processing engines. So, we created the Quadra Video Server.

This model uses the same Supermicro chassis as the Logan Video Server and the same Ubuntu operating system but comes with ten Quadra T1U U.2 form factor VPUs, which retail for $1,500 each. Each T1U offers roughly four times the throughput of the T408, performs on-board scaling and overlay, and can output AV1 in addition to H.264 and HEVC.

The CPU options are the same as the Logan server, with the 8-core unit costing $19,000, the 32-core unit costing $21,000, and the 64-core model costing $24,000. That’s 4X the throughput at just over 2x the price.

You can read my review of the 32-core Quadra Video Server here. I’ll again share one table, this time reporting encoding ladder performance at 1080p for H.264 (120 ladders), HEVC (140), and AV1 (120), and 4K for HEVC (40) and AV1 (30).

In comparison, running FFmpeg using only the CPU, the 32-core system only produced nineteen H.264 1080p ladders, five HEVC 1080p ladders, and six AV1 1080p ladders. Given this low-volume throughput at 1080p, we didn’t bother trying to duplicate the 4K results with CPU-only transcoding.

Figure 2. Encoding ladder performance of the Quadra Video Server.
Table 2. Encoding ladder performance of the Quadra Video Server.

Beyond sheer transcoding performance, the review also details AI-based operations and performance for tasks like region of interest transcoding, which can preserve facial quality in security and other relatively low-quality videos, and background removal for conferencing applications.

Where the Logan Video Server is your best low-cost option for high volume H.264 and HEVC transcoding, the Quadra Video Server quadruples these outputs, adds AV1 and onboard scaling and overlay, and makes AI processing available.

Come See Us at the Show

We now return to our normally scheduled IBC pitch. We’ll be in Stand 5.A86 and you can book a meeting by clicking here.

Figure 3. Book a meeting.
.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

Choosing Transcoding Hardware: Deciphering the Superiority of ASIC-based Technology

Which technology reigns supreme in transcoding: CPU-only, GPU, or ASIC-based? Kenneth Robinson’s incisive analysis from the recent symposium makes a compelling case for ASIC-based transcoding hardware, particularly NETINT’s Quadra. Robinson’s metrics prioritized viewer experience, power efficiency, and cost. While CPU-only systems appear initially economical, they falter with advanced codecs like HEVC. NVIDIA’s GPU transcoding offers more promise, but the Quadra system still outclasses both in quality, cost per stream, and power consumption. Furthermore, Quadra’s adaptability allows a seamless switch between H.264 and HEVC without incurring additional costs. Independent assessments, such as Ilya Mikhaelis‘, echo Robinson’s conclusions, cementing ASIC-based transcoding hardware as the optimal choice.

Choosing transcoding hardware

During the recent symposium, Kenneth Robinson, NETINT’s manager of Field Application Engineering, compared three transcoding technologies: CPU-only, GPU, and ASIC-based transcoding hardware. His analysis, which incorporated quality, throughput, and power consumption, is useful as a template for testing methodology and for the results. You can watch his presentation here and download a copy of his presentation materials here.

Figure 1. Overall savings from ASIC-based transcoding (Quadra) over GPU (NVIDIA) and CPU.
Figure 1. Overall savings from ASIC-based transcoding (Quadra) over GPU (NVIDIA) and CPU.

As a preview of his findings, Kenneth found that when producing H.264, ASIC-based hardware transcoding delivered CAPEX savings of 86% and 77% compared to CPU and GPU-based transcoding, respectively. OPEX savings were 95% vs. CPU-only transcoding and 88% compared to GPU.

For the more computationally complex HEVC codec, the savings were even greater. As compared to CPU-based transcoding, ASICs saved 94% on CAPEX and 98% on OPEX. As compared to GPU-based transcoding, ASICs saved 82% on CAPEX and 90% on OPEX. These savings are obviously profound and can make the difference between a successful and profitable service and one that’s mired in red ink.

Let’s jump into Kenneth’s analysis.

Determining Factors

Digging into the transcoding alternatives, Kenneth described the three options. First are CPUs from manufacturers like AMD or Intel. Second are GPUs from companies like NVIDIA or AMD. Third are ASICs, or Application Specific Integrated Circuits, from manufacturers like NETINT. Kenneth noted that NETINT calls its Quadra devices Video Processing Units (VPU), rather than transcoders because they perform multiple additional functions besides transcoding, including onboard scaling, overlay, and AI processing.

He then outlined the factors used to determine the optimal choice, detailing the four factors shown in Figure 2. Quality is the average quality as assessed using metrics like VMAF, PSNR, or subjective video quality evaluations involving A/B comparisons with viewers. Kenneth used VMAF for this comparison. VMAF has been shown to have the highest correlation with subjective scores, which makes it a good predictor of viewer quality of experience.

Choosing transcoding hardware - Determining Factors
Figure 2. How Kenneth compared the technologies.

Low-frame quality is the lowest VMAF score on any frame in the file. This is a predictor for transient quality issues that might only impact a short segment of the file. While these might not significantly impact overall average quality, short, low-quality regions may nonetheless degrade the viewer’s quality of experience, so are worth tracking in addition to average quality.

Server capacity measures how many streams each configuration can output, which is also referred to as throughput. Dividing server cost by the number of output streams produces the cost per stream, which is the most relevant capital cost comparison. The higher the number of output streams, the lower the cost per stream and the lower the necessary capital expenditures (CAPEX) when launching the service or sourcing additional capacity.

Power consumption measures the power draw of a server during operation. Dividing this by the number of streams produced results in the power per stream, the most useful figure for comparing different technologies.

Detailing his test procedures, Kenneth noted that he tested CPU-only transcoding on a system equipped with an AMD Epic 32-core CPU. Then he installed the NVIDIA L4 GPU (a recent release) for GPU testing and NETINT’s Quadra T1U U.2 form factor VPU for ASIC-based testing.

He evaluated two codecs, H.264 and HEVC, using a single file, the Meridian file from Netflix, which contains a mix of low and high-motion scenes and many challenging elements like bright lights, smoke and fog, and very dark regions. If you’re testing for your own deployments, Kenneth recommended testing with your own test footage.

Kenneth used FFmpeg to run all transcodes, testing CPU-only quality using the x264 and x265 codecs using the medium and very fast presets. He used FFmpeg for NVIDIA and NETINT testing as well, transcoding with the native H.264 and H.265 codec for each device.

H.264 Average, Low-Frame, and Rolling Frame Quality

The first result Kenneth presented was average H.264 quality. As shown in Figure 3, Kenneth encoded the Meridian file to four output files for each technology, with encodes at 2.2 Mbps, 3.0 Mbps, 3.9 Mbps, and 4.75 Mbps. In this “rate-distortion curve” display, the left axis is VMAF quality, and the bottom axis is bitrate. In all such displays, higher results are better, and Quadra’s blue line is the best alternative at all tested bitrates, beating NVIDIA and x264 using the medium and very fast presets.

Figure 3. Quadra was tops in H.264 quality at all tested bitrates.
Figure 3. Quadra was tops in H.264 quality at all tested bitrates.

Kenneth next shared the low-frame scores (Figure 4), noting that while the NVIDIA L4’s score was marginally higher than the Quadra’s, the difference at the higher end was only 1%. Since no viewer would notice this differential, this indicates operational parity in this measure.

Figure 4. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.
Figure 4. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.

The final H.264 quality finding displayed a 20-second rolling average of the VMAF score. As you can see in Figure 5, the Quadra, which is the blue line, is consistently higher than the NVIDIA L4 or medium or very fast. So, even though the Quadra had a lower single-frame VMAF score compared to NVIDIA, over the course of the entire file, the quality was predominantly superior.

Figure 5. 20-second rolling frame quality over file duration.
Figure 5. 20-second rolling frame quality over file duration.

HEVC Average, Low-Frame, and Rolling Frame Quality

Kenneth then related the same results for HEVC. In terms of average quality (Figure 6), NVIDIA was slightly higher than the Quadra, but the delta was insignificant. Specifically, NVIDIA’s advantage starts at 0.2% and drops to 0.04% at the higher bit rates. So, again, a difference that no viewer would notice. Both NVIDIA and Quadra produced better quality than CPU-only transcoding with x265 and the medium and very fast presets.

Figure 6. Quadra was tops in H.264 quality at all tested bitrates.
Figure 6. Quadra was tops in H.264 quality at all tested bitrates.

In the low-frame measure (Figure 7), Quadra proved consistently superior, with NVIDIA significantly lower, again a predictor for transient quality issues. In this measure, Quadra also consistently outperformed x265 using medium and very fast presets, which is impressive.

Figure 7. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.
Figure 7. NVIDIA’s L4 and the Quadra achieve relative parity in H.264 low-frame testing.

Finally, HEVC moving average scoring (Figure 8) again showed Quadra to be consistently better across all frames when compared to the other alternatives. You see NVIDIA’s downward spike around frame 3796, which could indicate a transient quality drop that could impact the viewer’s quality of experience.

Figure 8. 20-second rolling frame quality over file duration.
Figure 8. 20-second rolling frame quality over file duration.

Cost Per Stream and Power Consumption Per Stream - H.264

To measure cost and power consumption per stream, Kenneth first calculated the cost for a single server for each transcoding technology and then measured throughput and power consumption for that server using each technology. Then, he compared the results, assuming that a video engineer had to source and run systems capable of transcoding 320 1080p30 streams.

You see the first step for H.264 in Figure 9. The baseline computer without add-in cards costs $7,100 but can only output fifteen 1080p30 streams using an average of the medium and veryfast presets, resulting in a cost per stream was $473. Kenneth installed two NVIDIA L4 cards in the same system, which boosted the price to $14,214, but more than tripled throughput to fifty streams, dropping cost per stream to $285. Kenneth installed ten Quadra T1U VPUs in the system, which increased the price to $21,000, but skyrocketed throughput to 320 1080p30 streams, and a $65 cost per stream.

This analysis reveals why computing and focusing on the cost per stream is so important; though the Quadra system costs roughly three times the CPU-only system, the ASIC-fueled output is over 21 times greater, producing a much lower cost per stream. You’ll see how that impacts CAPEX for our 320-stream required output in a few slides.

Figure 9. Computing system cost and cost per stream.
Figure 9. Computing system cost and cost per stream.

Figure 10 shows the power consumption per stream computation. Kenneth measured power consumption during processing and divided that by the number of output streams produced. This analysis again illustrates why normalizing power consumption on a per-stream basis is so necessary; though the CPU-only system draws the least power, making it appear to be the most efficient, on a per-stream basis, it’s almost 20x the power draw of the Quadra system.

Figure 10. Computing power per stream for H.264 transcoding.
Figure 10. Computing power per stream for H.264 transcoding.

Figure 11 summarizes CAPEX and OPEX for a 320-channel system. Note that Kenneth rounded down rather than up to compute the total number of servers for CPU-only and NVIDIA. That is, at a capacity of 15 streams for CPU-only transcoding, you would need 21.33 servers to produce 320 streams. Since you can’t buy a fractional server, you would need 22, not the 21 shown. Ditto for NVIDIA and the six servers, which, at 50 output streams each, should have been 6.4, or actually 7. So, the savings shown are underrepresented by about 4.5% for CPU-only and 15% for NVIDIA. Even without the corrections, the CAPEX and OPEX differences are quite substantial.

Figure 11. CAPEX and OPEX for 320 H.264 1080p30 streams.
Figure 11. CAPEX and OPEX for 320 H.264 1080p30 streams.

Cost Per Stream and Power Consumption Per Stream - HEVC

Kenneth performed the same analysis for HEVC. All systems cost the same, but throughput of the CPU-only and NVIDIA-equipped systems both drop significantly, boosting their costs per stream. The ASIC-powered Quadra outputs the same stream count for HEVC as for H.264, producing an identical cost per stream.

Figure 12. Computing system cost and cost per stream.
Figure 12. Computing system cost and cost per stream.

The throughput drop for CPU-only and NVIDIA transcoding also boosted the power consumption per stream, while Quadra’s remained the same.

Figure 13. Computing power per stream for H.264 transcoding.
Figure 13. Computing power per stream for H.264 transcoding.

Figure 14 shows the total CAPEX and OPEX for the 320-channel system, and this time, all calculations are correct. While CPU-only systems are tenuous–at best– for H.264, they’re clearly economically untenable with more advanced codecs like HEVC. While the differential isn’t quite so stark with the NVIDIA products, Quadra’s superior quality and much lower CAPEX and OPEX are compelling reasons to adopt the ASIC-based solution.

Figure 14. CAPEX and OPEX for 320 1080p30 HEVC streams.
Figure 14. CAPEX and OPEX for 320 1080p30 HEVC streams.

As Kenneth pointed out in his talk, even if you’re producing only H.264 today, if you’re considering HEVC in the future, it still makes sense to choose a Quadra-equipped system because you can switch over to HEVC with no extra hardware cost at any time. With a CPU-only system, you’ll have to more than double your CAPEX spending, while with NVIDIA,  you’ll need to spend another 25% to meet capacity.

The Cost of Redundancy

Kenneth concluded his talk with a discussion of full hardware and geo-redundancy. He envisioned a setup where one location houses two servers (a primary and a backup) for full hardware redundancy. A similar setup would be replicated in a second location for geo-redundancy. Using the Quadra video server, four servers could provide both levels of redundancy, costing a total of $84,000. Obviously, this is much cheaper than any of the other transcoding alternatives.

NETINT’s Quadra VPU proved slightly superior in quality to the alternatives, vastly cheaper than CPU-only transcoding, and very meaningfully more affordable than GPU-based transcoders. While these conclusions may seem unsurprising – an employee at an encoding ASIC manufacturer concludes that his ASIC-based technology is best — you can check Ilya Mikhaelis’ independent analysis here and see that he reached the same result.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

From CPU to GPU to ASIC: Mayflower’s Transcoding Journey

Ilya’s transcoding journey took him from $10 million to under $1.5 million CAPEX while cutting power consumption by over 90%. This analytical deep-dive reveals the trials, errors, and successes of Mayflower’s quest, highlighting a remarkable reduction in both cost and power consumption.

From CPU to GPU to ASIC: The Transcoding Journey

Ilya Mikhaelis

Ilya Mikhaelis is the streaming backend tech lead for Mayflower, which builds and hosts streaming infrastructures for multiple publishers. Mayflower’s infrastructure handles over 10,000 incoming streams and over one million plus outgoing streams at a latency that averages one to two seconds.

Ilya’s challenge was to find the most cost-effective technology to transcode the incoming streams. His journey took him from CPU-based transcoding to GPU and then two generations of ASIC-based transcoding. These transitions slashed total production transcoding costs from $10 million dollars to just under $1.5 million dollars while reducing power consumption by over 90%, from 325,000 watts to 33,820 watts.

Ilya’s rigorous textbook-worthy testing methodology and findings are invaluable to any video engineer seeking the highest quality transcoding technology at the lowest capital cost and most efficient power usage. But let’s start at the beginning.

The Mayflower Internal CDN

As Ilya describes it, “Mayflower is a big company, under which different projects stand. And most of these projects are about high-load, live media streaming. Moreover some of Mayflower resources were included  in the top 50 of the most visited sites worldwide. And all these streaming resources are handled by one internal CDN, which was completely designed and implemented by my team.”

Describing the requirements, Ilya added, “The typical load of this CDN is about 10,000 incoming simultaneous streams and more than one million outgoing simultaneous streams worldwide. In most cases, we target a latency of one to two seconds. We try to achieve a real-time experience for our content consumers, which is why we need a fast and effective transcoding solution.”

To build the CDN, Mayflower used bare metal servers to maximize network and resource utilization and run a high-performance profile to achieve stable stream processing and keep encoder and decoder queues around zero. As shown in Figure 1, the CDN inputs streams via WebRTC and RTMP and delivers with a mix of WebRTC, HLS, and low latency HLS. It uses customized WebRTC inside the CDN to achieve minimum latency between servers.

Figure 1. Mayflower’s Low Latency CDN
Figure 1. Mayflower’s Low Latency CDN .

Ilya’s team minimizes resource wastage by implementing all high-level network protocols, like WebRTC, HLS, and low latency HLS, on their own. They use libav, an FFmpeg component, as a framework for transcoding inside their transcoder servers.

The Transcoding Pipeline

In Mayflower’s transcoding pipeline (Figure 2), the system inputs a single WebRTC stream, which it converts to a five-rung encoding ladder. Mayflower uses a mixture of proprietary and libav filters to achieve a stable frame rate and stable load. The stable frame rate is essential for outgoing streams because some protocols, like low latency HLS or HLS, can’t handle variable frame rates, especially on Apple devices.

Figure 2. Mayflower’s Low Latency CDN.
Figure 2. Mayflower’s Low Latency CDN.

CPU-Only Transcoding - Too Expensive, Too Much Power

After creating the architecture, Ilya had to find a transcoding technology as quickly as possible. Mayflower initially transcoded on a Dell R940, which currently costs around $20,000 as configured for Mayflower. When Ilya’s team first implemented software transcoding, most content creators input at 720p. After a few months, as they became more familiar with the production operation, most switched to 1080p, dramatically increasing the transcoding load.

You see the numbers in Figure 3. Each server could produce only 20 streams, which at a server cost of $20,000 meant a per stream cost of $1,000. At this capacity, scaling up to handle the 10,000 incoming streams would require 500 servers at a total cost of $10,000,000.

Total power consumption would equal 500 x 650, or 325,000 watts. The Dell R940 is a 3RU server; at an estimated monthly cost of $125 for colocation, this would add $750,000 per year. 

Figure 3. CPU-only transcoding was very costly and consumed excessive power.
Figure 3. CPU-only transcoding was very costly and consumed excessive power.

These numbers caused Ilya to pause and reassess. “After all these calculations, we understood that if we wanted to play big, we would need to find a cheaper transcoding solution than CPU-only with higher density per server, while maintaining low latency. So, we started researching and found some articles on companies like Wowza, Xilinx, Google, Twitch, YouTube, and so on. And the first hint was GPU. And when you think GPU, you think NVIDIA, a company all streaming engineers are aware of.”

“After all these calculations, we understood that if we wanted to play big, we would need to find a cheaper transcoding solution than CPU-only with higher density per server, while maintaining low latency.”

GPUs - Better, But Still Too Expensive

Ilya initially considered three NVIDIA products: the Tesla V100, Tesla P100, and Tesla T4. The first two, he concluded, were best for machine learning, leaving the T4 as the most relevant option. Mayflower could install six T4s into each existing Dell server. At a current cost of around $2,000 for each T4, this produced a total cost of $32,000 per server.

Under capacity testing, the T4-enabled system produced 96 streams, dropping the per-stream cost to $333. This also reduced the required number of servers to 105, and the total CAPEX cost to $3,360,000.

With the T4s installed, power consumption increased to 1,070 watts for a total of 112,350 watts. At $125 per month per server, the 105 servers would cost $157,500 annually to house in a colocation facility.

Figure 4. Capacity and costs for an NVIDIA T4-based solution.
Figure 4. Capacity and costs for an NVIDIA T4-based solution.

Round 1 ASICs: The NETINT T432

The NVIDIA numbers were better, but as Ilya commented, “It looked like we found a possible candidate, but we had a strong sense that we needed to further our research. We decided to continue our journey and found some articles about a company named NETINT and their ASIC-based solutions.”

Mayflower first ordered and tested the T432 video transcoder, which contains four NETINT G4 ASICs in a single PCIe card. As detailed by Ilya, “We received the T432 cards, and the results were quite exciting because we produced about 25 streams per card. Power consumption was much lower than NVIDIA, only 27 watts per card, and the cards were cheaper. The whole server produced 150 streams in full HD quality, with a power consumption of 812 watts. For the whole production, we would pay about 2 million, which is much cheaper than NVIDIA solution.”

You see all this data in Figure 5. The total number of T432-powered servers drops to 67, which reduces total power to 54,404 watts and annual colocation to $100,500.

Figure 5. Capacity and costs for the NETINT T432 solution.
Figure 5. Capacity and costs for the NETINT T432 solution.

While costs and power consumption kept improving, Ilya noticed that the CDN’s internal queue started increasing when processing with T432-equipped systems. Initially, Ilya thought the problem was the lack of onboard scaling on the T432, but then he noticed that “even when producing all these ABR ladders, our CPU load was about only 40% during high load hours. The bottleneck was the card’s decoding and encoding capacity, not onboard scaling.”

Finally, he pinpointed the increase in the internal queue to the fact that the T432’s decoder couldn’t maintain 4K60 fps decode for H.264 input. This was unacceptable because it increased stream latency. Ilya went searching one last time; fortunately, the solution was close at hand.

Round 2 ASICs: The NETINT Quadra T2 - The Transcoding Monster

Ilya next started testing with the NETINT Quadra T2 video processing unit, or VPU, which contains two NETINT G5 chips in a PCIe card. As with the other cards, Ilya could install six in each Dell server.

“All those disadvantages were eliminated in the new NETINT card – Quadra…We have already tested this card and have added servers with Quadra to our production. It really seems to be a transcoding monster.”

Ilya’s team liked what they found. “All those disadvantages were eliminated in the new NETINT card – Quadra. It has a hardware scaler inside with an optimized pipeline: decoder – scaler – encoder in the same VPU. And H264 4K60 decoding is not a problem for it. We have already tested this card and have added servers with Quadra to our production. It really seems to be a transcoding monster.”

Figure 6 shows the performance and cost numbers. Equipped with the six T2 VPUs, each server could output 270 streams, reducing the number of required servers from 500 for CPU-only to a mere 38. This dropped the per stream cost to $141, less than half of the NVIDIA T4 equipped system, and cut the total CAPEX down to $1,444,000. Total power consumption dropped to 33,820 watts, and annual colocation costs for the 38 3U servers were $57,000.

Figure 6. Capacity and costs for the NETINT Quadra T2 solution.
Figure 6. Capacity and costs for the NETINT Quadra T2 solution.

Cost and Power Summary

Figure 7 presents a summary of costs and power consumption, and the numbers speak for themselves. In Ilya’s words, “It is obvious that Quadra T2 dominates by all characteristics, and according to our team experience, it is the best transcoding solution on the market today.”

Figure 7. Summary of costs and power consumption.
Figure 5. Capacity and costs for the NETINT T432 solution.

“It is obvious that Quadra T2 dominates by all characteristics, and according to our team experience, it is the best transcoding solution on the market today.”

Ilya also commented on the suitability of the Dell R940 system. “I want to emphasize that the DELL R940 isn’t the best server for VPU and GPU transcoders. It has a small density of PCIe slots and, as a result, a small density of VPU/GPU. Moreover, in the case of  Quadra and even T432, you don’t need such powerful CPUs.”

In terms of other servers to consider, Ilya stated, “Nowadays, you may find platforms on the market with even 16 PCIe slots. In such systems, especially if you use Quadra, you don’t need powerful CPUs inside because everything is done on the VPU. But for us, it was a legacy with which we needed to live.”

Video engineers seeking the optimal transcoding solution can take a lot from Ilya’s transcoding journey: a willingness to test a range of potential solutions, a rigorous focus on cost and power consumption per stream, and extreme attention to detail. At NETINT, we’re confident that this approach will lead you to precisely the same conclusion as Ilya, that the Quadra T2 is “the best transcoding solution on the market today.”

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

Cloud or on-premise – streaming publisher’s dilemma

Publisher's dilemma - cloud or on-premise

Processing your media in the cloud or on-premises is one of the most critical decisions facing a streaming video service. Two recent articles provide strong opinions and insights on this decision and are worthy of review. Our take? Do the math and make your own decision.

The first article is “Why we’re leaving the cloud.”

By way of background, Hansson is co-owner and CTO of software developer 37signals, the developer of the project management platform Basecamp , and the premium email service Hey.

After running the two platforms on AWS for a number of years, Hannson commented that “renting computers is (mostly) a bad deal for medium-sized companies like ours with stable growth. The savings promised in reduced complexity never materialized.” As an overview, he asserts that the cloud excels at two ends of the spectrum: 1) simple and low-traffic applications and 2) highly irregular load with wild swings or towering peaks in usage.

When Hey first launched, running in AWS allowed the new service to seamlessly onboard the 300,000 users that signed up in the first three weeks, wildly exceeding the forecast of 30,000 in 6 months. However, since then, Hansson reported, these capacity spikes never reoccured, and by “continuing to operate in the cloud, we’re paying an at times almost absurd premium for the possibility that [they] could.”

In abandoning the cloud, Hansson had to stare down two common beliefs. First, is that the cloud simplifies systems and computer management. As it relates to his own businesses, he reports that “anyone who thinks running a major service like HEY or Basecamp in the cloud is “simple” has clearly never tried. Some things are simpler, others more complex, but on the whole, I’ve yet to hear of organizations at our scale being able to materially shrink their operations team, just because they moved to the cloud.”

He also tackles perceptions regarding the complexity of running equipment on-premise. “Up until very recently, everyone ran their own servers, and much of the progress in tooling that enabled the cloud is available for your own machines as well. Don’t let the entrenched cloud interests dazzle you into believing that running your own setup is too complicated. Everyone and their dog did it to get the internet off the ground, and it’s only gotten easier since.”

“Up until very recently, everyone ran their own servers, and much of the progress in tooling that enabled the cloud is available for your own machines as well. Don’t let the entrenched cloud interests dazzle you into believing that running your own setup is too complicated. Everyone and their dog did it to get the internet off the ground, and it’s only gotten easier since.”

In “Media Processing in the Cloud or On-Prem—Which Is Right for You?” , Alex Emmermann, Director of Business Development for Cloud Products at Telestream, takes a more moderate view (as you would expect).

Emmermann starts by pointing out where the cloud makes sense, zeroing in on the same capacity swings as Hansson. “A typical painful example is when capacity requirements shift underneath you, such as a service becoming more popular than you had initially allocated resources for. For example, when running a media services operation, there are many situations that can stress systems... In media processing, full-catalog licenses, mergers, or content migrations can cause enormous capacity requirements for transcoding and QC.”

Emmermann also introduces the concept of hybrid operations. “For many companies, a wholesale move may feel too risky, so a hybrid approach works well by allowing excess capacity requirements to burst into the cloud as required. This allows run rate systems to continue functioning while taking immediate advantage of cloud scaling when and if required. Depending on the needs of the service, a hybrid setup could continue to run indefinitely and very cost-effectively if on-prem CapEx resources have already been spent and the resources are in place to keep them running.”

In terms of companies that should operate on premises, Emmerman cites two examples. First are companies with significant CAPEX investments in encoding gear. “For the many thousands of busy on-premises servers processing run-rate media workflows throughout the world, they’re efficiently and cheaply doing what they need to do and will no doubt continue to do so for a long time.” He also mentions that inexpensive and reliable connectivity is an absolute requirement, and “there are certain places on the planet that may not have reliable interconnectivity to a cloud provider.”

All told, Emmerman concludes, “There’s no question that any media company investing in new services or wanting to have the capacity to say yes to any customer request will want to do this with a public cloud provider… On the other hand, any steady-state, on-premises service that is happily functioning as designed and only occasionally requires a small capital refresh will be happy to stay the course.”

Our Take? Do the Math

Play Video about Hard-Questions-on-Hot-Topics-1-cloud-or-on-prem
HARD QUESTIONS ON HOT TOPICS – CLOUD OR ON PREMISES, HOW TO DO THE MATH?
Watch the full conversation on YouTube: https://youtu.be/GSQsa4oQmCA

Anyone who has ever provisioned an EC2 instance from AWS and paid the hourly rate has wondered, “how does that compare to buying your own system?” We’re certainly not immune.

Given the impetus of this article, we decided to put pencil to paper or keyboard to a spreadsheet. We recently launched the NETINT Video Transcoding Server, which costs $7,000 and includes ten T408 transcoders that can output H.264 and HEVC. In benchmarking the entry-level system, it produced 21 five-rung H.264 ladders and 27 4-rung H.264 ladders. What would it cost to produce the same number of streams in AWS?

We checked the MediaLive price list here and confirmed it with the pricing calculator estimate here (Figure 3 for HEVC). Though a single hour of H.264 live streaming costs $0.46, this adds up to $4,004.17/per year. This jumps to $1.527 per hour for HEVC, or $13,375.55 per year. Both are for a single ladder.

Figure 3. Yearly cost for streaming a single five-rung HEVC encoding ladder.

To compare this to our streaming server, we multiplied each ladder by the number of ladders the server could produce, and extended all calculations out to five years. This translates to a five-year cost of $420,441 for H.264 and a staggering $1,805,712 for HEVC.

To compute the same five-year cost for the server, we added $69/month for colocation charges to the $7,000 base price. This came to $11,140 for either format.

Cloud or on-premise - streaming publisher's dilemma - table 1
Table 1. Five-year cost comparison, AWS MediaLive pricing compared to the NETINT server.

This comparison brought to mind Hansson’s comment that “Amazon, in particular, is printing profits renting out servers at obscene margins.” Surely, no streaming publisher is using MediaLive for 24/7 365 operations.

Taking a step back, it’s tough not to agree with the key points from both authors. The cloud does make the most sense when you need instant capacity for peak encoding. For steady-state operations, owning your own gear is always going to be cheaper.

All that said, run the numbers no matter what you’re doing in the cloud. While the results probably won’t be as startling as those shown in Table 1, you won’t know until you do the math.