NETINT Buyer’s Guide. Choosing the Right VPU & Server for Your Workflow.

This guide is designed to help you choose the optimum NETINT Video Processing Unit (VPU) for your encoding workflow.

As an overview, note that all NETINT hardware products (VPUs and transcoders) run the same basic software controlled via FFmpeg and GStreamer patches or an SDK. This includes load balancing of all encoding resources in a server. In addition, both generations are similar in terms of latency and HDR support.

Question 1. Which ASIC Architecture: Codensity G4 (Logan) or Codensity G5 (Quadra)?

Tables 1 and 2 show the similarities and differences between Codensity G4 ASIC-powered products (T408 and T432) and Codensity G5-based products (Quadra T1U, T1A, T2A). Both architectures are available in either the U.2 or AIC form factor, the latter all half-height half-length (HHHL) configurations.

From a codec perspective, the main difference is that G5-based products support AV1 encoding and VP9 decoding. In terms of throughput, G5-based products offer four times the throughput but cost roughly three times more than G4, making the cost per output stream similar but with greater stream densities per host server. G5 power consumption is roughly 3x higher per ASIC than G4, but the throughput is 4x, making power consumption per stream lower.

Choosing the Right VPU & Server - Table 1. Codec support, throughput, and power consumption.
Table 1. Codec support, throughput, and power consumption.

Table 2 covers other hardware features. From an encoding perspective, G5-based products enable tuning of quality and throughput to match your applications, while quality and throughput are fixed for G4-based products. The G5’s quality ceiling is higher than the G4, at the cost of throughput, and the quality floor is lower, with an option for higher throughput.

G5-based products are much more capable hardware-wise, performing scaling, overlay, and audio compression and offer AI processing of 15 TOPS for T1U and 18 TOPS for T1A (36 TOPS T2A). In contrast, G4-based products scale, overlay, and encode audio via the host CPU and offer no AI processing. You can read about Quadra’s AI capability here.

Peer-to-peer DMA is a feature that allows G5-based products to communicate directly with some specific GPUs, which is particularly useful in cloud gaming. This is only available on G5-based products. Learn about peer-to-peer DMA here.

Note that G4 and G5-based devices can co-exist on the same server, so you can add G5 devices to a server with G4 devices already installed and vice versa.

Choosing the Right VPU & Server - Table 2 Advanced hardware functionality.
Table 2. Advanced hardware functionality.

Observations:

  • Codensity G4 and G5-based VPUs offer similar cost-per-stream, with Quadra slightly more efficient on a watts-per-stream basis. Both products transcode to H.264 and HEVC formats (G5 encodes to AV1 and decodes VP9).

  • Choose G4-based products for:
    • The absolute lowest overall cost
    • Compatibility with existing G4-based encoding stacks
    • Interactive same resolution-in/out productions (minimum scaling and overlay)

  • Choose G5-based products for:
    • AV1 output
    • AI integration
    • Applications that need quality and throughput tuning
    • Applications that involve scaling and overlay
    • Maximum throughput from a single server
    • Cloud gaming

Question 2: Which G4-based Product?

This section discusses your G4-based options shown in Figure 1, with the U.2-based T408 in the background and AIC-form factor T432 in the foreground. These products are designated as Transcoders since this is their primary hardware-based function.

Choosing the Right VPU & Server - Figure 1. The NETINT T408 in the back, T432 in the front.
Figure 1. The NETINT T408 in the back, T432 in the front.

Table 3 identifies the key differences between NETINT’s two G4-based VPUs, the T408, which includes a single G4 ASIC in a U.2 form factor, and the T432, which includes four G4 ASICS in an AIC half-height half-length configuration.

Choosing the Right VPU & Server - Table 3. NETINT’s two G4-based products.
Table 3. NETINT’s two G4-based products.

Observations:

  • The U.2-based T408 offers the best available density for installing units into a 1RU server.
  • The AIC-based T432 is the best option for computers without U.2 connections and for maximum server chassis density.

Question 3: Which G5-based Product?

Figure 2 identifies the three Quadra G5-based products, with the U.2-based T1U in the back, the AIC-based T1A in the middle, and the AIC-based T2A in the front. These products are designated Video Processing Units, or VPUs, because their hardware functionality extends far beyond simple transcoding.

Choosing the Right VPU & Server - Figure 2. The Quadra T1U in the back, T1A in the middle, and T2A in front.
Figure 2. The Quadra T1U in the back, T1A in the middle, and T2A in front.

Table 3 identifies the key differences between NETINT’s three G5-based VPUs:

  • The T1U includes a single G5 ASIC in a U.2 form factor.
  • The T1A includes a single G5 ASIC in an AIC half-height half-length configuration.
  • The T2A includes two G5 ASICs in an AIC half-height half-length configuration.
Choosing the Right VPU & Server - Table 4. NETINT’s two G4-based products.
Table 4. NETINT’s two G4-based products.

Observations:

  • The U.2-based Quadra T1U offers the best density for installing in a 1RU server.
  • The Quadra T2A offers the best density for AIC-based installation and is ideal for cloud gaming servers that need peer-to-peer DMA communication with GPUs.
  • The AIC-based Quadra T1A is the most affordable AIC option for installs that don’t need maximum density.

Question 4: VPU or Server?

NETINT offers two video servers that use the same Supermicro 1114S-WN10RT server chassis; the Logan Video Server contains ten T408 U.2 VPUs, while the Quadra Video Server contains ten Quadra T1U VPUs. Servers offer a turnkey option for fast and simple deployment.

An advantage of buying a NETINT Video Server is all components, including CPU, RAM, hard drive, OS, and software versions, have been extensively tested for compatibility, stability, and performance, making them the easiest and fastest way to transition from software to hardware encoding.

As for the choice between servers, your answer to question 1 should guide your selection.

If you have any questions about any products, please contact us here.

Now ON-DEMAND: Symposium on Building Your Live Streaming Cloud

Video Transcoder vs. Video Processing Unit (VPU)

When choosing a product for live stream processing, half the battle is knowing what to search for. Do you want a live transcoder, a video processing unit (VPU), a video coding unit (VCU), Scalable Video Processor (SVP) or something else? If you’re not quite sure what these terms mean and how they relate, this short article will educate you in four minutes or less.  

In the Beginning, There Were Transcoders

Simply stated, a transcoder is any technology, software or hardware, that can input a compressed stream (decode) and output a compressed stream (encode). FFmpeg is a transcoder, and for video-on-demand applications, it works fine in most low-volume applications.

For live applications, particularly high-volume live interactive applications (think Twitch), you’ll probably need a hardware transcoder to achieve the necessary cost per stream (CAPEX), operating cost per stream, and density.

For example, the NETINT Video Transcoding Server, a single 1RU server with ten NETINT T408 Video Transcoders, can deliver up to 80 H.264/HEVC 1080p30 streams while drawing under 250 watts. Performed in software using only the CPU, this same output could take up to ten separate 1RU servers, each drawing well over 250 watts.

Netint Codensity, ASIC-based T408 Video Transcoder
The NETINT T408 Video Transcoder.

Speaking of the T408, if Websters defined a transcoder (it doesn’t), it might have a picture of the T408 as the perfect example of a transcoder. Based on custom transcoding ASICs, the T408 is inexpensive ($400), capable (4K @ 60 FPS or 4x 1080p60 streams), flexible (H.264 and HEVC), and exceptionally efficient (only 7 watts).

What doesn’t the T408 do? Well, that leads us to the difference between a transcoder and a VPU.

The difference between a transcoder and a Video Processing Unit (VPU)

First, the T408 doesn’t scale video. If you’re building a full encoding ladder from a high-resolution source, all the scaling for the lower rungs is performed by the host CPU. In addition, the T408 doesn’t perform overlay in hardware. So, if you insert a logo or other bug over your videos, again, the CPU does the heavy lifting.

Finally, the T408 was launched in 2019, the first ASIC-based transcoder to ship in quite a long time. So, it’s not surprising that it doesn’t incorporate any artificial intelligence processing capabilities.

What is a Video Processing Unit (VPU)?

What’s a Video Processing Unit? A hardware device that does all that extra stuff, scaling, overlay, and AI. You see this in the transcoding pipeline shown below, which is for the NETINT Quadra.

When it came to labeling the Quadra, you see the problem; It does much more than a video transcoder. Not only does it outperform the T408 by a factor of four, it adds AV1 output and all the additional hardware functionality. It’s much more than a simple video transcoder, it’s a video processing unit (VPU).

As much as we’d like to lay claim to the acronym, it actually existed before we applied it to the Quadra. It’s not surprising. It follows the terminology for CPU (central processing unit) and GPU (graphical processing unit). And, if Websters defined VPU (it doesn’t). Oh, you get the point. Here’s the required Quadra glamour shot.

Netint Codensity, ASIC-based Quadra T1A Video Processing Unit
The NETINT Quadra Video Processing Unit.

VCUs and M(SVP)

While NETINT was busy developing ASIC-based transcoders and VPUs for the mass market, large video publishers like YouTube and Meta produced their own ASICs to achieve similar benefits (and produce more acronyms). In 2021, when Google shipped their own ASIC-based transcoder called Argos, they labeled it a Video Coding Unit, or VCU.

Like the T408 and Quadra, the benefits of this ASIC-based technology are profound; as reported by CNET, “Argos handles video 20 to 33 times more efficiently than conventional servers when you factor in the cost to design and build the chip, employ it in Google’s data centers, and pay YouTube’s colossal electricity and network usage bills.” Interestingly, despite YouTube’s heavy usage of the AV1 codec, Argos encodes only H.264 and VP9, not AV1.

In May 2023, Meta released their own ASIC, which, like Argos, outputs H.264 and VP9, but not AV1. Called the Meta Scalable Video Processor (MSVP), the unit delivered impressive results, including “a throughput gain of ~9x for H.264 when compared against libx264 SW encoding…[and] a throughput gain of ~50x when compared with libVPX speed 2 preset.” Meta also noted that the unit drew only 10 watts of power, which is skimpy but also about 43% higher than the T408.

Of course, neither Google or Meta sells their ASIC to third parties, so if want the CAPEX and OPEX efficiencies that ASIC-based VPUs deliver, you’ll have to buy from NETINT.

Of course, neither Google or Meta sells their ASIC to third parties, so if want the CAPEX and OPEX efficiencies that ASIC-based VPUs deliver, you’ll have to buy from NETINT. The bottom line is that whether you call it a transcoder, VPU, VCU, or MSVP, you’ll get the highest throughput and lowest power consumption if it’s powered by an ASIC.

Play Video about HARD QUESTIONS ON HOT TOPICS: ASIC-based Video Transcoder versus Video Processing Unit (VPU)
HARD QUESTIONS ON HOT TOPICS:
ASIC-based Video Transcoder versus Video Processing Unit (VPU)
Watch the full conversation on YouTube: https://youtu.be/iO7ApppgJAg

NETINT Video Transcoding Server – ASIC technology at its best

NETINT Video Transcoding Server - quality-speed-density

Many high-volume streaming platforms and services still deploy software-only transcoding, but high energy prices for private data centers and escalating public cloud costs make the OPEX, carbon footprint, and dismal scalability unsustainable. Engineers looking for solutions to this challenge are actively exploring hardware that can integrate with their existing workflows and deliver the quality and flexibility of software with the performance and operational cost efficiency of purpose-built hardware. 

If this sounds like you, the USD $8,900 NETINT Video Transcoding Server could be the ideal solution. The server combines the Supermicro 1114S-WN10RT AMD EPYC 7543P-powered 1RU server with ten NETINT T408 video transcoders that draw just 7 watts each. Encoding HEVC and H.264 at normal or low latency, you can control transcoding operations via  FFmpeg, GStreamer, or a low-level API. This makes the server a drop-in replacement for a traditional x264 or x265 FFmpeg-based or GPU-powered encoding stack.

NETINT Video Transcoding Server

Due to the performance advantage of ASICs compared to software running on x86 CPUs, the server can perform the equivalent work of roughly 10 separate machines running a typical open-source FFmpeg and x264 or x265 configuration. Specifically,  the server can simultaneously transcode twenty 4Kp30 streams, and up to 80 1080p30 live streams. In ABR mode, the server transcodes up to 30 five-rung H.264 encoding ladders from 1080p to 360p resolution, and up to 28 four-rung HEVC encoding ladders. For engineers delivering UHD, the server can output seven 6-rung HEVC encoding ladders from 4K to 360p resolution, all while drawing less than 325 watts of total power.

This review begins with a technical description of the server and transcoding hardware and the options available to drive the encoders, including the resource manager that distributes jobs among the ten transcoders. Then we’ll review performance results for one-to-one streaming and then H.264 and HEVC ladder generation, and finish with a look at the server’s ultra-efficient power consumption.

NETINT Transcoding Server with 10 T408 Video Transcoders

Hardware Specs

Built on the Supermicro 1114S-WN10RT 1RU server platform, the NETINT Video Transcoding Server features ten NETINT Codensity ASIC-powered T408 video transcoders, and runs Ubuntu 20.04.05 LTSThe server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots to house the ten U.2 T408 video transcoders.

You can buy the server with any of three AMD EPYC processors with 8 to 64 cores. We performed the tests for this review on the 32-core AMD EPYC 7543P CPU that doubles to 64 threads with multithreading.  The server configured with the AMD EPYC 7713P processor with 64-cores and 128-threads sells for USD $11,500, and the economical AMD EPYC 7232P processor-based server with 8-cores and 16-threads lists for USD $7,000.

Regarding the server hardware, Supermicro is a leading server and storage vendor that designs, develops, and manufactures primarily in the United States. Supermicro adheres to high-quality standards, with a quality management system certified to the ISO 9001:2015 and ISO 13485:2016 standards and an environmental management system certified to the ISO 14001:2015 standard. Supermicro is also a leader in green computing and reducing data center footprints (see the white paper Green Computing: Top Ten Best Practices for a Green Data Center). As you’ll see below, this focus has resulted in an extremely power-efficient machine when operated with NETINT video transcoders.

Let’s explore the system - NETINT Video Transcoding Server

With this as background, let’s explore the system. Once up and running in Ubuntu, you can check T408 status via the ni_rsrc_mon_logan command, which reveals the number of T408s installed and their status. Looking at Figure 1, the top table shows the decoder performance of the installed T408s, while the bottom table shows the encoding performance.

Figure 1. Tracking the operation of the T408s, decode on top, encode on the bottom.

About the T408

T408s have been in service since 2019 and are being used extensively in hyper-scale platforms and cloud gaming applications. To date, more than 200 billion viewer minutes of live video have been encoded using the T408. This makes it one of the bestselling ASIC-based encoders on the market.

The NETINT T408 is powered by the Codensity G4 ASIC technology and is available in both PCIe and U.2 form factors. The T408s installed in the server are the U.2 form factor plugged into ten NVMe bays. The T408 supports close caption passthrough, and EIA CEA-708 encode/decode, along with support for High Dynamic Range in HDR10 and HDR10+ formats.

“To date, more than 200 billion viewer minutes of live video have been encoded using the T408. This makes it one of the bestselling ASIC-based encoders on the market.” 

ALEX LIU, Co-Founder,
COO at NETINT Technologies Inc.

The T408 decodes and encodes H.264 and HEVC on board but performs all scaling and overlay operations via the host CPU. For one-to-one same-resolution transcoding, users can select an option called YUV Bypass that sends the video transcoded by the T408 directly to the T408 encoder. This eliminates high-bandwidth trips through the bus to and from system memory, reducing the load on the bus and CPU. As you’ll see, in pure 1:1 transcode applications without overlay, CPU utilization is very low, so the T408 and server are very efficient for cloud gaming and other same-resolution, low-latency interactive applications. 

Netint Codensity, ASIC-based T408 Video Transcoder
Figure 2. The T408 is powered by the Codensity G4 ASIC.

Testing Overview

We tested the server with FFmpeg and GStreamer. As you’ll see, in most operations, performance was similar. In some simple transcoding applications, FFmpeg pulled ahead, while in more complex encoding ladder productions, particularly 4K encoding, GStreamer proved more performant, particularly for low-latency output.

Figure 3. The software architecture for controlling the server.  

Operationally, both GStreamer and FFmpeg communicate with the libavcodec layer that functions between the T408 NVME interface and the FFmpeg software layer. This allows existing FFmpeg and GStreamer-based transcoding applications to control server operation with minimal changes.

To allocate jobs to the ten T408s, the T408 device driver software includes a resource management module that tracks T408 capacity and usage load to present inventory and status on available resources and enable resource distribution. There are several modes of operation, including auto, which automatically distributes the work among the available resources.

Alternatively, you can manually assign decoding and encoding tasks to different T408 devices in the command line or application and control which streams are decoded by the host CPU or a T408. With these and similar controls, you can efficiently balance the overall transcoding load between the T408s and host CPU to maximize throughput. We used auto distribution for all tests.

Testing Procedures

We tested using Server version 1.0, running FFmpeg v4.3.1 and GStreamer v1.18 and T408 release 3.2.0. We tested with two use cases in mind. The first is a stream in-single stream out, either at the same resolution as the incoming stream or output at a lower resolution.  This mode of operation is used in many interactive applications like cloud gaming, real-time gaming, and auctions where the absolute lowest latency is required. We also tested scaling performance since many interactive applications scale the input to a lower resolution.

The second use case is ABR, where a single input stream is transcoded to a full encoding ladder. In both modes, we tested normal and low-latency performance. To simulate live streaming and minimize file I/O as a drag on system performance, we retrieved the source file from a RAM drive on the server and delivered the encoded file to RAM.

Play Video about NETINT Video Transcoding Server - ASIC technology at its best
HARD QUESTIONS ON HOT TOPICS
All you need to know about NETINT Transcoding Server powered by ASICs
Watch the full conversation on YouTube: https://youtu.be/6j-dbPbmejw

One-to-One Performance

Table 1 shows transcoding results for 4K, 1080p, and 720p in latency tolerant and low-delay modes. Instances is the number of full frame rate outputs produced by the system, with CPU utilization shown for reference. These results are most relevant for cloud gaming and similar applications that input a single stream, transcode the stream at full resolution, and distribute it.

As you can see, 4K results peak at 20 streams for all codecs, though results differ by the software program used to generate the streams. The number of 1080p outputs range from 70 – 80, while 720p streams range from 140 to 170. As you would expect, CPU utilization is extremely low for all test cases as the T408s are shouldering the complete decoding/encoding role. This means that performance is limited by T408 throughput, not CPU, and that the 64-core CPU probably wouldn’t produce any extra streams in this use case. For pure encoding operations, the 8-core server would likely suffice, though given the minimal price differential between the 8-core and 32-core systems, opting for the higher-end model is a prudent investment.

Latency

As for latency, in the normal mode, latency averaged around 45 ms for 4K transcoding and 34 ms for 1080p and 720p transcoding. In low delay mode, this dropped to around 24 ms for 4K, 7 ms for 1080p, and 3 ms for 720, all at 30 fps transcoding and measured with FFmpeg. For reference, at 30 fps, each frame is displayed for 33.33 ms. Even in latency-tolerant mode, latency is just over 1.36 frames for 4K and under a single frame for 1080p and 720p. In low delay modes, all resolutions are under a single frame of latency.

It’s worth noting that while software performance would drop significantly from H.264 to HEVC, hardware performance does not. Thus questions of codec performance for more advanced standards like HEVC do not apply when using ASICs. This is good news for engineers adopting HEVC, and those considering HEVC in the future. It means you can buy the server, comfortable in the knowledge that it will perform equally well (if not better) for HEVC encoding or transcoding.

Table 1. Full resolution transcodes with FFmpeg and Gstreamer
in regular and low delay modes.

Table 2 shows the performance when scaling from 4K to 1080p and from 1080p to 720p, again by the different codecs in and out. Since scaling is performed by the host CPU, CPU usage increases significantly, particularly on the higher volume 1080p to 720p output. Still, given that CPU utilization never exceeds 35%, it appears that the gating factor to system performance is T408 throughput. Again, while the 8-core system might be able to produce similar output if your application involves scaling, the 32-core system is probably better advised.

In these tests, latency was slightly higher than pure transcoding. In normal mode, 4K > 1080p latencies topped out at 46 ms and dropped to 39 ms for 1080p > 720p scaling, just over a single frame of latency. In low latency mode, these results dropped to 10 ms for 4K > 1080p and 10 ms for 1080p > 720p. As before, these latency results are for 30fps and were measured with FFmpeg.

Table 2: Performance while scaling from 4K to 1080p and 1080p to 720p.

The final set of tests involves transcoding to the AVC and HEVC encoding ladders shown in Table 3. These results will be most relevant to engineers distributing full encoding ladders in HLS, DASH, or CMAF containers.

Here we see the most interesting discrepancies between FFmpeg and GStreamer, particularly in low delay modes and in 4K results. In the 1080p AVC tests, FFmpeg produced 30 5-rung encoding ladders in normal mode but dropped to nine in low-delay mode. GStreamer produced 30 encoding ladders in both modes using substantially lower CPU resources. You see the same pattern in the 1080p four-rung HEVC output where GStreamer produced more ladders than FFmpeg using lower CPU resources in both modes.

Table 3. Full encoding ladders output in the listed modes.

FFmpeg produced very poor results in 4K testing, particularly in low latency mode, and it was these results that drove the testing with GStreamer. As you can see, GStreamer produced more streams in both modes and CPU utilization again remained very low. As with the previous results, the low CPU utilization means that the results reflect the encoding limits of the T408. For this reason, it’s unlikely that the higher end server would produce more encoding ladders.

In terms of latency, in normal mode, latency was 59 ms for the H.264 ladder, 72 ms for the 4 rung 1080p HEVC ladder, and 52 ms for the 4K HEVC ladder. These numbers dropped to 5 ms, 7 ms, and 9 ms for the respective configurations in low latency mode.

Power Consumption

Power consumption is an obvious concern for all video engineers and operations teams. To assess system power consumption, we tested using the IPMI Tool. When running completely idle, the system consumed 154 watts, while at maximum CPU, the unit averaged 400 watts with a peak of 425 watts.

We measured consumption during the three basic operations tested, pure transcoding, transcoding with scaling, and ladder creation, in each case testing the GStreamer scenario that produced the highest recorded CPU usage. You see the results in Table 4.

When you consider that CPU-only transcoding would yield a fraction of the outputs shown while consuming 25-30% more power, you can see that the T408 is exceptionally efficient when it comes to power consumption. The Watts/Output figure provides a useful comparison for other competitive systems, whether CPU or GPU-based.

Table 4. Power consumption during the specified operation.

Conclusion

With impressive density, low power consumption, and multiple integration options, the NETINT Video Transcoding Server is the new standard to beat for live streaming applications. With a lower price model available for pure encoding operations, and a more powerful model for CPU-intensive operations, the NETINT server family meets a broad range of requirements.

ASICs – The Time is Now

A brief review of the history of encoding ASICs reveals why they have become the technology of choice for high-volume video streaming services and cloud-gaming platforms.

Like all markets, there will be new market entrants that loudly announce for maximum PR effect, promising delivery at some time in the future. But, to date, outside of Google’s internal YouTube ASIC project called ARGOS and the recent Meta (Facebook) ASIC also for internal use only, NETINT is the only commercial company building ASIC-based transcoders for immediate delivery.

“ASICs are the future of high-volume video transcoding as NETINT, Google, and Meta have proven. NETINT is the only vendor that offers its product for sale and immediate delivery making the T408 and Quadra safe bets.”

Delaying a critical technology decision always carries risk. The risk is that you miss an opportunity or that your competitors move ahead of you. However, waiting to consider an announced and not yet shipping product means that you ALSO assume the manufacturing, technology, and supply chain risk of THAT product.

What if you delay only to find out that the announced delivery date was optimistic at best? Or, what if the vendor actually delivers, only for you to find out that their performance claims were not real? There are so many “what if’s” when you wait that it rarely is the right decision to delay when there is a viable product available.

Now let’s review the rebirth of ASICs for video encoding and see how they’ve become the technology of choice for high-volume transcoding operations.  

The Rebirth of ASICs for Video Encoding

An ASIC is an application specific integrated circuit that is designed to do a small number of tasks with high efficiency. ASICs are purpose-built for a specific function. The history of video encoding ASICs can be traced back to the initial applications of digital video and the adoption of the MPEG-2 standard for satellite and cable transmission.

Most production MPEG-2 encoders were ASIC-based.

As is the case for most new codec standards, the first implementation of MPEG-2 compression was CPU-based. Given the cost of using commodity servers and software, dedicated hardware is always necessary to handle the processing requirements of high-quality video encoding cost-effectively.

This led to the development and application of video encoding ASICs, which are specialized integrated circuits designed to perform the processing tasks required for video encoding. Encoding ASICs provide the necessary processing power to handle the demands of high-quality video encoding while being more cost-effective than CPU-based solutions.

With the advent of the internet, the demand for digital video continued to increase. The rise of on-demand and streaming video services, such as YouTube and Netflix, led to a shift towards CPU-based encoding solutions. This was due in part to the fact that streaming video required a more flexible approach to encoding including implementation agility with the cloud and an ability to adjust encoding parameters based on the available bandwidth and device capabilities.

As the demand for live streaming services increased, the limitations of CPU-based encoding solutions became apparent. Live streaming services, such as cloud gaming and real-time interactive video like gaming or conferencing, require the processing of millions of live interactive streams simultaneously at scale. This has led to a resurgence in the use of encoding ASICs for live-streaming applications. Thus, the rebirth of ASICs is upon us and it’s a technology trend that should not be ignored even if you are working in a more traditional entertainment streaming environment.

NETINT: Leading the Resurgence

NETINT has been at the forefront of the ASIC resurgence. In 2019, the company introduced its Codensity T408 ASIC-based transcoder. This device was designed to handle 8 simultaneous HEVC or H.264 1080p video streams, making it ideal for live-streaming applications.

The T408 was well-received by the market, and NETINT continued to innovate. In 2021, the company introduced its Quadra series. These devices can handle up to 32 simultaneous 1080p video streams, making it even more powerful than the T408, also adding the anticipated AV1 codec.

“NETINT has racked up a number of major wins including major names such as ByteDance, Baidu, Tencent, Alibaba, Kuaishou, and a US-based global entertainment service.”

As described by Dylan Patel, editor of the Semianalysis blog, in his article Meet NETINT: The Startup Selling Datacenter VPUs To ByteDance, Baidu, Tencent, Alibaba, And More, “NETINT has racked up a number of major wins including major names such as ByteDance, Baidu, Tencent, Alibaba, Kuaishou, and a similar sized US-based global platform.”

NETINT Quadra T1U Video Processing Unit
– NETINT’s second-generation of shipping ASIC-based transcoders.

Patel also reported that using the HEVC codec, NETINT video transcoders and VPUs crushed Nvidia’s T4 GPU, which is widely assumed to be the default choice when moving to a hardware encoder for the data center. The density and power consumption that can be achieved with a video ASIC is unmatched compared to CPUs and GPUs.

Patel commented further, “The comparison using AV1 is even more powerful… NETINT is the leader in merchant video encoding ASICs.”

“The comparison using AV1 is even more powerful…NETINT is the leader in video encoding ASICs.”

-Dylan Patel

ASIC Advantages

ASICs are designed to perform a specific task, such as encoding video, with a high degree of efficiency and speed. CPUs and GPUs are designed to perform a wide range of general-purpose computing tasks. As evidence of this fact, today, the primary application for GPUs has nothing to do with video encoding. In fact, just 5-10% of the silicon real estate on some of the most popular GPUs in the market are dedicated to video encoding or processing. Highly compute-intensive tasks like AI inferencing are the most common workload for GPUs today.

The key advantage of ASICs for video encoding is that they are optimized for this specific task, with a much higher percentage of gates on the chip dedicated to encoding than CPUs and GPUs. ASICs can encode much faster and with higher quality than CPUs and GPUs, while using less power and generating less heat.

“ASICs can encode much faster and with higher quality than CPUs and GPUs while using less power and generating less heat.”

-Dylan Patel

Additionally, because ASICs are designed for a specific task, they can be more easily customized and optimized for specific use cases. Though some assume that ASICs are inflexible, in reality, with a properly designed ASIC, the function it’s designed for may be tuned more highly than if the function was run on a general purpose computing platform. This can lead to even greater efficiency gains and improved performance.

The key takeaway is that ASICs are a superior choice for video encoding due to their application-specific design, which allows for faster and more efficient processing compared to general-purpose CPUs and GPUs.

Confirmation from Google and Meta

Recent industry announcements from Google and Meta confirm these conclusions. When Google announced the ASIC-based Argos VCU (Video Coding Unit) in 2021, the trade press rightfully applauded. CNET announced that “Google supercharges YouTube with a custom video chip.” Ars Technica reported that Argos brought “up to 20-33x improvements in compute efficiency compared to… software on traditional servers.” SemiAnalysis reported that Argos “Replaces 10 Million Intel CPUs.”

Google’s Argos confirms the value of encoding ASICs
(and shipped 2 years after the NETINT T408).

As described in the article “Argos dispels common myths about encoding ASICs” (bit.ly/ASIC_myths), Google’s experience highlights the benefits of ASIC-based transcoders. That is, while many streaming engineers still rely on software-based transcoding, ASIC-based transcoding offers a clear advantage in terms of CAPEX, OPEX, and environmental sustainability benefits. The article goes on to address outdated concerns about the shortcomings of ASICs, including sub-par quality and the lack of upgradeability.

The article discusses several key findings from Google’s presentation on the Argos ASIC-based transcoder at Hot Chips 33, including:

  • Encoding time has grown by 8000% due to increased complexity from higher resolutions and frame rates. ASIC-based transcoding is necessary to keep video services running smoothly.
  • ASICs can deliver near-parity to software-based transcoding quality with properly designed hardware.
  • ASICs quality and functionality can be improved and changed long after deployment.
  • ASICs deliver unparalleled throughput and power efficiency, with Google reporting a 90% reduction in power consumption.

Though much less is known about the Meta ASIC, its announcement prompted Facebook’s Director of Video Encoding, David Ronca, to proclaim, “I propose that there are two types of companies in the video business. Those that are using Video Processing ASICs in their workflows, and those that will.”

“…there are two types of companies in the video business. Those that are using Video Processing ASICs in their workflows, and those that will.”

Meta proudly announces its encoding ASIC
(3 years after NETINT’s T408 ships).

Unlike the ASICs from Google and Meta, you can actually buy ASIC-based transcoders from NETINT, and in fact scores of tens of thousands of units are operating in some of the largest hyperscaler networks and video streaming platforms today. The fact that two of the biggest names in the tech industry are investing in ASICs for video encoding is a clear indication of the growing trend towards application-specific hardware in the video field. With the increasing demand for high-quality video streaming across a variety of devices and platforms, ASICs provide the speed, efficiency, and customization needed to meet these needs.

Avoiding Shiny New Object Syndrome

ASICs as the best method for transcoding high volumes of live video has not gone unnoticed, meaning you should expect product announcements that are made pointing to “availability later this year.” When these occur around prominent trade shows, it can indicate a rushed announcement made for the show, and that the later availability may actually be “much later…”

It’s useful to remember that while waiting for a new product from a third-party supplier to become available, companies face three distinct risks: manufacturing, technology, and supply chain.

Manufacturing Risk:

One of the biggest risks associated with waiting for a new product is the manufacturing risk, which means that the product may have issues in manufacturing. That is, there is always a chance that the manufacturing process may encounter unexpected problems, causing delays and increasing costs. For example, Intel has faced manufacturing issues with its 10nm processors, which resulted in delays for its upcoming processors. As a result, Intel lost market share to competitors such as AMD and NVIDIA, who were able to release their products earlier.

Technology Risk:

Another risk associated with waiting for a new product is technology risk, or that the product may not conform to the expected specifications, leading to performance issues, security concerns, or other problems. For example, NVIDIA’s RTX 2080 Ti graphics card was highly anticipated, but upon release, many users reported issues with its performance, including crashes, artifacting, and overheating. This led to a delay in the release of the RTX 3080, as NVIDIA had to address these issues before releasing the new product. Similarly, AMD’s Radeon RX7900 XTX graphics card has been plagued with claims of overheating. 

Supply Chain Risk:

The third risk associated with waiting for a new product is supply chain risk. This means that the company may be unable to get the product manufactured and shipped on time due to issues in the supply chain. For example, AMD faced supply chain issues with its Radeon RX 6800 XT graphics card, leading to limited availability and higher prices.

The reality is that any company building and launching a cloud gaming or streaming service is assuming its own technology and market risks. Compounding that risk by waiting for a product that “might” deliver minor gains in quality or performance (but equally might not) is a highly questionable decision, particularly in a market where even minor delays in launch dates can tank a new service before its even off the ground.

Clearly, ASICs are the future of high-volume video transcoding; NETINT, Google, and Meta have all proven this. NETINT is the only vendor of the three that actually offers its product for sale and immediate delivery; in fast-moving markets like interactive streaming and cloud gaming, this makes NETINT’s shipping transcoders, the T408 and Quadra, the safest bets of all.

All You Need to Know About the NETINT Product Line

Quadra - All You Need to Know About the NETINT Product Line

This article will introduce you to the NETINT product line and Codensity ASIC generations. We will focus primarily on the hardware differences, since all products share a common software architecture and feature set, which are briefly described at the end of the article.

PRODUCT GALLERY. Click the product image to visit product page

Codensity G4-Powered Video Transcoder Products

The Codensity G4 was the first encoding ASIC developed by NETINT. There are two G4-based transcoders, the T408 (Figure 1), is available in a U.2 form factor and as an add-in card, and the T432 (Figure 2), which is available as an add-in card. The T408 contains a single G4 ASIC and draws 7 watts under full load, while the T432 contains four G4 ASICs and draws 27 watts.

The T408 costs $400 in low volumes, while the T432 costs $1,500. The T432 delivers 4x the raw performance of the T408.

Netint Codensity, ASIC-based T408 Video Transcoder
Figure 1. The NETINT T408 is powered by a single Codensity G4 ASIC.

T408 and T432 decode and encode H.264 and HEVC on the device but perform all scaling, overlay, and deinterlacing on the host CPU.

If you’re buying your own host, the selected CPU should reflect the extent of processing that it needs to perform and the overhead requirements of the media processing framework that is running the transcode function. 

When transcoding inputs without scaling, as in a cloud gaming or conferencing application, a modest CPU can suffice. If you are creating standard encoding ladders, deinterlacing multiple streams, or frequently scaling incoming videos, you’ll need a more capable CPU. For a turn-key solution, check out the NETINT Logan Video Server options.

Netint Codensity, ASIC-based T432 Video Transcoder
Figure 2. The NETINT T432 includes four Codensity G4 ASICs.

The T408 and T432 run on multiple versions of Ubuntu and CentOS; see here for more detail about those versions and recommendations for configuring your server.

The NETINT Logan Video Server

The NETINT Video Transcoding Server includes ten T408 U.2 transcoders. It is targeted for high-volume transcoding applications as an affordable turn-key replacement for existing hardware transcoders or where a drop-in solution to a software-based transcoder is preferred.

The lowest priced model costs $7,000 and is built on the Supermicro 1114S-WN10RT server platform powered by an AMD EPYC 7232P CPU Series Processor with eight CPU cores and 16 threads running Ubuntu 20.04.05 LTS. The server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots that house the ten T408 transcoders. At full transcoding capacity, the server draws 220 watts while encoding or transcoding up to ten 4Kp60 streams or as many as 160 720p60 video streams.

The server is also offered with two more powerful CPUs, the AMD EPYC 7543P Server Processor (32-cores/64-threads, $8,900) and the AMD EPYC 7713P Server Processor (64-cores/128-threads, $11,500). Other than the CPU, the hardware specifications are identical.

FIGURE 3. The NETINT Video Transcoding Server.

All Codensity G4-based products support HDR10 and HDR10+ for H.264 and H.265 encode and decode, as well as EIA CEA-708 closed captions for H.264 and H.265 encode and decode. In low-latency mode, all products support sub-frame latency. Other features include region-of-interest encoding, a customizable GOP structure with eight presets, and forced IDR frame inserts at any location.

The T408, T432, and NETINT Server are targeted toward high-volume interactive applications that require inexpensive, low-power, and high-density transcoding using the H.264 and HEVC codecs.

Codensity G5-Powered Live Transcoder Products

In addition to roughly quadrupling the H.264 and HEVC throughput of the Codensity G4, the Codensity G5 is our second-generation ASIC that adds AV1 encode support, VP9 decode support, onboard scaling, cropping, padding, graphical overlay, and an 18 TOPS (Trillions of Operations Per Second) artificial intelligence engine that runs the most common frameworks all natively in silicon.

Codensity G5 also includes audio DSP engines for encoding and decoding audio codecs such as MP3, AAC-LC, and HE AAC. All this on-board activity minimizes the role of the CPU allowing Quadra products to operate effectively in systems with modest CPUs.

Where the G4 ASIC is primarily a transcoding engine, the G5 incorporates much more onboard processing for even greater video processing acceleration. For this reason, NETINT labels Codensity G4-based products as Video Transcoders and Codensity G5-based products as Video Processing Units or VPUs.

The Codensity G5 is available in three products (Figure 4), the U.2-based Quadra T1 and PCIe-based Quadra T1A, which include one Codensity G5 ASIC, and the PCIe-based , which includes two Codensity G5 ASICs. Pricing for the T1 starts at $1,500. 

In terms of power consumption, the T1 draws 17 Watts, the T1A 20 Watts, and the T2 draws 40 Watts.

Figure 4. The Quadra line of Codensity G5-based products.

All Codensity G5-based products provide the same HDR and close caption support as the Codensity G4-based products. They have also been tested on Windows, MacOS, Linux and Android OS with support for virtual machine and container virtualization, including Single Root I/O Virtualization [SRIOV].

From a quality perspective, the Codensity G4-based transcoder products offer no configuration options to optimize quality vs. throughput. Quadra Codensity G5-powered VPUs offer features like lookahead and rate-distortion optimization that allow users to customize quality and throughput for their particular applications.

Play Video about Hard Questions - NETINT product line
HARD QUESTIONS ON HOT TOPICS – WHAT DO YOU NEED TO UNDERSTAND ABOUT NETINT PRODUCTS LINE
Watch the full conversation on YouTube: https://youtu.be/qRtnwjGD2mY

AI-Based Video Processing

Beyond VP9 ingest and AV1 output, and superior on-board processing, the Codensity G5 AI engine is a game changer for many current and future video processing applications. Each Codensity G5 ASIC includes two onboard Neural Processing Units (NPUs). Combined with Quadra’s integrated decoding, scaling, and transcoding hardware, this creates an integrated AI and video processing architecture that requires minimal interaction from the host CPU.

Today, in early 2023, the AI-enabled processing market is nascent, but Quadra already supports several applications like AI-based region of interest filter, background removal (see Quadra App Note APPS553), and others. Additional features under development include an automatic facial ID for video conferencing, license plate detection and OCR for security, object detection for a range of applications, and voice-to-text.

Quadra includes an AI Toolchain workflow that enables importing models from AI tools like Caffe, TensorFLow, Keras, and Darknet for deployment on Quadra. So, in addition to the basic models that NETINT provides, developers can design their own applications and easily implement them on Quadra

Like NETINT’s Codensity G4 based products, Quadra VPUs are ideal for interactive applications that require low CAPEX and OPEX. Quadra VPUs offer increased onboard processing that enables lower-cost host systems and the ability to customize throughput and quality, deliver AV1 output, and deploy AI video applications.

The NETINT Quadra 100 Video Server

The NETINT Quadra 100 Video Server includes ten Quadra T1 U.2 VPUs and is targeted for ultra high-volume transcoding applications and for services seeking to deliver AV1 stream output.  

The Quadra 100 Video Server costs $20,000 and is built on the Supermicro 1114S-WN10RT server platform powered by an  AMD EPYC 7543P Server Processor (32-cores/64-threads) running Ubuntu 20.04.05 LTS. The server ships with 128 GB of DDR4-3200 RAM and a 400GB M.2 SSD drive with 3x PCIe slots and ten NVME slots that house the ten T1 U.2 VPUs. At full transcoding capacity, the server draws around 500 watts while encoding or transcoding up to 20 8Kp30 streams or as many as 640 720p30 video streams.

The Quadra server is also offered with two different CPUs, the AMD EPYC 7232P Server Processor (8-cores/16-threads, price TBD) and the AMD EPYC 7713P Server Processor (64-cores/128-threads, price TBD). Other than the CPU, the hardware specifications are identical.

Media Processing Frameworks - Driving NETINT Hardware

In addition to SDKs for both hardware generations, NETINT offers highly efficient FFmpeg and GStreamer SDKs that allow operators to apply an FFmpeg/libavcodec or GStreamer patch to complete the integration.

In the FFmpeg implementation, the libavcodec patch on the host server functions between the NETINT hardware and FFmpeg software layer, allowing existing FFmpeg-based video transcoding applications to control hardware operation with minimal changes.

The NETINT hardware device driver software includes a resource management module that tracks hardware capacity and usage load to present inventory and status on available resources and enable resource distribution. User applications can build their own resource management schemes on top of this resource pool or let the NETINT server automatically distribute the decoding and encoding tasks.

In automatic mode, users simply launch multiple transcoding jobs, and the device driver automatically distributed the decode/encode/processing tasks among the available resources. Or, users can assign different hardware tasks to different NETINT devices, and even control which streams are decoded by the host CPU or NETINT hardware. With these and similar controls, users can most efficiently balance the overall transcoding load between the NETINT hardware and host CPU and maximize throughput.

In all interfaces, the syntax and command structure is similar for T408s and Quadra units which simplifies migrating from G4-based products to Quadra hardware. It is also possible to operate T408 and Quadra hardware together in the same system.

That’s the overview. For more information on any product, please check the following product pages (click the image below to see product page). 

PRODUCT GALLERY. Click the product image to visit product page

Computing Payback Period on T408s

Computing Payback Period on T408s

One of the most power-hungry processes performed in data centers is software-based live transcoding, which can be performed much more efficiently with ASIC-based transcoders. With power costs soaring and carbon emissions an ever-increasing concern, data centers that perform high-volume live transcoding should strongly consider switching to ASIC-based transcoders like the NETINT T408. Computing the Payback Period is easy with this calculator.

To assist in this transition, NETINT recently published two online calculators that measure the cost savings and payback period for replacing software-based transcoders with T408s. This article describes how to use these calculators and shows that data centers can recover their investment in T408 transcoders in just a few months, even less if you can repurpose servers previously used for encoding for other uses. Most of the data shown are from a white paper that you can access here.

About the T408

Briefly, NETINT designs, develops, and sells ASIC-powered transcoders like the T408, which is a video transcoder in a U.2 form factor containing a single ASIC. Operating in x86 and ARM-based servers, T408 transcoders output H.264 or HEVC at up to 4Kp60 or 4x 1080p60 streams per T408 module and draw only 7 watts.

Simply stated, a single T408 can produce roughly the same output as a 32-core workstation encoding in software but drawing anywhere from 250 – 500 watts of power. You can install up to 24 T408s in a single workstation, which essentially replaces 20 – 24 standalone encoding workstations, slashing power costs and the associated carbon emissions.

In a nutshell, these savings are why large video publishers like YouTube and Meta are switching to ASICs. By deploying NETINT’s T408s, you can achieve the same benefits without the associated R&D and manufacturing costs. The new calculators will help you quantify the savings.

Determining the Required Number of T408s

The first calculator, available here, computes the number of T408s required for your production. There are two steps; first, enter the rungs of your encoding ladder into the table as shown. If you don’t know the details of your ladder, you can click the Insert Sample HD or 4K Ladder buttons to insert sample ladders.

After entering your ladder information, insert the number of encoding ladders that you need to produce simultaneously, which in the table is 100. Then press the Compute button (not shown in the Figure but obvious on the calculator).

Calculator 1: Computing the number of required T408 transcoders.

This yields a total of 41 T408s. For perspective, the calculator should be very accurate for streams that don’t require scaling, like 1080p inputs output to 1080p. However, while the T408 decodes and transcodes in hardware, it relies on the host CPU for scaling. If you’re processing full encoding ladders, as we are in this example, throughput will be impacted by the power of the host CPU.

As designed, the calculator assumes that your T408 server is driven by a 32-core host CPU. On an 8-16 core system, expect perhaps 5 – 10% lower throughput. On a 64-core system, throughput could increase by 15 – 20%. Accordingly, please consider the output from this calculator as a good rough estimate accurate to about plus or minus 20%.

To compute the payback period, click the Compute Payback Period shown in Figure 1. To restart the calculation, refresh your browser.

Computing Payback Period

Computing the payback period requires significantly more information, which is summarized in the following graphic.

Calculator 2: Information needed to compute the payback period.

Step by step

  1. Choose your currency in the drop-down list.

  2. Enter your current cost per KW. The $0.25/KW is the approximate UK cost as of March 2022 from this source, which you can also access by clicking the information button to the right of this field. This information button also contains a link to US power costs here.

  3. Enter the number of encoders currently transcoding your live streams. In the referenced white paper, 34 was the number of required servers needed to produce 100 H.264 encoding ladders.

  4. Enter the power consumption per encoding server. The 289 watts shown were the actual power consumption measured for the referenced white paper. If you don’t know your power consumption, click the Info button for some suggested values.

  5. Enter the number of encoding servers that can be repurposed. The T408s will dramatically improve encoding density; for example, in the white paper, it took 34 servers transcoding with software to produce the same streams as five servers with ten T408s each. Since you won’t need as many encoding servers, you can shift them to other applications, which has an immediate economic benefit. If you won’t be able to repurpose any existing servers for some reason, enter 0 here.

  6. Enter the current cost of the encoding servers that can be repurposed. This number will be used to compute the economic benefit of repurposing servers for other functions rather than buying new servers for those functions. You should use the current replacement cost for these servers rather than the original price.

  7. Enter the number of T408s required. If you start with the first calculator, this number will be auto-filled.

  8. Enter your cost for the T408s. $400 is the retail price of the T408 in low quantities. To request pricing for higher volumes, please check with a NETINT sales representative. You can arrange a meeting HERE. 

  9. Enter the power consumption for each T408. The T408 draws 7 watts of power which should be auto-filled.

  10. Enter the number of computers needed to host the T408s. You can deploy up to ten T408s in a 1RU server and up to 24 T408s in a 2RU server. We assumed that you would deploy using the first option (10 T408s in a single 1RU) and auto-filled this entry with that calculation. If the actual number is different, enter the number of computers you anticipate buying for the T408s.

  11. Enter the price for computers purchased to run T408s (USD). If you need to purchase new computers to house the T408, enter the cost here. Note that since the T408 decodes incoming H.264 and HEVC streams and transcodes on-board to those formats, most use cases work fine on workstations with 8-16 cores, though you’ll need a U.2 expansion chassis to house the T408s. Check this link for more information about choosing a server to house the T408s. We assumed $3,000 because that was the cost for the server used in the white paper.

    If you’re repurposing existing hardware, enter the current cost, similar to number 6.

 

  1. Enter power consumption for the servers (watts/hour). As mentioned, you won’t need a very powerful computer to run the T408s, and CPU utilization and power consumption should be modest because the T408s are doing most of the work. This number is the base power consumption of the computer itself; the power utilized by the T408s will be added separately.

When you’ve entered all the data, press the Calculate button.

Interpreting the Results

The calculator computes the payback period under three assumptions:

  • Simple: Payback Period on T408 Purchases
  • Simple: Payback Period on T408 + New Computers
  • Comprehensive: Consider all costs
Figure 3. Simple payback on T408 purchases.

This result divides the cost of the T408 purchases by the monthly savings and shows a payback period of around 11 months. That said, if five servers with T408s essentially replaced 34 servers, unless you’re discarding the 29 servers, the third result is probably a more accurate reflection of the actual economic impact.

Figure 4. Simple: Payback Period on T408 + New Computers

This result includes the cost of the servers necessary to run the T408s, which extends the payback period to about 20.5 months. Again, however, if you’re able to allocate existing encoding servers into other roles, the third calculation is a more accurate reflection.

Figure 5. Comprehensive: consider all costs.

This result incorporates all economic factors. In this case, the value of the repurposed computers ($145,000) exceeds the costs of the T408s and the computers necessary to house them ($103,600), so you’re ahead the day you make the switch.

However you run the numbers, data centers driving high-volume live transcoding operations will find that ASIC-based transcoders will pay for themselves in a matter of months. If power costs keep rising, the payback period will obviously shrink even further.

Introduction to AI Processing on Quadra

Intro to AI Processing on Quadra - NETINT technologies

The intersection of video processing and artificial intelligence (AI) delivers exciting new functionality, from real-time quality enhancement for video publishers to object detection and optical character recognition for security applications. One key feature in NETINT’s Quadra Video Processing Units are two onboard Neural Processing Units (NPUs). Combined with Quadra’s integrated decoding, scaling, and transcoding hardware, this creates an integrated AI and video processing architecture that requires minimal interaction from the host CPU. As you’ll learn in this post, this architecture makes Quadra the ideal platform for executing video-related AI applications.

This post introduces the reader to what AI is, how it works, and how you deploy AI applications on NETINT Quadra. Along the way, we’ll explore one Quadra-supported AI application, Region of Interest (ROI) encoding.

About AI

Let’s start by defining some terms and concepts. Artificial intelligence refers to a program that can sense, reason,  act, and adapt. One AI subset that’s a bit easier to grasp is called machine learning, which refers to algorithms whose performance improves as they are exposed to more data over time.

Machine learning involves the five steps shown in the figure below. Let’s assume we’re building an application that can identify dogs in a video stream. The first step is to prepare your data. You might start with 100 pictures of dogs and then extract features, or represent them mathematically, that identify them as dogs: four legs, whiskers, two ears, two eyes, and a tail. So far, so good.

AI Processing on Quadra - figure 1
Figure 1. The high-level AI workflow (from Escon Info Systems)

To train the model, you apply your dog-finding algorithm to a picture database of 1,000 animals, only to find that rats, cats, possums, and small ponies are also identified as dogs. As you evaluate and further train the model, you extract new features from all the other animals that disqualify them from being a dog, along with more dog-like features that help identify true canines. This is the ”machine learning” that improves the algorithm.

As you train and evaluate your model, at some point it achieves the desired accuracy rate and it’s ready to deploy.

The NETINT AI Tool Chain

Then it’s time to run the model. Here, you export the model for deployment on an AI-capable hardware platform like the NETINT Quadra. What makes Quadra ideal for video-related AI applications is the power of the Neural Processing Units (NPU) and the proximity of the video to the NPUs. That is, since the video is entirely processed in Quadra, there are no transfers to a CPU or GPU, which minimizes latency and enables faster performance. More on this is below.

Figure 2 shows the NETINT AI Toolchain workflow for creating and running models on Quadra. On the left are third-party tools for creating and training AI-related models. Once these models are complete, you use the free NETINT AI Toolkit to input the models and translate, export, and run them on the Quadra NPUs – you’ll see an example of how that’s done in a moment. On the NPUs, they perform the functions for which they were created and trained, like identifying dogs in a video stream.

AI Processing on Quadra - figure 2
Figure 2. The NETINT AI Tool Chain.

Quadra Region of Interest (ROI) Filter

Let’s look at a real-world example. One AI function supplied with Quadra is an ROI filter, which analyzes the input video to detect faces and generate Region of Interest (ROI) data to improve the encoding quality of the faces. Specifically, when the AI Engine identifies a face, it draws a box around the face and sends the box’s coordinates to the encoder, with encoding instructions specific to the box.

Technically, Quadra identifies the face using what’s called a YOLOv4 object detection model. YOLO stands for You Only Look Once, which is a technology that requires only a single pass of the image (or one look) for object detection. By way of background, YOLO is a highly regarded family of “deep learning object detection models. The original versions of YOLO are implemented using the DARKNET framework, which you see as an input to the NETINT AI Toolkit in Figure 2.

Deep learning is different from the traditional machine learning described above in that it uses large datasets to create the model, rather than human intervention. To create the model deployed in the ROI filter, we trained the YOLOv4 model in DARKNET using hundreds of thousands of publicly available image data with labels (where the labels are bounding boxes on people’s faces). This produced a highly accurate model with minimum manual input, which is faster and cheaper than traditional machine learning. Obviously, where relevant training data is available, deep learning is a better alternative than traditional machine learning.

Using the ROI Function

Most users will access the ROI function via FFmpeg, where it’s presented as a video filter with the filter-specific command string shown below. To execute the function, you call the filter (ni_quadra_roi), enter the name and location of the model (yolov4_head.nb), and a QP value to adjust the quality within each box (qpoffset=-0.6). Negative values increase video quality, while positive values decrease it so that the command string would increase the quality of the faces by approximately 60% over other regions in the video.  

-vf ‘ni_quadra_roi=nb=./yolov4_head.nb:qpoffset=-0.6’

Obviously, this video is highly compressed; in a surveillance video, the ROI filter could preserve facial quality for face detection; in a gambling or similar video compressed at a higher bitrate, it could ensure that the players’ or performers’ faces look their best.

Figure 3. The region of interest filter at work; original on LEFT, ROI filter on the RIGHT

In terms of performance, a single Quadra unit can process about 200 frames per second or at least six 30fps streams. This would allow a single Quadra to detect faces and transcode streams from six security cameras or six player inputs in an interactive gambling application, along with other transcoding tasks performed without region of interest detection.

Figure 4 shows the processing workflow within the Quadra VPU. Here we see the face detection operating within Quadra’s NPUs, with the location and processing instructions passing directly from the NPU to the encoder. As mentioned, since all instructions are processed on Quadra, there are no memory transfers outside the unit, reducing latency to a minimum and improving overall throughput and performance. This architecture represents the ideal execution environment for any video-related AI application.

Figure 4. Quadra’s on-board AI and encoding processing.

NETINT offers several other AI functions, including background removal and replacement, with others like optical character recognition, video enhancement, camera video quality detection, and voice-to-text on the long-term drawing board. Of course, via the NETINT Tool Chain, Quadra should be able to run most models created in any machine learning platform.

Here in late 2022, we’re only touching the surface of how AI can enhance video, whether by improving visual quality, extracting data, or any number of as-yet unimagined applications. Looking ahead, the NETINT AI Tool Chain should ensure that any AI model that you build will run on Quadra. Once deployed, Quadra’s integrated video processing/AI architecture should ensure highly efficient and extremely low-latency operation for that model.