Ask ten cloud gamers what an acceptable level of latency is for cloud gaming, and you’ll get ten different answers. However, they will all agree that lower latency is better.
At NETINT, we understand. As a supplier of encoders to the cloud gaming market, our role is to supply the lowest possible latency at the highest possible quality and the greatest encoding density with the lowest possible power consumption. While this sounds like a tall order, because our technology is ASIC based, it’s what we do for cloud gaming and high-volume video streaming workloads of all types.
In this article, we’ll take a quick look at the technology stack for cloud gaming and the role of compression. Then we’ll discuss the performance of the NETINT Quadra VPU (video processing unit) series using the four measuring sticks of latency, density, video quality, and power consumption.
The Cloud Gaming Technology Stack
Figure 1 illustrates the different elements of the cloud gaming technology stack, particularly how the various transfer, compute, rendering, and encoding activities contribute to overall latency.
At the heart of every cloud gaming center is a game engine that typically runs the operating system native to the game, usually Android or Windows, though Linux and macOS is not uncommon. (see here for Meta’s dual OS architecture)
Since most games rely on GPU for rendering, all cloud gaming data centers have a healthy dose of GPU resources. These functions are incorporated in the cloud compute and graphics engine shown on the left, which creates the frames sent to the encode function for encoding and transmission to the gamer.
As illustrated in Figure 1, Nokia budgets 100 ms for total latency. Inside the data center, which is shown on the left, Nokia allows 15 ms to receive the data, 40 ms to process the input and render the frame, 5 ms to encode the frame, and 15 seconds to return it to the remote player. That’s a lot to do in the time it takes a sound wave to travel just 100 feet.
Figure 1. Cloud gaming latency budget from Nokia.
NETINT’s Quadra VPU series is ideal for the standalone encode function. All Quadra VPUs are powered by the NETINT Codensity G5 ASIC. It’s called a video processing unit because in addition to H.264, HEVC, and VP9 decode, and H.264, HEVC, and AVI encode, Quadra VPUs offer onboard scaling, overlay, and an 18 TOPS AI engine (per chip).
Cloud Gaming Latency and Density
Table 1 reports latency and density for a single Quadra VPU. As you would expect, latency depends on video resolution by way of the available network bandwidth and, to a much lesser degree, the number of jobs being processed.
Game producers understand the resolution/latency tradeoff and design the experience around this. So, a cloud gaming vendor might deliver a first-person shooter game at 720p to minimize latency while providing a better UX on medium bandwidth connections and a slower-paced role-playing or strategy game at larger resolutions to optimize the visual experience. As you can see, a single Quadra VPU can service both scenarios, with 4K latency under 20 ms and 720p latency around 4 ms at extremely high stream counts.
Table 1. Quadra throughput and average latency for AVC and HEVC.
In terms of density, the jobs shown in Table 1 are for a single Quadra VPU. Though multiple units won’t scale linearly, performance will increase substantially as you install additional units into a server. Because the Quadra is focused solely on video processing and encoding operations, it outperforms most general-purpose GPUs, CPUs, and even FPGA-based encoders from a density perspective.
Quadra Output Quality
From a quality perspective, hardware transcoders are typically benchmarked against the x264 and x265 codecs running in FFmpeg. Though FFmpeg’s throughput is orders of magnitude lower, these codecs represent well known and accepted quality levels. NETINT recently compared Quadra quality against x264 and x265 in a low latency configuration using a CGI-based data set.
Table 2 shows the results for H.264, with Rate-Distortion Optimization Quantization enabled and disabled. Enabling RDOQ increases quality slightly but decreases throughput. Quadra exceeded x264 quality in both configurations using the veryfast preset, typical for live streaming.
Table 2. The NETINT Quadra VPU series delivers better H.264 quality
than the x264 codec using the veryfast preset.
For HEVC, Table 3 shows the equivalent x265 preset with RDOQ disabled (the high throughput, lower-quality option) at three Rate Distortion Optimization levels, which also trade-off quality for throughput. Even with RDOQ disabled and with RDO set to 1 (low quality. high throughput) Quadra delivers the equivalent of x265 Medium quality. Note that most live streaming engineers use superfast or ultrafast to produce even a modest number of HEVC streams in a software-only encoding scenario.
Table 3. The NETINT Quadra VPU series delivers better quality
than the x265 codec using the medium preset.
Low Power Transcoding for Cloud Gaming
At full power, Quadra T1 draws 70 watts. Though some GPUs offer similar power consumption, they typically deliver much fewer streams.
In this comparison with the NVIDIA T4, the Quadra T1 drew .71 watts per 1080p stream, about 84% less than the 3.7 watts per stream required by the T4. This obviously translates to an 84% reduction in energy costs and carbon emissions per stream. In terms of CAPEX, Quadra costs $53.57 per 1080p stream, 63% cheaper than the T4’s $144/stream.
When it comes to gameplay, most gamers prioritize latency and quality. In addition to delivering these two key QoE elements, cloud gaming vendors must also focus on CAPEX, OPEX, and sustainability. By all these metrics, the ASIC-based Quadra is the most ideal encoder for any cloud gaming production workflow.