Region of Interest Encoding for Cloud Gaming: A Survey of Approaches

Region of Interest (ROI) Encoding for-Cloud Gaming

As cloud gaming use cases expand, we are studying even more ways to deliver high-quality video with low latency and efficient bitrates.

Region of Interest Encoding (ROI) is one way to enhance video quality while reducing bandwidth. This post will discuss three ROI-based techniques recently proposed in research papers that may soon be adopted in cloud gaming encoding workflows.

This blog is meant to be informative. If I missed any important papers or methods, feel free to contact me HERE.

Region of Interest (ROI) Encoding

ROI encoding allows encoders to prioritize frame quality in critical regions most closely scrutinized by the viewer and is an established technique for improving viewer Quality of Experience. For example, NETINT’s Quadra video processing unit (VCU) uses artificial intelligence (AI) to detect faces in videos and then ROI encoding to improve facial quality. The NETINT T408/T432 also supports ROI encoding, but the specific regions must be manually defined in the command string.

ROI encoding is particularly relevant to cloud gaming, where viewers prefer fast-moving action, high resolutions, and high frame rates, but also want to play at low bitrates on wireless or cellular networks with ultra-low latency. These factors make cloud gaming a challenging compression environment.

Whether for real word videos or cloud gaming, the challenge with ROI encoding lies in identifying the most relevant regions of interest. As you’ll see, the three papers described below all take a markedly different approach. 

In the paper “Content-aware Video Encoding for Cloud Gaming” (2019), researchers from Simon Fraser University and Advanced Micro Devices propose using metadata provided by the game developer to identify the crucial regions. As the article states,

“Identifying relevant blocks is straightforward for game developers, because they know the logic and semantics of the game. Thus, they can expose this information as metadata with the game that can be accessed via APIs... Using this information, one or more regions of interest (ROIs) are defined as bounding boxes containing objects of importance to the task being achieved by the player.”

The authors label their proposed method CAVE, for Content-Aware Video Encoding. Architecturally, CAVE sits between the game process and the encoder, as shown in Figure 1. Then, “CAVE uses information about the game’s ROIs and computes various encoding parameters to optimize the quality. It then passes these parameters to the Video Encoder, which produces the encoded frames sent to the client.”

Region of Interest Encoding for Cloud Gaming: A Survey of Approaches
Figure 1. The CAVE encoding method is implemented between the game process and encoder.

The results were promising. The technique “achieves quality gains in ROIs that can be translated to bitrate savings between 21% and 46% against the baseline HEVC encoder and between 12% and 89% against the closest work in the literature.”

Additionally, the processing overhead introduced by CAVE was less than 1.21%, which the authors felt would be reduced even further with parallelization, though implementing the process in silicon could completely eliminate the additional CPU loading.

ROI from Gaze Tracking

Another ROI-based approach was studied in the paper “Cloud Gaming With Foveated Video Encoding” by researchers from Aalto University in Finland and Politecnico di Torino in Italy. In this study, the region of interest was detected by a Tobii 4C Eye Tracker. This data was sent to the server, which used it to identify the ROI and adjust the Quantization Parameter (QP) values for the affected blocks accordingly.

Region of Interest Encoding for Cloud Gaming: A Survey of Approaches
Figure 2. Using region of interest data from a gaze tracker.

Referring to the title of this paper, the term ‘foveation’ refers to a “non-uniform sampling response to visual stimuli” that’s inherent to the human visual system. By incorporating the concept of foveation, the encoder can most effectively allocate QP values to the regions of interest and surrounding frames, and seamlessly blend them with other regions within the frame.

As stated in the paper, to compute the quality of each macroblock, “the gaze location is translated to a macroblock based coordinate system. The macroblock corresponding to the current gaze location is assigned the lowest QO, while the QO of macroblocks away from the gaze location increases progressively with distance from the gaze macroblock.” 

The researchers performed extensive testing and analysis and ultimately concluded that “[o]ur evaluation results suggest that its potential to reduce bandwidth consumption is significant, as expected.” 
Regarding latency, the paper reports that “user study establishes the feasibility of FVE for FPS games, which are the most demanding latency wise.”

Obviously, any encoding solution tied to a gaze tracker has limited applicability, but the authors saw a much broader horizon ahead. “[w]e intend to attempt eliminating the need for specialized hardware for eye tracking by employing web cameras for the purpose. Using web cameras, which are ubiquitous in modern consumer computing devices like netbooks and mobile devices, would enable widespread adoption of foveated streaming for cloud gaming.”

Detecting ROI from Machine Learning

Finally, DeepGame: Efficient Video Encoding for Cloud Gaming was published in October 2021 by researchers from Simon Fraser University and Advanced Micro Devices, including three authors of the first paper mentioned above.

As detailed in the introduction, the authors propose “a new video encoding pipeline, called DeepGame, for cloud gaming to deliver high-quality game streams without codec or game modifications…DeepGame takes a learning-based approach to understand the player contextual interest within the game, predict the regions of interest (ROIs) across frames, and allocate bits to different regions based on their importance.”

At a high level, DeepGame is implemented in three stages:

  1. Scene analysis to gather data
  2. ROI prediction, and
  3. Encoding parameters calculation

Regarding the last stage, these encoding parameters are passed to the encoder via “a relatively straightforward set of APIs” so it’s not necessary to modify the encoder source code.

The authors describe their learning-based approach as follows; “DeepGame learns the player’s contextual interest in the game and the temporal correlation of that interest using a spatiotemporal deep neural network.” The schema for this operation is shown in Figure 3.

In essence, this learning-based approach means that some game-specific training is required beforehand and some processing during gameplay to identify ROIs in real time. The obvious questions are, how much latency does this process add, and how much bandwidth does the approach save.

Region of Interest Encoding for Cloud Gaming: A Survey of Approaches
Figure 3. DeepGame’s neural network-based schema for detecting region of interest.

Regarding latency, model training is performed offline and only once per game (and for major upgrades). Running the inference on the model is performed during each gaming session. During their testing, the researchers ran the inference model on every third frame and concluded that “ROI prediction time will not add any processing delays to the pipeline.”

The researchers trained and tested four games, FIFA 20, a soccer game, CS:GO, a first-person shooter game, and NBA Live 19 and NHL 19, and performed multiple analyses. First, they compared their predicted ROIs to actual ROIs detected using a Gazepoint GP3 eye-tracking device. Here, accuracy scores ranged from a high of 85.95% for FIFA 20 to a low of 73.96% for NHL 19.

Then, the researchers compared the quality in the ROI regions with an unidentified “state-of-the-art H.265 video encoder” using SSIM and PSNR. BD-Rate savings for SSIM ranged from 33.01% to 20.80%, and from 35.06% to 19.11% for PSNR. They also compared overall frame quality using VMAF, which yielded nearly identical scores, proving that DeepGame didn’t degrade overall quality despite the bandwidth savings and improved quality with regions of interest.

The authors also performed a subjective study with the FIFA 20 and CS:GO games using x264 with and without DeepGame inputs. The mean opinion scores incorporated the entire game experience, including lags, distortions, and artifacts. In these tests, DeepGame improved the Mean Opinion Scores by up to 33% over the base encoder.

Play Video about Region of Interest Encoding for Cloud Gaming: A Survey of Approaches - thumbnail
Watch the full conversation on YouTube:


All approaches have their pros and cons. The CAVE approach should be most accurate in identifying ROIs but requires metadata from game developers. The gaze tracker approach can work with any game but requires hardware that many gamers don’t have and is unproven for webcams. Meanwhile, DeepGame can work with any game but requires pre-game training and involves ingame running of reference models.

All appear to be very viable approaches for improving QoE and reducing bandwidth and latency while working with existing codecs and encoders. Unfortunately, none of the three proposals described seem to have progressed towards implementation. This makes ROI encoding for cloud gaming a technology worth watching, if not yet available for implementation.

Maximizing Cloud Gaming Performance with ASICs

Maximizing Cloud Gaming Performance with ASICs

Ask ten cloud gamers what an acceptable level of latency is for cloud gaming, and you’ll get ten different answers. However, they will all agree that lower latency is better.

At NETINT, we understand. As a supplier of encoders to the cloud gaming market, our role is to supply the lowest possible latency at the highest possible quality and the greatest encoding density with the lowest possible power consumption. While this sounds like a tall order, because our technology is ASIC based, it’s what we do for cloud gaming and high-volume video streaming workloads of all types.

In this article, we’ll take a quick look at the technology stack for cloud gaming and the role of compression. Then we’ll discuss the performance of the NETINT Quadra VPU (video processing unit) series using the four measuring sticks of latency, density, video quality, and power consumption.

The Cloud Gaming Technology Stack

Figure 1 illustrates the different elements of the cloud gaming technology stack, particularly how the various transfer, compute, rendering, and encoding activities contribute to overall latency.

At the heart of every cloud gaming center is a game engine that typically runs the operating system native to the game, usually Android or Windows, though Linux and macOS is not uncommon. (see here for Meta’s dual OS architecture)

Since most games rely on GPU for rendering, all cloud gaming data centers have a healthy dose of GPU resources. These functions are incorporated in the cloud compute and graphics engine shown on the left, which creates the frames sent to the encode function for encoding and transmission to the gamer.

As illustrated in Figure 1, Nokia budgets 100 ms for total latency. Inside the data center, which is shown on the left, Nokia allows 15 ms to receive the data, 40 ms to process the input and render the frame, 5 ms to encode the frame, and 15 seconds to return it to the remote player. That’s a lot to do in the time it takes a sound wave to travel just 100 feet.

Maximizing Cloud Gaming Performance with ASICs - figure 1
Figure 1. Cloud gaming latency budget from Nokia.

NETINT’s Quadra VPU series is ideal for the standalone encode function. All Quadra VPUs are powered by the NETINT Codensity G5 ASIC. It’s called a video processing unit because in addition to H.264, HEVC, and VP9 decode, and H.264, HEVC, and AVI encode, Quadra VPUs offer onboard scaling, overlay, and an 18 TOPS AI engine (per chip).

Quadra is available in several single-chip solutions (T1 and T1A) and a dual-chip solution (T2) and starts at $1,500 in low quantities. Depending upon the configuration that you purchase, you can install up to ten Quadra VPUs in a single 1RU server and twenty Quadra VPUs in a 2RU server.

Cloud Gaming Latency and Density

Table 1 reports latency and density for a single Quadra VPU. As you would expect, latency depends on video resolution by way of the available network bandwidth and, to a much lesser degree, the number of jobs being processed.

Game producers understand the resolution/latency tradeoff and design the experience around this. So, a cloud gaming vendor might deliver a first-person shooter game at 720p to minimize latency while providing a better UX on medium bandwidth connections and a slower-paced role-playing or strategy game at larger resolutions to optimize the visual experience. As you can see, a single Quadra VPU can service both scenarios, with 4K latency under 20 ms and 720p latency around 4 ms at extremely high stream counts.

Maximizing Cloud Gaming Performance with ASICs - table 1
Table 1. Quadra throughput and average latency for AVC and HEVC.

In terms of density, the jobs shown in Table 1 are for a single Quadra VPU. Though multiple units won’t scale linearly, performance will increase substantially as you install additional units into a server. Because the Quadra is focused solely on video processing and encoding operations, it outperforms most general-purpose GPUs, CPUs, and even FPGA-based encoders from a density perspective.

Quadra Output Quality

From a quality perspective, hardware transcoders are typically benchmarked against the x264 and x265 codecs running in FFmpeg. Though FFmpeg’s throughput is orders of magnitude lower, these codecs represent well known and accepted quality levels. NETINT recently compared Quadra quality against x264 and x265 in a low latency configuration using a CGI-based data set.

Table 2 shows the results for H.264, with Rate-Distortion Optimization Quantization enabled and disabled. Enabling RDOQ increases quality slightly but decreases throughput. Quadra exceeded x264 quality in both configurations using the veryfast preset, typical for live streaming.

Maximizing Cloud Gaming Performance with ASICs - table 2
Table 2. The NETINT Quadra VPU series delivers better H.264 quality
than the x264 codec using the veryfast preset.

For HEVC, Table 3 shows the equivalent x265 preset with RDOQ disabled (the high throughput, lower-quality option) at three Rate Distortion Optimization levels, which also trade-off quality for throughput. Even with RDOQ disabled and with RDO set to 1 (low quality. high throughput) Quadra delivers the equivalent of x265 Medium quality. Note that most live streaming engineers use superfast or ultrafast to produce even a modest number of HEVC streams in a software-only encoding scenario.

Table 3. The NETINT Quadra VPU series delivers better quality
than the x265 codec using the medium preset.

Low Power Transcoding for Cloud Gaming

At full power, Quadra T1 draws 70 watts. Though some GPUs offer similar power consumption, they typically deliver much fewer streams.

In this comparison with the NVIDIA T4, the Quadra T1 drew .71 watts per 1080p stream, about 84% less than the 3.7 watts per stream required by the T4. This obviously translates to an 84% reduction in energy costs and carbon emissions per stream. In terms of CAPEX, Quadra costs $53.57 per 1080p stream, 63% cheaper than the T4’s $144/stream.

When it comes to gameplay, most gamers prioritize latency and quality. In addition to delivering these two key QoE elements, cloud gaming vendors must also focus on CAPEX, OPEX, and sustainability.  By all these metrics, the ASIC-based Quadra is the most ideal encoder for any cloud gaming production workflow. 

Mobile cloud gaming and technology suppliers

Cloud gaming is the perfect application for ASIC-based transcoding. NETINT products are extensively deployed in cloud gaming overseas. High-profile domestic...

Video games are a huge market segment, projected to reach US$221.4 billion in 2023, expanding to an estimated US$285 billion by 2027. Of that, cloud gaming grossed an estimated US$3 billion+ in 2022 and is projected to produce over US$12 billion in revenue by 2026.

While the general video game market generates minimal revenue from encoder sales, cloud gaming is the perfect application for ASIC-based transcoding. NETINT products were designed, in part, for cloud gaming and are extensively deployed in cloud gaming overseas. We expect to announce some high-profile domestic design wins in 2023.

If you’re not a gamer, you may not be familiar with what cloud gaming is and how it’s different from PC or console-based gaming. This is the first of several introductory articles to get you up to speed on what cloud gaming is, how it works, who the major players are, and why it’s projected to grow so quickly. 

What is cloud gaming

Figure 1, from this article, illustrates the difference between PC/console gaming and cloud gaming. On top is traditional gaming, where the gamer needs an expensive, high-performance console or game computer to process the game logic and render the output. To the extent that there is a cloud component, say for multiple players, the online server tracks and reports the interactions, but all computational and rendering heavy lifting is performed locally.

Mobile cloud gaming and technology suppliers - figure 1
Figure 1. The difference between traditional and cloud gaming. From this article.

On the bottom is cloud gaming. As you can see, all you need on the consumer side is a screen and game controller. All of the game logic and rendering are performed in the cloud, along with encoding for delivery to the consumer.

Cloud gaming workflow

Figure 2 shows a high-level cloud workflow – we’ll dig deeper into the cloud gaming technology stack in future articles, but this should help you grasp the concept. As shown, the gamer’s inputs are sent to the cloud, where a virtual instance of the game interprets, executes, and renders the input. The resultant frames are captured, encoded, and transmitted back to the consumer, where the frames are decoded and displayed. 

Figure 2. A high-level view of the cloud side of cloud gaming from this seminal article.

Cloud gaming and consumers' benefits

Cloud gaming services incorporate widely different business models, pricing levels, available games, performance envelopes, and compatible devices. In most cases, however, consumers benefit because:

  • They don’t need a high performant PC or game console to play games – they can play on most connected devices. This includes some Smart TVs for a true, big-screen experience.
  • They don’t need to download, install, or maintain games on their game platform.
  • They don’t need to buy expensive games to get started.
  • They can play the same game on multiple platforms, from an expensive gaming rig or console to a smartphone or tablet, with all ongoing game information stored in the cloud so you can immediately pick up where you left off.

Publishers benefit because they get instant access to users on all platforms, not just the native platforms the games were designed for. So, console and PC-based games are instantly accessible to all players, even those without the native hardware. Since games aren’t downloaded during cloud gaming, there’s no risk of piracy, and the cloud negates the performance advantages long-held by those with the fastest hardware, leveling the playing field for game play.

Gaming experience

Speaking of performance, what’s necessary to achieve a traditional local gameplay experience? Most cloud platforms recommend a 10 Mbps download speed at a minimum for mobile, with a wired Ethernet connection recommended for computers and smart TVs. As you would expect, your connection speed dictates performance, with 4K ultra-high frame rate games requiring faster connection speeds than 1080p@30fps gameplay.

As mentioned at the top, cloud gaming is expected to capture an increasing share of overall gameplay revenue going forward, both from existing gamers who want to play new games on new platforms and new gamers. Given the revenue numbers involved, this makes cloud gaming a critical market for all related technology suppliers.