Region of Interest Encoding
for Cloud Gaming:
A Survey of Approaches

As cloud gaming use cases expand, we are studying even more ways to deliver high-quality video with low latency and efficient bitrates.

Region of Interest Encoding (ROI) is one way to enhance video quality while reducing bandwidth. This post will discuss three ROI-based techniques recently proposed in research papers that may soon be adopted in cloud gaming encoding workflows.

This blog is meant to be informative. If I missed any important papers or methods, feel free to contact me HERE.

Region of Interest (ROI) Encoding

ROI encoding allows encoders to prioritize frame quality in critical regions most closely scrutinized by the viewer and is an established technique for improving viewer Quality of Experience. For example, NETINT’s Quadra video processing unit (VCU) uses artificial intelligence (AI) to detect faces in videos and then ROI encoding to improve facial quality. The NETINT T408/T432 also supports ROI encoding, but the specific regions must be manually defined in the command string.

ROI encoding is particularly relevant to cloud gaming, where viewers prefer fast-moving action, high resolutions, and high frame rates, but also want to play at low bitrates on wireless or cellular networks with ultra-low latency. These factors make cloud gaming a challenging compression environment.

Whether for real word videos or cloud gaming, the challenge with ROI encoding lies in identifying the most relevant regions of interest. As you’ll see, the three papers described below all take a markedly different approach. 

In the paper “Content-aware Video Encoding for Cloud Gaming” (2019), researchers from Simon Fraser University and Advanced Micro Devices propose using metadata provided by the game developer to identify the crucial regions. As the article states,

“Identifying relevant blocks is straightforward for game developers, because they know the logic and semantics of the game. Thus, they can expose this information as metadata with the game that can be accessed via APIs... Using this information, one or more regions of interest (ROIs) are defined as bounding boxes containing objects of importance to the task being achieved by the player.”

The authors label their proposed method CAVE, for Content-Aware Video Encoding. Architecturally, CAVE sits between the game process and the encoder, as shown in Figure 1. Then, “CAVE uses information about the game’s ROIs and computes various encoding parameters to optimize the quality. It then passes these parameters to the Video Encoder, which produces the encoded frames sent to the client.”

Region of Interest Encoding for Cloud Gaming - 1
Figure 1. The CAVE encoding method is implemented between the game process and encoder.

The results were promising. The technique “achieves quality gains in ROIs that can be translated to bitrate savings between 21% and 46% against the baseline HEVC encoder and between 12% and 89% against the closest work in the literature.”

Additionally, the processing overhead introduced by CAVE was less than 1.21%, which the authors felt would be reduced even further with parallelization, though implementing the process in silicon could completely eliminate the additional CPU loading.

ROI from Gaze Tracking

Another ROI-based approach was studied in the paper “Cloud Gaming With Foveated Video Encoding” by researchers from Aalto University in Finland and Politecnico di Torino in Italy. In this study, the region of interest was detected by a Tobii 4C Eye Tracker. This data was sent to the server, which used it to identify the ROI and adjust the Quantization Parameter (QP) values for the affected blocks accordingly.

Region of Interest Encoding for Cloud Gaming - 2
Figure 2. Using region of interest data from a gaze tracker.

Referring to the title of this paper, the term ‘foveation’ refers to a “non-uniform sampling response to visual stimuli” that’s inherent to the human visual system. By incorporating the concept of foveation, the encoder can most effectively allocate QP values to the regions of interest and surrounding frames, and seamlessly blend them with other regions within the frame.

As stated in the paper, to compute the quality of each macroblock, “the gaze location is translated to a macroblock based coordinate system. The macroblock corresponding to the current gaze location is assigned the lowest QO, while the QO of macroblocks away from the gaze location increases progressively with distance from the gaze macroblock.” 

The researchers performed extensive testing and analysis and ultimately concluded that “[o]ur evaluation results suggest that its potential to reduce bandwidth consumption is significant, as expected.” 
Regarding latency, the paper reports that “user study establishes the feasibility of FVE for FPS games, which are the most demanding latency wise.”


Obviously, any encoding solution tied to a gaze tracker has limited applicability, but the authors saw a much broader horizon ahead. “[w]e intend to attempt eliminating the need for specialized hardware for eye tracking by employing web cameras for the purpose. Using web cameras, which are ubiquitous in modern consumer computing devices like netbooks and mobile devices, would enable widespread adoption of foveated streaming for cloud gaming.”

Detecting ROI from Machine Learning

Finally, DeepGame: Efficient Video Encoding for Cloud Gaming was published in October 2021 by researchers from Simon Fraser University and Advanced Micro Devices, including three authors of the first paper mentioned above.

As detailed in the introduction, the authors propose “a new video encoding pipeline, called DeepGame, for cloud gaming to deliver high-quality game streams without codec or game modifications…DeepGame takes a learning-based approach to understand the player contextual interest within the game, predict the regions of interest (ROIs) across frames, and allocate bits to different regions based on their importance.”

At a high level, DeepGame is implemented in three stages:

  1. Scene analysis to gather data
  2. ROI prediction, and
  3. Encoding parameters calculation

Regarding the last stage, these encoding parameters are passed to the encoder via “a relatively straightforward set of APIs” so it’s not necessary to modify the encoder source code.

The authors describe their learning-based approach as follows; “DeepGame learns the player’s contextual interest in the game and the temporal correlation of that interest using a spatiotemporal deep neural network.” The schema for this operation is shown in Figure 3.

In essence, this learning-based approach means that some game-specific training is required beforehand and some processing during gameplay to identify ROIs in real time. The obvious questions are, how much latency does this process add, and how much bandwidth does the approach save.

Region of Interest Encoding for Cloud Gaming - 3
Figure 3. DeepGame’s neural network-based schema for detecting region of interest.

Regarding latency, model training is performed offline and only once per game (and for major upgrades). Running the inference on the model is performed during each gaming session. During their testing, the researchers ran the inference model on every third frame and concluded that “ROI prediction time will not add any processing delays to the pipeline.”

The researchers trained and tested four games, FIFA 20, a soccer game, CS:GO, a first-person shooter game, and NBA Live 19 and NHL 19, and performed multiple analyses. First, they compared their predicted ROIs to actual ROIs detected using a Gazepoint GP3 eye-tracking device. Here, accuracy scores ranged from a high of 85.95% for FIFA 20 to a low of 73.96% for NHL 19.

Then, the researchers compared the quality in the ROI regions with an unidentified “state-of-the-art H.265 video encoder” using SSIM and PSNR. BD-Rate savings for SSIM ranged from 33.01% to 20.80%, and from 35.06% to 19.11% for PSNR. They also compared overall frame quality using VMAF, which yielded nearly identical scores, proving that DeepGame didn’t degrade overall quality despite the bandwidth savings and improved quality with regions of interest.

The authors also performed a subjective study with the FIFA 20 and CS:GO games using x264 with and without DeepGame inputs. The mean opinion scores incorporated the entire game experience, including lags, distortions, and artifacts. In these tests, DeepGame improved the Mean Opinion Scores by up to 33% over the base encoder.

Play Video about Hard Questions on Hot Topics - hear directly from Jan Ozer about Region of Interest Encoding
Watch the full conversation on YouTube:


All approaches have their pros and cons. The CAVE approach should be most accurate in identifying ROIs but requires metadata from game developers. The gaze tracker approach can work with any game but requires hardware that many gamers don’t have and is unproven for webcams. Meanwhile, DeepGame can work with any game but requires pre-game training and involves ingame running of reference models.

All appear to be very viable approaches for improving QoE and reducing bandwidth and latency while working with existing codecs and encoders. Unfortunately, none of the three proposals described seem to have progressed towards implementation. This makes ROI encoding for cloud gaming a technology worth watching, if not yet available for implementation.

Related Article

Content Royalties - featured image

The Truth About Content Royalties

The Truth About Content Royalties Download Content Royalties Summary Much of the pause for codecs like HEVC and VVC relates to the threat of patent