Region of Interest Encoding for Cloud Gaming: A Survey of Approaches

Region of Interest (ROI) Encoding for-Cloud Gaming

As cloud gaming use cases expand, we are studying even more ways to deliver high-quality video with low latency and efficient bitrates.

Region of Interest Encoding (ROI) is one way to enhance video quality while reducing bandwidth. This post will discuss three ROI-based techniques recently proposed in research papers that may soon be adopted in cloud gaming encoding workflows.

This blog is meant to be informative. If I missed any important papers or methods, feel free to contact me HERE.

Region of Interest (ROI) Encoding

ROI encoding allows encoders to prioritize frame quality in critical regions most closely scrutinized by the viewer and is an established technique for improving viewer Quality of Experience. For example, NETINT’s Quadra video processing unit (VCU) uses artificial intelligence (AI) to detect faces in videos and then ROI encoding to improve facial quality. The NETINT T408/T432 also supports ROI encoding, but the specific regions must be manually defined in the command string.

ROI encoding is particularly relevant to cloud gaming, where viewers prefer fast-moving action, high resolutions, and high frame rates, but also want to play at low bitrates on wireless or cellular networks with ultra-low latency. These factors make cloud gaming a challenging compression environment.

Whether for real word videos or cloud gaming, the challenge with ROI encoding lies in identifying the most relevant regions of interest. As you’ll see, the three papers described below all take a markedly different approach. 

In the paper “Content-aware Video Encoding for Cloud Gaming” (2019), researchers from Simon Fraser University and Advanced Micro Devices propose using metadata provided by the game developer to identify the crucial regions. As the article states,

“Identifying relevant blocks is straightforward for game developers, because they know the logic and semantics of the game. Thus, they can expose this information as metadata with the game that can be accessed via APIs... Using this information, one or more regions of interest (ROIs) are defined as bounding boxes containing objects of importance to the task being achieved by the player.”

The authors label their proposed method CAVE, for Content-Aware Video Encoding. Architecturally, CAVE sits between the game process and the encoder, as shown in Figure 1. Then, “CAVE uses information about the game’s ROIs and computes various encoding parameters to optimize the quality. It then passes these parameters to the Video Encoder, which produces the encoded frames sent to the client.”

Region of Interest Encoding for Cloud Gaming: A Survey of Approaches
Figure 1. The CAVE encoding method is implemented between the game process and encoder.

The results were promising. The technique “achieves quality gains in ROIs that can be translated to bitrate savings between 21% and 46% against the baseline HEVC encoder and between 12% and 89% against the closest work in the literature.”

Additionally, the processing overhead introduced by CAVE was less than 1.21%, which the authors felt would be reduced even further with parallelization, though implementing the process in silicon could completely eliminate the additional CPU loading.

ROI from Gaze Tracking

Another ROI-based approach was studied in the paper “Cloud Gaming With Foveated Video Encoding” by researchers from Aalto University in Finland and Politecnico di Torino in Italy. In this study, the region of interest was detected by a Tobii 4C Eye Tracker. This data was sent to the server, which used it to identify the ROI and adjust the Quantization Parameter (QP) values for the affected blocks accordingly.

Region of Interest Encoding for Cloud Gaming: A Survey of Approaches
Figure 2. Using region of interest data from a gaze tracker.

Referring to the title of this paper, the term ‘foveation’ refers to a “non-uniform sampling response to visual stimuli” that’s inherent to the human visual system. By incorporating the concept of foveation, the encoder can most effectively allocate QP values to the regions of interest and surrounding frames, and seamlessly blend them with other regions within the frame.

As stated in the paper, to compute the quality of each macroblock, “the gaze location is translated to a macroblock based coordinate system. The macroblock corresponding to the current gaze location is assigned the lowest QO, while the QO of macroblocks away from the gaze location increases progressively with distance from the gaze macroblock.” 

The researchers performed extensive testing and analysis and ultimately concluded that “[o]ur evaluation results suggest that its potential to reduce bandwidth consumption is significant, as expected.” 
Regarding latency, the paper reports that “user study establishes the feasibility of FVE for FPS games, which are the most demanding latency wise.”

Obviously, any encoding solution tied to a gaze tracker has limited applicability, but the authors saw a much broader horizon ahead. “[w]e intend to attempt eliminating the need for specialized hardware for eye tracking by employing web cameras for the purpose. Using web cameras, which are ubiquitous in modern consumer computing devices like netbooks and mobile devices, would enable widespread adoption of foveated streaming for cloud gaming.”

Detecting ROI from Machine Learning

Finally, DeepGame: Efficient Video Encoding for Cloud Gaming was published in October 2021 by researchers from Simon Fraser University and Advanced Micro Devices, including three authors of the first paper mentioned above.

As detailed in the introduction, the authors propose “a new video encoding pipeline, called DeepGame, for cloud gaming to deliver high-quality game streams without codec or game modifications…DeepGame takes a learning-based approach to understand the player contextual interest within the game, predict the regions of interest (ROIs) across frames, and allocate bits to different regions based on their importance.”

At a high level, DeepGame is implemented in three stages:

  1. Scene analysis to gather data
  2. ROI prediction, and
  3. Encoding parameters calculation

Regarding the last stage, these encoding parameters are passed to the encoder via “a relatively straightforward set of APIs” so it’s not necessary to modify the encoder source code.

The authors describe their learning-based approach as follows; “DeepGame learns the player’s contextual interest in the game and the temporal correlation of that interest using a spatiotemporal deep neural network.” The schema for this operation is shown in Figure 3.

In essence, this learning-based approach means that some game-specific training is required beforehand and some processing during gameplay to identify ROIs in real time. The obvious questions are, how much latency does this process add, and how much bandwidth does the approach save.

Region of Interest Encoding for Cloud Gaming: A Survey of Approaches
Figure 3. DeepGame’s neural network-based schema for detecting region of interest.

Regarding latency, model training is performed offline and only once per game (and for major upgrades). Running the inference on the model is performed during each gaming session. During their testing, the researchers ran the inference model on every third frame and concluded that “ROI prediction time will not add any processing delays to the pipeline.”

The researchers trained and tested four games, FIFA 20, a soccer game, CS:GO, a first-person shooter game, and NBA Live 19 and NHL 19, and performed multiple analyses. First, they compared their predicted ROIs to actual ROIs detected using a Gazepoint GP3 eye-tracking device. Here, accuracy scores ranged from a high of 85.95% for FIFA 20 to a low of 73.96% for NHL 19.

Then, the researchers compared the quality in the ROI regions with an unidentified “state-of-the-art H.265 video encoder” using SSIM and PSNR. BD-Rate savings for SSIM ranged from 33.01% to 20.80%, and from 35.06% to 19.11% for PSNR. They also compared overall frame quality using VMAF, which yielded nearly identical scores, proving that DeepGame didn’t degrade overall quality despite the bandwidth savings and improved quality with regions of interest.

The authors also performed a subjective study with the FIFA 20 and CS:GO games using x264 with and without DeepGame inputs. The mean opinion scores incorporated the entire game experience, including lags, distortions, and artifacts. In these tests, DeepGame improved the Mean Opinion Scores by up to 33% over the base encoder.

Play Video about Region of Interest Encoding for Cloud Gaming: A Survey of Approaches - thumbnail
HARD QUESTIONS ON HOT TOPICS – CLOUD OR ON PREMISES, HOW TO DO THE MATH?
Watch the full conversation on YouTube: https://youtu.be/KIaYFS54QNY

Summary

All approaches have their pros and cons. The CAVE approach should be most accurate in identifying ROIs but requires metadata from game developers. The gaze tracker approach can work with any game but requires hardware that many gamers don’t have and is unproven for webcams. Meanwhile, DeepGame can work with any game but requires pre-game training and involves ingame running of reference models.

All appear to be very viable approaches for improving QoE and reducing bandwidth and latency while working with existing codecs and encoders. Unfortunately, none of the three proposals described seem to have progressed towards implementation. This makes ROI encoding for cloud gaming a technology worth watching, if not yet available for implementation.

Introduction to AI Processing on Quadra

Intro to AI Processing on Quadra - NETINT technologies

The intersection of video processing and artificial intelligence (AI) delivers exciting new functionality, from real-time quality enhancement for video publishers to object detection and optical character recognition for security applications. One key feature in NETINT’s Quadra Video Processing Units are two onboard Neural Processing Units (NPUs). Combined with Quadra’s integrated decoding, scaling, and transcoding hardware, this creates an integrated AI and video processing architecture that requires minimal interaction from the host CPU. As you’ll learn in this post, this architecture makes Quadra the ideal platform for executing video-related AI applications.

This post introduces the reader to what AI is, how it works, and how you deploy AI applications on NETINT Quadra. Along the way, we’ll explore one Quadra-supported AI application, Region of Interest (ROI) encoding.

About AI

Let’s start by defining some terms and concepts. Artificial intelligence refers to a program that can sense, reason,  act, and adapt. One AI subset that’s a bit easier to grasp is called machine learning, which refers to algorithms whose performance improves as they are exposed to more data over time.

Machine learning involves the five steps shown in the figure below. Let’s assume we’re building an application that can identify dogs in a video stream. The first step is to prepare your data. You might start with 100 pictures of dogs and then extract features, or represent them mathematically, that identify them as dogs: four legs, whiskers, two ears, two eyes, and a tail. So far, so good.

AI Processing on Quadra - figure 1
Figure 1. The high-level AI workflow (from Escon Info Systems)

To train the model, you apply your dog-finding algorithm to a picture database of 1,000 animals, only to find that rats, cats, possums, and small ponies are also identified as dogs. As you evaluate and further train the model, you extract new features from all the other animals that disqualify them from being a dog, along with more dog-like features that help identify true canines. This is the ”machine learning” that improves the algorithm.

As you train and evaluate your model, at some point it achieves the desired accuracy rate and it’s ready to deploy.

The NETINT AI Tool Chain

Then it’s time to run the model. Here, you export the model for deployment on an AI-capable hardware platform like the NETINT Quadra. What makes Quadra ideal for video-related AI applications is the power of the Neural Processing Units (NPU) and the proximity of the video to the NPUs. That is, since the video is entirely processed in Quadra, there are no transfers to a CPU or GPU, which minimizes latency and enables faster performance. More on this is below.

Figure 2 shows the NETINT AI Toolchain workflow for creating and running models on Quadra. On the left are third-party tools for creating and training AI-related models. Once these models are complete, you use the free NETINT AI Toolkit to input the models and translate, export, and run them on the Quadra NPUs – you’ll see an example of how that’s done in a moment. On the NPUs, they perform the functions for which they were created and trained, like identifying dogs in a video stream.

AI Processing on Quadra - figure 2
Figure 2. The NETINT AI Tool Chain.

Quadra Region of Interest (ROI) Filter

Let’s look at a real-world example. One AI function supplied with Quadra is an ROI filter, which analyzes the input video to detect faces and generate Region of Interest (ROI) data to improve the encoding quality of the faces. Specifically, when the AI Engine identifies a face, it draws a box around the face and sends the box’s coordinates to the encoder, with encoding instructions specific to the box.

Technically, Quadra identifies the face using what’s called a YOLOv4 object detection model. YOLO stands for You Only Look Once, which is a technology that requires only a single pass of the image (or one look) for object detection. By way of background, YOLO is a highly regarded family of “deep learning object detection models. The original versions of YOLO are implemented using the DARKNET framework, which you see as an input to the NETINT AI Toolkit in Figure 2.

Deep learning is different from the traditional machine learning described above in that it uses large datasets to create the model, rather than human intervention. To create the model deployed in the ROI filter, we trained the YOLOv4 model in DARKNET using hundreds of thousands of publicly available image data with labels (where the labels are bounding boxes on people’s faces). This produced a highly accurate model with minimum manual input, which is faster and cheaper than traditional machine learning. Obviously, where relevant training data is available, deep learning is a better alternative than traditional machine learning.

Using the ROI Function

Most users will access the ROI function via FFmpeg, where it’s presented as a video filter with the filter-specific command string shown below. To execute the function, you call the filter (ni_quadra_roi), enter the name and location of the model (yolov4_head.nb), and a QP value to adjust the quality within each box (qpoffset=-0.6). Negative values increase video quality, while positive values decrease it so that the command string would increase the quality of the faces by approximately 60% over other regions in the video.  

-vf ‘ni_quadra_roi=nb=./yolov4_head.nb:qpoffset=-0.6’

Obviously, this video is highly compressed; in a surveillance video, the ROI filter could preserve facial quality for face detection; in a gambling or similar video compressed at a higher bitrate, it could ensure that the players’ or performers’ faces look their best.

Figure 3. The region of interest filter at work; original on LEFT, ROI filter on the RIGHT

In terms of performance, a single Quadra unit can process about 200 frames per second or at least six 30fps streams. This would allow a single Quadra to detect faces and transcode streams from six security cameras or six player inputs in an interactive gambling application, along with other transcoding tasks performed without region of interest detection.

Figure 4 shows the processing workflow within the Quadra VPU. Here we see the face detection operating within Quadra’s NPUs, with the location and processing instructions passing directly from the NPU to the encoder. As mentioned, since all instructions are processed on Quadra, there are no memory transfers outside the unit, reducing latency to a minimum and improving overall throughput and performance. This architecture represents the ideal execution environment for any video-related AI application.

Figure 4. Quadra’s on-board AI and encoding processing.

NETINT offers several other AI functions, including background removal and replacement, with others like optical character recognition, video enhancement, camera video quality detection, and voice-to-text on the long-term drawing board. Of course, via the NETINT Tool Chain, Quadra should be able to run most models created in any machine learning platform.

Here in late 2022, we’re only touching the surface of how AI can enhance video, whether by improving visual quality, extracting data, or any number of as-yet unimagined applications. Looking ahead, the NETINT AI Tool Chain should ensure that any AI model that you build will run on Quadra. Once deployed, Quadra’s integrated video processing/AI architecture should ensure highly efficient and extremely low-latency operation for that model.