For Cloud-Gaming, a VPU can deliver 200 simultaneous 720p30 game sessions from a single 2RU server.
When you encode using a Video Processing Unit (VPU) rather than the built-in GPU encoder, you will decrease your cost per concurrent user (CCU) by 90%, enabling profitability at a much lower subscription price. How is this technically feasible? Two technology enablers make this possible. First, extraordinarily capable encoding hardware, known as a VPU (video processing unit), dedicated to the task of high-quality video encoding and processing. And second, peer-to-peer direct memory access (DMA) that enables video frames to be delivered at the speed of memory compared to the much slower NVMe buss between the GPU and VPU. Let’s discuss these in reverse order.
Peer-to-Peer Direct Memory Access (DMA)
Within a cloud gaming architecture, the primary role of the GPU is to render frames from the game engine output. These frames are then encoded into a standard codec that is easily decoded on a wide cross section of devices. Generally this is H.264 or HEVC, though AV1 is becoming of interest to those with a broader Android user based. Encoding on the GPU is efficient from a data transfer standpoint because the rendering and encoding occurs on the same silicon die; there’s no transfer of the rendered YUV frame to a separate transcoder over the slower PCIe or NVMe busses. However, since encoding requires substantial GPU resources, this dramatically reduces the overall throughput of the system. Interestingly, it’s the encoder that is often at full capacity and, thus the bottleneck, not the rendering engine. Modern GPU’s are built for general-purpose graphical operations, thus, more real estate is devoted to this compared to video encoding.
By installing a dedicated video encoder in the system and using traditional data transfer techniques, the host CPU can easily manage the transfer of the YUV frame from the GPU to the transcoder but as the number of concurrent game sessions increase the probability of dropped frames or corrupted data makes this technique not usable.
NETINT, working with AMD enabled peer-to-peer direct memory access (DMA) to overcome this situation. DMA is a technology that enables devices within a system to exchange data in memory by allowing the GPU to send frames directly to the VPU whereby removing the situation of the buss becoming clogged as the concurrent session count increases above 48 720p streams.

The Benefits of Peer-to-Peer DMA
Peer-to-peer DMA delivers multiple benefits. First, by eliminating the need for CPU involvement in data transfers, peer-to-peer DMA significantly reduces latency, which translates to a more responsive and immersive gaming experience for end-users. NETINT VPUs feature latencies as low as 8ms in fully loaded and sustained operation.
In addition, peer-to-peer DMA relieves the CPU of the burden of managing inter-device data transfers. This frees up valuable CPU cycles, allowing the CPU to focus on other critical tasks, such as game logic and physics calculations, optimizing overall system performance and producing a smoother gaming experience.
By leveraging peer-to-peer communications, data can be transferred at greater speeds and efficiency than CPU-managed transfers. This improves productivity and scalability for cloud gaming production workflows.
These factors combine to produce higher throughput without the need for additional costly resources. This cost-effectiveness translates to improved return on investment (ROI) and a major competitive advantage.
Extraordinarily Capable VPUs
Peer-to-peer DMA has no value if the encoding hardware used is not equally capable. With NETINT VPUs, that isn’t the case here.
The reference system that produces 200 720p30 cloud gaming sessions is built on the Supermicro AS-2015CS-TNR server platform with a single GPU and two Quadra T2A VPUs. This server supports AV1, HEVC, and H.264 video game streaming at up to 8K and 60fps, though as may be predicted, the simultaneous stream counts will be reduced as you increase framerate or resolution.
Quadra T2A is the most capable of the Quadra VPU line, the world’s first dedicated hardware to support AV1. With its embedded AI and 2D engines, the Quadra T2A can support AI-enhanced video encoding, region of interest, and content-adaptive encoding. Quadra T2A coupled with a P2P DMA enabled GPU, allows cloud gaming providers to achieve unprecedented high throughput with ultra-low latency.
Quadra T2A is an AIC (HH HL) form-factor video processing unit with two Codensity G5 ASICs that operates in x86 or Arm-based servers requiring just 40 watts at maximum load. It enables cloud gaming platforms to transition from software or GPU-only based encoding with up to a 40x reduction in the total cost of ownership.

What Can A VPU Do For You?
It makes Cloud Gaming profitable, finally.
Peer-to-peer DMA is a game-changing technology that reduces latency and increases system throughput. When paired with an extraordinarily capable VPU like the NETINT Quadra T2A, now you can deliver an immersive gaming experience at a CCU that cannot be matched by any competing architecture.