When crafting a cutting-edge workstation for video-related AI operations, efficiency is the name of the game. When you design in the formidable NVIDIA L4 Tensor Core GPU, the allure of relying on the L4 for both video decoding and AI is undeniable. However, our rigorous testing has uncovered an alarming revelation: when tasked with decoding 90 streams of 1080p video, the L4’s AI performance can take a nosedive, plummeting by a staggering 50% or more.
In this pursuit of efficiency, a game-changing solution emerges. You stand at a crossroads: either invest in another L4, effectively doubling your costs, or embrace the Quadra T2A. This innovative solution specializes in handling decoding tasks, and it does so at less than half the cost of the L4. As you digest the testing and analysis that follows, you’ll realize that the Quadra/L4 hybrid system allows you to extract the full power of your AI setup while keeping costs in check.
Optimizing NVIDIA L4 AI Processing with Quadra Video Decoding
If you’re designing a workstation for video-related AI around the NVIDIA L4 Tensor Core GPU, you may find it tempting to deploy the L4 for video decoding and AI. However, our tests show that the L4’s AI-related performance drops by as much as 50% or more when decoding 90 streams of 1080p video.
You can buy another L4 to boost performance up to 100%, or you can buy a Quadra T2A to handle the decoding load for less than half the cost. As you’ll see from the testing and analysis below, a hybrid Quadra/L4 system should deliver the highest performance and lowest cost available.
The purpose of these tests was to evaluate the performance of TensorRT under varying loads of decoding on an L4 platform. The test was divided into two main components: an AI Test and a Decoding Test.
In the AI Test, we ran three different models with batch sizes of 1 and 8, assessing performance in both fp16 (half-precision) and int8 (quantized) modes. This helped us understand how TensorRT performed under different computational workloads and precision modes.
The Decoding Test involved decoding 90 instances of video content. Each instance ran at approximately 36 frames per second (fps). We monitored the fps as more decoding instances were added, noting any decrease in performance.
It’s important to highlight that the L4 started encountering memory allocation errors after reaching the 90-instance mark. Decoding was executed using this FFmpeg command, utilizing hardware acceleration for H.264 decoding.
ffmpeg -loglevel info -vsync 0 -c:v h264_cuvid -f concat -safe 0 -i file.h264.list -f null –
This test aimed to assess how TensorRT handled concurrent decoding tasks and whether it could maintain stable AI performance in the presence of a decoding load.
Table 1 shows AI fp16 performance numbers with and without decoding instances running in the background. As you can see, the average drop in AI performance when running with decoding instances was 50.67%
TABLE 1. AI fp16 performance number with and without decoding instances running in the background.
Note the substantial decrease in performance.
Table 2 shows AI int8 performance numbers with and without decoding instances running in the background. Note the 57.52% drop in performance.
TABLE 2. AI int8 performance number with and without decoding instances running in the background. Note the substantial performance decrease.
These results show a substantial drop in AI performance when running AI workloads concurrently with 90 instances of 1080p video decoding on an L4 platform. The performance decreases for both fp16 and int8 modes are significant, with an average slowdown of approximately 50.67% for fp16 and 57.52% for int8. This performance degradation will have a detrimental impact on the efficiency and responsiveness of AI setups involving video processing, as evidenced by the lower throughput (queries per second, QPS) achieved when decoding instances are added.
Offloading Decoding is Cost Effective
In this context, adding a dedicated decoder card like the Quadra should improve performance while minimizing overall cost. Here’s why:
Offloading Decoding Work: The primary advantage of the Quadra is that it offloads the video decoding tasks from the GPU, allowing it to dedicate all resources to AI processing. The provided results clearly show that running both AI and decoding on the same platform leads to significant performance degradation. By using the Quadra for decoding, you free up valuable computational resources for AI, resulting in higher AI throughput.
Cost-Effectiveness: While the Quadra T2A costs $2,750 (lower in high quantities), the L4 sells for $5600. A single Quadra T2A can decode well over 90 streams, which would eliminate the 50% reduction in performance experienced by decoding on the L4. This option costs $2,750. To achieve the same performance boost when decoding on the L4, you’d have to add another L4 card, which costs $5,600.
Scalability: Adding a dedicated decoder card can also improve the scalability of your AI setup. As your workload grows and you require more decoding instances, you can easily scale by adding more Quadra cards, which is a cost-effective way to handle increased video processing demands while maintaining AI performance.
If PCIe slots are in short supply, the Quadra T1U offers about half the performance of the T2A in a server-friendly U.2 form factor for $1,500. Combining two T1Us for each L4 in a server architecture would dramatically increase the AI-compute density of the system, saving additional CAPEX and conserving valuable data center real estate.
Summary and Conclusion
These results emphasize the importance of segregating video decoding from AI workloads to maintain optimal performance. Adding a decoder card like the Quadra is the most affordable option for boosting overall system performance, making it a smart choice for system architects looking to enhance AI setups involving video footage and the NVIDIA L4.