Build Your Own Streaming Infrastructure – Software

Build Your Own Streaming Infrastructure - Article by Jan Ozer from NETINT Technologies

My assumption is that you’re currently using a cloud-based service like AWS for your live streaming and are seeking to reduce costs by buying your own transcoding hardware, installing the necessary software, and hosting the server on-premises or in a co-location facility. This article covers the software side.

To begin, let’s acknowledge that AWS and other cloud services have created a well-featured and highly integrated ecosystem for live streaming and distribution. The downside is the cost.

To illustrate the potential savings, I’ll refer to this article, which compared the cost of producing 21 H.264 ladders and 27 HEVC ladders via AWS MediaLive and by encoding with NETINT’s recently launched Logan Video Server. As you can see in the table, MediaLive costs around $400K for H.264 and $1.8 million for HEVC, as compared to $11,140 in both cases for the co-located server.

Streaming Infrastructure - Table from article 'cloud or on-prem'
Table 1. Five-year cost comparison . AWS MediaLive pricing compared to the NETINT Server

While there are less expensive options available inside and outside of AWS, whenever you pay for hardware by the minute or hour of production, you’re vastly overpaying as compared to owning your own hardware. Sure, you say, but it’s so easy compared to running your own hardware.

If that’s a concern, here are some comforting words from David Heinemeier Hansson, co-owner, and CTO of software developer 37signals, the developer of the project management platform Basecamp and email service Hey. Recently, Hansson wrote  Why we’re leaving the cloud, a blog that detailed his companies’ decisions to do just that. Here’s the relevant quote.

Up until very recently, everyone ran their own servers, and much of the progress in tooling that enabled the cloud is available for your own machines as well. Don’t let the entrenched cloud interests dazzle you into believing that running your own setup is too complicated. Everyone and their dog did it to get the internet off the ground, and it’s only gotten easier since.

My wife has chihuahuas, and given their difficulties with potty training, I seriously doubt they could do it, but you get the point. To paraphrase FDR, all you have to fear is fear itself. The bottom line is that running your own live streaming service should cost relatively little CAPEX, will save significant OPEX, and won’t be nearly as challenging as you might be fearing.

Let’s look at your options for the software required to run your homegrown system.

Transcoding and Packaging Software

Figure 1 shows the minimum software and infrastructure needed for a live-streaming service. Presumably, you’ve already got the live production covered, and since AWS doesn’t offer a player, you have that piece addressed as well. You’ll need a content delivery network to deliver your streaming video, but you can continue to use CloudFront or other CDN. The software that you absolutely have to replace is the live transcoding and packaging component.

Here you have three options; multimedia frameworks, media servers, and “other.” Let’s discuss each in turn.

Multimedia Frameworks

Multimedia frameworks are software libraries, tools, and APIs that provide a set of functionalities and capabilities for multimedia processing, manipulation, and streaming. The best-known framework is FFmpeg, followed by GStreamer and GPAC, and they are all available open source.

Build Your Own Streaming Infrastructure - Software- diagram-2
Figure 1. Netflix uses GPAC for its packaging,
a significant technology endorsement for GPAC
and for multimedia frameworks in general.

Multimedia frameworks excel in projects at both ends of the complexity spectrum. For simple projects, like transcoding an input stream to an encoding ladder, you can create a script that inputs the stream, transcodes, and hands the packaged output streams off to a CDN in a matter of minutes. You can use the script to process thousands of simultaneous jobs, all at no charge.

At the other end of the spectrum, these frameworks also excel at complex jobs with idiosyncratic custom requirements that likely aren’t available in a server or commercial software product. The development, maintenance, and modification costs are considerable, but you get maximum feature flexibility if you’re willing to pay that cost.

What you don’t get with these tools is a user interface or simple configuration options – you start with a blank slate and must program in all desired features. What could be as simple as checking a checkbox in a streaming media server could require dozens or even thousands of lines of code in a multimedia framework.

Which takes us to streaming media servers.

Streaming Media Servers

The next category of products are streaming media servers, and it includes Wowza Streaming Engine, Nimble Streamer, and two open-source servers, Red5 and Ant Media Server. These servers tend to excel for most productions in the middle of the complexity spectrum and offer multiple advantages over multimedia frameworks.

There are several reasons why you might choose to use a streaming server over a multimedia framework, including a simplified setup and configuration. Most streaming servers provide out-of-the-box streaming solutions with pre-configured settings and management interfaces that simplify the setup and configuration process. While not all offer GUIs, those that don’t offer simple option selection in configuration files.

Build Your Own Streaming Infrastructure - Software- diagram-3
Figure 2. Wowza Streaming Engine is a highly regarded streaming server

As mentioned above, streaming servers often offer simpler access to advanced features that you’d have to craft by hand with a multimedia framework. They also offer better integration with third-party services like digital rights management (DRM) and content delivery networks. Between the simplified setup, easier access to features, and improved integration with other services, packaged servers can dramatically accelerate getting your live streaming service up and running.

Once you’re operational, you’ll appreciate management interfaces that monitor the health and performance of your streaming infrastructure, track viewer analytics, manage streaming workflows, and make real-time adjustments. If you’re in a dynamic demand environment, some streaming servers offer built-in scalability features and load balancing to manage the load over multiple hard transcoding resources. You’d have to build all that by hand or with plug-ins if using a multimedia framework.

The two potential downsides of streaming servers are cost and customizability. You’ll have to pay a monthly fee for some versions of these servers, and you may find it complicated or nearly impossible to add what you might consider to be essential features.

Other Streaming-Capable Programs

Most companies building their own live-streaming infrastructures will implement either a multimedia framework or a streaming server, but there are other programs that incorporate the core encoding and packaging functions. One such program is Norsk from id3as. Norsk bills itself as “an SDK that enables developers to easily create amazing, dynamic live video workflows and deploy them at any scale.” As such, it combines both video production and streaming server-related functions

You see this in Figure 3. The top portion shows that Norsk supports the typical codecs and packaging formats deployed by live-streaming producers. At the bottom of the figure, you see that Norsk also offers production-oriented features like multiple camera support, graphics and overlays, and transitions.

Build Your Own Streaming Infrastructure - Software- diagram-4
Figure 3. Norsk offers both production and server-related functions.

Interestingly, Norsk doesn’t have a GUI, instead offering a high-level API to simplify configuration and operation, with a Workflow Visualizer component to view the running state of the application. In this fashion, Norsk attempts to provide the configurability of multimedia frameworks with the ease of operation of scripting-driven streaming media servers.

Finding a program like Norsk that combines transcoding and packaging with other essential streaming-related functions makes a lot of sense; there’s one less vendor to onboard and one less product to learn and support. As remote production becomes more common, we expect more programs like Norsk to become available.

Those are your high-level options. If you’re interested in learning more about these and other programs that can drive encoding and packaging for your live transcoder. You should plan to attend our upcoming symposium; details will be available in the next couple of weeks.

Parking Lot Rules, B-Frames, and Ultra Low-Latency Encoding

One of my sweetest memories of bringing up our two daughters was weekly trips to the grocery store. Each got a $5.00 bribe for accompanying their father, which they happily invested in various tchotchkes that seldom lasted the week. When we exited the car, “parking lot rules” always applied, which meant that each daughter held one of Daddy’s hands for the walk to the store. Two girls, two hands, no running around the busy parking lot.

Parking lot rules came to mind as we debugged a decoding latency issue when testing a new server product. Initial tests revealed a decoding latency of up to 200 milliseconds in some high-volume configurations. Given that the encoding latency was under 20 milliseconds, the decoding numbers were uncomfortably high.

Eliminate B-Frames from the Origination Stream

After raising the issue, our testing team implemented a fix, which dropped latency to under 20 milliseconds, and decreased encoding latency as well. The change is the parking-lot-rules corollary for live streamers, which is “for ultra-low latency, eliminate B-frames from your live streaming workflow.” With H.264, this means using the baseline profile, which eliminates B-frames. With H.265, you’ll have to use a GOP structure that does the same.

A quick glance at Figure 1 reveals why B-frames blow-up decoding latency (shoutout to OTTverse, where we grabbed the image). B-frames, of course, incorporate redundancies from frames before and after the frame being encoded. They are packed and decoded out of order. Any frame decoded out of order adds latency – the further they are out of order, the greater the latency.

Figure 1. B-frames are packed out of order and can increase decode latency.

Will eliminating B-frames (or the Baseline H.264 profile) reduce the quality of the incoming stream? Only minimally, if at all. These streams are typically produced at a relatively high bit rate, so B-frames or higher-quality profiles deliver minimal additional quality. It’s even less likely that any decrease in quality would be noticeable in the output stream (see here).

Let’s pause for a moment and reflect on the bigger picture. Figure 2 shows the typical live streaming workflow. We’ve been talking about B-frames in the on-premise encode impacting the decoding latency in the transcoding server. What about B-frames in the transcoding server when encoding streams for delivery to viewers?

Predictably, the result is the same. B-frames introduce the same latency during encoding for delivery, for the same reason–packing frames out of order introduces delays. This is why, when implementing low-latency mode with the NETINT Quadra Video Processing Unit and T408 transcoder, you must use a GOP preset that encodes with consecutive frames.

When you get things right – incoming streams without B-frames and outgoing streams without B-frames, the results are transformative. Let’s have a look.

Tue Low Latency Transcoding

Table 1 below shows actual testing results. This use case involves scaling 1080p AVC input down to 720p for delivery, which is common for interactive gaming, auction sites, and conferencing, and the server can produce 320 streams while encoding AVC, HEVC, and AV1. I don’t have the original data for the input file with B-frames, but as I recall, decoder latency averaged 150 – 200 ms, a noticeable break in a live conversation. Even worse, unlike encoder latency, it didn’t drop significantly in low-delay mode.

As you see in the table, after the fix, total latency is around 160 ms for all outputs in normal, (latency-tolerant) mode. Working with the input file without B-frames, and outputting streams without B-frames, combined encoder and decoder latency plummets to around 22 ms, well under a single frame (which for 30 fps video takes 33 ms to display). That’s low enough for even the most latency-sensitive applications.

Table 1. Encode/decode latency in normal and low-delay mode
(with a properly formatted input file).

How much will the lack of B-frames impact quality in the output encoding ladder? Once again, B-frames have delivered surprisingly little value in the tests that I’ve performed. You can read a good article on the subject here, and access updated data here (see page 22), which show less than a 1% quality difference between streams with and without B-frames. The bottom line, of course, is that if your application needs ultra-low latency, you have to prioritize that over any potential quality loss, though it’s good to know that few, if any, viewers will notice it.  

Returning to the thoughts that prompted this article, when my daughters have their kids, an endearing wish is that they implement parking lot rules in all relevant shopping trips. Given their progress to date, this may not occur in my lifetime. If you’re a live-streaming engineer, you have no similar excuse to ignore the corollary. If latency is critical, make sure you eliminate B-frames from your live-streaming workflows.

PS. The server referenced is the Quadra Video 100 Server, which combines ten Quadra video processing units (VPUs) with a SuperMicro chassis driven by a 32-core CPU. Total cost should be around $20,000 in this configuration. Stay tuned for more details or message us.