Low-latency video is a priority for many streaming engineers. This article reviews three schemas for producing low-latency video, identifies their pros and cons, and summarizes the questions engineers should consider when choosing a low-latency technology and/or service provider.
From an architectural perspective, let’s start with what we all know; HTTP Live Streaming (HLS) and Dynamic Adaptive Streaming Over HTTP (DASH) are the de facto stream delivery protocols in most regions of the world. Technically, HLS is a draft standard of the IETF largely controlled by Apple and is now in its 12th version. In contrast, MPEG-DASH is a true international standard for streaming media, ratified by the Motion Pictures Experts Group (MPEG). Both enjoy nearly universal support from encoders, packagers, DRM providers, players, and other streaming infrastructure providers.
By way of background, HLS and DASH were originally developed to enable streaming video delivery without a streaming server. Prior to their creation, media servers like Adobe’s Flash Media server maintained a connection between the server and player to deliver media to the Flash Player using the RTMP protocol. Since Adobe charged a license fee for the servers, and each server could only manage a finite number of connections, large-scale streaming events were expensive to produce.
HLS and DASH were the two most significant HTTP-based Adaptive Streaming (HAS) techniques that supplanted the Adobe servers (Adobe and Microsoft had HAS technologies that some operators still use). Like all HAS technologies, HLS and DASH operate using a combination of media segments and metadata files stored on standard HTTP web servers. You see this in Figure 1. During playback, players retrieve the metadata files to identify the location of the media files and then download the files via HTTP as needed to play the video. All logic resides in the player; the server just stores the files.
Figure 1. All HAS technologies operate using metadata files and media segments.
From a latency perspective, most HAS techniques used ten-second segments, and most players buffered up to three segments before starting to play. Do the math, and you get up a minimum of thirty seconds of latency, long enough for your neighbor watching via satellite to see the goal, cheer the goal, hug his or her significant other, and grab another beverage from the fridge, leaving you uncomfortably wondering “what just happened?”
While the latency is awful, consider all the benefits that HAS techniques deliver. Because they transfer via HTTP, HAS technologies are supported by all browsers and virtually all video players in smart TVs, OTT dongles, smartphones, tablets, and other devices. The players retrieve standard HTTP packets that don’t need a special media server, are firewall-friendly, and can be delivered by normal CDNs, just like other web data.
Adaptive bitrate delivery with multiple files and bandwidths for different clients is standard, with caption and multiple language support, promoting a very high quality of experience (QoE). Advertising insertion is available, as well as studio-quality digital rights management (DRM) via techniques like Widevine, PlayReady, FairPlay, and the newest entrant, Huawei WisePlay.
So, except for latency, HAS techniques are quite effective for both the viewer and the publisher. Note that engineers seeking to reduce latency from thirty seconds to five seconds or so can cut the segment sizes to one-to-two-second segments, but going below this could start to degrade video quality. To effectively reduce your latency to sub-3 seconds or so, you must switch to a low-latency variety of HLS or DASH.
Table 1. Feature sets of the different technology options.
HLS, DASH, AND CMAF
Let’s briefly explore what CMAF is and how it relates to DASH and HLS. Where DASH and HLS are adaptive streaming protocols, CMAF, which stands for the Common Media Application Format, is a container format for streaming media. More specifically, CMAF combines a single set of media files and multiple sets of manifest files to enable one group of files to serve multiple targets.
Today, many engineers use CMAF to combine one set of media files with separate DASH and HLS manifests. This saves encoding costs since you produce one set of files rather than two and halves the footprint on the streaming server. Technically, you don’t deliver via CMAF, you deliver via HLS or DASH using media packaged in a CMAF container.
Obviously, if you can package one set of media files for delivery via DASH or HLS, they must work similarly. That’s why I’m bundling HLS and DASH into their own column. Since CMAF is a container format, not a streaming protocol, I’m ignoring that altogether, though I could have just as easily included CMAF in the HLS/DASH column.
Figure 2. LL DASH uses chunks to reduce latency. (www.theoplayer.com/blog/low-latency-dash).
Rather than buffering three complete segments before playing, the LL DASH or LL HLS player typically buffers three to four chunks. In operation, this can reduce latency to perhaps three to five seconds, often more, depending upon multiple factors.
In terms of compatibility, most, but not all, current players support LL HLS/DASH. Obviously, if you’re going to deploy either technology, you should verify player compatibility. Both technologies are backwards compatible so that legacy players that don’t support low latency can simply play the streams at normal latency. Otherwise, LL HLS/DASH retains all the other benefits of HAS delivery, including standard CDNs, ABR delivery, DRM, captions, advertising, and multiple language support.
THEO High-Efficiency Streaming Protocol (HESP)
THEO Technologies invented HESP and it’s managed by the HESP Alliance, which includes multiple members that provide valuable infrastructure support for HESP, including EZDRM and BuyDRM (digital rights management), MediaMelon (optimization and analytics), Ceeblue (transcoding), and multiple turnkey service providers. Like Apple HLS, HESP has been made available as a draft information specification via the Internet Engineering Task Force (IETF).
Technically, HESP is a HAS technique that offers lower latency while retaining all the benefits of HAS delivery. HESP works by creating two streams for each live event (Figure 3).
First is the initialization stream, which contains only keyframes. This allows the player to start playback on any frame, not just at the start of a group of pictures, which often can be between 1-2 seconds long. The second stream is the continuation stream, which is encoded normally and is the stream viewed by the player.
When a viewer joins the stream, the player first loads a frame from the initialization stream, which plays immediately, reducing latency to as low as less than a second. Then, the player retrieves subsequent frames from the continuation stream. You see this in Figure 3. When a player starts to play the stream, it retrieves a keyframe from the Initialization stream (C1). Then, it continues to play frames from the Continuation stream (d1 and so on).
If multiple videos are available, when the viewer switches streams, the same thing happens. The player grabs a frame from the initialization stream of the second video (G2) to enable the immediate switch and then retrieves additional frames from the Continuation stream (h2, i2, j2) to deliver normal quality.
As shown in Figure 4, you can implement HESP using standard production techniques, a standard encoder, and a regular CDN, but you need a HESP-compatible packager and player. Interestingly, the HESP continuation stream is backwards compatible to LL HLS and LL DASH, so if you don’t have an HESP player, you fall back to LL HLS or LL DASH latency. Once implemented, HESP delivers all the benefits of HAS technologies, including ABR delivery, DRM support, and advertising support.
Unlike all the other technologies discussed, HESP is royalty-bearing, with royalties assessed on usage volume, subscriptions, and hardware devices. The annual cap for software and subscription is $2.5 million, with a $25 million annual cap on devices sold. Click here for more details on the HESP royalties.
To be fair, MPEG LA did attempt to form a patent pool on DASH-related technology in 2015, but it closed in 2019. Later in 2019, two companies sued Showtime, Vudu, and Crackle on the same patents, though the patent upon which the claims were made was found invalid in 2022.
Google created WebRTC to enable real-time audio/video communications between browsers without plug-ins. Today, WebRTC is an open-source project and standard published by both the IETF and W3C. These standards mean near-universal support by computer and mobile browsers, though support in smart TVs, OTT dongles, and game devices is not nearly as pervasive as support for HLS and DASH.
Originally used for simple peer-to-peer communications, WebRTC was later used to power conferencing applications and has been extended into one-to-many live streaming applications, primarily because it offers sub 500 ms latency. It’s useful to consider WebRTC from two perspectives: what you get out of the box and what product and service providers can build around it to make it more useful. Let’s explore how WebRTC works and then return to this observation.
Figure 5 shows the basic WebRTC operating schema in peer-to-peer mode. Two peers wishing to connect meet through a signaling server. Once they connect, they exchange audio-video data directly in real-time using the User Datagram Protocol (UDP) protocol as opposed to HTTP. These architectural differences from HAS technologies have multiple implications.
First, using direct streaming via the User Datagram Protocol (UDP) rather than chunks and segments via HTTP means lower latency than HAS technologies. That’s the object of the exercise, so that’s a good thing.
However, the matching and other roles that servers play in WebRTC mean that WebRTC events can’t scale beyond a certain size without adding servers, which adds to the cost. UDP delivery means that the audio/video data can’t be delivered by standard CDNs, adding further to the cost and potentially limiting your reach, and UDP packets may not be able to get through some corporate firewalls, limiting access by some viewers, though there are several techniques that can mitigate this risk.
Originally designed for browser-to-browser communications, WebRTC has gained a reputation for low-quality video because most traditional applications involve low-quality webcam video compressed by the low-quality encoder in a browser. WebRTC has also been rightfully criticized for the lack of true adaptive streaming, studio DRM, and features like advertising insertion and captions. If you were to develop your own WebRTC application for large-scale live event streaming from scratch, you would have to work around all these limitations.
Fortunately, you don’t have to start from scratch. Multiple vendors like Ant Media, Red5, Wowza, and others offer servers that can ingest and transcode high-quality audio/video streams and automatically add servers and WebRTC-capable CDN capacity to meet viewer demand.
For those seeking a more turnkey experience, multiple vendors like Ant Media, Dolby.io, Wowza, Phenix, , and many others also offer WebRTC packages that you can deploy by providing a live feed and writing a check.
Though different providers offer different feature sets, several WebRTC-based services provide true ABR delivery and other HAS-like features, though you may have to deploy a service-specific player. In addition, EZDRM and Castlabs have shown working DRM implementations for WebRTC, so if studio DRM is essential, that’s now available.
Going forward, two new protocols, WHIP and WHEP, may further simplify deploying WebRTC for large-scale live streaming. WHIP stands for the WebRTC-HTTP Ingest Protocol, and it simplifies high-quality, ultra-low latency ingest into WebRTC, much like RTMP does for typical live streaming. WHEP stands for WebRTC-HTTP Egress Protocol, and it standardizes how WebRTC data can be downloaded to a non-WebRTC client, like a smart TV without WebRTC support.
WebRTC clearly offers lower latency than HAS -based technologies. The only questions are how much extra you’ll pay to achieve that latency over HAS technologies and what, if any, features you won’t be able to access.
Other Low Latency Technologies
There are other technologies used for low latency. For example, Nanocosmos offers a low-latency service called NanoStream Cloud using a technology called WebSockets with LL HLS rather than WebRTC. Like WebRTC, Websockets offers lower latency than any HAS approach, but as a pure technology, can’t match the features of LL HLS/DASH. However, as with WebRTC, developers like Nanocosmos have built around WebSockets to create fully featured services and technologies worth considering for many types of projects.
Figure 6. Nanocosmos’ nanoStream Cloud service is built around WebSockets.
Questions to Ask
If you’re choosing a turnkey service, you probably care more about performance, features, and cost than the technology underlying the service. Here’s a list of questions to ask when choosing your technology and/or service provider.
- What’s the maximum latency you can tolerate? The lower the latency, the greater the cost. Lower latency also means decreased playback robustness since little, if any, of the stream is pre-buffered.
- What’s the overall projected cost of your event? If WebRTC or other low-latency service increases the cost, do the economics of the project justify that increased price?
- What latency will the service deliver at your scale and geographic distribution?
- What devices are supported (computers, mobile, and living room)?
- What are the critical features that you need, and can the service deliver them? Captions? Advertising? Studio quality DRM? Multiple language support?
- Can the system maintain synchronization with all viewers to support auctions and sports gambling?
- What codecs are available, and what bandwidths can the system support?
- Does the system support dynamic adaptive streaming, which means that it switches streams during the event to adjust to changing bandwidth conditions (rather than sending a single quality stream)?
- If the system transmits UDP packets, what features are available to avoid firewall blocking?
- Can you use your own player, or do you need to deploy a custom player? If custom, what are the design and branding opportunities?
- What encoders does the system support, and at what quality level?
NETINT’s Low Latency VPUs
All low-latency productions start with low-latency transcoding. For a look at the latency produced by NETINT’s Quadra Video Processing Unit in normal and low-latency modes, check out this review: Unveiling the Quadra Server: The Epitome of Power and Scalability.