Skip to main content

WebRTC

πŸš€ How does WebRTC work?​

WebRTC allows browsers to communicate directly with each other in real-time, enabling features like video chatting, voice calls, and file sharing without the need for any additional software or plugins.

Key Components:​

  • Client Devices: These are the devices, such as computers, smartphones, or tablets, that users use to communicate with each other.
  • STUN Server (Session Traversal Utilities for NAT): This server helps in establishing a connection between client devices, especially when they are behind firewalls or NAT (Network Address Translation). It helps clients discover their public IP addresses and assess network conditions.
  • Signaling Server: This server helps clients exchange information necessary to establish a connection. It doesn't transmit media but facilitates negotiation between clients.

Step-by-Step Process:​

  • Initialization: Two client devices (let's call them Client A and Client B) initiate a connection by exchanging information through a signaling server.
  • SDP Offer: Client A sends a Session Description Protocol (SDP) offer to Client B via the signaling server. This SDP contains details about the media (like video and audio) that Client A wants to share and its network capabilities.
  • SDP Answer: Client B receives the SDP offer from Client A, generates its own SDP answer containing its media details and network capabilities, and sends it back to Client A through the signaling server.
  • ICE (Interactive Connectivity Establishment): Both clients use ICE to establish the best connection path between them. ICE helps in traversing firewalls and NATs by discovering reachable IP addresses and exchanging connectivity checks.
  • Establishing Peer Connection: Once both clients have exchanged SDP offers and answers and established the best connection path using ICE, they can directly communicate with each other. They exchange media streams (video, audio, data) directly without routing through any intermediate server.

πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ Is it Possible with Multiple Peers?​

πŸ”₯ 1. Mesh Model​

In the mesh model, each peer directly connects to every other peer. However, with 6 or more users, performance significantly degrades.

πŸ–₯️ 2. MCU – Multipoint Control Units​

  • MCUs are also referred to as Multipoint Conferencing Units. Whichever way you spell it out, the basic functionality is shown in the following diagram.
  • Each peer in the group call establishes a connection with the MCU server to send up its video and audio. The MCU, in turn, makes a composite video and audio stream containing all of the video/audio from each of the peers, and sends that back to everyone.
  • Regardless of the number of participants in the call, the MCU makes sure that each participant gets only one set of video and audio. This means the participants’ computers don’t have to do nearly as much work. The tradeoff is that the MCU is now doing that same work. So, as your calls and applications grow, you will need bigger servers in an MCU-based architecture than an SFU-based architecture. But, your participants can access the streams reliably and you won’t bog down their devices.
  • Media servers that implement MCU architectures include Kurento (which Twilio Video is based on), Frozen Mountain, and FreeSwitch.

πŸ”Š 3. SFU – Selective Forwarding Units​

SFU stands for Selective Forwarding Unit. Also known in the specifications as SFM (Selective Forwarding Middlebox).

At times, the term is used to describe a type of video routing device, while at other times it will be used to indicate the support of routing technology and not a specific device. An SFU is a media server component capable of receiving multiple media streams and then deciding which of these media streams should be sent to which participants. Its main use is in supporting group calls and live streaming/broadcast scenarios.

  • In this case, each participant still sends just one set of video and audio up to the SFU, like our MCU. However, the SFU doesn’t make any composite streams. Rather, it sends a different stream down for each user. In this example, 4 streams are received by each participant, since there are 5 people in the call.
  • The good thing about this is it’s still less work on each participant than a mesh peer-to-peer model. This is because each participant is only establishing one connection (to the SFU) instead of to all other participants to upload their own video/audio. But, it can be more bandwidth intensive than the MCU because the participants each receive multiple streams downloaded.
  • The nice thing for participants about receiving separate streams is that they can do whatever they want with them. They are not bound to layout or UI decisions of the MCU. If you have been in a conference call where the conferencing tool allowed you to choose a different layout (ie, which speaker’s video will be most prominent, or how you want to arrange the videos on the screen), then that was using an SFU.
  • Media servers which implement an SFU architecture include Jitsi and Janus.

🫧 SFU performance​

WebRTC SFUs are the most common media server architecture today when implementing large group meetings and live streaming services. The reason for that is that it gives the best return on investment. You will find SFU implementations in most video conferencing and group video meeting applications. In audio-only use cases they are a bit less popular, though there are a few that use them in these cases as well.

SFUs don’t process the media but rather route it around. As such, they consume considerably less CPU than their MCU alternative. Their performance relies heavily on network throughput.

When deploying SFU servers, it is recommended to place them as close as possible to the users that need to connect to them, spreading them geographically across the globe.