How to build the features inside Clubhouse and Twitter spaces?

How to build the features inside Clubhouse and Twitter spaces?

Aditya Kumar's photo
Aditya Kumar
·May 22, 2022·

11 min read

Subscribe to my newsletter and never miss my upcoming articles

Play this article

Live audio applications such as Clubhouse saw a big boom in adoption during the covid-19 lockdown. This sudden growth inspired a lot of other companies to introduce their own live audio/video products or features. In a few months, it became one of the most requested features in not only social media and gaming apps but also in workplace communication apps like slack. The typical list of features inside live rooms (or spaces or game rooms) in such apps includes -

  • Live comments: Participating users can comment during live audio discussions. They can also pin comments etc.
  • Live reactions: These emojis fly on your screen at a blinding speed, depicting what the participating audience feels during the live discussion.
  • Tracking speakers, listeners, moderators, etc., in real time: The list of users changes almost instantly. The total count of the participating audience and the list of people attending the room as listeners are essential for the speakers.
  • Muting/unmuting audio instantly - The first level of moderation in such live room-based systems is muting the participant speaking out of turn.
  • Invite other people to become speakers (or move them to listeners) - This has to be as fast and smooth as possible.
  • Raise hands - People in the audience can "raise hands" to express their intent to become a speaker.
  • Live polls - Hosts and moderators can start a poll, and the audience can vote on various options.
  • Other moderation features - like kicking people out, blocking users, etc.

The names and product terminologies of these features may change according to your app. But, if you dig down deep, it's the same set of functionalities, more or less. I am sure there are many other features possible, but I think you get the picture.

However, one requirement is the most crucial requirement for the success of such a product, and that is - Speed.

Everything must be ultra-fast and should feel instantaneous to delight the participating users. Otherwise, we were good with skype/zoom/google meet type of video meeting products. Right?

Before I go further, let me be absolutely clear. This blog post details the architecture for building features inside a live audio (or video) system. I will not talk about the system design of the Live audio streaming part in this post. There are plenty of tutorials (both text and video). But none of them talk about the other real-time features inside the live room. And that's usually the point of failure where the user experience gets messed up.

This brings us to the question - How to design a system that can deliver this ultra-fast experience inside live rooms?

I had a similar challenge while designing the same system for Leher App. It's a community-based social network(similar to Clubhouse, but much older and more functionalities). Our primary audience is people in India's tier 2/ tier 3 cities. At the time of writing this doc, we have served millions of such live room experiences. Each day, thousands of live rooms run concurrently, and each works at an ultra-fast speed. Let's talk about the high-level product requirements I had when designing the solution.

High-level product requirements

Apart from the list of features mentioned in the introduction section above, I had the following list of additional product requirements -

  • Ability to hold thousands of concurrent listeners - We have powered live rooms with more than 7k concurrent users at a time.
  • Ability to support thousands of comments and reactions - We have powered several rooms with 50k+ reactions and 10k+ comments
  • Ability to schedule rooms for the future - Most apps call them upcoming events, but they are called "scheduled rooms" in our app. Anyone can schedule a room for the future and invite thousands of people even before the room starts. They can all mark the room as interested or raise hands, etc., even before the room is live.
  • Ability to display the summary of the room - We show the total number of comments, listeners, reactions, duration of the room, etc.
  • Ability to do data analysis of the live room data - It was a requirement from day one to make sound business decisions. We needed the ability to run different types of analysis on live room data and view it from multiple dimensions.

These were the high-level product requirements. Now, let's talk about the high-level engineering requirements (or the engineering problems to solve in such a system)

Engineering problems to solve

Apart from the product requirements, I had to consider many engineering problems. Following are a few of them -

  • High scalability in terms of reads and writes - As mentioned earlier, the app powers thousands of public and private live rooms at any given point in time. These live audio rooms have hundreds (sometimes thousands) of listeners firing tens of thousands of live actions (reactions, comments, etc.) simultaneously.
  • Scaling the WebSocket connections - Maintaining a track of WebSocket connection pools and scaling them on demand.
  • Cost-effectiveness - Do I need to explain this? It's the de facto requirement for every system in every startup. We needed to ensure that our costs don't grow exponentially as we scale.
  • Observability and security - Protecting our socket communication without affecting the speed was one of the biggest challenges. We also needed to keep track of all data movements.

Now that you have an idea about the complexity of the problem statement in such a system let me present my solution to you. To keep the explanation as beginner-friendly and straightforward as possible, I will only explain the high-level architecture, data flow, and the tools used in the execution. I will not go into the detailed implementation of each service. But, If you are working on such a system and face problems, my DMs are always open. Please find me on LinkedIn and Twitter.

Shameless plug

Hi there. Before presenting the solution, I decided to do a shameless plug. If you are someone looking to learn more about building such highly performant cost-effective distributed systems, I have created a dedicated learning community called cloudeasy.club just for you. You can learn about system design, software architecture, and distributed systems by joining this club. You can find us on Discord or Whatsapp group. Also, feel free to ping me on LinkedIn or Twitter if you have any questions.

The solution: High-level design of our live room system

system-architecture-live-room-2.png

Working of the system - Flow of data

The entire system uses Redis as a primary database. This allows us to ensure the fastest possible experience on the User's end. Let's understand the step-by-step flow of data.

  • Our mobile app (client) connects to our WebSocket endpoint using a client certificate and authorization token. The client certificate authentication mechanism ensures that no unauthorized clients can connect to this endpoint. Read more about it on this link.
  • The client sends and receives data (and operations like creating a comment, creating a reaction, etc.) through WebSocket events only.
  • The Socket handling service is a monolith that handles all the real-time functionalities inside the live room. We intentionally kept all the socket-related functionalities in one service to reduce complexity and get the most optimal performance. It talks directly with our Redis cluster using DB reads/writes. This service is written in Node.js for simplicity and because of a mature WebSocket eco-system in javascript.
  • When a live room ends, we dump all the data we need to store permanently into a separate data warehouse called Google Big Query. (Explained later). The way this happens is - When the creator of the live audio rooms ends it, we trigger a pub-sub event to mark the end of the room. This event is handled by a service called Analytics-write service. It reads the necessary data from our redis cluster and stores it in Google Big query. We use Google Cloud Pub/sub for sending messages to this analytics-write-service. The service itself is written in go programming language.
  • All the authorization inside our socket handling service is done using a dedicated auth service. (explained later). The auth service is also written in the go programming language.

The above architecture and the data flow enable us to solve all product and engineering problems I mentioned earlier. Now that you understand the basic data flow inside the system let's talk about the individual problem statements and their solution in the system one by one.

Problem 1 - Maintaining socket session information

At the time of writing this blog, our socket events handling service has 50+ Kubernetes pods (instances of the application) running in our Kubernetes cluster. To ensure that the session information doesn't get mixed up, we use the Socket.IO Redis adapter. It stores state information in our redis cluster. You can read more about how it works on this link. I will be honest. I was skeptical about the scalability of socket.io and this adapter in general. But so far, they have worked liked a charm for us. We use Redis pub-sub to power all the real-time functionalities inside the live room.

Problem 2 - Modelling complex schemas in Redis

It's 2022, and if you are still using redis only as a cache, you are yet to discover its true potential. Redis provides sub-milliseconds latency even under heavy concurrent load. It has advanced expiry and indexing capabilities, and now it supports the most common database use-cases through specialized modules such as RediSearch, RedisJSON, RedisGraph, RedisTimeseries, RedisBloom, etc. However, in our case, we decided to rely more on the data structures provided by redis by default. The idea was to avoid any vendor lock-in. So, we did the heavy lifting on the application side. Redis data structures, if used well, can help you map any object-based database use case. I will explain this more in detail in another blog post. Understanding it will require a good command of all the redis data structures. Please note - We delete the live room data from our redis databases as soon as it ends and dump all the relevant information in BigQuery. This helps us save a ton of costs. Also, we set the expiry of all the keys related to a particular live room to 72 hours. It ensures that even if the cleanup fails for any reason whatsoever, the keys are expired automatically from the database.

Problem 3 - Analytics

At the time of designing the system, it was absolutely clear to me that irrespective of which database I chose to support the functionalities, the same database won't be able to handle all the analytics queries we want to run at scale. It is the reason we went for Google Big Query (https://cloud.google.com/bigquery). It's a petabyte-scale cloud data warehouse solution that offers advanced query capabilities with standard SQL. It also offers many cool features like automatic visualization (spreadsheet, graphs, you name it ), Machine learning, observability, etc. But the main reason for choosing Google big query was that - it's serverless and hence dirt cheap. There is no provisioning required. You only pay for storage, and the queries you run and the pricing is also suitable for small startups like us. You can read more about it on the official page. But the idea I am trying to convey here is the separation of concerns. Analytics was a separate business requirement that had nothing to do with the features we were serving to the User. And hence, we decided to solve both problems using different databases. My recommendation is always to try to assess your OLAP and OLTP needs separately. In my experience, trying to mix them up often results in bad decisions related to system architecture and, specifically, databases.

Problem 4 - Fast Authorization

We needed to have the quickest possible way to do the User's authorization. Our auth system is a centralized system that caters auth needs of all the microservices inside our cluster. We didn't design it this way for just this live room functionality. Still, it's the same system used by every other feature on the app, like - creating communities, payments, networking, discovery, etc. And it's super fast because of two reasons -

  • We use gRPC calls for communicating with this auth service.
  • The service is native to our Kubernetes cluster. That means the service has no public access. It's accessible only inside our cluster through Kubernetes DNS. It enables us to perform auth checks in less than two milliseconds, even with millions of operations per second.

I will write a separate blog post explaining this auth system design. But, I hope you got the idea. Cut down the network time, use fast communication protocols like gRPC, and you can ensure the fastest end user possible without compromising on the security of the system.

Scalability and future challenges

We are at a decent scale, and the system hasn't given us any problems. According to our estimates, the parts of the architecture that will need tweaking will be

  • The session management of socket connection - just like WhatsApp and Discord faced this problem at scale; I am sure it will become a trouble for us when we need to deal with millions of concurrent connections.
  • Pub-sub: Even though Redis pub-sub has scaled well for us, I am not sure it battle tested for very high concurrency. So, I am not exactly sure how long it will serve us. My hope is "forever" because of its simplicity and intuitiveness.

But, to be very honest, we are not expecting WhatsApp or Discord level of concurrency anytime soon, and I am sure most businesses will never reach that level ever. So, if you are not at that scale, this sytem architecture will indeed serve your needs and will be light on your pockets.

Have questions?

I hope the above system design helps you in the applications you are building at your current organization. You can pick up the solutions to the problems and apply them in different contexts. And if you are developing a live audio application, you may copy and paste the entire architecture without thinking twice. Feel free to ping me on Twitter or LinkedIn if you have any questions. Join my community on Discord or Whatsapp group for more content.

 
Share this