Skip to main content

Load Testing Reports - Scaling Conversations and Feeds

1. Introduction

We at LikeMinds wanted to evaluate how our Chat and Feed services perform when users interact concurrently. With anticipated user growth and increasing reliance on real-time communication and content sharing, this stress test was key to understanding where our backend infrastructure shines—and where it strains.


2. How We Structured the Test

We tested three different environments with increasing infrastructure strength. Below is a detailed comparison of the infrastructure, including CPU, memory, and service layout:

ComponentSetup 1 (Baseline)Setup 2 (Optimized DB Compute)Setup 3 (Feed Handling Optimized)
API Gateway (Kettle)1 pod, 1Gi RAM, CPU: 2 vCPU1 pod, 1Gi RAM, CPU: 2 vCPU1 pod, 1Gi RAM, CPU: 2 vCPU
Chat Service (Caravan)2 pods, 3Gi RAM each, CPU: 5 vCPU total2 pods, 3Gi RAM each, CPU: 5 vCPU total1 pod, 3Gi RAM, CPU: 4.5 vCPU
Background Worker1 Celery pod, 1Gi RAM, CPU: 1–4 vCPU dynamic1 Celery pod, 1Gi RAM, CPU: 1–4 vCPU dynamic1 Swarm Worker pod, 1Gi RAM, CPU: 1–3 vCPU
Feed Service (Swarm)2 pods, 1Gi RAM each, CPU: 2 vCPU2 pods, 1Gi RAM each, CPU: 2 vCPU5 pods, 3Gi RAM each, CPU: 2–5 vCPU
Rate Limiter (Skulk)1 pod, 1Gi RAM, CPU: 1 vCPU1 pod, 1Gi RAM, CPU: 1 vCPU1 pod, 1Gi RAM, CPU: 1 vCPU
Database TypePostgreSQLPostgreSQLAzure Cosmos DB (MongoDB vCore)
Database Spec2 vCores, 8GiB RAM4 vCores, 16GiB RAM2 instances, 4Gi RAM each (replicated MongoDB cluster)
CPU ObservedPeak: 8.5%, Avg: ~5.3%Peak: 5.62%, Avg: ~1.8%Peak: 16%, Avg: ~5.2%
Memory UsageBaseline: ~25%, Stable under loadBaseline: ~32%, Stable under loadBaseline: ~28%, Slight increase during peak loads

We simulated 100, 500, and 1000 users clicking, commenting, liking, and chatting—mimicking a real-life scenario where users are active concurrently.


3. Performance Metrics and Observations

Chatroom Performance

ActionConcurrent UsersAvg. Response Time (ms)Throughput (requests/sec)
View Chatrooms1000495599.2
Check Member Status10004543120.8
DM Status Lookup1000461882.4

Upgrading to Setup 2 showed modest improvement in speed and reduced CPU pressure. Memory stayed stable throughout.

Chat Features (Join, Leave, Mute, etc.)

ActionConcurrent UsersAvg. Response Time (ms)Throughput (requests/sec)
Join Chatroom1000474381.6
Leave Chatroom (Private)1000462782.9
View Participants1000475580.8
Mute Notifications1000630575.6

Mute and topic-related actions tended to be slower, suggesting additional backend processing.

Feed Handling (Content Posting and Interaction)

ActionConcurrent UsersAvg. Response Time (ms)Throughput (requests/sec)
Create a Post10005753513.9
Like a Post10003587719.5
Comment on a Post10003287320.2

The most demanding actions were content-related. Creating posts at high volume led to nearly half of them timing out, especially under the highest pressure.


4. Unique Users Simulation

To better replicate a production environment, we re-ran the tests using different users each time. For instance, 1000 different people each created 1 distinct post (for a total of 1000 posts). The results were revealing:

ActionUsersAvg. Response Time (ms)Throughput (requests/sec)
Create a Post10002707420.7
Like a Post1000276926.0
Comment on a Post10005224215.6

The changes reduced response time and improved throughput, proving the value of user-level optimizations.


5. Key Learnings

  • Service Stability: The services handled up to 500 concurrent users very comfortably. At 1000, higher latencies were observed, especially for content-heavy operations yet the services were able to manage the load effectively.
  • Infrastructure Scaling: Doubling database resources significantly lowered CPU load without increasing memory usage.
  • Caching Impact: When data wasn't cached (i.e., accessed for the first time), latencies were higher. Services relying on cached data performed more predictably.

6. What’s Next

  • Further Optimization: Post creation and commenting need more attention to prevent timeouts.
  • Smart Load Distribution: Adding smarter load balancing mechanisms may help maintain performance beyond 1000 users.
  • Monitoring and Auto-scaling: Real-time infrastructure scaling during traffic surges could prevent resource exhaustion.

Final Thoughts

While the services held strong under realistic load conditions, the results provided clear indicators on where we should optimize further. By testing various usage scenarios and infrastructure combinations, we now have a clearer roadmap to scale our platform sustainably.