Load Testing Reports - Scaling Conversations and Feeds
1. Introduction
We at LikeMinds wanted to evaluate how our Chat and Feed services perform when users interact concurrently. With anticipated user growth and increasing reliance on real-time communication and content sharing, this stress test was key to understanding where our backend infrastructure shines—and where it strains.
2. How We Structured the Test
We tested three different environments with increasing infrastructure strength. Below is a detailed comparison of the infrastructure, including CPU, memory, and service layout:
Component | Setup 1 (Baseline) | Setup 2 (Optimized DB Compute) | Setup 3 (Feed Handling Optimized) |
---|---|---|---|
API Gateway (Kettle) | 1 pod, 1Gi RAM, CPU: 2 vCPU | 1 pod, 1Gi RAM, CPU: 2 vCPU | 1 pod, 1Gi RAM, CPU: 2 vCPU |
Chat Service (Caravan) | 2 pods, 3Gi RAM each, CPU: 5 vCPU total | 2 pods, 3Gi RAM each, CPU: 5 vCPU total | 1 pod, 3Gi RAM, CPU: 4.5 vCPU |
Background Worker | 1 Celery pod, 1Gi RAM, CPU: 1–4 vCPU dynamic | 1 Celery pod, 1Gi RAM, CPU: 1–4 vCPU dynamic | 1 Swarm Worker pod, 1Gi RAM, CPU: 1–3 vCPU |
Feed Service (Swarm) | 2 pods, 1Gi RAM each, CPU: 2 vCPU | 2 pods, 1Gi RAM each, CPU: 2 vCPU | 5 pods, 3Gi RAM each, CPU: 2–5 vCPU |
Rate Limiter (Skulk) | 1 pod, 1Gi RAM, CPU: 1 vCPU | 1 pod, 1Gi RAM, CPU: 1 vCPU | 1 pod, 1Gi RAM, CPU: 1 vCPU |
Database Type | PostgreSQL | PostgreSQL | Azure Cosmos DB (MongoDB vCore) |
Database Spec | 2 vCores, 8GiB RAM | 4 vCores, 16GiB RAM | 2 instances, 4Gi RAM each (replicated MongoDB cluster) |
CPU Observed | Peak: 8.5%, Avg: ~5.3% | Peak: 5.62%, Avg: ~1.8% | Peak: 16%, Avg: ~5.2% |
Memory Usage | Baseline: ~25%, Stable under load | Baseline: ~32%, Stable under load | Baseline: ~28%, Slight increase during peak loads |
We simulated 100, 500, and 1000 users clicking, commenting, liking, and chatting—mimicking a real-life scenario where users are active concurrently.
3. Performance Metrics and Observations
Chatroom Performance
Action | Concurrent Users | Avg. Response Time (ms) | Throughput (requests/sec) |
---|---|---|---|
View Chatrooms | 1000 | 4955 | 99.2 |
Check Member Status | 1000 | 4543 | 120.8 |
DM Status Lookup | 1000 | 4618 | 82.4 |
Upgrading to Setup 2 showed modest improvement in speed and reduced CPU pressure. Memory stayed stable throughout.
Chat Features (Join, Leave, Mute, etc.)
Action | Concurrent Users | Avg. Response Time (ms) | Throughput (requests/sec) |
---|---|---|---|
Join Chatroom | 1000 | 4743 | 81.6 |
Leave Chatroom (Private) | 1000 | 4627 | 82.9 |
View Participants | 1000 | 4755 | 80.8 |
Mute Notifications | 1000 | 6305 | 75.6 |
Mute and topic-related actions tended to be slower, suggesting additional backend processing.
Feed Handling (Content Posting and Interaction)
Action | Concurrent Users | Avg. Response Time (ms) | Throughput (requests/sec) |
---|---|---|---|
Create a Post | 1000 | 57535 | 13.9 |
Like a Post | 1000 | 35877 | 19.5 |
Comment on a Post | 1000 | 32873 | 20.2 |
The most demanding actions were content-related. Creating posts at high volume led to nearly half of them timing out, especially under the highest pressure.
4. Unique Users Simulation
To better replicate a production environment, we re-ran the tests using different users each time. For instance, 1000 different people each created 1 distinct post (for a total of 1000 posts). The results were revealing:
Action | Users | Avg. Response Time (ms) | Throughput (requests/sec) |
---|---|---|---|
Create a Post | 1000 | 27074 | 20.7 |
Like a Post | 1000 | 2769 | 26.0 |
Comment on a Post | 1000 | 52242 | 15.6 |
The changes reduced response time and improved throughput, proving the value of user-level optimizations.
5. Key Learnings
- Service Stability: The services handled up to 500 concurrent users very comfortably. At 1000, higher latencies were observed, especially for content-heavy operations yet the services were able to manage the load effectively.
- Infrastructure Scaling: Doubling database resources significantly lowered CPU load without increasing memory usage.
- Caching Impact: When data wasn't cached (i.e., accessed for the first time), latencies were higher. Services relying on cached data performed more predictably.
6. What’s Next
- Further Optimization: Post creation and commenting need more attention to prevent timeouts.
- Smart Load Distribution: Adding smarter load balancing mechanisms may help maintain performance beyond 1000 users.
- Monitoring and Auto-scaling: Real-time infrastructure scaling during traffic surges could prevent resource exhaustion.
Final Thoughts
While the services held strong under realistic load conditions, the results provided clear indicators on where we should optimize further. By testing various usage scenarios and infrastructure combinations, we now have a clearer roadmap to scale our platform sustainably.