In modern engineering organizations, service owners don't just build services, but are also responsible for their uptime and performance. Ideally, each new feature is thoroughly tested in development and staging environments before going live. Load tests designed to simulate users and traffic patterns are performed to baseline the capacity of the stack. For significant events like product launches, demand is forecasted, and resources are allocated to handle it. However, the real world is unpredictable. Despite the best-laid plans, below is a brief glimpse of what could still go wrong (and often does):
- Traffic surges: Virality (Slashdot effect) or sales promotions can trigger sudden and intense traffic spikes, overloading the infrastructure.
- Heavy-hitters and scrapers: Some outlier users can hog up a significant portion of a service's capacity, starving regular user requests.
- Unexpected API usage: APIs can occasionally be utilized in ways that weren't initially anticipated. Such unexpected usage can uncover bugs in the end-client code. Additionally, it can expose the system to vulnerabilities, such as application-level DDoS attacks.
- Expensive queries: Certain queries can be resource-intensive due to their complexity or lack of optimization. These expensive queries can lead to unexpected edge cases that degrade system performance. Additionally, these queries could push the system to its vertical scaling boundaries.
- Infrastructure changes: Routine updates, especially to databases, can sometimes lead to unexpected outcomes, like a reduction in database capacity, creating bottlenecks.
- External API quotas: A backend service might rely on external APIs or third-party services, which might impose usage quotas. End users get impacted when these quotas are exceeded.