Web services must be equipped with the ability to anticipate and manage unpredictable traffic surges, a crucial requirement for businesses with a growing online presence. Failure to do so can lead to a degraded user experience, resulting in potential revenue loss over time. Users expect seamless and reliable service, and any disruptions or downtime can severely impact a business's reputation.
By taking a proactive approach to managing unpredictable traffic surges, web services can ensure the ongoing satisfaction of their users and the long-term success of their business. It is crucial for businesses to prioritize user journeys and invest in solutions that can help navigate these challenges.
In this blog post, we will start by exploring the impact of sudden traffic surges on businesses that we know, love, and use. We will then discuss how FluxNinja Aperture can adaptively throttle traffic on the API layer based on saturation signals from the database.
Chess.com’s 2023 Server Overload
During early 2023, traffic on Chess.com crossed the 100M mark. The factors that led to such record high numbers could be placed into two broad categories:
- Planned - Popular chess tournaments, acquisition of another popular online chess platform, new and powerful bot named Mittens that everyone wanted to play against, and holiday season
- Unplanned - Chess featured in a wildly popular ad, a major cheating controversy, sudden burst of new content around chess (partly because of the factors mentioned above)
The planned factors can be handled by provisioning enough resources, thorough testing, and having the relevant teams on high alert. But how do you prepare for the unplanned ones? Chess.com saw a boom in overall user-base and number of active users during early 2023, but their servers were struggling. During peak traffic times, some users got kicked out of their ongoing games, others got presented with “502 Bad Gateway” and “Database Overloaded” errors, while others were simply unable to log on to their favorite chess platform. During these times, the users were not happy.
Screenshot of “Database Overloaded” message as seen by a user trying to log in. Source: Chess.com Forum.
While these were exciting times for the chess fans and the platform alike, new users and existing paying customers of the service were left feeling frustrated. Chess.com continued to see growth over the entirety of the first quarter of 2023 and hurriedly put out another statement regarding the state of their servers and the actions they were taking to alleviate the situation, underscoring the importance of robust infrastructure.
Disney's Ticketing Dilemma
People of all ages enjoy Disney’s theme parks. When their Annual Passes (and other seasonal passes) sales go live, their system confronts immense traffic. This inevitably leads to their servers getting overwhelmed and potential customers being put into a “virtual waiting room”.
Screenshot of “virtual waiting room” as seen by users during Disney’s annual ticket sales. Source: WDWMAGIC.
More reports of long wait times have been seen while purchasing annual passes, purchasing tickets, and making reservations during resort discount releases. More likely than not, statically configured wait times are inadequate and can lead to users being deterred from successfully performing their action. Proper management of traffic and adaptively adjusting the rate of users (based on signals from the backend services and databases) is vital to prevent customer dissatisfaction and potential PR issues.
Duolingo's Traffic Bottleneck
Duolingo, a popular language learning application, was no stranger to high traffic surges. During the COVID-19 pandemic, as many countries mandated isolation at home, Duolingo’s usage spiked to new highs. This growth presented unprecedented problems related to scale and storage as their servers started experiencing bottlenecks.
Users of Duolingo, both regular and paying, reported long loading times in opening new lessons and delays in getting responses. This highlights the broader industry need for more advanced traffic management solutions.
Solution in Action
To demonstrate how Aperture can help in cases like the ones mentioned above, we will use Playground. The setup for the demo contains a Go service that reads from a PostgreSQL database upon receiving a request. The playground includes Aperture Agent installed as DaemonSet and the Go application comes ready with Envoy sidecar proxy, configured to communicate with Aperture Agent. You can read more about it here.
The Aperture policy for this demo is built using load-scheduling/postgresql blueprint with the following values:
Load Scheduling Policy for PostgreSQL
- agent_group: default
- agent_group: default
The policy watches for the percentage of connections in use inside
to detect overloads. When this metric is increasing, it signifies that the load
on the database is increasing. If it remains high for a long time, the database
is not able to keep up with the client requests and returns connection errors.
Aperture throttles traffic when the percentage of connections used is above the
setpoint. Throttling means that Aperture is controlling the rate of requests
coming into the service. Any requests in excess of this rate are queued in
Aperture. In this policy, AIAD (Additive Increase/Additive Decrease) strategy is
used. During overload, the policy progressively throttles traffic by 20% every
10 seconds. Once the system has recovered, the policy progressively allows 5%
more traffic every 10 seconds.
In addition to throttling, the policy performs prioritization of requests while
in Aperture's queue. It uses the
user-type header to prioritize the incoming
requests, ensuring that the
subscriber users are prioritized over
users during overload.
To see the policy in action, we start a wavepool traffic generator to constantly
send requests to the Go service. It sends requests with the
subscriber in a 1:1 ratio.
In this blog post, we explored the impact of sudden traffic surges on businesses, and how FluxNinja Aperture can help manage overload scenarios and protect user journeys using FluxNinja’s graceful-js library.
As digital services continue to grow, addressing scalability becomes a core concern. It's not just about technology, but also about consistently meeting user expectations. FluxNinja Aperture can help position services to effectively navigate these challenges.
For a more in-depth understanding of Aperture, feel free to explore our Documentation and GitHub repository. Additionally, join our Slack community to discuss best practices, ask questions, and participate in discussions.