Skip to main content

· 9 min read
Jai Desai
Sudhanshu Prajapati

At a Glance:

  • The blog aims to demystify the process of coupling HashiCorp Consul, a widely adopted service mesh, with FluxNinja Aperture, a platform specializing in observability-driven load management.
  • HashiCorp Consul and FluxNinja Aperture's technical teams collaborated to enable seamless integration. This is facilitated through the Consul’s Envoy extension system, leveraging features like external authorization and OpenTelemetry Access logging.
  • By integrating these two platforms, the service reliability and performance of networked applications can be significantly improved. The synergy offers adaptive rate-limiting, workload prioritization, global quota management, and real-time monitoring, turning a previously manual siloed process of traffic adjustments into an automated, real-time operation.

In the dynamic world of software, modern applications are undergoing constant transformation, breaking down into smaller, nimble units. This metamorphosis accelerates both development and deployment, a boon for businesses eager to innovate. Yet, this evolution isn't without its challenges. Think of hurdles like service discovery, ensuring secure inter-service communication, achieving clear observability, and the intricacies of network automation. That's where HashiCorp Consul comes into play.

· 11 min read
Harjot Gill
Tanveer Gill

Over the last decade, significant investments have been made in the large-scale observability systems, accompanied by the widespread adoption of the discipline of Site Reliability Engineering (SRE). Yet, an over-reliance on observability alone has led us to a plateau, where we are witnessing diminishing returns in terms of overall reliability posture. This is evidenced by the persistent and prolonged app failures even at well-resourced companies that follow the best practices for observability.

Furthermore, the quest for reliability is forcing companies to spend ever more on observability, rivaling the costs of running the services they aim to monitor. Commercial SaaS solutions are even more expensive, as the unpredictable pricing models can quickly skyrocket the observability bill. The toll isn't only monetary; it extends to the burden shouldered by developers implementing observability and operators tasked with maintaining a scalable observability stack.

· 10 min read
Sudhanshu Prajapati

In modern engineering organizations, service owners don't just build services, but are also responsible for their uptime and performance. Ideally, each new feature is thoroughly tested in development and staging environments before going live. Load tests designed to simulate users and traffic patterns are performed to baseline the capacity of the stack. For significant events like product launches, demand is forecasted, and resources are allocated to handle it. However, the real world is unpredictable. Despite the best-laid plans, below is a brief glimpse of what could still go wrong (and often does):

  • Traffic surges: Virality (Slashdot effect) or sales promotions can trigger sudden and intense traffic spikes, overloading the infrastructure.
  • Heavy-hitters and scrapers: Some outlier users can hog up a significant portion of a service's capacity, starving regular user requests.
  • Unexpected API usage: APIs can occasionally be utilized in ways that weren't initially anticipated. Such unexpected usage can uncover bugs in the end-client code. Additionally, it can expose the system to vulnerabilities, such as application-level DDoS attacks.
  • Expensive queries: Certain queries can be resource-intensive due to their complexity or lack of optimization. These expensive queries can lead to unexpected edge cases that degrade system performance. Additionally, these queries could push the system to its vertical scaling boundaries.
  • Infrastructure changes: Routine updates, especially to databases, can sometimes lead to unexpected outcomes, like a reduction in database capacity, creating bottlenecks.
  • External API quotas: A backend service might rely on external APIs or third-party services, which might impose usage quotas. End users get impacted when these quotas are exceeded.

· 11 min read
Tanveer Gill

Imagine a bustling highway system, a complex network of roads, bridges, tunnels, and intersections, each designed to handle a certain amount of traffic. Now, consider the events that lead to traffic jams - accidents, road work, or a sudden influx of vehicles. These incidents cause traffic to back up, and often, a jam in one part of the highway triggers a jam in another. A bottleneck on a bridge, for instance, can lead to a jam on the road leading up to it. Congestion creates many complications, from delays and increased travel times, to drivers getting annoyed over wasted time and too much fuel burned. These disruptions don’t just hurt the drivers, they hit the whole economy. Goods are delayed and services are disrupted as employees arrive late (and angry) at work.

But highway systems are not left to the mercy of these incidents. Over the years, they have evolved to incorporate a multitude of strategies to handle such failures and unexpected events. Emergency lanes, traffic lights, and highway police are all part of the larger traffic management system. When congestion occurs, traffic may be re-routed to alternate routes. During peak hours, on-ramps are metered to control the influx of vehicles. If an accident occurs, the affected lanes are closed, and traffic is diverted to other lanes. Despite their complexities and occasional hiccups, these strategies aim to manage traffic as effectively as possible.

· 10 min read
Sudhanshu Prajapati

We’ve been hearing about rate limiting quite a lot these days, being implemented throughout popular services like Twitter and Reddit. Companies are finding it more and more important to control the abuse of services and keep costs under control.

Before I started working as a developer advocate, I built quite a few things, including integrations and services that catered to specific business needs. One thing that was common while building integrations was the need to be aware of rate limits when making calls to third-party services. It’s worth making sure my integration doesn't abuse the third-party service API. On the other hand, third-party services also implement their own rate-limiting rules at the edge to prevent being overwhelmed. But how does all this actually work? How do we set this up? What are the benefits of rate limiting? We’ll cover these topics, and then move on to the reasons why adaptive rate limiting is necessary.

· 8 min read
Marta Rogala

Have you ever tried to buy a ticket online for a concert and had to wait or refresh the page every three seconds when an unexpected error appeared? Or have you ever tried to purchase something during Black Friday and experienced moments of anxiety because the loader just kept on… well… loading, and nothing appeared? We all know it, and we all get frustrated when errors occur and we don't know what's wrong with the website and why we can't buy the ticket we want.

· 18 min read
Sudhanshu Prajapati

Even thirty years since its inception, PostgreSQL continues to gain traction, thriving in an environment of rapidly evolving open source projects. While some technologies appear and vanish swiftly, others, like the PostgreSQL database, prove their longevity, illustrating that they can withstand the test of time. It has become the preferred choice by many organizations for data storage, from general data storage to an asteroid tracking database. Companies are running PostgreSQL clusters with petabytes of data.

· 3 min read
Sudhanshu Prajapati
Karanbir Sohi

San Francisco — FluxNinja is thrilled to announce the General Availability of its innovative open source tool, Aperture. This cutting-edge solution is designed to enable prioritized load shedding driven by observability and graceful degradation of non-critical services, effectively preventing total system collapse. Furthermore, Aperture intelligently auto-scales essential resources only when necessary, resulting in significant infrastructure cost savings.

· 13 min read
Sudhanshu Prajapati

In today's world, the internet is the most widely used technology. Everyone, from individuals to products, seeks to establish a strong online presence. This has led to a significant increase in users accessing various online services, resulting in a surge of traffic to websites and web applications.

Because of this surge in user traffic, companies now prioritize estimating the number of potential users when launching new products or websites, due to capacity constraints which lead to website downtime For instance, after the announcement of ChatGPT 3.5, there was a massive influx of traffic and interest from people all around the world. In such situations, it is essential to have load management in place to avoid possible business loss.

· 7 min read
Sudhanshu Prajapati

Graceful degradation and managing failures in complex microservices are critical topics in modern application architecture. Failures are inevitable and can cause chaos and disruption. However, prioritized load shedding can help preserve critical user experiences and keep services healthy and responsive. This approach can prevent cascading failures and allow for critical services to remain functional, even when resources are scarce.

To help navigate this complex topic, Tanveer Gill, the CTO of FluxNinja, got the opportunity to present at Chaos Carnival 2023 (March 15-16), which happened virtually, the sessions were pre-recorded. Though, attendees could interact with speakers since they were present all the time during the session.