Blog | FluxNinja Aperture

FluxNinja Aperture v1.0 is here - A managed rate-limiting service, batteries included

January 29, 2024 · 6 min read

Developer Relations

The FluxNinja team is excited to launch “rate-limiting as a service” for developers. This is a start of a new category of essential developer tools to serve the needs of the AI-first world, which relies heavily on effective and fair usage of programmable web resources.

info

Try out FluxNinja Aperture for rate limiting. Join our community on Discord, appreciate your feedback.

FluxNinja is leading this new category of “managed rate-limiting service” with the first of its kind, reliable, and battle-tested product. After its first release in 2022, FluxNinja has gone through multiple iterations based on the feedback from the open source community and paid customers. We are excited to bring the stable version 1.0 of the service to the public.

Balancing Cost and Efficiency in Mistral with Concurrency Scheduling

January 8, 2024 · 11 min read

Karanbir Sohi

Software Engineer

In the fast-evolving space of generative AI, OpenAI's models are the go-to choice for most companies for building AI-driven applications. But that may change soon as open-source models catch up by offering much better economics and data privacy through self-hosted models. One of the notable competitors in this sector is Mistral AI, a French startup, known for its innovative and lightweight models, such as the open-source Mistral 7B. Mistral has gained attention in the industry, particularly because their model is free to use and can be self-hosted. However, generative AI workloads are computationally expensive, and due to the limited supply of Graphics Processing Units (GPUs), scaling them up quickly is a complex challenge. Given the insatiable hunger for LLM APIs within organizations, there is a potential imbalance between demand and supply. One possible solution is to prioritize access to LLM APIs based on request criticality while ensuring fair access among users during peak usage. At the same time, it is important to ensure that the provisioned GPU infrastructure gets maximum utilization.

In this blog post, we will discuss how FluxNinja Aperture's Concurrency Scheduling and Request Prioritization features significantly reduce latency and ensure fairness, at no added cost, when executing generative AI workloads using the Mistral 7B Model. By improving performance and user experience, this integration is a game-changer for developers focusing on building cutting-edge AI applications.

Prototype to Production Roadmap for Generative AI-based Products

January 6, 2024 · 12 min read

Pradeep Sharma

Developer Relations

As we enter 2024, Generative AI-based applications are poised to become mainstream

Given Generative AI’s limitations at the start of 2023, the world was skeptical whether Generative AI would deliver tangible value to the businesses and to the customers. With the current state of Generative AI services, it seems totally possible. Many of us have by now built some prototypes of Generative AI-based apps that are effectively solving specific business problems and delivering concrete value to a small set of users.

This was possible due to continuous improvements in Generative AI services from GPT-3.5 to GPT-4-Turbo, from LlaMa to Mistral, and many more incremental as well as disruptive developments. We were able to confidently use Generative AI services to deliver value consistently, and the dream of building useful Generative AI-based apps is not a dream anymore but a reality.

In 2024, we will see massive adoption of such Generative AI-based products

After building a prototype, the next challenge one needs to solve is how to ship those prototypes to the hands of millions of such users reliably in production. And that is not yet done by many but has been proven to be possible.

A prime example of this is CodeRabbit, a leading AI Code Review tool that utilizes GPT for automating PR reviews.

Building cost-effective Generative AI applications

December 19, 2023 · 8 min read

Gur Singh

Co-Founder - CodeRabbit

Nato Boram

Software Engineer - CodeRabbit

info

This guest post is by CodeRabbit, a startup utilizing OpenAI's API to provide AI-driven code reviews for GitHub and GitLab repositories.

Since its inception, CodeRabbit has experienced steady growth in its user base, comprising developers and organizations. Installed on thousands of repositories, CodeRabbit reviews several thousand pull requests (PRs) daily. We have previously discussed our use of an innovative client-side request prioritization technique to navigate OpenAI rate limits. In this blog post, we will explore how we manage to deliver continuous, in-depth code analysis cost-effectively, while also providing a robust, free plan to open source projects.

Bringing Observability-driven load management to Istio

November 28, 2023 · 11 min read

Tanveer Gill

Co-Founder

Sudhanshu Prajapati

Developer Advocate

Service Mesh technologies such as Istio have brought great advancements to the Cloud Native industry with load management and security features. However, despite all these advancements, Applications running on Service Mesh technologies are unable to prioritize critical traffic during peak load times. Therefore, we still have real-world problems such as back-end services getting overloaded because of excess load, leading to a degraded user experience. While it is close to impossible to predict when a sudden increase in load will happen, it is possible to bring a crucial and essential feature in Service Meshes to handle unexpected scenarios: Observability-driven load management.

In this blog post, we will discuss the benefits of Istio and how FluxNinja Aperture can elevate Istio to be a robust service mesh technology that can handle any unexpected scenario by leveraging advanced load management capabilities.

Ensuring Reliability: Listening to Database Signals For Better User Experience

October 24, 2023 · 6 min read

Hasit Mistry

Software Engineer

Web services must be equipped with the ability to anticipate and manage unpredictable traffic surges, a crucial requirement for businesses with a growing online presence. Failure to do so can lead to a degraded user experience, resulting in potential revenue loss over time. Users expect seamless and reliable service, and any disruptions or downtime can severely impact a business's reputation.

By taking a proactive approach to managing unpredictable traffic surges, web services can ensure the ongoing satisfaction of their users and the long-term success of their business. It is crucial for businesses to prioritize user journeys and invest in solutions that can help navigate these challenges.

Squeezing Water from Stone: Managing OpenAI Rate Limits with Request Prioritization

October 23, 2023 · 12 min read

Gur Singh

Co-Founder - CodeRabbit

Suman Kumar

Software Engineer - CodeRabbit

Nato Boram

Software Engineer - CodeRabbit

info

This is a guest post by CodeRabbit, a startup that uses OpenAI's API to provide AI-driven code reviews for GitHub and GitLab repositories.

Since CodeRabbit launched a couple of months ago, it has received an enthusiastic response and hundreds of sign-ups. CodeRabbit has been installed in over 1300 GitHub organizations and typically reviews more than 2000 pull requests per day. Furthermore, the usage continues to grow at a rapid pace, we are experiencing a healthy week-over-week growth.

While this rapid growth is encouraging, we've encountered challenges with OpenAI's stringent rate limits, particularly for the newer gpt-4 model that powers CodeRabbit. In this blog post, we will delve into the details of OpenAI rate limits and explain how we leveraged the FluxNinja's Aperture load management platform to ensure a reliable experience as we continue to grow our user base.

Extending HashiCorp Consul with FluxNinja Aperture's Observability-Driven Load Management

September 18, 2023 · 9 min read

Jai Desai

Co-Founder

Sudhanshu Prajapati

Developer Advocate

At a Glance:

The blog aims to demystify the process of coupling HashiCorp Consul, a widely adopted service mesh, with FluxNinja Aperture, a platform specializing in observability-driven load management.
HashiCorp Consul and FluxNinja Aperture's technical teams collaborated to enable seamless integration. This is facilitated through the Consul’s Envoy extension system, leveraging features like external authorization and OpenTelemetry Access logging.
By integrating these two platforms, the service reliability and performance of networked applications can be significantly improved. The synergy offers adaptive rate-limiting, workload prioritization, global quota management, and real-time monitoring, turning a previously manual siloed process of traffic adjustments into an automated, real-time operation.

In the dynamic world of software, modern applications are undergoing constant transformation, breaking down into smaller, nimble units. This metamorphosis accelerates both development and deployment, a boon for businesses eager to innovate. Yet, this evolution isn't without its challenges. Think of hurdles like service discovery, ensuring secure inter-service communication, achieving clear observability, and the intricacies of network automation. That's where HashiCorp Consul comes into play.

Observing Much, Achieving Little - The Reliability Paradox

August 28, 2023 · 12 min read

Harjot Gill

Co-Founder

Tanveer Gill

Co-Founder

Over the last decade, significant investments have been made in the large-scale observability systems, accompanied by the widespread adoption of the discipline of Site Reliability Engineering (SRE). Yet, an over-reliance on observability alone has led us to a plateau, where we are witnessing diminishing returns in terms of overall reliability posture. This is evidenced by the persistent and prolonged app failures even at well-resourced companies that follow the best practices for observability.

Furthermore, the quest for reliability is forcing companies to spend ever more on observability, rivaling the costs of running the services they aim to monitor. Commercial SaaS solutions are even more expensive, as the unpredictable pricing models can quickly skyrocket the observability bill. The toll isn't only monetary; it extends to the burden shouldered by developers implementing observability and operators tasked with maintaining a scalable observability stack.

From Static to Adaptive: A Framework for Implementing Rate Limits

August 14, 2023 · 10 min read

Sudhanshu Prajapati

Developer Advocate

In modern engineering organizations, service owners don't just build services, but are also responsible for their uptime and performance. Ideally, each new feature is thoroughly tested in development and staging environments before going live. Load tests designed to simulate users and traffic patterns are performed to baseline the capacity of the stack. For significant events like product launches, demand is forecasted, and resources are allocated to handle it. However, the real world is unpredictable. Despite the best-laid plans, below is a brief glimpse of what could still go wrong (and often does):

Traffic surges: Virality (Slashdot effect) or sales promotions can trigger sudden and intense traffic spikes, overloading the infrastructure.
Heavy-hitters and scrapers: Some outlier users can hog up a significant portion of a service's capacity, starving regular user requests.
Unexpected API usage: APIs can occasionally be utilized in ways that weren't initially anticipated. Such unexpected usage can uncover bugs in the end-client code. Additionally, it can expose the system to vulnerabilities, such as application-level DDoS attacks.
Expensive queries: Certain queries can be resource-intensive due to their complexity or lack of optimization. These expensive queries can lead to unexpected edge cases that degrade system performance. Additionally, these queries could push the system to its vertical scaling boundaries.
Infrastructure changes: Routine updates, especially to databases, can sometimes lead to unexpected outcomes, like a reduction in database capacity, creating bottlenecks.
External API quotas: A backend service might rely on external APIs or third-party services, which might impose usage quotas. End users get impacted when these quotas are exceeded.