Efficient Feature Flagging Platform

Published at

August 6, 2021

Status

In progress

The Problem

As an experimentation team, we use feature flags as the primary means to split traffic between two feature variants (control / treatment). We own a lot of the wrappers around our vendor to help collect exposure events through our organization's analytics pipeline. However, we've had to build abstractions over our vendor's products due to some of their constraints.

There are three categories of problems we have encountered and wanted to address. First, there is rigid targeting, which requires additional infrastructure to set up automations to schedule rollouts and create custom release strategies for the organization. Un-configurable bucketing, which is the mechanism used to derive the serve variation deterministically, is another challenge. Our vendor uses a fixed hash function, making it unable to specify a hash function. Different hash functions have different distribution properties. Additionally, our vendor is unable to change the variant selection strategy; nearly all vendors just serve the first matched rule. Other strategies include the most matched rules and the last matched rule. Using the same salt for bucketing across products doesn't result in the same distribution characteristics. This results in services sharing our vendor's projects, resulting in these services fetching a lot of flags, which can cause performance issues, among other problems.

The second category of problems we've encountered is unreliable transport. We've commonly seen initialization failures on projects with lots of flags or large rulesets (rules that reference 30k IDs, for example). Furthermore, our vendor's transport mode (SSE) is configured when the client is initialized, leading to unreliable networks (e.g., mobile networks).

Finally, we've encountered vendor lock-in. Once an application is using our vendor's SDKs, it becomes challenging to migrate to another vendor. Additionally, we are unable to provision an instance in our own cloud environment. Proxies are available, but they still send sensitive data (i.e., evaluation context key ~> user trait keys, etc.) to our vendor. This is a pain when trying to make our product FEDRAMP compliant.

We ideally wanted our vendor to address some of these issues, but they've outgrown us and are much reluctant on supporting our use cases. Unfortunately, we didn't get funding as a team to build an internal feature management platform.

Introducing Flagbase

ALT

However, we've been facing some problems with our vendor's platform, which led us to explore an internal solution. After evaluating our needs, we identified three categories of problems: rigid targeting, unreliable transport, and vendor lock-in.

For targeting, we found that our vendor's platform requires additional infrastructure to set up automations for scheduling rollouts or creating custom release strategies for our organization. Additionally, we were unable to specify a hash function or change the variant selection strategy, resulting in fixed bucketing with different distribution properties across products. Our vendor's platform also used the same salt for bucketing, making it hard to run cross-product experiments.

ALT

Regarding transport, we noticed initialization failures and unreliable networks, especially in mobile networks. This is because the transport mode is configured when the client is initialized, which can lead to issues in large rulesets that reference 30k IDs or more.

ALT

Finally, our vendor's platform posed issues related to vendor lock-in. It was difficult to migrate to another vendor once an application was using a specific vendor's SDKs. We were also unable to provision an instance in our own cloud environment. Proxies were available, but they still sent sensitive data to our vendor, making it hard to make our product compliant with FEDRAMP.

ALT

Given these challenges, we decided to develop our own feature management platform, which we named Flagbase. Flagbase is an open-source platform that addresses the problems we encountered with our vendor's platform. It features advanced targeting capabilities, including the ability to scope targeting rules to specific time windows and customisable hash functions, variation selection strategies, and bucketing salt.

In terms of transport, Flagbase has a hybrid transport system that allows SDKs to switch dynamically between Server-Sent Events (SSE) and HTTP polling modes, depending on network conditions. This minimizes unnecessary traffic, as only flag or rule changes are propagated through the network.

Moreover, Flagbase ensures that there is no vendor lock-in. It offers OpenFeature-compliant SDKs, which makes it easy to switch to another vendor that is also OpenFeature compliant. Flagbase is also self-hosted, which means it doesn't rely on specific technologies from a particular cloud vendor. It consists of a single executable (flagbased) that can run in multiple modes and has minimal dependencies (DB-Postgres, Cache-Redis, Telemetry-OpenTelemetry→Prometheus).

High Level Overview

Now that we have discussed the problems that can arise with feature flags and the benefits of using an open-source platform like Flagbase, let's take a closer look at how it works.

At a high level, the application uses the Flagbase SDK to pass in the evaluation context. The SDK then gets the feature flags from the Flagbase service. There are two types of SDKs: client-side and server-side.

ALT

A client-side SDK evaluates flags on the Flagbase service, whereas a server-side SDK evaluates flags within the SDK itself. The evaluation function determines the feature variant, given a set of rules and an evaluation context. For example, the evaluation context could be user attributes like age, gender, and whether the user is a paid user.

ALT

The Flagbase service uses advanced targeting capabilities that allow targeting rules to be scoped to specific time windows. Additionally, Flagbase provides a customizable hash function, variation selection strategy, and bucketing salt to ensure that the feature flag evaluation is deterministic and consistent.

ALT

Efficient transport is also a key feature of Flagbase. The SDKs can switch between Server-Sent Events (SSE) and HTTP polling modes dynamically during runtime to adapt to network conditions. Furthermore, only flag or rule changes are propagated through the network, which minimizes unnecessary network traffic.

One of the most important benefits of Flagbase is that there is no vendor lock-in. The Flagbase SDKs are OpenFeature compliant, which means they can be used with other vendors that also support OpenFeature compliance. Additionally, Flagbase is self-hosted and designed in a way that does not rely on specific technologies from a particular cloud vendor. It consists of a single executable (flagbased) that can be run in multiple modes, and it has very few dependencies, such as Postgres for the database, Redis for caching, and OpenTelemetry for telemetry, which can be exported to Prometheus for monitoring.

Service Design

Continuing from the previous section, let's take a closer look at the design of the Feature Flag SDKs and Service.

ALT

Client/Server SDK Design:

The Feature Flag SDK is available in two types of designs - Client-side and Server-side.

Client-side SDK

The Client-side SDK has the following components:

Interface: This provides methods such as getVariation('flag-key', 'default-variation') to retrieve flag variations from the SDK.

Feature Store: This is a data store that contains evaluated flags retrieved from the service. It is typically saved in indexdb or local storage for web applications, and in SQLite for mobile SDKs.

Transporter: The requests sent to the SDK contain the client SDK key for a given project environment, evaluation context, and retrieve the evaluated flags for the given environment via HTTP polling or Server-Sent Events (SSE).

Server-side SDK:

The Server-side SDK has the following components:

Interface: This provides methods such as getVariation('flag-key', 'control', evaluationContext) to retrieve flag variations from the SDK.

Evaluation Engine: This uses targeting rules for a flag, along with the evaluation context, to derive the served variation.

Feature Store: This is a data store that contains flags with their respective targeting rules. It is usually stored in-memory, but it is possible to create connectors to Redis or Memcache, among others.

Transporter: The requests sent to the SDK contain the client SDK key for a given project environment and retrieve flags and their respective targeting for the given environment via HTTP polling or SSE.

Flagbase is able to run in multiple worker modes, for example, flagbased worker start --mode=poller. These modes include:

API: JSON:API compliant interface to manage resources.

Streamer: Server-Sent Events (EventSource).

Poller: HTTP polling worker.

Events: Captures exposure events and custom events.

Relayer: Sits closer to the application environment and proxies flags and/or event data.

Dependencies

Flagbase has a few dependencies, including Postgres and Redis. PubSub is used by the streamer to subscribe to flagsets. Flag updates trigger a push event to subscribers.

Further Optimisation

One of these optimizations is the use of diffing to transport only the changes made to the flags and rules. Instead of sending the entire flag configuration each time a change is made, Flagbase calculates the diff between the current flag configuration and the previous one. This is achieved by leveraging the version number of the flag, which is used by the service to calculate and return only the required diff.

This approach significantly reduces the amount of data that needs to be transported between the service and the SDKs, which can be particularly useful in scenarios where network bandwidth is limited or unreliable. In addition, this optimization helps reduce the load on the service, making it more scalable and reliable.

Another optimization is the use of hybrid transport. Flagbase SDKs are designed to dynamically switch between Server-Sent Events (SSE) and HTTP polling modes during runtime, depending on the network conditions. SSE allows for real-time updates and reduces the overhead associated with HTTP polling, while HTTP polling is more reliable in scenarios where network connectivity is spotty. By switching between these two modes based on network conditions, Flagbase is able to provide an efficient and reliable transport mechanism.