Feature Toggling from the Trenches
The Problem
Imagine you have a software product that is under constant development; that's probably not all that difficult. Imagine now you want to commit your stable but partially-completed feature into your trunk repo, but not make it available until it's ready. Now imagine you are developing a feature that will be a premium feature, only available to a small section of your customers.
NewVoiceMedia had these problems, and more, and solved them - mitigated them at least - by using feature toggling.
A Feature Toggle Mechanism
This story starts about 5 years ago. NewVoiceMedia had a history of implementing naive solutions to the problem of allowing a feature for a specific customer or group of customers. Often the solution consisted of a new control in an existing screen accessed by a NewVoiceMedia administrator, with the setting stored in a bespoke field in the database. Sometimes it came with a whole new admin screen. Sometimes a feature was enabled via a value in a web.config file. These solutions were either costly to implement in terms of time and effort, or costly to maintain so meant that it was not done for every new feature.
The upshot of this is that most code went to production with no ability to shut off a new feature if something unexpected happened - in spite of manual and automated tests, unexpected things do happen. Some of these unexpected things led to rollbacks of production releases, and decreased customer confidence in our product. As we moved towards quicker releases to production, the need to protect the stability of the product became more and more important, together with our desire to check code in to the trunk/main branch as early as possible.
The first implementation of a toggling system was a list of toggle names and default values (on/off) and a database table that stored overrides of that default value for customers. This was backed by a service to read the values to determine if a feature was enabled. The code containing a toggled feature would look something like:
foo();
if(toggles.Enabled(FeatureType.ReticulateSplines, customer))
{
splines.Reticulate();
}
bar();
This implementation was quick and easy to implement from a developer's perspective but required manual scripts running to set override values in the database. This quickly became tiresome, so the next evolution was a screen to allow administration of overrides.
The first implementation of a toggling system was a list of toggle names and default values (on/off) and a database table that stored overrides of that default value for customers. This was backed by a service to read the values to determine if a feature was enabled. The code containing a toggled feature would look something like:
foo();
if(toggles.Enabled(FeatureType.ReticulateSplines, customer))
{
splines.Reticulate();
}
bar();
This implementation was quick and easy to implement from a developer's perspective but required manual scripts running to set override values in the database. This quickly became tiresome, so the next evolution was a screen to allow administration of overrides.
This is the basic system that's in place today. We've made enhancements and tweaks to the system along the way when the need has arisen, for example:
- Caching of override values
- Tagging of feature toggles, allowing:
- Visual grouping on the admin screen
- Clearer separation between development toggles (to enable swift rollback of code, to protect the platform) and customer toggles (to enable a premium feature for a particular account)
- Introducing different levels of granularity for toggles
- Platform-wide, to turn a feature on or off for all customers
- User-specific, to allow a specific user to have access to a feature
Toggle lifecycle
Introducing this mechanism means that it's very easy to add a developer toggle to every new feature. As you can imagine that leads to a large number of toggles, and their related checks become dotted around the code. When a developer toggle outlives its usefulness, it's in the technical debt pile. This means that each developer needs to be disciplined in removing unused toggles when they're no longer relevant. The lifecycle of a development toggle at NewVoiceMedia generally runs like this:
- Introduce a toggle, default the value "off"
- Develop the feature behind the toggle, including extending appropriate automated tests, to execute code with a toggle on and off
- Enable the toggle for test customers in a test environment. This allows the feature to be tested while it's being developed (for early feedback)
- Turn the toggle on in production for selected clients
- After feedback is gathered, change the default value of the toggle to "on", leaving code and tests in place for toggle on and off
- Remove the unit tests testing the toggle "off" code
- Remove the toggle "off" code
- Remove the toggle itself
Observations
If I was writing a toggle mechanism again from scratch, I would use the patterns that I've learnt along the way with the NewVoiceMedia journey. One thing I would do differently is avoid the choice of a default value for toggles; all toggles would be off unless an override exists. This would simplify logic in calculating the value of a toggle.
I would also consider using a 3rd party library for toggle management. This page lists a few libraries available https://featureflags.io/dotnet-feature-flags/
Read more about feature toggling as a concept on Martin Fowler's blog https://martinfowler.com/articles/feature-toggles.html