Mono Environment
Made a twitter commitment to write this up and I know it's going to sound insane if you're a shop with somewhere between 3-8 environments but hear me out.
So first off, what is a mono-environment? Call it a uni-environment if that helps, but it means the same thing, one, one environment, the art of just having production. 😱
Before we go on this journey, I want you to admit you've maybe got some baggage, I think we all have: how many times have you rocked up to a site and at the interview they were talking about 100% test coverage, what their test shape looked like U-I-E2E, ice cream cone, pyramid, upside down pyramid. But that first day it turns out to be much worse with next to no coverage!
Those experiences are going to more than likely thinking I'm a bit of a crackpot and will have already reinforced your view that the level of testing and proactive monitoring required to sustain one environment without the saftey of another env isn't possible, but stay with me and let's unpick it piece by piece.
This concept isn't one size fits all, you might have a problem where this concept fundamentally doesn't work, either because the problem is too, or you have regulatory or process constraints, I'm not suggesting it's the path you should take, but if you can devolve your challenge and the shoe fits, great!
There a bunch of things that we need to work through in order to get the value of a new feature to manifest it's self to a business or our users. We've gotta build something, test it, ship it, deploy it and probably change it a few times without creating more defects.
Now generally the argument for an environment is that we need somewhere isolated from production to do something, usually test, sometimes it's system integration where we bring together some partners and integrate their code and see if everything works out.
Testing
What I'm proposing is that with thorough testing we can eliminate the need for these environments.
Let's take a new POST endpoint for example, what will we test in code?
- Payload Validation does the endpoint accept a valid payload.
- Negative payload validation, do we get an error when we submit a payload with a string that's too long or that contains invalid characters; even a payload with malformed JSON.
- Does the endpoint return extraneous information such as the 'x-powered-by header', something that would give the red team the upper hand.
- Does the code try to call the upstream service if there is one?
- How does it handle and upstream NON-200 or a malformed payload/truncated response.
- Does endpoint return the correct error codes when something fails and does it obfuscate or suppress them when they could be used to probe for weakness.
- Does the endpoint return the proper response.
Seem like a lot? That's just proper testing on the HTTP part of an endpoint, but with this sort of testing you can not only be sure the endpoint is doing what you expect but that it contracts with dependent microservices remains intact.
Feature Flags
Thinking about the nature of a safe release or code deployment in an always-on, single environment, we have a couple of challenges to think about, how do we transition traffic from one version of the application to another and back again if necessary. We still want the ability to validate something works manually as well as automatically, feature flags cover both of those necessities.
Feature flags can implement in any number of ways, from something more flexible and SaaS-based like Launch Darkly to something more homebrew like a JSON file containing booleans that are evaluated before a component is loaded.
We can stand up different versions of a code path and switch using feature flags, switching the production default to our new feature once we're ready for user traffic.
Contracts
On the subject of contracts, Pact Broker is a fantastic tool if you have dependant microservices and you need to protect your self from contract drift, with excellent monitoring you would pretty quickly see you funnel affected by an endpoint that's not doing what it should do, an unexplained spike in non-200's. Lots of these implements should give you warning signs that something has got broken.
Things will go wrong in a mono-environment, for sure, but! If you do you postmortems, I think you will quickly figure out if there is work to be done on mitigating, preventing or if generally there is a communication issue somewhere.
Judgement and Traffic Splitting
Tools like Netflix's Keyenta which is now standalone can help you automate the decision-making process to release, what does the baseline error rate look like, does this new release match or exceed the baseline? Make a call to proceed or not.
We can also do something similar using Istio Route Rules in Kubernetes and drive a small percentage of traffic through a new version of a container before we're ready to deprecate the old one and promote the new one.
Just to bring the point home, we're still doing all this in one environment, the level of testing is a change in the way we engineer and a difference in how we build products: a great product owner will understand there is more than just shipping code to development, we own the whole damn thing as a team, Build, Security, Test, Release, Support.
On CI
See the mono-environment makes release super easy because you don't have to worry about promoting assets if you're running trunk based development, merge to master and we go straight to production.
One of the things I've always hated is the conflation of arbitrary jobs and production of assets in CI. Lately, I've been working on collapsing 8 environments down to just two, what's been great is the VSTS Release Management feature.
I've been able to standup trunk based development, use the development environment as the destination for a release once we merge to master, and then use Release Management to flow those artefacts produced on merge into the downstream environments using either automation or gated decision points. One artefact -> multiple environments.
If you're running a mono-environment it's no secret I'm a massive fan of CircleCI, version two is just phenomenal, the steps of your pipeline execute in a docker container, so you get to accelerate your build by having a container set up with all the build tools already installed and waiting for you. We can also do some cool stuff like pull all your code, then fan out in the pipeline, run tests in parallel and then fan back in when you're ready to build an artefact.
Monitoring, Logging & Tracing
I know I speak about the value of ChatOps quite often, but one of the double-edged swords that it brings is the always-connected nature of real-time chat, I'm you're practising mono-environment then you're going to need real-time feedback on things being broken with a direct route to the people who can triage and fix it where possible.
We could cover off a spike in errors and reporting them in real-time, what about actively testing our happy paths? So we're looking for an external testing service, ideally a way we can script those tests and maintain them in code. I've used New Relic Synthetics in the past with success, the promotion of code dev to live can be a little clunky unless you're willing to do something creative. I'd say this is quite critical to get right, anything that is clunky or causes friction will quickly get pushed to the side or ignore, this means we will lose the value in having it.
Recap
Taking a single end point change as an example:
- We alter an existing endpoint to accept a new JSON payload
- We alter the schema validation to check the incoming payload
- We write the tests to validate the new payload.
- We write a test stubbing out the upstream and making sure they get called.
- We write any new tests required to hit 100% coverage
- We write any tests that might bolster our security.
- Strong change that this endpoint needs to dual run because we haven't updated the front-end, let's write a test to make sure both payloads work.
- We know we have a synthetic testing the old payload, maybe we should write a new synthetic to test submission of the new payload, but we can hide that under a feature flag for now.
- Let's write a test to make sure you can only ever get our new endpoint by supplying the relevant feature flag.
Finally we can be confident we can push this out under its new feature flag, we have monitoring in place, we know it's not accessible without the correct flag, we will get alerted if the live synth breaks because it's testing the existing submission and our new synth is testing the one under a feature flag.
Naturally, when we're ready to promote this to live, we can use the Istio Route Rules to begin to slide traffic across to a new version of maybe the frontend container that will cause submissions to the new endpoint, keep an eye on error rates on slack and make the call to increase manually or again use something like Kayenta to make a judgment.
This isn't a blueprint for a mono-environment, you will have your own challenges and design issues to overcome. You need to decide if the benefits of one environment and more durability is inside your acceptable risk profile.
I don't think anything I've mentioned is extreme of specific to operating a mono-environment, I think it's just good clean development practices. There are loads of great tools and technologies out there you could add into the mix, each part of your stack will have specific challenges, they might be tech or just merely hygiene and ensuring other members of the team follow the process.
As always, let me know your thoughts
Bree xoxo