There are many questions in the air regarding architecting Infrastructure as code (IaC) and IaC pipelines. Adopting cloud and automation tools eases the complexity of infrastructure changes. However, improving consistency and reliability doesn't come out of the box in the software. It takes Architects to think through how they will use the tools to design the systems, processes, and discipline to use them effectively.
A common problem with IaC design is that unintentionally as we introduce more and more components in the stack, our IaC becomes a monolith. In the past, I have seen companies who have put the best resources and humungous efforts in coding IaC templates/scripts for 50+ modules realizing that change-set management and risk of touching the same massive code base for every minor change is unmanageable. A wrongly designed IaC certainly makes life messier.
Here I want to share my experience and a few important lessons learned while designing and optimizing many IaC solutions. Though I have tried to keep the points generic, I will be mentioning a few AWS Tools and services for practical examples:
1. Follow Layered IaC Design Approach
Categorize stack components in Hardly Changed, Infrequently Changes, and Frequently changed layers and decide appropriate deployment strategy and tool for every layer.
2. Keep Loose cross References
Instead of using tightly coupled references, i.e., the output of layer1 stack directly referred in layer2 stack, it's better to push layer1 output to global storage like Vault or AWS Parameter Store. We are not bound to use the same tool for layer1 and 2. Also, we can change params manually in case of any unavoidable situation OR Severity0 issue.
3. Use Public Cloud Provider's native tools wherever possible
“When it comes to IaC, at times, being "cloud agnostic" is an overvalued concept.”
There is no simple way to write a cloud-agnostic deployment template. Better use the cloud provider's native tool. E.g., AWS's CDK gives off-the-shelf three types of constructs, i.e., Level1, Level2, and Level3. Level 1 resources are the same as CloudFormation resources, L2 are curated ones that encapsulate L1 resources, and Level3 creates an entire architecture for a particular use case. Using L2 and L3 resources eliminates the difficulty of managing complex cross-referencing by providing simplified curated resources.
4. Use Nested templates with Modular Approach
“Managing the entire IaC in a single file is an inefficient way.”
Being modular helps in easy updates where any part can be changed without the risk of touching others.
Other development considerations:
- Environment-specific Inputs should be saved outside the template and passed as Configuration Files.
- Use Unique environment suffix with every resource name. It's mainly for proper tagging and also helps to avoid any region-specific unique naming restrictions.
- Secrets should strictly be outside IaC templates and repo. We can use Vault or AWS secret-manager kind to services to manage secrets.
5. Validate and Test before Execution
“If Infra is version controlled and managed as code, testing and validating the code can't be overlooked.“
Tools like AWS-cflint and AWS Taskcat helps in template validation.
6. Use Deploy Only Pipelines
Maintain well-controlled Deploy Only Pipelines for Production to avoid arbitrary infrastructure changes.
7. Run regular jobs to catch Drifts (if any)
Yes, there will be drifts as it's impossible to handle every P0 production infra issue with IaC changes. We must be flexible for quick fixes but immediately add a story to enhance IaC. Implement IaC changes, test in lower environments, revert the manual shift, and rollout same via IaC in the following deployment window.