By Malia Oreta
Effectively orchestrating cloud infrastructure at scale is arguably one of our most important considerations when developing new products and services on the web. Without a finely tuned, well-architected infrastructure plan, digital products won’t be able to scale to their full potential, and teams won’t be able to deliver meaningful features in an agile way.
But infrastructure maintenance and configuration typically happens “behind the scenes”, and can sometimes be a difficult cost to justify, especially on projects where stakeholders may equate “progress” with “visibility”. At the same time, delivering user-facing features without a solid backend architecture to support them usually results in slower workflows, increased labor cost, and potentially unhappy clients. When you can’t rely on your underlying architecture, your team ends up in a cycle of constantly putting out fires, instead of spending time refining things like monitoring, security, automation, and resources optimizations.
So how do we solve this?
For AWS cloud servers, a great way to avoid these kinds of technical constraints on your team is by building a “Golden AMI pipeline”, that establishes a baseline configuration for acceptable EC2 instances by each of your organization’s business units. This can prove to be especially useful when your organization has several development teams that don’t communicate, but need to comply by the same security and compliance standards. With this pipeline in place, individual teams can move more quickly and autonomously.
More specifically, a robust Golden AMI pipeline is what’s considered a “security guardrail”, where your DevOps team doesn’t “approve” deployments, but rather “enables” them. When consistency is maintained throughout your infrastructure, there are less surprises and time spent addressing unexpected issues. By handling these configurations centrally, you can be confident that each business unit is complying to your baseline acceptance criteria. It can also help to free up the DevOps team from manual or tedious maintenance tasks, like having to patch several environments because of an OS security update, handle SSH user management, or confirm uniform dependency injection across environments.
Our team currently provides DevOps processes to manage numerous customer owned AWS accounts under a central organization. Each account consists of various auto-scaling group requirements while maintaining a common configuration of EC2 instances. Relatively simple but tedious tasks such as updating the AMI can sometimes take several hours for manual updates. A typical maintenance schedule for full system updates could take between 4-5 hours to complete, while an automated AMI distribution could be executed in under 2.5 hours.
AWS has great docs for general purpose maintenance, but we found that many of our custom solutions needed a more tailored approach to these types of DevOps tasks. We felt that documenting our process would be helpful both as an internal reference point for us, as well as for your DevOps teams looking to improve their workflows in similar ways.
At a high-level overview, the process for building and distributing new AMIs are as follows:
While this isn't the most difficult process, it can be time-consuming, and require hefty context switching and focus during builds and installs. For example, a new EC2 server can take at least 2-5 minutes to spin up, depending on its AMI. Attempting to start configuration before it's ready can also disrupt any startup scripts and require manual remediation (which sometimes means blowing it up and restarting patiently..). Lastly, rolling out the AMI across all managed accounts can take hours depending on server usages, such as rolling out and testing multiple ECS application deployments.
Because we manage our infrastructure as code using HashiCorp Terraform, we were also using HashiCorp Packer to automate the AMI creation. However, since the need to automate the AMI distribution and rollout has become a higher priority, we have migrated to using AWS EC2 Image Builder instead.
Using a mix of AWS services, the process described above can be automated and ran on a defined schedule, ad-hoc, or by service-provided updates such as security vulnerabilities or changes to image builder components, and deployed via CD.
A high-level overview of the process is as follows:
If an ECS deployment cannot reach a stable state, an error is thrown for manual review
If the process fails for a group of instances, it can be re-run to target only those specific instances using either resource tags or IDs. External accounts can be added to the maintenance cycle by simply adding the account ID to the list of targets in the execution command.
Previously, all steps were triggered manually. Now, the only manual steps are the QA/remediation efforts and the production rollout trigger, both of which can also be automated with the right tests and procedures in place. At a minimum, half of the time spent on rolling out AMI updates is now saved, along with all of the context and focus required to do so manually across all managed AWS accounts. Our engineers can have more peace of mind managing more consistent server configurations across all projects, without being blindsided by outdated dependencies or security vulnerabilities, as project updates become less frequent.
Automated and centralized server maintenance and configurations can save both time and sanity. It forces engineers to think about these systems through both a detailed and high-level lens which can shine light on required use cases and error flows that may be overlooked during manual, ad-hoc updates across different projects. Consistent configuration and automation documents or scripts also typically lead to a basic level of documentation for the target systems as well.
Now that we've shared a bit about how we've automated our server maintenance processes, follow along to read about how we manage our centralized infrastructure configuration as code using Terraform.