Cloud Engineering weeknotes, 10 December 2021

It passed me by that it was our first anniversary two weeks ago. It’s been quite the year, and although it feels like not a lot has happened over the last week, I’m not sure that any of us would have forecast the progress we have made in the last 54 weeks. I remember a workshop earlier this year, thinking that we had so much to do; it would be interesting to run that again to see how much we’ve moved on.

Although it’s good that we’ve been able to do so much for colleagues across the council, what I’m most proud of is the team we’ve built. Actually agile, focusing on value, flat, and dedicated to learning. We have a camaraderie that has sustained us, a culture that is healthy and welcoming. Long may this last.

Over this last week, the team has got some good stuff done, mostly to support other teams. We’ve built a VPN to connect to Servelec’s back end to enable a data migration; built some EC2s; set up networking for Social Care Finance; and are investigating ways to enable colleagues in HR to receive data from our payroll provider.

Some of the other work is being used as a catalyst, or maybe a test bed, for things we knew we’d need to do in future anyway. For example, we know that external services will need ingress access to our environment, so we have used requests from the Academy application manager and from the Data Platform team to work out how best to do this in a secure way, and how to automate it.

There’s been some progress on account migrations. The GIS apps are talking to each other but we need to set up a connection to the Addresses API as the final step. We’ve agreed a way forward on the Housing accounts with the Housing Finance team as well, and that should be unblocked early next week. Progress on the websites migrations has also come on strongly, with an AMI built in Packer and an ALB configured. We’ll migrate the first site in the staging environment shortly, to make sure it all works as expected.

Cloud Engineering weeknotes, 3 December 2021

Guest post by one of the team’s engineers

Our second week of the sprint carries on without our delivery manager who is taking some well earned rest. It’s a testament to the trust he has in us that we’ve been left to run our own show and tell and retro without the need for a substitute DM. It’s also a testament to how well the team works that we’re all aware of our roles and what needs to be done.

Account migrations have moved on a little. Advanced e5 was onboarded successfully with assistance from Advanced and the finance team, and GIS apps are being migrated to a dedicated account. The housing staging and production accounts are dependent on some development work being completed. The API accounts are all dependent on resources being migrated to their own accounts and we expect movement on that over the next few weeks.

In Globalprotect we are now looking to set up role-based access so that users can only see “their” applications. So that we can apply additional controls we are looking to implement two entry points (portals) for internal and external applications. As part of the security implementations we have added a set of scripts that enable API application creation, this enables us to restrict access to the management console.

A further security improvement is AWS Federation for GitHub actions, which would allow us to remove some credentials that are currently being stored in the Infrastructure repo. This will also allow for dynamic role assumption based on repo, tag or person running a workflow in the future.

We have started developing Ansible for the various Hackney websites. The idea behind this, is in a similar way to writing Terraform modules, we can start to create Ansible roles that are reusable blocks of configuration. This can then be used to do things like create our custom AMIs, that conform to stricter security standards or have any prerequisite customisations baked in. Additionally, we can use these to actually configure an EC2 after it has been deployed automatically. The intention is to deploy all the various websites and Backstage in this manner, creating a pattern for any other future applications that require hosting on EC2. 

Cloud Engineering weeknotes, 26 November 2021

This week we’ve been looking to the future a fair bit. The first part of this was in reviewing our roadmap and planning the larger pieces of work for the next 5-6 months. A lot has changed since we put this roadmap together back in March, and we have a much better understanding of what is needed and how we work with other teams. 

We have removed some items now deemed unnecessary, some items that run counter to our working philosophy, and some items that are low-value. If they are important enough, they will come back in the future. 

We’re concentrating on a smaller number of items, doing fewer things better, and we will start with testing our disaster recovery capability. We’ve had a couple of dry runs and will build on that; the idea is to see how we react and then iterate to build on strengths and identify weaknesses. 

Looking to the future has also included starting to plan for when our contractor colleagues roll off. This will be phased, with at least one person remaining till January or February. We are now starting handovers, with documentation and coaching. 

Part of this effort is a lot of work done in the last week or so to put as much automation into the firewalls as possible. We should soon be in a position where nobody has to go into the control panel; everything should be managed by Panorama and any changes made via Terraform. We have high availability, and changes to things like routing should be very rare. 

This extends to Globalprotect, which is part of the firewall software. We’re piloting a new way to create an application in Globalprotect using its API, rather than needing direct access to the control panel, and it’s gone well. Be on the lookout for more Globalprotect changes in the next week or so. 

We’ve made some progress on the account migrations this week. The work to migrate Advanced e5 is prepared, and the change should be made imminently. We’ve also migrated Housing-Dev, and work on the Websites account is progressing well. Unfortunately we’re blocked on the other Housing accounts, which in turn blocks the final set of migrations. We are talking daily to the MTFH team about getting this unblocked. 

The next few weeks may be frantic, but the light in the tunnel is getting bright. 

Cloud Engineering weeknotes, 19 November 2021

More documentation this week, including a draft “Team ways of working” document that has really made me think. When writing this, I looked back to our first show & tell a year ago, when we set out our principles and values; I really do believe that we have held true to these. The fact that we’re doing this unconsciously is a good sign that the principles and values really do describe who we are as a team. 

We are tantalisingly close to finishing off some huge pieces of work. Our new firewalls are in, Panorama is being deployed for central management, and we have a host of improvements to Globalprotect lined up. One significant change coming in the near future will be new, and separate, URLs depending on where the application is hosted. The majority of applications will be on gp-apps.hackney.gov.uk and any applications hosted in our own AWS, like Qlik, will be on gp-vpn.hackney.gov.uk.

Unfortunately, there’s been no real progress on account migrations. We are ready to go on the Advanced e5 account, with a new VPN, but delays at Advanced mean that this will now not happen before next Tuesday. We are also still dealing with competing priorities in MTFH, but are meeting with the lead developers later today to unblock that. Until the Housing accounts are moved, we cannot move the API accounts. 

However, we are able to clean the API accounts up. The last significant group of apps to be moved to a new account is the GIS apps, such as Earthlight and LLPG. We have five EC2 instances and an RDS database to move, and the infrastructure to do so is just about ready. 

This week, we’ve noticed some issues in the platform, and have taken steps, or will take steps, to fix it. For example, we noticed that our Backups module wasn’t operating as expected – none of the backups older than 30 days had been deleted. We identified a missing line in the code, which has been fixed, and all the old snapshots have been purged. There is an associated cost with this, so S3 costs should fall a bit next month. 

Costs have been on my agenda this week. So we ran a report to identify over- or under-provisioned EC2 instances, and the recommendations have been shared. Some of the changes recommended don’t save a lot individually, but if we accepted all of them, we could save in the region of $10,000 per month (based on 24/7 usage). And that’s before Savings Plans, which we’re talking to AWS about. 

EC2 cost is now our single most expensive line item. Please make sure that your non-prod EC2s are powered down overnight by enabling the scheduler tags in Terraform. 

Finally, we are ripping up our roadmap next week, the second time we’ve done this since starting. We now have a much better understanding of where we are and what is needed, and some of the things we originally envisaged are either no longer necessary, or not possible. We would welcome input into this, and feedback once drafted, so if there’s anything you think should be included, please let us know. 

Cloud Engineering weeknotes, 11 November 2021

Some weeks in this project definitely have themes to them, and for me this week it feels that that theme has been documentation. As I said last week, we know there are gaps in our documentation, and that it mostly exists – but it’s in our heads and practices, or not properly codified in a single, easily-available document. So this week, as people have spare time, we’ve been writing documentation for the Playbook. I’ve even done some of it myself. There is a lot more to do yet, but you can check our progress. Thank you, Stuart, for the inspiration and motivation. 

We are rapidly (yet it feels oddly slowly, at times) reaching v1.0 of our platform. Following the rebuild of the firewalls, we have been able to deploy Panorama to manage them. This will hopefully be the last major piece of work on the firewalls for a little while. There is one ongoing piece of work though, to set up authentication groups on Globalprotect to give more granular permissions. This is proving a little more difficult than expected, but we have a solution to test. 

Account migrations rumble on; we had of course expected to have finished this work a couple of weeks ago. There’s not been much progress since last week, though we have started looking at the Websites account, and working out if it would be better to host the WordPress instances in containers instead of on EC2s. 

We have the necessary changes for e5 lined up with Advanced, and should be able to move that account and attach its new VPN next week. However, we are stuck on the Housing accounts due to competing priorities in MTFH. We anticipate that that will be resolved next week, and this will in turn unblock the move of the API accounts. 

And then… we start iterating. 

For now, though, there’s also been a lot of support work this week. We’ve supported the Security Assurance team with their work, and created an EC2 for the new Canon Uniflow scanning service. We continue to iterate our GitHub policies, and are providing advice and guidance to several teams. We’ve also just enabled something called Compute Optimiser, which scans our entire estate to identify any Compute resources that are over- or under-provisioned. 

Dare I say things are starting to settle and mature?