Cloud Engineering weeknotes, 26 November 2021

This week we’ve been looking to the future a fair bit. The first part of this was in reviewing our roadmap and planning the larger pieces of work for the next 5-6 months. A lot has changed since we put this roadmap together back in March, and we have a much better understanding of what is needed and how we work with other teams. 

We have removed some items now deemed unnecessary, some items that run counter to our working philosophy, and some items that are low-value. If they are important enough, they will come back in the future. 

We’re concentrating on a smaller number of items, doing fewer things better, and we will start with testing our disaster recovery capability. We’ve had a couple of dry runs and will build on that; the idea is to see how we react and then iterate to build on strengths and identify weaknesses. 

Looking to the future has also included starting to plan for when our contractor colleagues roll off. This will be phased, with at least one person remaining till January or February. We are now starting handovers, with documentation and coaching. 

Part of this effort is a lot of work done in the last week or so to put as much automation into the firewalls as possible. We should soon be in a position where nobody has to go into the control panel; everything should be managed by Panorama and any changes made via Terraform. We have high availability, and changes to things like routing should be very rare. 

This extends to Globalprotect, which is part of the firewall software. We’re piloting a new way to create an application in Globalprotect using its API, rather than needing direct access to the control panel, and it’s gone well. Be on the lookout for more Globalprotect changes in the next week or so. 

We’ve made some progress on the account migrations this week. The work to migrate Advanced e5 is prepared, and the change should be made imminently. We’ve also migrated Housing-Dev, and work on the Websites account is progressing well. Unfortunately we’re blocked on the other Housing accounts, which in turn blocks the final set of migrations. We are talking daily to the MTFH team about getting this unblocked. 

The next few weeks may be frantic, but the light in the tunnel is getting bright. 

Cloud Engineering weeknotes, 19 November 2021

More documentation this week, including a draft “Team ways of working” document that has really made me think. When writing this, I looked back to our first show & tell a year ago, when we set out our principles and values; I really do believe that we have held true to these. The fact that we’re doing this unconsciously is a good sign that the principles and values really do describe who we are as a team. 

We are tantalisingly close to finishing off some huge pieces of work. Our new firewalls are in, Panorama is being deployed for central management, and we have a host of improvements to Globalprotect lined up. One significant change coming in the near future will be new, and separate, URLs depending on where the application is hosted. The majority of applications will be on gp-apps.hackney.gov.uk and any applications hosted in our own AWS, like Qlik, will be on gp-vpn.hackney.gov.uk.

Unfortunately, there’s been no real progress on account migrations. We are ready to go on the Advanced e5 account, with a new VPN, but delays at Advanced mean that this will now not happen before next Tuesday. We are also still dealing with competing priorities in MTFH, but are meeting with the lead developers later today to unblock that. Until the Housing accounts are moved, we cannot move the API accounts. 

However, we are able to clean the API accounts up. The last significant group of apps to be moved to a new account is the GIS apps, such as Earthlight and LLPG. We have five EC2 instances and an RDS database to move, and the infrastructure to do so is just about ready. 

This week, we’ve noticed some issues in the platform, and have taken steps, or will take steps, to fix it. For example, we noticed that our Backups module wasn’t operating as expected – none of the backups older than 30 days had been deleted. We identified a missing line in the code, which has been fixed, and all the old snapshots have been purged. There is an associated cost with this, so S3 costs should fall a bit next month. 

Costs have been on my agenda this week. So we ran a report to identify over- or under-provisioned EC2 instances, and the recommendations have been shared. Some of the changes recommended don’t save a lot individually, but if we accepted all of them, we could save in the region of $10,000 per month (based on 24/7 usage). And that’s before Savings Plans, which we’re talking to AWS about. 

EC2 cost is now our single most expensive line item. Please make sure that your non-prod EC2s are powered down overnight by enabling the scheduler tags in Terraform. 

Finally, we are ripping up our roadmap next week, the second time we’ve done this since starting. We now have a much better understanding of where we are and what is needed, and some of the things we originally envisaged are either no longer necessary, or not possible. We would welcome input into this, and feedback once drafted, so if there’s anything you think should be included, please let us know. 

Cloud Engineering weeknotes, 11 November 2021

Some weeks in this project definitely have themes to them, and for me this week it feels that that theme has been documentation. As I said last week, we know there are gaps in our documentation, and that it mostly exists – but it’s in our heads and practices, or not properly codified in a single, easily-available document. So this week, as people have spare time, we’ve been writing documentation for the Playbook. I’ve even done some of it myself. There is a lot more to do yet, but you can check our progress. Thank you, Stuart, for the inspiration and motivation. 

We are rapidly (yet it feels oddly slowly, at times) reaching v1.0 of our platform. Following the rebuild of the firewalls, we have been able to deploy Panorama to manage them. This will hopefully be the last major piece of work on the firewalls for a little while. There is one ongoing piece of work though, to set up authentication groups on Globalprotect to give more granular permissions. This is proving a little more difficult than expected, but we have a solution to test. 

Account migrations rumble on; we had of course expected to have finished this work a couple of weeks ago. There’s not been much progress since last week, though we have started looking at the Websites account, and working out if it would be better to host the WordPress instances in containers instead of on EC2s. 

We have the necessary changes for e5 lined up with Advanced, and should be able to move that account and attach its new VPN next week. However, we are stuck on the Housing accounts due to competing priorities in MTFH. We anticipate that that will be resolved next week, and this will in turn unblock the move of the API accounts. 

And then… we start iterating. 

For now, though, there’s also been a lot of support work this week. We’ve supported the Security Assurance team with their work, and created an EC2 for the new Canon Uniflow scanning service. We continue to iterate our GitHub policies, and are providing advice and guidance to several teams. We’ve also just enabled something called Compute Optimiser, which scans our entire estate to identify any Compute resources that are over- or under-provisioned. 

Dare I say things are starting to settle and mature?

Cloud Engineering weeknotes, 5 November 2021

I was off last week; as I got back into things this week, I had a chance to reflect on how far we’ve come over the last 10 months. 

The catalyst for this was a chat with Stuart, our new (permanent!) Senior Engineer. Considering the circumstances of our formation, as a team we have built a stable, secure platform; we’ve coached each other and upskilled a number of Hackney colleagues who had no previous experience in AWS; we’ve made a welcoming, productive, and expert team which often puts the needs of others ahead of our own. 

And I’m proud of that. In my absence, the team kept going and did all the right things in the right way and welcomed a new member as if he’d always been here. But we just don’t give ourselves enough credit for what we’ve done and in the circumstances we’ve been working in. As I’ve said before, nobody would ever choose to do a cloud migration in our circumstances, and we have much to be proud of. 

Stuart’s arrival has given us a useful outsider’s view on what we’ve done and what’s missing. He’s given us a brain dump of documentation he’d expect as a new starter, which we will work through over the next couple of sprints. Almost all of it exists already, we just need to publish it in the Playbook so that it’s in one place. He’s also started work on formalising our change and release processes so that we can avoid repeating some of the mistakes we’ve made in the last couple of months. 

The account migrations proceed, though slowly. The work needed to move e5 and the Housing accounts is lined up, and we’ve started decommissioning unused resources (with the data backed up). There is a definite chain of events – Manage Arrears needs to be updated so that we can move Housing, which will enable us to move APIs, which will allow us to clean up those accounts and move things to more appropriate homes. 

We’ve made some additional security improvements this sprint. We have a module to automate much of the Windows Server patching, which we spoke about in our lunch & learn. We’ve also made some changes to the GitHub repo to restrict who can approve PRs and enabled Branch Protection. 

The firewalls have been completely overhauled in the last two weeks. We’ve adopted a new licensing model that saves a lot of money, and the revised Terraform allows for faster redeployments. Importantly, we’re now able to use Panorama to manage the full suite of firewalls, and this means we only need to make a configuration change in one firewall – Panorama will manage the deployment of that change to all other devices. 

Thank you for your patience while we’ve rebuilt the firewalls as we know there have been a lot of outages and cancellations – but that does neatly illustrate why we need to tighten up our own change and release processes!