Cloud Engineering weeknotes, 25 March 2022

We did something new this week: We got together in the HSC. Well, most of us, as a few team members have been sick this week. But for the first time, the majority of the team were in the same place, at the same time, and it was magic. We did our show & tell with an actual audience and proved to our colleagues that we do, in fact, have legs. 

We’ve done some actual work as well. The StagingAPIs account was migrated to the Hub on Thursday night with no significant issues, and it’s working well. This just leaves the Production account and a legacy account that is mostly unused, but needs to be cleaned up. This has been an absolute labour of love, and although the migrations have taken significantly longer than planned, better to do it slowly and safely than to break things for everyone. 

Our work supporting the Mosaic restoration project is drawing to a close. This week we built the reporting server, and due to the limits of RDP connections, we have proposed to use AppStream for the data analysts to do their work. There is a plan B if it doesn’t work, but AppStream is (for once) likely to be more cost-effective. We’re working with the AppStream team on this now. 

We’ve also provided support this week to the Document Migration team, helping them set up additional EC2s to process the vast amount of recovered eDocs into Google Drive. We’ve made sure that these have been set up in a cost-effective way, and the team will ensure that the instances are shut down when not in use. 

We’ve been able to get some platform work done this week, which is good. Some of this has been simply cleaning up code – I say simply, it’s often harder to make your code clean – but we’ve also improved how some of our components work. One example of this is a new CircleCI module which automates the IAM setup for CircleCI for new accounts. This has been done manually in the past, and so this is in line with our policy of automating as much as we can. 

Centralised logging has not been deployed due to sickness absence, but we’ve started some work to set up Athena on the Cloudtrail logs. This will make it much easier to query the logs, especially if we need to audit actions committed by individuals. As said before, a secure platform is a good platform. 

Cloud Engineering weeknotes, 18 March 2022

Our good friend WordPress paid another visit this week, with more troubles on the Intranet. We’ve applied a manual fix for now, but we have been able to make progress on the long-term fix of better infrastructure. We’ve also had a meeting with a WordPress specialist who is doing a discovery into the estate, which was useful. 

Steady progress has been a theme this week, despite a few curveballs. A significant problem with Webreg in Globalprotect was reported to us. We thought the solution of using the desktop client wasn’t possible because our users have Chromebooks… a little digging, and Palo Alto has actually released a Chrome OS desktop client. We’ve tested it, and it works. We’re working with our Google and Devices teams to get this rolled out, which should be early next week. 

We have also migrated the DevelopmentAPIs account onto the Hub. This went without a hitch, and leaves just the Staging and Production accounts to go, which should follow soon. Both accounts will need some clean-up as they have applications that should be elsewhere, but that can be done later. The main applications are the GIS applications, and we have an agreed plan for that in April. 

Our support for the Mosaic restoration will draw to a close soon, but the essential infrastructure for their go-live is in a good place. We need to help with the reporting server but that can wait a few days, and won’t take long. 

On top of all this, there’s been some platform iteration as well. With thanks to the Social Care team, the backup service now includes DocumentDB instances. It’s fantastic to see a product team iterating one of our modules (reviewed by us) – this is the sort of thing we always envisaged from the start of this work way back when. 

We’ve also been tidying up the code for the HSCN connection, and have been working on a centralised logging service. This last one collates all the different logs in a single place, making it easier for the Security team to review and inspect. This links nicely with some work on permissions we have planned for next month, which will start with a new permission set specifically for that team. A secure platform is a good platform. 

Cloud Engineering weeknotes, 11 March 2022

It was Groundhog week for wordpress issues on the website, but after a lot of cross-team cooperation it has now been resolved.  WordPress as a whole will hopefully become less of a strain on the team as we have a meeting with a specialist in WordPress to go over the current state and work out a way forward with it.

Some work has been done to start centralising Cloudwatch logs from every account, aggregating them into a single logging account. Control Tower does some of that for us but there are areas which aren’t covered by this. Centralising the logs in this way will provide a base for us to eventually provide some kind of centralised monitoring and alerting across the whole platform so it is a useful building block for the future.

Some housekeeping has been done on the Palo Alto firewalls, removing some configuration which was no longer needed or was left over from testing. More work has been done to make the access to the management dashboard of the firewalls far more secure, implementing a VPN which further tightens the security around a vital part of the AWS platform. Finally in firewall-land, more work has been done to get Panorama up and running which will give us a tool that can manage our three firewall environments in one place whilst also giving us faster disaster recovery options.

Work is still ongoing to migrate the GIS systems to their own, dedicated account. This work is important to ensure some order to how our accounts are organised, whilst also making it possible for us to finish off the work to migrate our “legacy” accounts to our final network architecture.

We’re still closely collaborating with the security team, having had a very productive meeting around potential vulnerabilities in our setup and how we can provide their team with more direct access to AWS so they will be able to see for themselves where we have issues.

Beyond that there has been the usual support and guidance that we provide around AWS access requests, 1Password requests (which is another orphaned service we’ve taken on) and cross account communication for some of the ongoing data recovery activities.

Cloud Engineering weeknotes, 4 March 2022

Our WordPress woes have continued into this week, with the pipeline between WordPress and Netlify falling over. This has meant that content on the main website couldn’t be updated. After investigation, the issue proved to be a plugin error, but one which was having knock-on effects and wasn’t easily fixable. We now believe the error is on the front end, and we can’t do anything there; the Dev team will be approached for help. 

The WordPress instances have proven to be a significant drain in recent months. We’ll be taking this up with DMT again as it’s outside our remit. The new infrastructure will help, but providing this level of life support is actually blocking us from moving the instances over to it. 

In happier news, we’ve worked with the Data Recovery team to remove five EC2 instances and reduce their EBS volumes by 40TB. Resizing the volumes will save over $4500 per month, and it’s already being reflected in our cost forecast for March. On the costs front, we’ve worked with the Data Platform team to move the Budgets module into the Infrastructure repo, so it’s available for all teams to use. We’d encourage it. 

The first of the three business grants applications has been moved out of the APIs accounts, prior to those accounts being migrated to the Hub. It was a good learning experience, and the steps have been documented. We’ve agreed to pause the other two applications for now. They may be decommissioned soon, and they’re not blocking the account migration. 

Our networking support for other projects is going well. The MESH client for Mosaic has been configured and can connect to Servelec. As the HSCN is only in Production, we need to repeat the exercise there. We’ve also finally cleared up some confusion over the needs for a product for Repairs, setting a red line for the supplier as the original design on their side would have presented a security risk for us. 

We’ve also finally started the migration from standalone firewall configurations to using Panorama to manage them all. We’re starting this in the Dev environment and there may be short-term outages as the Hubs rebuild. We’ll give notice of this, and Production will be of course out of hours. 

Cloud Engineering weeknotes, 25 February 2022

It feels like a lot has been done this week, which is good. It helps that all team members have been available this week, not that having time off is a bad thing!

We started the week with our old friend WordPress. We found out late on Sunday night that two Hackney sites had fallen over. Emergency action was taken to restore both sites and to amend the EC2s they run on, so that this shouldn’t happen again. The real solution, though, is to complete the move to the new infrastructure. This has been going slowly due to an issue in the load balancers, but we think we’ve resolved that now. 

Migration work has picked up again; the first grants application is nearly finished after being paused for a few days for urgent support work. We will review whether to move the other two now or later, in view of this week’s government announcements. At account level, we are almost in a position to move the DevelopmentAPIs account, but have discovered that it has only one subnet – which the move process tries to destroy. We’re investigating a workaround. 

We’ve been supporting the Mosaic project a good bit this week. The EC2 box they need has been built and the client installed, and we are just waiting for Servelec to amend the VPN configuration on their side. We’ve also now obtained a trial licence for Host Access to test with Plus5, and need to set up a VPN for that. Finally on the networking front, we had a really constructive discussion with Data Platform about making access to Qlik as secure as possible, given the desire to take it out of AppStream but in the knowledge that it’s not compatible with Globalprotect. A paper will be discussed at TDA next week. 

We reviewed our roadmap this week. Most of the changes were to break existing items down into smaller chunks, but we have added new tasks to create new Terraform modules (as a result of the Backstage paper mentioned last week) and to iterate existing ones. One iteration pulled forward is to change the default setting in the Overnight Shutdown module to TRUE, so that all EC2s will shut down overnight unless explicitly changed. This change will be merged soon and will apply when you next update your Terraform, so be aware and take necessary action to ensure that any EC2s needed overnight remain powered up.