Data Platform Weeknote 19: 07.02.2022

About the Data PlatformThe vision for the Data Platform is to rebuild a better data infrastructure to  deliver a secure, scalable and reusable cloud-based platform that brings together the council’s key data assets. Enabling us to democratise access to data (where appropriate), use technology to enable deeper insight, and derive greater value from our data to improve the lives of residents in Hackney. For more information, see our Playbook.

We’re now at a point where more analysts (mostly within Data & Insight for now) are beginning to use the platform more for their BAU work. As such, we’re trying to make a clearer distinction between what we’re doing to build the platform (the focus of the project) and how people are using the platform. For example, we’ve started using separate ‘Work In Progress’ streams on our Jira board and structured our Show & Tell to reflect building vs. using, and we’re thinking about what other changes might be helpful.

How we’ve been building the platform this week:

  • Creating a reusable process to ingest data from our in-house applications through event-streaming so that we can make housing data available for reporting: We’ve been frustratingly blocked on getting the housing data that we’ve received in Kafka into S3. There is a managed Connector but we haven’t been able to get the configuration to work and all the AWS documentation is suggesting that we need to develop our own. We’ve been trying to avoid this because a custom-built connector would be much more difficult for a Hackney team to maintain in future. Next week we have a new Data Engineer with Kafka experience joining the team, as well as a meeting with AWS, which we’re hoping will help solve the problem.
  • Creating a reusable process to ingest data from a MSQL database so that we can make Council Tax data available for reporting and reuse: We began a tech spike to determine the best way to ingest Academy data (via their Insight database) into the platform. We’re exploring options on how to most efficiently get the data and have been testing some approaches, but we haven’t been able to focus on this as much as we planned because the team needed to come together to try to unblock the Kafka issues.
  • Setting up AWS cost alerting so that we can better monitor our usage and spend: We’re creating a reusable process to alert the team if we have a spike in our spending. The alerting process is in place, now we just need to sort out the month-on-month comparison so that we can distinguish between ‘normal’ increases as the platform is used more and unusual spikes.
  • Improving the analyst experience of prototyping scripts: Analysts in parking have had difficulty switching between the different dialects of SQL used in the different tools that they used to prototype (Athena) and schedule/run scripts (Glue). We used our collab session to discuss how a notebooking solution would help meet this need, as well as others (e.g. analysts will need a Python prototyping environment that can run on a Chromebook rather than in Docker). However, we agreed that we could meet their needs either through a notebook or changing our ‘orchestration’ (i.e. how we schedule workflows and trigger jobs) and we want to test out basic versions of these to see which approach better meets their needs.
  • Migrating Qlik into the platform infrastructure: We’ve been working with the DevOps team to complete the required setup to move Qlik from the Production API account into the Data Platform Production account. In order to perform this migration we’ve had to re-setup a few bits of infrastructure around Qlik. Specifically we’ve been configuring a security certificate which is used to encrypt the connection to Qlik, as well as setting up AppStream which will initially allow users to connect to Qlik inside the Data Platform account.

Alongside this work we’ve been working to develop an alternative method of connecting to Qlik in the form of Global Protect VPN which allows users to gain access to the private network of the AWS accounts without exposing the devices held in it to the public network. This will allow users to connect to the Qlik server without having to expose Qlik publicly.

There are two main issues around the implementation of Global Protect. The first is that Google Single Sign-on isn’t working correctly when accessed through the Global Protect Portal as the portal loads the Google pages on your behalf which breaks some of the Google code. The second issue is that while using the Production Global Protect desktop application you do not have access to the internet. This wasn’t a feature in the staging environment but is the intended setup in production. We’re working with the Cloud Engineering team to address these issues, but may need to explore other options for providing secure access to Qlik if we can’t address them.

How analysts have been using the platform this week:

  • We fixed a bug for Parking analysts which was preventing their dashboards from getting the latest data. Their Liberator tables were getting so big that the ingestion process was creating new partitions (essentially creating two folders for the same day, rather than putting all the files in the same folder) so we’ve updated the process so it’s all going to the same place.
  • In preparation for getting housing data into the platform, we’ve mapped the data we need to recreate a tenant and leaseholder list against the platform API entities and shared this with the Manage My Home team so they know what further work is needed on their end.
  • We’re developing a set of refined bulky waste tables that present key information such as collection date and cancellation date. This requires a fair bit of data transformation because of the way this data is currently held within the Liberator tables in the raw zone. The refined tables will make it much easier to use this data in a dashboard.
  • We’ve added a Python library that enables us to convert British National Grid coordinates to latitude and longitude. This will make it easier for Qlik to map this data, and can be reused for other datasets.

Up Next:

  • Onboarding a new senior data engineer into the team
  • Hopefully unblocking our Kafka issues, or if not reassessing our approach
  • Completing the tech spike on Academy
  • Supporting analysts to keep using the platform to complete their BAU tasks
+ posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.