For more information about the HackIT Data Platform project please have a look at this weeknote on the HackIT blog.
Improving the Data Platform playbook
We have spent some time this week rethinking the structure of our Data Platform playbook. Despite being happy with a lot of the content that has already been added to the playbook, we felt that we needed to spend some time looking at particular user journeys and how to make the experience of using the playbook as comprehensive and accessible as possible.
We have thought about the particular needs of data scientists, analysts, engineers and managers and mapped the common and unique parts of the playbook that they may need to access. We are in the process of restructuring the playbook so that it uses a much clearer navigation menu for these users.
Collaborating with Manage My Home
We’ve met with the Modern Tools for Housing team to agree on a reusable process to stream data from their application. We hope that our data platform data engineers will be able to help the Manage My Home team in this process.
We have put together a document which explores the benefits of using Kafka over an S3 bucket for the data streaming process. Kafka is an open source software which provides a framework for storing, reading and analysing streaming data. We also need to consider how Kafka might work or if it has any limitations working with the current .net based API architecture.
Simplifying creating Glue jobs in code
A Glue ‘job’ refers to the business logic that performs the extract, transform, and load (ETL) work in AWS (Amazon Web Services) Glue . When you start a job, AWS Glue runs a script that extracts data from sources, transforms the data, and loads it into targets.
This week we have worked to ensure that the process for creating Glue jobs is clearly documented so analysts can easily create them. We are also working to make some Terraform templates (a tool to manage the entire lifecycle of infrastructure using infrastructure as code) that can be reused by copying & pasting, requiring no prior knowledge of Terraform. We want to make sure code is organised to make it easy for an analyst working for a specific department to know where to put code relating to them.
Up next we will be looking at testing the process with analysts to make sure our work meets their needs and in order to decide if further refinement of the processes is required.
Creating a refined (Tascomi) planning data set with a daily snapshot
Now that Tascomi data is getting into the platform every day, we are in a position to refine our workflow. This means that only data increments will be processed every day (through parsing and refinement). After this process, daily data increments will be incorporated with the previous version of the dataset to create a daily full snapshot. We’re still testing this approach and we’re hoping it will make Tascomi data (both current and historic) easy to access for planning analysts.
Using Redshift with Tascomi data
Redshift is a cloud based data warehouse product designed for large scale data set storage and analysis. Exposing the Tascomi data in the Redshift cluster means that we now have daily loads into Qlik ( business analytics platform) with only the latest current version of the data stored into tables for analysts to use. We have started to create a data model that pre-builds the associations between the tables for easier interrogation. Analysts can also use Redshift connectors into Google Data Studio.
Next Show & Tell – Friday 12th November
Our next Show & Tell is on the 12th of November at 12-12.30pm. Come along to find out more about what we are up to and invite others that may be interested (the calendar invite is open). Email email@example.com if you need any help. For anyone that can’t make it, don’t worry we will record the session and post on Currents/Slack after.