Archiving Crossrail’s Data
Author: Dr Isao Matsumoto
Publication Date: 13/03/2018
This micro-report outlines the innovative approach adopted by Crossrail for archiving its data at the end of the project.
The paper will be of interest to Information Governance and IT professionals working on large programmes.
Read the full document
Crossrail is Europe’s largest construction project tasked with the delivery of the new Elizabeth line that will bring an extra 1.5 million people to within 45min of central London, from Shenfield and Abbey Wood in the east, to Heathrow and Reading in the west.
To address the rapid growth of the business as it moved from the design phase to the construction phase of the project, and the downsizing of the business at the end of the project, a scalable IT setup was required. Supporting the changing needs of the business, at each of the different phases of the Crossrail project, led to a purpose built largely stand-alone IT infrastructure and application capability being adopted, not restricted by legacy systems and dedicated to the task of building the new railway.
To support the delivery of the Crossrail construction project, Crossrail had a dedicated IT team of about twenty five people, supported by three outsourcing relationships for IT, CAD and Finance, which at its peak supported a diverse user base of approximately 3,000 users at over forty sites around London.
Crossrail’s Archiving Challenge
Through the course of the Crossrail project numerous IT systems were used to support and deliver different aspects of the project (see Data Architecture Strategy). However, once the Elizabeth line (operating name of the new service) is handed over to Rail for London (RfL) to operate and all the Crossrail contracts have been closed, none of these IT systems will be needed for the ongoing operation of the Elizabeth line.
Whilst the IT systems are not needed for operational reasons, the information contained in these systems still needs to be retained. Transport for London (TfL), who will have overall responsibility for the new Elizabeth line, therefore required all relevant Crossrail information to be handed over for corporate and regulatory compliance (for example HR records and Finance records need to be retained for 7 years and Asbestos records need to be retained for 41 years). The retention of information was further complicated due to the temporary nature of the Crossrail company, whose sole purpose was to deliver the Elizabeth line, which meant there was no direct path for these systems to be reused on other projects.
The requirement to retain information and the fact the systems would not be reused presented the business with the challenge of finding an efficient and cost effective way of maintaining access to the data and retaining knowledge of the data. To address these issues the Crossrail IT team took the ambitious and innovative decision to develop a purpose built archive system to retain the data from its systems.
Archiving Options Considered & Selected
Where Crossrail and TfL use the same systems, the decision was made to move the information contained in the Crossrail system directly to the equivalent TfL system for retention. Examples of this approach include the Financial Management System, CAD and a number of Procurement Management Systems. However, for the remaining 24 systems used by Crossrail that did not exist in TfL, an alternative approach for retaining the information contained in these systems had to be found.
The initial option considered for retaining the information within these 24 systems used by Crossrail, was to maintain each of these systems (specific examples of this included the document/contract management system, the programme planning system and the email archiving system). However, the forecast ongoing infrastructure, licencing, support and maintenance costs quickly became prohibitive. Another challenge associated with maintaining legacy systems is that although it would be possible to maintain each system the knowledge of how these systems were setup and how they worked would slowly be lost making it difficult, if not impossible, to find the relevant information over time. The alternative option considered and adopted was to extract the information from each Crossrail system and add them into a single Crossrail Archive.
Crossrail Archive Solution
A key benefit of having a single Crossrail Archive, creating a digital representation of the physical and functional characteristics of the Elizabeth Line delivered by Crossrail Building Information Model (BIM), was that most technical, commercial and/or legal questions could be addressed from a single system, rather than having to request information held in multiple systems owned by multiple departments in TfL.
In deciding how the Crossrail Archive should be developed two alternatives were considered:
- To extract the data from each system and populate a Structured Query Language (SQL) relational database.
- To extract the data from each system and use a Non structured Query Language (NoSQL non relational database) approach.
The SQL approach was discounted based on the complexity that would be required to accommodate each of the bespoke tables and relationship between tables, to build up a picture of a single transaction for each system.
Converting the proprietary data schemas into a simple NoSQL, human readable, set of key value pairs (label and value) ensured the data was easy to search, portable and in a future proof format. Another advantage of the NoSQL approach was that it removed a considerable amount of business application logic that over time would be extremely difficult to retain and understand.
Another key consideration in the development of the Archive was the technology to be used. Cloud based technology was selected as this provided the greatest number of benefits and aligned with TfL’s cloud first strategy. By removing the physical infrastructure costs and by using a Platform as a Service (PaaS) setup removed/minimised a number of further support cost items (e.g. physical hosting, infrastructure, support and maintenance costs). In addition the storage cost model proved to be very cost effective, based on the assumption that access to the archive system would, over time, reduce to the point where it would be rarely accessed and could be moved to cold storage.
Taking into consideration all of these aspects the system architecture chosen was a PaaS Cloud based:
- object storage, to store all the files associated with each of the JSON Objects,
- key management serve, to ensure all data is encrypted at rest and in transit,
- web app service, for the end users to access, search and manage the Crossrail Archive; and,
- search indexing, to search all JSON Objects and associated files (word, excel, powerpoint, pdf, …).
What Do We Need to Keep?
To confirm if the data in each system needed to be retained a systematic review of the data in each system was undertaken (see Figure 1 below). This review established if the data needed to be retained and if so why and for how long. In addition during this review process a Data Protection Impact Assessment (DPIA) was also undertaken.
Figure 1 – Data review process
For each system identified for archiving the following import process was undertaken (see Figure 2 below).
Figure 2 – Data import process
Other key considerations in designing the archive system were:
Chain of Custody: Documenting the how the data was collected in the original systems and how this data was transferred to the archive, was done in two parts. The first part was documented in the policies and procedures in place at the time the data was collected and the second part was documented in the process followed when moving data from each system into the archive (as outlined in Figure 2 above). Both sets of documentation were then stored within the archive for reference and finally the backups of each of the 24 systems were also kept, to restore the systems if needed.
Data Security: with over 50TB of sensitive data in the system and with minimal usage over time, monitoring access to the archive needed to form part of a wider ongoing and supported process. To achieve this, standard TfL policies, procedures and tools used to encrypt the data, authenticate users and monitor access and usage were applied to the Crossrail Archive. Additionally an audit trail of all searches taking place was created.
General Data Protection Regulation (GDPR)1: With greater personal data protections regulation coming into force documenting where personal data was held was a key consideration of the overall archiving project. After completing the DPIA assessment the decision was taken to avoid bringing unnecessary personal data into the archive to start off with and where the data was brought across, being very clear as to why the information needed to be retained.
Unstructured data: With little, and often inconsistent, structure around the information contained on Network Drives and SharePoint sites, determining what information needed to be retained and whether the data contained personal information, was particularly challenging. To make an informed assessment of the data contained on the Network Drives and SharePoint sites required finding individuals with both the time and knowledge of the history of their usage. To minimise this issue clear guidance on where to store each type of data should be provided up front and built into each of the systems and where data is no longer needed this should be regularly deleted.
Links to related documents: To minimise the duplication of information across systems and to maintain a single source of the truth, a number of systems cross referenced the documents, events and assets held in the document/contract management system using hyperlinks. In addition to other systems cross referencing information in the document/contract management system the document/contract management system also made extensive use of cross referencing documents, events and assets, building up a mesh of interrelated data.
Although creating links between systems, documents, events and assets is central to the BIM it creates two issues. The first is that when the document/contract management system is replaced or archived, the links need to be updated if they are to continue working. The second is that when the document/contract management system is ultimately disposed of at the end of the retention period the disposal of the document/contract management system can be determined by the other systems that reference the data in the document/contract management system.
To address the first issue, where these links were clearly identifiable, the hyperlinks in the system records were updated as part of the migration process to maintain the dynamic relation between the systems and the document/contract management system records. With regards to the second issue the decision was taken to ensure the document/contract management system had the longest retention period required.
Security over time: the change in nature of the system from being an operational to an archive system means user access levels need to be considered to ensure archive users are able to find relevant information. To this end, four broad levels of access security were added:
- Access to search a specific Crossrail system in the archive,
- Access to search personal data,
- Access to search Confidential, Restricted or Unrestricted data; and,
- Access to view the files associated with search result (where a user does not have access to view the files they can then add these to a request basket that is then reviewed by the archive administrator).
Although password protected files provide an extra level of security that can be managed at user level, as people leave these files become inaccessible.
Geographical and other non-text information: most of the systems archived as JSON objects only require a basic viewer. However, there are a few systems that contain geospatial information and non-text information, where the basic viewer is not sufficient. With regards to the geospatial information a map viewer was added to the Crossrail Archive. With regards to the other non-text information of which there were two, Oracle Primavera P6, used by planners to manage the overall Master Operational and Handover Schedule and the Underground Construction Information Management System (UCIMS), used to capture ground movement across the Crossrail route during construction, using both sensors and satellite data. The former was exported as XER files and for the latter the decision was made to forego on the graphical representation of the settlement information as, if this was needed, this would be bulk exported and addressed as part of a separate project.
Recommendations for Future Projects
- Consider how data will be archived when commissioning new systems.
- Document when you store data if it contains personal data.
- When data is stored, ensure the sensitivity of the data (Unrestricted, Restricted or Classified) is documented.
- Be very clear about why you are keeping information and for how long.
- Be clear about where information is kept in line with the above.
- Avoid mixing different usage requirements in the same system.
- Minimise the use of unstructured data storage (network drives).
- Avoid the use of password protected files and, if needed, make sure a process is in place for managing these passwords.
- Regularly delete data that is no longer required.
 Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (Text with EEA relevance)
Isao is the Senior Project & Portfolio Manager, responsible for managing the IT PMO and Project Management team. Drawing on over 20 years’ experience in the construction industry, focusing on optimising the end to end IT project delivery process, ensuring appropriate governance structures and processes are in place and followed. This experience has allowed him to support the IT team and the business to review and prioritise IT projects inline with the strategic business goals and efficiently deliver approved IT projects on time and on budget with relevant and consistent reporting metrics.