Big data technologies for meteorological and climate data [funded by DFG]
The growth of globally available climate data is projected to be 350 petabytes (PB) by 2030, expecting data from climate models to be the largest and fastest-growing segment of the archive. The meteorological and archival retrieval system (MARS archive) of the European Centre for Medium-Range Weather Forecasts (ECMWF) has gone beyond 200 PB and thus, constitutes the world’s largest archive of meteorological data (ECMWF 2017). The Copernicus programme is Europe’s flagship earth observation programme aims to provide timely access to environmental information for improved environmental management, climate change mitigation and civil security. Copernicus brings a new wealth of openly available environmental information and at the same time a new range of data users. However, an open data policy and improved environmental information at a higher spatial and temporal resolution are only important pre-requisites. At the same time, data systems have to evolve from pure download services to on-demand access and processing of large volumes of data. This project brings together technical developments and application needs. On one hand it is aimed to explore and evaluate the applicability of big data technologies to meteorological and climate data structures and how cloud services can be exploited for new on-demand data systems. On the other hand, it is aimed to showcase how new data systems will be beneficial to new environmental applications and analyses, e.g. machine learning applications.
The project is divided into three work packages: (i) systems requirements definition, (ii) system definition, setup and evaluation and (iii) system application. The first step (i) aims to identify the requirements different users of meteorological and climate data have for current and future cloud-based data systems. A qualitative research, including an online user survey and expert interviews, is conducted to identify user requirements. Based on the outcomes of the user requirements analysis, a second step (ii) aims to set up and evaluate prototypes of a data service based on different big data technologies, e.g. Apache Spark and Dask/xarray. The data services will be cloud-based, thus it is further explored how geospatial data services can be set up on cloud infrastructures, such as Google Cloud Platform (GCP) or Amazon Web Services (AWS). In a third step (iii), the established prototypes will be applied to for example a machine learning or parallel computation example to showcase how users could benefit from the established data services.