ExtremeEarth

1. Defining the user requirements for the food security use case

Anja Rösel - VISTA, April 2019

Defining the user requirements for the food security use case

Food security, especially in a changing Earth environment, is one of the most challenging issues of this century. Population growth, increased food consumption, and the challenges of climate change will extend over the next decades. To deal with these, both regional and global measures are necessary. There is a need to increase biomass production which means to increase the yield in a sustainable way. It is important to minimize the risks of yield loss even under more extreme environmental conditions, while making sure not to deplete or damage the available resources.

Irrigation, as one of the most important measures for food production, requires reliable water resources in the area that is being farmed, either from ground water or surface water. Irrigation planning and choosing the right measures at the right time, requires detailed and reliable information on status and prospects of water availability.

One of our first jobs within the Extreme Earth Project is to define the user requirements. To start with this, we have to ask us the following questions:

Who are the users?
Users can come from numerous areas with different background: For course, the most obvious users are the farmers themselves, or better representatives from large agricultural companies. But of course, also irrigation companies or consulting agencies might be very interested in the information we will gain through this project. Maybe the most important users are the stakeholders or decision makers from communal, federal and national governments or even the EU authorities. We did some research to find the according key persons of each category and invited them to our first User’s Workshop that we organised in March in Munich.
What are their requirements or their needs in terms of food security?
After intense discussions with a group of potential demo users on our User’s workshop we also created a questionnaire to identify the additional user’s needs. The evaluation of this questionnaire will give us more information about technical requirements like resolution, format and accessibility of the final product, as well as thematic requirements like areas of interest and main crops.
Why do we focus on irrigation?
To sustain global food security, two practices are of high importance: irrigation and fertilization. Fertilisation is a bio-chemical process that can be controlled and optimized through agricultural management. It relies mainly on industrial goods, and the resources for it can be transported if the necessary infrastructure is available.

Water availability is – in contrast to fertilisation – a highly variable and often uncertain variable. Limited water availability can be an issue for many farmers, industries and governments. A large portion of the world’s fresh water is linked to snowfall, snow storage and seasonal release of the water. All these components are subject to increased variability due to climate change and this might result in an increase in extreme events.

With the use of the Earth Observation data, modelling and in-situ measurements of the snow cover, all necessary information regarding water availability can be obtained – especially now using sophisticated deep learning techniques to handle the large data volume of the Copernicus archive. With this approach VISTA will be able to combine for example seasonal information about water storage in the Alps, with the highly dynamic water demand in agricultural areas such as the Danube area and give large scale recommendations for private farmers, but also for national governments about sustainable water usage.

2. How ExtremeEarth Brings Large-scale AI to the Earth Observation Community with Hopsworks, the Data-intensive AI Platform

Theofilos Kakantousis, Tianze Wang and Sina Sheikholeslami - LogicalClocks/KTH, April 2020

How ExtremeEarth Brings Large-scale AI to the Earth Observation Community with Hopsworks, the Data-intensive AI Platform

In recent years, unprecedented volumes of data are generated in various domains. Copernicus, a European Union flagship programme, produces more than three petabytes(PB) of Earth Observation (EO) data annually from Sentinel satellites. This data is made readily available to researchers that are using it, among other things, to develop Artificial Intelligence (AI) algorithms in particular using Deep Learning (DL) techniques that are suitable for Big Data. One of the greatest challenges that researchers face however, is the lack of tools that can help them unlock the potential of this data deluge and develop predictive and classification AI models.

ExtremeEarth is an EU-funded project that aims to develop use-cases that demonstrate how researchers can apply Deep Learning in order to make use of Copernicus data in the various EU Thematic Exploitation Platforms (TEPs). A main differentiator of ExtremeEarth to other such projects is the use of Hopsworks, a Data-Intensive AI software platform for scalable Deep Learning. Hopsworks is being extended as part of ExtremeEarth to bring specialized AI tools for EO data and the EO data community in general.

Hopsworks, Earth Observation Data and AI in one Platform

Hopsworks is a Data-Intensive AI platform which brings a collaborative data science environment to researchers who need a horizontally scalable solution to developing AI models using Deep Learning. Collaborative means that users of the platform get access to different workspaces, called projects, where they can share data and programs with their colleagues, hence improving collaboration and increasing productivity. The Python programming language has become the lingua franca amongst data scientists and Hopsworks is a Python-first platform, as it provides all the tools needed to get started programming with Python and Big Data. Hopsworks integrates with Apache Spark and PySpark, a popular distributed processing framework.

Hopsworks brings to the Copernicus program and the EO data community essential features required for developing aI applications at scale, such as distributed Deep Learning with Graphics Processing Units (GPUs) on multiple servers, as demanded by the Copernicus volumes of data. Hopsworks provides services that facilitate conducting Deep Learning experiments, all the way from doing feature engineering with the Feature Store, to developing Deep Learning models with the Experiments and Models services that allow them to manage and monitor Deep Learning artifacts such as experiments, models and automated code-versioning and much more . Hopsworks storage and metadata layer is built on top of HopsFS, the award-winning highly scalable distributed file system, which enables Hopsworks to meet the extreme storage and computational demands of the ExtremeEarth project.

Hopsworks brings horizontally scalable Deep Learning for EO data close to where the data lives, as it can be deployed on Data and Information Access Services (DIAS). The latter provides centralised access to Copernicus data and information which combined with the AI for EO data capabilities that Hopsworks brings, an unparalleled data science environment is made available to researchers and data scientists of the EO data community.

Challenges of Deep Learning with EO Data

Recent years have witnessed the performance leaps of Deep Learning (DL) models thanks to the availability of big datasets (e.g. ImageNet) and the improvement of computation capabilities (e.g., GPUs and cloud environments). Hence, with the massive amount of data coming from earth observation satellites such as the Sentinel constellation, DL models can be used for a variety of EO-related tasks. Examples of these tasks are sea-ice classification, monitoring of water flows, and calculating vegetation indices.

However, together with the performance gains comes many challenges for applying DL to EO tasks, including, but not limited to:

Labeled datasets for training: While collecting raw Synthetic-Aperture Radar (SAR) images from the satellites is one thing, labeling those images to make them suitable for supervised DL is yet a time consuming task. Should we seek help from unsupervised or semi-supervised learning approaches to eliminate the need for labeled datasets? Or should we start building tools to make annotating the datasets easier?
Interpretable and Human-understandable models or EO tasks: Given enough labeled data, we can probably build a model with satisfactory performance. But how can we justify the reasons behind why the model makes certain predictions given certain inputs? While we can extract the intermediate predictions for given outputs, can we reach interpretations that can be better understood by humans?
Management of very large datasets: Managing terabytes (TB) of data that can still fit into a single machine is one thing, but managing petabytes (PB) of data that requires distributed storage and provides a good service for the DL algorithms so as not to slow down the training and serving process is a totally different challenge. To further complicate the management, what about partial failures in the distributed file system? How shall we handle them?
Heterogeneous data sources and modalities (e.g., SAR images from satellites, sensor readings from ground weather stations): How can we build models that effectively use multi-modalities? For example, how can we utilize the geo-location information in an image classification model?
DL architectures and learning algorithms for spectral, spatial, and temporal data: While we might be able to perform preprocessing and design model architectures for RGB image classification, how do these apply to SAR images? Can we use the same model architectures? How to extract useful information from multi-spectral images?
Training and fine-tuning (hyperparameter optimizations) of DL models: Hyperparameters are those parameters of the training process (e.g., the learning rate of the optimizer, or the size of the convolution windows) that should be manually set before training. How can we effectively train models and tune the hyperparameters? Should we change the code manually? Or can we use frameworks to provide some kind of automation?
The real time requirements for serving DL models: Once the training is done, we want to use our trained model to predict outcomes based on the newly observed data. Often these predictions have to be made in real-time or near-real-time to make quick decisions. For example, we want to update the ice charts of the shipping routes every hour. How to serve our DL models online to meet these real-time requirements?

Deep Learning Pipelines for EO Data with Hopsworks

A Data Science application in the domain of Big Data typically consists of a set of stages that form a Deep Learning pipeline. ُThis pipeline is responsible for managing the lifecycle of data that comes into the platform and is to be used for developing machine learning models. In the EO data domain in particular, these pipelines need to scale to the petabyte-scale data that is available within the Copernicus program. Hopsworks provides data scientists with all the required tools to build and orchestrate each stage of the pipeline, depicted in the following diagram.

In detail, a typical Deep Learning pipeline would consist of:

Data Ingestion: The first step is to collect and insert data into the AI platform where the pipeline is to be run. A great variety of data sources can be used such as Internet of Things (IoT) devices, web-service APIs etc. In ExtremeEarth, the data typically resides on the DIAS which can be directly accessed from Hopsworks.
Data Validation: Tools such as Apache Spark that can cope with Big Data are typically employed to validate incoming data that is to be used in later stages. For example data might need to be parsed and cleaned up from duplicate or missing values or a simple transformation of an alphanumeric field to a numeric one might be needed.
Feature Engineering: Before making use of the validated data to develop DL models, the features that will be used to develop such models need to be defined, computed and persisted. Hopsworks Feature Store is the service that data engineers and data scientists use for such tasks, as it provides rich APIs, scalability and elasticity to cope with varying data volumes and complex data types and relations. For example users can create groups of features or compute new features such as aggregations of existing ones.
Model development (Training): Data scientists can greatly benefit from a rich experiment API provided by Hopsworks to run their machine learning code, whether it be TensorFlow, Keras, PyTorch or another framework with a Python API. In addition, Hopsworks manages GPU allocation across the entire cluster and facilitates distributed training which involves making use of multiple machines with multiple GPUs per machine in order to train bigger models and faster.
Model Serving & Monitoring: Typically the output of the previous stage is a DL model. To make use of it, users can submit inference requests by using the Hopsworks built-in elastic model serving infrastructure for TensorFlow and scikit-learn, two popular machine learning frameworks. Models can also be exported in the previous pipeline and downloaded from Hopsworks directly to be embedded into external applications, such as iceberg detection and water availability detection in food crops. Hopsworks also provides infrastructure for model monitoring, that is continuously monitoring the requests being submitted to the model and its responses and users can then apply their own business logic on which actions to take depending on how the monitoring metrics output changes over time.

Example Use Case: Iceberg Classification with Hopsworks

Drifting icebergs pose major threats to the safety of navigation in areas where icebergs might appear, e.g., the Northern Sea Route and North-West Passage. Currently, domain experts manually conduct what is known as an “ice chart” on a daily basis, and send it to ships and vessels. This is a time-consuming and repetitive task, and automating it using DL models for iceberg classification would result in generation of more accurate and more frequent ice charts, which in turn leads to safer navigation in concerned routes.

Iceberg classification is concerned with telling whether a given SAR image patch contains an iceberg or not. Details of the classification depends on the dataset that will be used. For example, given the Statoil/C-CORE Iceberg Classifier Challenge dataset, the main task is to train a DL model that can predict whether an image contains a ship or an iceberg (binary classification).

The steps we took to develop and serve the model were the following:

First step is preprocessing. We read the data which is stored in JSON format and create a new feature which is the average of the satellite image bands.
Second step is inserting the data into the Feature Store which provides APIs for managing feature groups and creating training and test datasets. In this case, we created the training and test datasets in TFRecord format after scaling the images as we are using TensorFlow for training.
Third step is building and training our DL model on Hopsworks. Since the dataset is not very complicated and we have a binary classification task, using a DL model that is very similar to LeNet-5 yields 87% accuracy on the validation set after 20 epochs of training which takes 60 seconds to train on a Nvidia GTX1080. This step also includes hyperparameter tuning. Ablation studies, in which we remove different components (e.g., different convolutional layers, or dataset features) can also be employed to gain more insights about the model. Hopsworks provides efficient and easy support for hyperparameter tuning and ablation studies through a Python-based framework called Maggy. Finally, to further increase the training speed, the distributed training strategy provided in Hopworks can be used.

The final step is to export and serve the model. Model is exported and saved into the Hopsworks “Models” dataset. Then we use the Hopsworks elastic model serving infrastructure to host TensorFlow serving which can scale with the number of inference requests.

Conclusion

In this blog post we described how the ExtremeEarth project brings new tools and capabilities with Hopsworks to the EO data community and the Copernicus program. We also showed how we have developed a practical use case by using Copernicus data and Hopsworks. We keep developing Hopsworks to make it even more akin to the tools and processes used by researchers across the entire EO community and we continue development of our use cases with more sophisticated models using even more advanced distributed Deep Learning training techniques.

3. Regular Monitoring of Agricultural Areas using Sentinel 2 Image Time Series and Deep Learning Techniques

Claudia Paris and Lorenzo Bruzzone - UNITN, April 2020

Regular Monitoring of Agricultural Areas using Sentinel 2 Image Time Series and Deep Learning Techniques

With the advent of the Copernicus European programs, completely full, open and free remote sensing data are constantly acquired at a global scale. These data are extremely useful for regularly monitoring the Earth’s surface. In particular, the multispectral high-resolution optical images acquired by Sentinel 2 are suited for the monitoring of agricultural areas, which need to be frequently updated since cultivations change their spectral and textural appearance according to their crop type growth cycle. Sentinel 2 is characterized by specific bands in the Red-Edge spectral range dedicated to the study of vegetation, which allow the characterization of the phenological parameters of different crop types. In this context, ExtremeEarth aims to define a system architecture able to fully exploit the long time series of Sentinel 2 images to perform accurate and regular monitoring of agricultural areas at large scale.

This peculiar classification problem requires the definition of a multitemporal approach that accurately characterizes the different crop types, which have their own phenological characteristics and development times. Instead of using a pre-trained network, ad-hoc deep network architecture tailored to the specific spatial, temporal and spectral nature of dense time series of Sentinel 2 will be defined. Given the complexity of the considered problem, an automatic system architecture based on recurrent deep neural networks will be considered by focusing the attention on the Long Short Term Memory (LSTM). These deep learning models are able to capture the temporal correlation of sequential data, by storing an unlimited amount of evidence and make decisions in that actual temporal context.

The first experimental analysis will be carried out on the whole Danube catchment, which is characterized by considerably extent (801.463 km²) and heterogenous environmental conditions. At such large scale, many challenges have to be addressed. First, the deep learning architecture has to handle the fact that the spectral signatures of the land-cover classes are characterized by high spatial variability. Hence, due to physical factors (e.g., acquisition conditions, soil properties) the same crop types may present different spectral properties in the different region of the Danube Catchment. Then, it is important to take into account the presence of highly unbalanced classes as well as to properly model the high inter-class and low-intraclass variance to guarantee a consistent classification product. Moreover, for the reliable training of deep networks, millions of annotated samples are required. While in computer vision many large databases of labelled images are available, in remote sensing it is not feasible to collect in-situ data at such a scale. Hence, a large training database of "weak" labelled samples will be defined in an unsupervised and automatic way.

Example of crop type map that will be produced for the whole Danube Catchment. The 17 crop types that will be identified are reported.

The system will generate crop type and crop boundaries maps at 10m of spatial resolution, the highest resolution achieved by Sentinel 2. The crop type maps will be characterized by 17 different main cultivations, namely, Grassland, Forage, Oat, Potato, Beet, Spring Wheat, Winter Wheat, Permanent Plantations, Maize, Legumes, Rapeseed, Rye, Spring Barley, Winter barley, Soy, and Flowering legumes. The preliminary test will be performed using the time series of Sentinel 2 images acquired in 2018. To validate the results obtained the 2018 European Land Use and Coverage Area Frame Survey (LUCAS) database will be used. The LUCAS survey, coordinated by the Statistical Office of the European Commission (Eurostat), aims to collect harmonized data land cover/land use, agro-environmental and soil data by field observation of geographically referenced points.

4. New sea ice and iceberg datasets created within the polar use case

Åshild Kiærbech - METNO, April 2020

New sea ice and iceberg datasets created within the polar use case

Within ExtremeEarth, the polar use case will develop automatic methods for classifying sea ice which can be used in the production of ice charts. This will streamline existing ice charting workflows and allow ice analysts to focus on products which are currently out of the reach of machine learning techniques.

In this blogpost we will focus on an important aspect of the development of automatic methods; the training data that is used in the training of the algorithms. Supervised learning of machines requires labelled data for training the machines to be well-performing decision-makers.

New sea ice datasets

These new sea ice datasets are publicly available with the DOI. The plan is to use these as common datasets for the project, being able to compare methods on the same data. The three datasets represent each of the three use cases within the polar use case: Ice edge, ice types, and icebergs:

The ice edge dataset is a binary dataset with the two classes ice and sea.
The icetypes dataset contains segmented images, where each segment contains one among more specific ice types, or sea.
The iceberg dataset contains outlines of all icebergs observed within one Sentinel-2 image.

The March 2018 image from the binary ice edge dataset. The outlined box is the satellite image extent, wherein white represent the ice class, and blue the sea class, and the grey areas are land.

The March 2018 image from the ice type dataset. The outlined box is the satellite image extent, wherein the blue lines are the ice type segments’ outline. The rgb satellite image is shown as background.

The iceberg dataset. The red outlines are marked icebergs. The blue lines are the segments drawn for the March image for the ice type dataset.

All images used in all datasets are over the same geographical area in the seas east of Greenland, and are acquired in the year 2018. The two first datasets each consist of twelve images for each month in 2018 and for the third dataset we have Sentinel-2 image from March 2018.

Why do we need labelled data?

Labelled data are made by assigning labels to data entries. The sea ice classification and detection algorithms we are working with use satellite images as input data. That could for instance be radar images from the synthetic aperture radar Sentinel-1, or an optical image from the Sentinel-2 optical satellite. A satellite image over the Arctic areas may contain different types of sea ice, open water, and also pieces of land. Thus the image may contain multiple different surface type labels.

A closer look on the iceberg dataset with the Sentinel-2 image in background. The red outlines are marked icebergs. The blue lines are the segment outlines drawn for the March image in the ice type dataset.

A majority of the automatic deep learning algorithms need a ground truth to compare the input data with during training. The machine needs to see the connection between different kinds of image features and the connected label name. When it has seen many such connections, it may be able to make a good decision on what it sees in a new image.

5. Distributed geospatial analytics over linked data in ExtremeEarth

Dimitris Bilidas - UoA, June 2020

Distributed geospatial analytics over linked data in ExtremeEarth

The information produced by the deep learning pipelines of the ExtremeEarth project will contain useful knowledge that the end users should access. This information will be transformed to linked RDF data over a familiar vocabulary for the users, in the form of ontological classes and properties, and will be interlinked with other useful datasets. A system that will store and query all these massive linked spatial RDF datasets on the HOPS platform is under development as part of the ExtremeEarth. This system will provide the user with the capability to explore the datasets, obtain useful information and perform rich spatial analytics using the GeoSPARQL query language. The system is built using the popular distributed in-memory processing framework Spark, and its extension GeoSpark for spatial partitioning, indexing and processing, and it works by translating an input GeoSPARQL query into a series of Spark jobs.

Apart from the spatial partitioning of the geometries contained in the dataset, the system also employs advanced partitioning techniques for the thematic RDF data, aiming to provide efficient execution by minimizing data transfer across the network, and also minimize the amount of data that need to be accessed for each query. The system has already been tested in the HOPS platform with a spatial RDF dataset of more than 1 billion triples and 100 million complex geometries, which already exceeds the data size that centralized state of the art geospatial RDF stores can handle in a single server with up to 128 GB of memory.

Several improvements are currently under development, including specialized compression techniques like dictionary or prefix-based encoding for the thematic RDF data, and an efficient cost-based query optimizer that will take into consideration both thematic and spatial query selectivity in order to decide about the execution order of the query operators and the usage of the spatial access methods of the underlying data (partitioning and indexing schemes).

6. JedAI: An open-source library for state-of-the-art large scale data integration

George Papadakis - UoA, June 2020

JedAI: An open-source library for state-of-the-art large scale data integration

The Java gEneric DAta Integration system (JedAI) constitutes an open source, high scalability toolkit that offers out-of-the-box solutions for any data integration task, e.g., Record Linkage, Entity Resolution and Link Discovery. At its core lies a set of domain-independent, state-of-the-art techniques that apply to data of any structuredness: from the structured data of relational databases and CSV files to the semi-structured data of SPARQL endpoints and RDF dumps and to unstructured (i.e., free text) Web data. Any combination of these data types is also possible.

Before ExtremeEarth, JedAI’s latest release, version 2.0, implemented a single end-to-end workflow that was based on batch, schema-agnostic blocking techniques [1].

The batch, blocking-based end-to-end workflow of JedAI.

This workflow was implemented by JedAI-core, the back-end of the system, while JedAI-gui provided a front-end that is suitable for both expert and lay users. JedAI-gui actually is a desktop application with a wizard-like interface that allows users to select among the available methods per workflow step so as to form end-to-end pipelines. The relation of the two modules is depicted in the following figure.

The system architecture of JedAI version 2.

In the context of ExtremeEarth, we upgraded JedAI to versions 2.1 [2] and 3.0 [3], improving it substantially in all respects.

More specifically, its back-end has been enriched with the following three end-to-end workflows:

A batch, schema-based workflow that is based on string similarity joins, as shown in the figure. Compared to the schema-agnostic, blocking-based workflow, it offers lower effectiveness for much higher efficiency, provided that there are reliable attributes (i.e., predicates) for matching entities according to the similarity of their values (e.g., the title of publications in bibliographic datasets).

The batch, schema/join-based end-to-end workflow of JedAI.
A progressive schema-agnostic workflow that extends the corresponding batch workflow in Figure 1 with a Prioritization step, as shown in Figure 4. The goal of this step is to schedule the processing of entities, blocks or comparisons according to the likelihood that they involve duplicates. This enables JedAI to operate in a pay-as-you-go fashion that detects matches early on and terminates the entire processing on-time, i.e., when the likelihood of detecting additional duplicates is negligible. As a result, we can minimize the run-time of applications that can operate on partial results or have access to limited computational resources.

The budget-agnostic (progressive), blocking-based end-to-end workflow of JedAI. The shaded workflow steps are optional, as some progressive methods can be applied directly to the input entities, while self-loops indicate steps that can be repeated, using a different method each time.
A progressive schema-based workflow that involves the same steps as its budget-agnostic (batch) counterpart in Figure 3. The only difference is that Top-k Similarity Join is used for detecting the candidate matches in non-increasing order of estimated similarity – either globally, across the entire dataset, or locally, for individual entities.

Additionally, JedAI version 3.0 adapts all algorithms of JedAI-core to the Apache Spark framework and implements them in Scala. Through Apache Spark’s massive parallelization, JedAI is able to minimize the run-time of any end-to-end ER workflow, a goal that is in line with ExtremeEarth’s target to tackle huge volumes of data. The code is available here.

Overall, these new features equip JedAI with support for three-dimensional Entity Resolution, as it is defined in Figure 5. The first dimension corresponds to schema-awareness, distinguishing between schema-based and schema-agnostic workflows. The second dimension corresponds to budget-awareness, distinguishing between budget-agnostic (i.e., batch) and budget-aware (i.e., progressive) workflows. The third dimension corresponds to the execution mode, which distinguishes between serial and massively parallel processing. JedAI supports all possible combinations of the three dimensions, which is a unique feature across all open-source data integration tools, as they are typically restricted to schema-based, batch workflows (grey area in the figure).

The solution space of the end-to-end ER pipelines that can be constructed by JedAI version 3.

Another advantage of JedAI is its extended coverage of the state-of-the-art techniques for Entity Resolution. Every workflow step in all end-to-end workflows conveys a series of state-of-the-art approaches. Thus, unlike the other open-source tools, which typically offer a limited number of methods, JedAI enables users to build millions of different end-to-end pipelines.

JedAI’s version 3 also improves its user interface in three ways:

By adding a command line interface that offers the same functionalities as the desktop application.
Through a Python wrapper that is based on pyjnius, JedAI can be seamlessly used in a Jupyter Notebook.
Through a novel Web application with a wizard-like interface that provides a unified access to local execution (on a single machine) and to remote execution (over a cluster through Apache Livy). This Web application can be easily deployed through a Docker image. A video demonstrating its use and capabilities is available here.

The new system architecture in Figure 6 shows how all these improvements have been integrated into JedAI version 3. A technical report that analytically describes all components and evaluates their performance over a series of established datasets is available here.

The system architecture for JedAI version 3.

At the moment, we are also working on extending JedAI with techniques that are suitable for geospatial interlinking. We are adding support for all topological relations of the DE-9IM model, such as contains, crosses, covers and disjoint. To this end, we have integrated the state-of-the-art blocking-based methods of Silk [4] and Limes [5] (namely Radon [6]) into JedAI. We have also developed new techniques that improved the existing ones by facilitating their adaptation to the Apache Spark framework. We are performing thorough experimental analysis to verify the high scalability of our novel techniques, demonstrating that they are capable of handling tens of millions of geometries.

References

[1] George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, Manolis Koubarakis. "The return of JedAI: End-to-End Entity Resolution for Structured and Semi-Structured Data". Proc. VLDB Endow. 11(12): 1950-1953 (2018)

[2] George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, Manolis Koubarakis. "Domain- and Structure-Agnostic End-to-End Entity Resolution with JedAI". SIGMOD Rec. 48(4): 30-36 (2019)

[3] George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, Nikiforos Pittaras, Giovanni Simonini, Dimitrios Skoutas, Paul Isaris, George Giannakopoulos, Themis Palpanas, Manolis Koubarakis. "JedAI3: beyond batch, blocking-based Entity Resolution". EDBT 2020: 603-606

[4] Panayiotis Smeros, Manolis Koubarakis. "Discovering Spatial and Temporal Links among RDF Data". LDOW@WWW 2016

[5] Axel-Cyrille Ngonga Ngomo, Sören Auer. "LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data". IJCAI 2011: 2312-2317

[6] Mohamed Ahmed Sherif, Kevin Dreßler, Panayiotis Smeros, Axel-Cyrille Ngonga Ngomo. "Radon - Rapid Discovery of Topological Relations". AAAI 2017: 175-181

7. AI Software Architecture for Copernicus Data with Hopsworks

Theofilos Kakantousis (LC), Desta Haileselassie Hagos (KTH), July 2021

AI Software Architecture for Copernicus Data with Hopsworks

TLDR: Hopsworks, the data-intensive AI platform with a feature store, brings support for scale-out AI with Copernicus data and the H2020 ExtremeEarth project. Hopsworks is integrated with the Polar and FoodSecurity Thematic Exploitation Platforms (TEPs) on the CREODIAS infrastructure. Two use cases, polar and food security, have been developed by making use of the scale-out distributed deep learning support of Hopsworks and the PBs of data made available by CREODIAS and processed by Hopsworks and the TEPs .

This article is based on the paper “The ExtremeEarth software Architecture for Copernicus Earth Observation Data” included in the Proceedings of the 2021 conference on Big Data from Space (BiDS 2021) [1].

Introduction

In recent years, unprecedented volumes of data are being generated in various domains. Copernicus, a European Union flagship programme, produces more than three petabytes (PB) of Earth Observation (EO) data annually from Sentinel satellites [2]. However, current AI architectures making use of deep learning in remote sensing are struggling to scale in order to fully utilize the abundance of data.

ExtremeEarth is an EU-funded project that aims to develop use-cases that demonstrate how researchers can apply deep learning in order to make use of Copernicus data in the various European Space Agency (ESA) TEPs. A main differentiator of ExtremeEarth to other such projects is the use of Hopsworks, a Data-Intensive AI software platform with a Feature Store and tooling for horizontally scale-out learning. Hopsworks has successfully been extended as part of ExtremeEarth project to bring specialized AI tools to the EO data community with 2 use cases already developed on the platform with more to come in the near future.

Bringing together a number of cutting edge technologies, which deal from storing extremely large volumes of data all the way to performing scalable machine learning and deep learning algorithms in a distributed manner, and having them operate over the same infrastructure poses some unprecedented challenges. These challenges include, in particular, integration of ESA TEPs and Data and Information Access Service (DIAS) platforms with a data platform (Hopsworks) that enables scalable data processing, machine learning and deep learning on Copernicus data; development of very large training datasets for deep learning architectures targeting the classification of Sentinel images.

In this blog post, we describe both the software architecture of the ExtremeEarth project with Hopsworks as its AI platform centerpiece and the integration of Hopsworks with the other services and platforms of ExtremeEarth that make up for a complete AI with EO data experience.

AI with Copernicus data - Software Architecture

There are several components that comprise the overall architecture with the main ones being the following.

Hopsworks. An open-source data-intensive AI platform with a feature store. Hopsworks can scale to the petabytes of data required by the ExtremeEarth project and provides tooling to build horizontally scalable end-to-end machine learning and deep learning pipelines. Data engineers and data scientists utilize Hopsworks’ client SDKs that facilitate AI data management, machine learning experiments, and productionizing serving of machine learning models.

Thematic Exploitation Platforms (TEPs). These are collaborative, virtual work environments providing access to EO data and the tools, processors, and Information and Communication Technology (ICT) resources required to work with various themes, through one coherent interface. TEPs address coastal, forestry, hydrology, geohazards, polar, urban themes, and food security themes. ExtremeEarth in particular is concerned with the polar and food security TEPs where the use cases also stem from. These use cases include building machine learning models for sea ice classification to improve maritime traffic as well as food crops and irrigation classification.

Data and Information Access Service (DIAS). To facilitate and standardize access to data, the European Commission has funded the deployment of five cloud-based platforms. They provide centralized access to Copernicus data and information, as well as to processing tools. These platforms are known as the DIAS, or Data and Information Access Services [2]. ExtremeEarth software architecture is built on CREODIAS, a Cloud infrastructure platform adapted to the processing of big amounts of EO data, including an EO data storage cluster and a dedicated Infrastructure-as-a-Service (IaaS) Cloud infrastructure for the platform’s users. The EO data repository contains Sentinel-1, 2, 3, and 5-P, Landsat-5, 7, 8, Envisat, and many Copernicus Services data.

Figure 1: ExtremeEarth software architecture.

Figure 1 provides a high-level overview of the integration of the different components with each other. The components can be classified into four main categories(layers):

Product layer. The product layer provides a collaborative virtual work environment, through TEPs, that operates in the cloud and enables access to the products, tools, and relevant EO, and non-EO data.
Processing layer. This layer provides the Hopsworks data-intensive Artificial Intelligence (AI) platform. Hopsworks is installed within the CREODIAS OpenStack cloud infrastructure and operates alongside the TEPs. Also, Hopsworks has direct access to the data layer and the different data products provided by the TEPs.
Data layer. The data layer offers a cloud-based platform that facilitates and standardizes access to EO data through a Data and Information Access Service (DIAS). It also provides centralized access to Copernicus data and information, as well as to processing tools. TEPs are installed and run on a DIAS infrastructure, which in the case of ExtremeEarth is the CREODIAS.
Physical layer. It contains the cloud environment’s compute, storage, networking resources, and hardware infrastructures.

To provide a coherent environment for AI with EO data to application users and data scientists, the goal of the architecture presented here is to make most components transparent and simplify developer access by using well-defined APIs while making use of commonly used interfaces such as RESTful API. As a result, a key part of the overall architecture is how these different components can be integrated to provide a coherent whole. The APIs used for the full integration of the ExtremeEarth components via the inter-layer interfaces of the software platform are described below and also are illustrated in Figure 1:

Raw EO data. DIASes and CREODIAS in particular, provide Copernicus data access which includes the downstream of Copernicus data as it is generated by satellites. At an infrastructure level, this data is persisted at an object store with an S3 object interface, managed by CREODIAS.
Pre-processed data. TEPs provide the developers of the deep learning pipelines with pre-processed EO data which forms the basis for creating training and testing datasets. Developers of the deep learning pipeline can also define and execute their own pre-processing, if the pre-processed data is not already available.
Object storage. CREODIAS provides object storage used for storing data produced and consumed by the TEPs services and applications. In ExtremeEarth, this object store is used primarily for storing training data required by the Polar and Food Security use cases. This training data is provided as input to the deep learning pipelines.
EO Data Hub Catalog. This service is provided and managed by CREODIAS. It provides various protocols including OGC WFS and a REST API as interfaces to the EO data.
TEP-Hopsworks EO data access. Users can directly access raw EO data from their applications running on Hopsworks. Multiple methods, e.g., object data access API (SWIFT/S3), filesystem interface, etc., are provided for accessing Copernicus and other EO-data available on CREODIAS.
TEP-Hopsworks infrastructure Integration. Hopsworks and both the Polar and Food Security TEPs are installed and operated on CREODIAS and its administrative tools enable TEPs to spawn and manage virtual machines and storage by using CloudFerro [3] which provides an OpenStack-based cloud platform to TEPs. Hopsworks is then installed in this platform and it can access compute and storage resources provided by the TEPs.
TEP-Hopsworks API integration. Hopsworks is provided within the TEP environment as a separate application and is mainly used as a development platform for the deep learning pipelines and applications of the Polar and Food Security use cases. These applications are exposed in the form of processors to the TEP users. Practically, a processor is a TEP abstraction that uses machine learning models that have previously been developed by the data scientists of the ExtremeEarth use cases in Hopsworks.
Hopsworks-TEPs datasets. TEPs provide users with access to data to be served as input to processors from various sources. Such sources include the data provided by CREODIAS and external services that the TEP can connect to. The pre-processed data is stored in an object storage provided by CREODIAS, thus made available to Hopsworks users by exchanging authentication information. Hopsworks users can also upload their own data to be used for training or serving. Hopsworks provides a REST API for clients to work with model serving, and authentication is done in the form of API keys managed by Hopsworks on a per user basis. These keys can therefore be used by external clients to authenticate against the Hopsworks REST API. There are two ways by which the trained model can be served via the TEP: (i) The model can be exported from Hopsworks and be embedded into the TEP processor. (ii) The model can be served online on Hopsworks and a processor on the TEP submits inference requests to the model serving instance on Hopsworks and returns the results. In method (i), once the machine learning model is developed, it can then be transferred from Hopsworks to the Polar TEP by using the Hopsworks REST API and Python SDK. TEP users can integrate the Hopsworks Python SDK into the processor workflow to further automate the machine learning pipeline lifecycle. In method (ii), the TEP is able to submit inference requests to the model being served by the online model serving infrastructure run on Kubernetes and hosted on Hopsworks. Figure 2 illustrates this approach for the Polar use case.
Linked Data tools. Linked data applications are deployed as Hopsworks jobs using Apache Spark. A data scientist can work with big geospatial data using ontologies and submit GeoSPARSQL queries using tools developed and extended within ExtremeEarth, namely GeoTriples [4], Strabo2, JedAI-Spatial, and Semagrow.

Figure 2: ExtremeEarth software architecture for the Polar Use Case.

Application users that interact with the TEPs effectively are the users of the AI products generated by the machine learning and deep learning pipelines developed by the data scientists in Hopsworks. Previously we described the integration of the various components. Figure 3 depicts the flow of events within this architecture:

EO data scientists log in to Hopsworks.
Read and pre-process raw EO data in Hopsworks, TEP or in the local machine.
Create training datasets based on the intermediate pre-processed data.
Develop deep learning pipelines.
Perform linked data transformation, interlinking, and storage.
Log in to the Polar or Food Security TEP applications.
Select Hopsworks as the TEP processor. The processor starts the model serving in Hopsworks via the REST API. The processor also downloads the model from Hopsworks via the REST API and serving is done within the TEP application.
Submit federated queries with Semagrow and use the semantic catalogue built into Hopsworks.

Figure 3: ExtremeEarth software architecture flow of events