JedAI: An open-source library for state-of-the-art large scale data integration
The Java gEneric DAta Integration system (JedAI) constitutes an open source, high scalability toolkit that offers out-of-the-box solutions for any data integration task, e.g., Record Linkage, Entity Resolution and Link Discovery. At its core lies a set of domain-independent, state-of-the-art techniques that apply to data of any structuredness: from the structured data of relational databases and CSV files to the semi-structured data of SPARQL endpoints and RDF dumps and to unstructured (i.e., free text) Web data. Any combination of these data types is also possible.
Before ExtremeEarth, JedAI’s latest release, version 2.0, implemented a single end-to-end workflow that was based on batch, schema-agnostic blocking techniques [1].
The batch, blocking-based end-to-end workflow of JedAI.
This workflow was implemented by JedAI-core, the back-end of the system, while JedAI-gui provided a front-end that is suitable for both expert and lay users. JedAI-gui actually is a desktop application with a wizard-like interface that allows users to select among the available methods per workflow step so as to form end-to-end pipelines. The relation of the two modules is depicted in the following figure.
The system architecture of JedAI version 2.
In the context of ExtremeEarth, we upgraded JedAI to versions 2.1 [2] and 3.0 [3], improving it substantially in all respects.
More specifically, its back-end has been enriched with the following three end-to-end workflows:
-
A batch, schema-based workflow that is based on string similarity joins, as shown in the figure. Compared to the schema-agnostic, blocking-based workflow, it offers lower effectiveness for much higher efficiency, provided that there are reliable attributes (i.e., predicates) for matching entities according to the similarity of their values (e.g., the title of publications in bibliographic datasets).
The batch, schema/join-based end-to-end workflow of JedAI.
-
A progressive schema-agnostic workflow that extends the corresponding batch workflow in Figure 1 with a Prioritization step, as shown in Figure 4. The goal of this step is to schedule the processing of entities, blocks or comparisons according to the likelihood that they involve duplicates. This enables JedAI to operate in a pay-as-you-go fashion that detects matches early on and terminates the entire processing on-time, i.e., when the likelihood of detecting additional duplicates is negligible. As a result, we can minimize the run-time of applications that can operate on partial results or have access to limited computational resources.
The budget-agnostic (progressive), blocking-based end-to-end workflow of JedAI. The shaded workflow steps are optional, as some progressive methods can be applied directly to the input entities, while self-loops indicate steps that can be repeated, using a different method each time.
-
A progressive schema-based workflow that involves the same steps as its budget-agnostic (batch) counterpart in Figure 3. The only difference is that Top-k Similarity Join is used for detecting the candidate matches in non-increasing order of estimated similarity – either globally, across the entire dataset, or locally, for individual entities.
Additionally, JedAI version 3.0 adapts all algorithms of JedAI-core to the Apache Spark framework and implements them in Scala. Through Apache Spark’s massive parallelization, JedAI is able to minimize the run-time of any end-to-end ER workflow, a goal that is in line with ExtremeEarth’s target to tackle huge volumes of data. The code is available here.
Overall, these new features equip JedAI with support for three-dimensional Entity Resolution, as it is defined in Figure 5. The first dimension corresponds to schema-awareness, distinguishing between schema-based and schema-agnostic workflows. The second dimension corresponds to budget-awareness, distinguishing between budget-agnostic (i.e., batch) and budget-aware (i.e., progressive) workflows. The third dimension corresponds to the execution mode, which distinguishes between serial and massively parallel processing. JedAI supports all possible combinations of the three dimensions, which is a unique feature across all open-source data integration tools, as they are typically restricted to schema-based, batch workflows (grey area in the figure).
The solution space of the end-to-end ER pipelines that can be constructed by JedAI version 3.
Another advantage of JedAI is its extended coverage of the state-of-the-art techniques for Entity Resolution. Every workflow step in all end-to-end workflows conveys a series of state-of-the-art approaches. Thus, unlike the other open-source tools, which typically offer a limited number of methods, JedAI enables users to build millions of different end-to-end pipelines.
JedAI’s version 3 also improves its user interface in three ways:
-
By adding a command line interface that offers the same functionalities as the desktop application.
Through a Python wrapper that is based on pyjnius, JedAI can be seamlessly used in a Jupyter Notebook.
Through a novel Web application with a wizard-like interface that provides a unified access to local execution (on a single machine) and to remote execution (over a cluster through Apache Livy). This Web application can be easily deployed through a Docker image. A video demonstrating its use and capabilities is available here.
The new system architecture in Figure 6 shows how all these improvements have been integrated into JedAI version 3. A technical report that analytically describes all components and evaluates their performance over a series of established datasets is available here.
The system architecture for JedAI version 3.
At the moment, we are also working on extending JedAI with techniques that are suitable for geospatial interlinking. We are adding support for all topological relations of the DE-9IM model, such as contains, crosses, covers and disjoint. To this end, we have integrated the state-of-the-art blocking-based methods of Silk [4] and Limes [5] (namely Radon [6]) into JedAI. We have also developed new techniques that improved the existing ones by facilitating their adaptation to the Apache Spark framework. We are performing thorough experimental analysis to verify the high scalability of our novel techniques, demonstrating that they are capable of handling tens of millions of geometries.
References
[1] George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, Manolis Koubarakis. "The return of JedAI: End-to-End Entity Resolution for Structured and Semi-Structured Data". Proc. VLDB Endow. 11(12): 1950-1953 (2018)
[2] George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, Manolis Koubarakis. "Domain- and Structure-Agnostic End-to-End Entity Resolution with JedAI". SIGMOD Rec. 48(4): 30-36 (2019)
[3] George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, Nikiforos Pittaras, Giovanni Simonini, Dimitrios Skoutas, Paul Isaris, George Giannakopoulos, Themis Palpanas, Manolis Koubarakis. "JedAI3: beyond batch, blocking-based Entity Resolution". EDBT 2020: 603-606
[4] Panayiotis Smeros, Manolis Koubarakis. "Discovering Spatial and Temporal Links among RDF Data". LDOW@WWW 2016
[5] Axel-Cyrille Ngonga Ngomo, Sören Auer. "LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data". IJCAI 2011: 2312-2317
[6] Mohamed Ahmed Sherif, Kevin Dreßler, Panayiotis Smeros, Axel-Cyrille Ngonga Ngomo. "Radon - Rapid Discovery of Topological Relations". AAAI 2017: 175-181