Konzeption und Umsetzung von RDF-Summarization-Cubes in SPARK-SQL für das Profiling von Schema.org-Daten

Author: R. Buschberger
Master Thesis: MT1804 (November, 2018)
Supervised by: o. Univ.-Prof. Dr. Michael Schrefl
Instructed by: Dr. Bernd Neumayr
Accomplished at: University Linz, Institute of Business Informatics - Data & Knowledge Engineering
Resources: Copy

Abstract (German)

Das von den führenden Suchmaschinenanbietern ins Leben gerufene schema.org-Vokabular erlaubt die semantische Beschreibung einer Web-Seite in maschinenlesbarer Form. Um im Rahmen von Data-Profiling die tatsächliche Verwendung des schema.org-Vokabulars untersuchen zu können, wird in dieser Arbeit ein entsprechender Entwurf erarbeitet und umgesetzt.

Für die Analyse der Verwendung des schema.org-Vokabulars werden die von Web-Data-Commons veröffentlichten, aus den Korpussen von Common-Crawl extrahierten strukturierten Web-Daten herangezogen. Aus den Rohdaten werden Primärfakten erstellt, welche die schema.org-Klassenhierarchie als semantische Dimensionen enthalten. Ausgehend von den Primärfakten werden die Umsetzungsvarianten „Cube“ und „Star“ (zusammengefasst als RDF-Summarization-Cubes) erzeugt, welche die entsprechenden schema.org-Analysen ermöglichen. Die Umsetzung wird in PySpark-SQL implementiert und in einem Hadoop-Cluster einer Proof-of-Concept-Umgebung entwickelt und getestet. Die ersten Analysen in Bezug auf die strukturierten Web-Daten-Formate Microdata, JSON-LD und RDFa auf Basis eines relativ kleinen Ausschnittes von Web-Data-Commons haben ergeben, dass in JSON-LD am öftesten schema.org-Klassen verwendet werden. Um die Skalierbarkeit der Umsetzung zu überprüfen, wurde der Proof-of-Concept-Prototyp mithilfe der Plattform Databricks in der Microsoft-Azure-Cloud auf eine größere Datenmenge ausgeführt.

Abstract (English)

The schema.org vocabulary was developed by the leading search engine operators to facilitate the semantic annotation of websites in a machine readable and consistent way. In this thesis a data profiling approach for the analysis of the usage of the schema.org vocabulary is developed. The analysis is based on the data from Web Data Commons which extract structured data from the Common Crawl corpuses. The schema.org hierarchy will embed into the created primary fact tables by the usage of semantic dimensions. On the basis of the primary fact tables the variants “Cube” and “Star” (collectively referred to as RDF-Summarization-Cubes) are developed to enable proper schema.org analysis. The proof-of-concept prototype is implemented in PySpark SQL and is executed and tested in a proof-of-concept Hadoop cluster. The first analysis of the usage of the structured data formats Microdata, JSON-LD and RDFa based on a rather small fragment of web data commons shows that the schema.org vocabulary is used most frequently with the JSON-LD format. To show its scalability the proof-of-concept prototype was also deployed in the Microsoft Azure Cloud using Databricks and executed on a larger fragment of the web data commons corpus.