SCIENS - Structure and Content Indexing with Extensible, Nestable Structures

Project duration
January 2006 - January 2008
Project type
PhD Thesis Project  ( Katharina Grün )


Project team
Katharina Grün (DKE)

Publications
K. Grün:
A Generic Framework for Querying and Updating Secondary XML Index Structures
In: Proceedings of the SIGMOD 2007 Ph.D. Workshop on Innovative Database Research (IDAR 2007), Beijing, China, June 10, 2007, pp. 27-32, 2007.
K. Grün, M. Schrefl:
Exploiting the Structure of Update Fragments for Efficient XML Index Maintenance
In: Guozhu Dong, Xuemin Lin, Wei Wang, Yun Yang, Jeffrey Xu Yu (Eds.): Advances in Data and Web Management, Proceedings of the Joint 9th Asia-Pacific Web Conference (APWeb 2007) and the 8th International Conference on Web-Age Information Management (WAIM 2007), HuangShan, China, June 16-18, 2007, Springer Verlag Deutschland, Reihe Lecture Notes in Computer Science (LNCS), Vol. 4505, ISBN 978-3-540-72483-4, pp. 471-478, 2007.
K. Grün, M. Schrefl:
Extensible Indexing in XML Databases
Institute report 08.01, August 2008.
P. Lasinger:
Indexing Encrypted XML Documents in the SemCrypt Database Management System
(Master Thesis, 2006)
Diplomarbeit, Betreuung: o. Univ.-Prof. Dr. Michael Schrefl, unter Anleitung von Mag. Katharina Grün und Mag. Michael Karlinger, ausgeführt an der Universität Linz, Institut für Wirtschaftsinformatik - Data & Knowledge Engineering, Juli 2006.
K. Grün:
Flexible and Selective Indexing in XML Databases
(PhD Thesis, 2008)

Motivation XML, the eXtensible Markup Language, is becoming more and more popular not only as data exchange format on theWeb but also as data format in database applications. The emerging trend towards XML applications creates the need for persistent storage of XML documents in databases. To efficiently query documents, XML databases require indices on the content and structure of frequently queried document fragments. Currently, XML databases still fail in offering flexible and selective indexing support for the specific requirements of the hierarchical, semi-structured XML data model. Challenges Flexibility in indexing refers to supporting arbitrary queries on the content and/or structure of XML documents. Providing flexibility poses the following challenges: How can indices represent and process the hierarchical document structure? Which index structures are necessary to support arbitrary queries on the document content and structure? Selectivity refers to indexing frequently queried document fragments instead of entire documents. Providing selectivity raises the following research questions: How can a database management system process indices that refer to arbitrary document fragments? How can it keep arbitrary indices consistent with updates on documents? Description The indexing approach SCIENS (Structure and Content Indexing with Extensible, Nestable Structures) provides flexible and selective indexing for XML databases. To represent and process the document structure, SCIENS uses a labeling scheme that encodes structural relationships into labels. While existing XML indexing approaches propose proprietary index structures, SCIENS adapts existing index structures to the XML data model. By extending and nesting index structures, it provides indices for arbitrary query workloads. The index framework enables SCIENS to process arbitrary indices based on an index model. To keep indices consistent with document updates, the maintenance algorithm of SCIENS exploits the document fragments being updated to extract relevant index updates. All concepts have been implemented and integrated into the SemCrypt database management system. Relevance Flexible indices enable the definition of those indices that best match the query workload. Selectivity reduces index size and accelerates index traversal. Compared to existing XML indexing approaches, SCIENS only requires a small number of existing index structures, but can support a wider range of queries. The index framework guarantees that querying and updating documents remains unaffected by specific indices used. By exploiting the structure of update fragments, the maintenance algorithm can process updates more efficiently than existing approaches. Resources Slides (pptx, pdf)