Date of Award
11-21-2022
Document Type
Dissertation
Publisher
Santa Clara : Santa Clara University, 2022
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science and Engineering
First Advisor
Ahmed Amer
Abstract
Digital technology makes it easy to generate and distribute large volumes of data. However, it has also complicated the process of verifying and validating sources of data and their derivatives risking obfuscation of truth amidst the deluge of data. To address this issue, I trace and develop an approach based on data provenance tracking. Specifically, I make it possible to deep trace the origins and lineages of data, by applying state-of-the-art data provenance technologies, which I extend beyond traditional data provenance applications. In this dissertation, I demonstrate that with the right data infrastructure it is feasible to grant greater agency and integrity in representing and preserving data provenance, and that it is possible to do so in broader domains than those for which it is currently captured and represented.
I propose a general, flexible, and extensible data provenance framework, titled MetaScriptura, that treats provenance data as primary data in and of itself, and considers the importance of being able to represent a wide range of data structures atop suitably robust and secure storage infrastructure for capturing and representing richer semantic metadata as completely and accurately as possible. Specifically, I illustrate the efficiency and feasibility of this provenance metadata model with relevant ontologies in three different domains: multi-varied classical translated texts and commentaries; trusted citations of scholarly publications; and complementing and enhancing the process of maintaining interrelated legal records. MultiVerse is the name of first implementation of MetaScriptura applied to classical literary works and their translations. MetaScribe is another use case that extends the general framework to capture intent and sentiment of citations in scholarly publications, where links between sources are already well established, but which can be greatly enhanced with richer metadata. MetaLex is a proposal for legal record verification systems that require a high degree of trust in the completeness of the provenance and data tracking.
These applications, with enabled annotation features for recording rich provenance data, demonstrate the feasibility of both representation of domain knowledge and retrieval of subtle yet important underlying causal relationships between linked data items (via trustworthy provenance repositories). While these systems may accrue more provenance metadata to enhance completeness of data’s contextual relevance, system performance indicators remain scalable in the face of increasing volumes of data.
MetaScriptura aims to go beyond capturing simple data provenance information, and instead, attempts to pursue the ability to capture both the content and nature of data, and the varied kinds of commentary or manipulations that can be performed upon it. When used with the translation of a text, it can allow future revisions to the meaning deemed most likely to be accurate, since no alternative translation need be abandoned. When used with commentaries on a scholarly work, it can elevate the value of the commentary beyond simple critique of a body of work and into an integral part of the work as referenced by future scholars. When used in its most technical and computing-specific form, it can allow the construction of a richer and more meaningful recording of realistic user behavior in relation to data storage devices. As long as an artifact can be expressed as digital data, the manipulation or commentary upon it can be realized, and thus, the goal of MetaScriptura is to make sure that none of it is lost, and yet all of it remains (from a human perspective) manageable.
As data provenance is certainly broader than techno-computing fields, touching upon legal and ethical questions, I, therefore, give due consideration to broader technological, philosophical, and digital hermeneutical-epistemological questions prompted by the proposed framework. I especially discuss issues of trust and authenticity within the context of an AI-enhanced infosphere (e.g., I discuss the implications of a cyber-librarian that could be enabled by the proposed framework).
I believe that MetaScriptura, a general data provenance framework, opens vistas of opportunities for knowledge representation and interpretations in a wide spectrum of domains, and allows deeper knowledge discovery, by arming its adopters with the ability to maintain and analyze richer contextual semantic provenance metadata.
Recommended Citation
Israel, Maria Joseph, "MetaScriptura: A General Data Provenance Framework" (2022). Engineering Ph.D. Theses. 44.
https://scholarcommons.scu.edu/eng_phd_theses/44