Infer-Proven-ence: Capturing Provenance Information with Minimal Intrusion

Project overview

Commercial and government decisions are driven by data. Provenance is the record of how data and processes were created, modified and used. It is used to support quality assessments for data, provide traceability, identify possible system intrusions, etc. Unfortunately, all of the uses of provenance require that provenance information be captured by each system within a system of systems. This capture problem is costly and does not scale. To date, only applications that have a high value to scientists have been provenance capture-enabled [9, 17]. Instead, we seek to build observation points external to any pre-built system that will create partial, or inferred, provenance that can be reused across any system that uses the same architectural components. In order to facilitate the adoption of provenance within enterprise systems built from a heterogeneous software stack that is unique to each organization, the Infer-Proven-ence project is researching the underlying feasibility and creating a toolbox of techniques that will reduce the number of applications that must be provenance-enabled. Unlike a provenance-enabled application that can report observed provenance, inferred provenance has a probability of being what actually happened. Depending on the overall architecture, different provenance inference techniques need to be available. An inference technique that works within a database and its limited set of transformations will not work over streaming data. This work is establishing the theoretical underpinnings for two different provenance-capture inference mechanisms that work within common architectures. It will create implementations of each technique that can be evaluated within real-world scenarios. The Infer-Proven-ence approach shall be evaluated across two distinct architectures: one for stream processing, and one for data analytics. While architectures exist that combine all of these components, we intentionally split them into the smallest representative unit with respect to data flow and application-type. With this in mind, Infer-Proven-ence will be evaluated across two distinct architectures: a stream processing of sensor data architecture; and a data analytic architecture. Evaluation will consider: ability to correctly infer provenance; accuracy of inferred provenance; cost of implementation within the given architecture, scalability of approach and the utility of the inferred provenance for a use case specific to each problem domain. For the first technique, we will work with partners at Roke Manor Research and their autonomous vehicle program in which data from disparate sensors is streamed through a set of micro-processors and driving decsions are made. Provenance within this use case will be used to highlight anomalies and likely sources of decision errors. For the second technique, we will work within a data analytic architecture in which source data is transformed and manipulated during the process of analysis. Provenance within this use case will be used to reproduce the analytic results In addition to the real-world evaluation, we shall work closely with UK's Software Sustainability Institute, which promotes sustainable software technologies in order to build software that can be transitioned and reused by others. SSI shall assist in ensuring that Infer-Proven-ence is generalizable and relevant to any discipline based only on the architecture required by that discipline. Finally, Infer-Proven-ence will produce a roadmap for further research, taking stock of the work done and identifying future opportunities. Infer-Proven-ence also builds partnerships across several institutions including Southampton's Cyber Security Research Centre, the University of Massachusetts Amherst, the Software Sustainability Institute and Roke Manor Research in order to investigate provenance inference in real-world situations.

Staff

Lead researchers

Professor Age Chapman

Professor of Computer Science

Connect with Age

Email: Adriane.Chapman@soton.ac.uk

Collaborating research institutes, centres and groups

Research outputs

Deep learning provenance data integration: a practical approach

Débora Pina, Adriane Chapman, Daniel De Oliveira & Marta Mattoso, 2023

DOI: 10.1145/3543873.3587561

Type: conference

Toward a common standard for data and specimen provenance in life sciences

Rudolf Wittner, Petr Holub, Cecilia Mascia, Francesca Frexia, Heimo Müller, Markus Plass, Clare Allocca, Fay Betsou, Tony Burdett, Ibon Cancio, Adriane Chapman, Martin Chapman, Mélanie Courtot, Vasa Curcin, Johan Eder, Mark Elliot, Katrina Exter, Carole Goble, Martin Golebiewski, Bron Kisler, Andreas Kremer, Simone Leo, Sheng Lin-Gibson, Anna Marsano, Marco Mattavelli, Josh Moore, Hiroki Nakae, Isabelle Perseil, Ayat Salman, James Sluka, Stian Soiland-Reyes, Caterina Strambio-De-Castillia, Michael Sussman, Jason R. Swedlow, Kurt Zatloukal & Jörg Geiger, 2023, Learning Health Systems

DOI: 10.1002/lrh2.10365

Type: letterEditorial

DPDS: assisting data science with data provenance

Adriane Chapman, Luca Lauro, Paolo Missier & Riccardo Torlone, 2022, Proceedings of the VLDB Endowment, 15(12), 3614–3617

DOI: 10.14778/3554821.3554857

Type: article

Data provenance, curation and quality in metrology

James Cheney, Age Chapman, Joy Davidson & Alistair Forbes, 2022

DOI: 10.1142/9789811242380_0009

Type: bookChapter

Enabling personal consent in databases

Georgios Konstantinidis, Jet Holt & Age Chapman, 2021, Proceedings of the VLDB Endowment, 15(2), 375–387