The demands of today’s geoscientist are many, not least in the pursuit of high quality geological insight. Mankind’s interactions with our planet are often at scales that carry risk: infrastructure development, hydrocarbon and mineral extraction, and societal development in changing environments. Geological insights are consumed by end users far removed from the geoscientific tribe and these end users will have differing perspectives and perceptions of geoscientific output. The most fundamental question that can be asked of data and insights produced by a geologist is, ‘how much can it be trusted?’
Trustworthy data allow geological models to be constrained and risks to be calculated. Trust in data is delivered by three components:
- The numerical error in a measurement, or the looseness of a subjective description. Is it 10 m? 10.00 m? 10.0 ± 0.1 m? The first is unhelpful, the second suggests a potential for high numerical precision with four significant figures, but only the third gives any feeling for how certain the user should be. Similarly describing samples from outcrop and core requires a high level of linguistic ability for it to be of use to anyone else in a time and place far removed from the initial inspection.
- The error in locating a measurement in space and time. With the advent of GPS and atomic clocks, locating something in space and time should be easy. It is trivial for a field team to keep a GPS logger switched on all day and apply waymarkers at each sampling location. Similarly all imagery collected should include information on the field of view, whether of a small outcrop or a complete vista. This allows simple integration of data into a virtual world or geospatial mash-up. Unfortunately, the best attempts are often thwarted by the diversity of coordinate reference systems, projections, and reference ellipsoids used in locating data. Any dataset or report providing spatial data should include the datum, reference ellipsoid, and coordinate system to give absolute confidence in location.
- Information and data about the measurement — often called metadata. This aspect of data quality is often unclear until you have need for it, but you will often end up wishing that more had been collected. It could be information about instrumentation such as calibration settings; it could be information about the sampling environment — the weather, the sea state; it could be human information — who was on a field team; or it could be processing information — what version of which application was used to refine data or perform an interpretation.
These components are elements of a framework known as a data provenance architecture. There are organisations and groups (NASA, academic and government consortia, integrated oil companies) that perform geoscientific enquiry at massive scales and rates, and collect data from across many domains: geological, geophysical, oceanographic, engineering, even societal. Their workflows are complex and lengthy, and are typically revisited as new data become available for assimilation and re-evaluation. Data provenance architectures are emerging as a formal framework for the persistence of data quality and trustworthiness through such long and complex workflows. If implemented properly it should allow an end user to understand how and where the data were collected and, equally importantly, who has touched it and with what processes as it moves through a workflow. In the upstream oil industry many of the concepts are still aspirational but it is clear that the chain of data provenance is only as strong as its weakest link.
Current exemplary projects that address data provenance include:
- NASA’s data management plan guidelines: ageo.co/18H93wf
- The USGS data management resources: ageo.co/14rkIvR
- The UK NERC data management plans: ageo.co/1dwr6EL
- UK scientific and computing academic communities: ageo.co/16vtyGD especially the ADMIRe program: ageo.co/1aZZhXS
Every geologist should be mindful of the part that they play in the chain and carry out their science with this in mind. They should strive to implement it in their daily work, as part of their corporate culture, and when procuring software and systems that facilitate their interaction with their data.