The emergent insights of high-density data replacing low sampling frequency

Data quality.  Data accuracy.  Technology differences.

In the era of data-driven insights, what is the role of high density data for soil and groundwater monitoring?  High density data is far from a mere surge in information volume.  It delivers the opportunity for more precision in your modeling and decision making.

High Density Data

High density data is not simply more of the same data you are used to.  It is counter-intuitive but true.   Let us take a look at an easy example.

You are measuring carbon dioxide flux from an ecosystem.  Typically, this is measured using a flux chamber which will measure how much carbon dioxide accumulates in a closed volume over a short period of time.  From that you estimate how much carbon dioxide is leaving an ecosystem.  Before high-density data, we might collect such estimates every week for a few weeks a year.  Fast forward to 2023, and you are using real-time, remote monitoring systems and you are collecting carbon dioxide flux every 30 minutes, 365 days a year.   At first glance, you might think, this is just more of the same.  You would be wrong.

High density data has emergent properties that low density data does not.

For example, low density data does not tell you how reliable your estimate is.  Technically, we call this the probability distribution function of the data distribution.   Typically, in environmental sciences we will assume a normal or log-normal distribution.   Based on this assumption, we will then estimate how reliable your estimate is.  High density data requires no assumption.  The data density is such, that we can reliably fit a complex distribution function like a gamma function, or even a binomial-gamma function hybrid to the data stream.

Why should you care about this?  Forecasts based on high-density data will not only be more accurate but more robust to outlier events because we can readily incorporate low-probability events, such as floods, fires, etc into such distribution functions.   More importantly, we can also transparently and easily communicate the uncertainty of the current and future estimates.

High density data alters risk management.

Risk management models that estimate contaminant fate, transport, and risk all incorporate complex physical phenomenon.  But when you only have low density data from a single groundwater estimate every year, risk management models need to simplify complex processes into an easy parameter.   It’s a best-guess estimate based on limited data.  This simplification is how you end up with leaching models or vapour intrusion models based on a single criteria (often linked to soil texture).  We know that the fate of soil contamination isn’t dictated by a single factor.  It’s complex.  High density data makes these simplifying assumptions no longer necessary.

A test of a model is how useful it is, not whether it is true.

With low density data, no effort is made to adjust the model or estimate how valid it is for an environment.  Instead, a generic model is used because we prefer to eliminate interpretation bias, rather than incorporate accuracy.   The cost of this decision is that stakeholders do not trust the models because they do not conform to their experiences.   High density data flips this entire process on its head.

A complex model is set up with many parameters left unknown.   High density data is then fed into the model and the output is evaluated using a data assimilation process, such as Model-Independent Parameter Estimation and Uncertainty (PEST).   This process involves model parameters automatically adjusted to conform with the incoming data stream.  With this process the current model can often predict past performance.    These models can readily identify unknown unknowns and identify which of these you need to focus your efforts on.   Or in other words, high density data allows your models to identify important data gaps in your site characterization.

High density data forces the users to confront an uncomfortable truth of environmental data.  All data is merely an imperfect estimate of site conditions.

Here is an example.  A typical contaminated site is 2,000 m3.   Five, 5 m deep borehole samples will collect 0.3 m3, of which 0.008 m3 will be sent to the laboratory where 5 x 0.000 004 m3 will be extracted for analysis.   Thus, your gold standard low density data provides you with 5 independent estimates of 1 billionth of your site volume, and based on that you will need to make site decisions.

In contrast, high density data typically samples 12.6 m3 every 30 minutes per sensor, and a typically installation would sample 188 m3 every 30 minutes.  High density data provides you with 262,000 estimates of 1/10th of your site volume.  Based on that you can make better site decision.