library(tidyverse)
library(lubridate)
library(kableExtra)
theme_set(theme_classic())Background & Data
Dataset description:
Urban lakes are heavily impacted by human activities and climate variability, and they provide many ecosystem services to residents. The Minneapolis-St. Paul Area Long Term Ecological Research Program (MSP LTER) is studying long term changes in urban lake water quality. The goal of this dataset is to understand how land-use change, management, and climate have impacted urban lake biogeochemistry over time. This dataset includes parameters characterizing the long term (> 5 years) surface water quality and chemistry of 294 lakes and ponds in the Minneapolis-Saint Paul Seven County Metropolitan Area, Minnesota, USA.
Main research question:
What’s the relationship between turbidity and nutrients in Minnesota’s urban lakes?
lakes <- read_csv(here::here('data', "Timeseries_v6_forEDI.csv"))head(lakes)# A tibble: 6 × 7
DOW sampleDate parameter result gtlt dataSource county
<chr> <date> <chr> <dbl> <dbl> <chr> <chr>
1 02000300 1974-07-03 secchi.m 1.22 0 MPCA Anoka
2 02000300 1974-07-10 secchi.m 1.37 0 MPCA Anoka
3 02000300 1974-07-17 secchi.m 1.52 0 MPCA Anoka
4 02000300 1974-07-24 secchi.m 1.07 0 MPCA Anoka
5 02000300 1974-07-31 secchi.m 0.91 0 MPCA Anoka
6 02000300 1974-08-07 secchi.m 1.07 0 MPCA Anoka
| Variables | Description |
|---|---|
| DOW | DNR Lake Number |
| sampleDate | Date which sample was collected on |
| parameter | Water quality parameter |
| result | Value of parameter |
| gtlt | Greater than less than, indicates if values are above or below detection limits |
| dataSource | Source of original data |
| county | County which lake is located in |
| Parameters | Description |
|---|---|
| chla.nonPcorr.ug_L | Chlorophyll a, not corrected for pheophytin |
| chla.Pcorr.ug_L | Chlorophyll a, corrected for pheophytin |
| Cl.mg_L | Chlorine |
| NH4.mg_L | Ammonium |
| nitrate.mg_L | Nitrate |
| nitrite.mg_L | Nitrite |
| NOX.mg_L | Nitrate & Nitrite |
| secchi.m | Secchi depth |
| spCond.uS_cm | Specific conductivity |
| TKN.mg_L | Total Kjeldahl nitrogen |
| TN.mg_L | Total nitrogen |
| TP.mg_L | Total phosphorus |
ggplot(lakes, aes(result)) +
geom_histogram() +
facet_wrap(~parameter, scales = "free")
lakes_wide <- lakes |>
pivot_wider(names_from = parameter, values_from = result)DAG
graph TD Date[Sample Date] County[County] Secchi[Secchi Depth] TP[Total Phosphorus] TN[Total Nitrogen] Date --> TN Date --> TP Date --> Secchi County --> TN County --> TP Secchi --> TP Secchi --> TN classDef whitebox fill:#EDEDED; class Date,County,Secchi,TN,TP whitebox;
Statistical Model
Model thoughts:
Using a Gamma distribution seems like the best choice for pollution concentrations as a response.
- Right skewed data, positive continuous data
Model in statistical notation:
\[\text{PConcentration} \sim \text{Gamma}(\mu, \phi)\]
\[\log(\mu) = \beta_0 + \beta_1 * \text{SecchiDepth}\]
Response Family: Gamma
Link Function: log()
Predictor: Secchi Disk Depth (m)
Response: Total Phosphorus (mg/L)
Hypothesis:
Since turbidity is known to indicate water quality & pollution concentrations, we’d expect there to be a relationship between secchi depth as turbidity measurement (predictor) and total phosphorus (response).
\(H_0\): \(\beta_1 = 0\)
\(H_1\): \(\beta_1 \neq 0\)
Fit model to simulated data
Traditionally, the Gamma distribution is parameterized as follows:
Shape: \(\alpha\)
Scale: \(\theta\)
Mean: \(\mu = \alpha\theta\)
Variance: \(\sigma^2 = \alpha\theta^2\)
The parameterization can also be defined in terms of \(\phi\) as:
Shape: \(\phi\)
Scale: \(\mu / \phi\)
Mean: \(\mu = \alpha\theta = \phi*(\frac{\mu}{\phi}) = \mu\)
Variance: \(\sigma^2 = \alpha\theta^2=\phi(\frac{\mu}{\phi})^2=\mu^2 / \phi\)
When picking initial parameters to simulate, we also define \(\phi\) in addition to our \(\beta\)s.
n <- 200
beta0 <- 1
beta1 <- -0.7 # Simulate log response decrease as predictor increases
phi <- 3To confirm our understanding of the model and system, we create simulated data with pre-defined coefficients, run the model and expect our coefficients to be returned.
Generating simulated data:
The predictor:
- This is the independent variable.
rgamma()was used to pick random values from a Gamma distribution with a shape of 1 & scale of 0.5. Choosing these parameters, gave the predictor a mean of 0.5 (Eq. 3). A Gamma distribution was chosen to add realism since the real pollution concentration data in this post are also Gamma distributed.
# Simulated predictor with a mean of 0.5
predictor <- rgamma(n = n, shape = 1, scale = 0.5)The response:
- This is the dependent variable.
rgamma()was again used to pick random values from a Gamma distribution. However, the scale parameter is set toexp(beta0 + beta1 * predictor) / phiwhich sets the relationship between our predictor and response. This scale gives the mean of the response for each observation which we want to change as the predictor changes, defined in the initial model notation:
\[\log(\mu)=\beta_0 + \beta_1 * \text{predictor}\] Exponentiating both sides gives:
\[\mu = e^{\beta_0 + \beta_1 * \text{predictor}}\] which was used in the scale argument as defined in Eq. 6.
# Simulated response with changing mean
response <- rgamma(n = n, shape = phi, scale = exp(beta0 + beta1 * predictor) / phi)Plotting simulated data:
# Combine into dataframe
sim_df <- data.frame(predictor, response)
# Create scatterplot
ggplot(sim_df, aes(predictor, response)) +
geom_point()
# Run model on simulated data
sim_model <- glm(response ~ predictor, data = sim_df, family = Gamma(link = "log"))
# Create grid (one predictor for now)
sim_grid <- tibble(predictor = seq(0, 4, length.out = 500))
# Generate predictions with standard error in link space
sim_se <- predict(sim_model, newdata = sim_grid, type = "link", se.fit = TRUE)
# Undo link & add confidence intervals
sim_ci <- sim_grid |>
mutate(log_response = sim_se$fit,
log_response_se = sim_se$se.fit,
log_response_lower = qnorm(0.025, log_response, log_response_se),
log_response_upper = qnorm(0.975, log_response, log_response_se),
# Mean
predicted_mean = exp(log_response),
# Mean CI
mean_lower = exp(log_response_lower),
mean_upper = exp(log_response_upper))# Return simulated data coefficients
sim_sum <- summary(sim_model)
coef(sim_sum) |>
kbl(digits = 2, align = 'c')| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.89 | 0.06 | 14.14 | 0 |
| predictor | -0.58 | 0.09 | -6.53 | 0 |
# Plot simulated data with model fit
ggplot() +
geom_point(data = sim_df, aes(predictor, response), alpha = 0.3) +
geom_line(data = sim_ci, aes(predictor, predicted_mean), color = 'red', linewidth = 1) +
geom_ribbon(data = sim_ci, aes(predictor, ymin = mean_lower, ymax = mean_upper), alpha = 0.5)
Fit model to real data
# Raw scatterplot
ggplot(lakes_wide, aes(secchi.m, TP.mg_L)) + geom_point(alpha = 0.1)
# Run model on real data
real_model <- glm(TP.mg_L ~ secchi.m, data = lakes_wide, family = Gamma(link = "log"))# Get real model summary
real_sum <- summary(real_model)# Create prediction grid (one predictor for now)
pred_grid <- tibble(secchi.m = seq(min(lakes_wide$secchi.m, na.rm = TRUE),
max(lakes_wide$secchi.m, na.rm = TRUE),
length.out = 100))
# Generate predictions with standard error in link space
pred_se <- predict(real_model, newdata = pred_grid, type = "link", se.fit = TRUE)
# Undo link & add confidence intervals
pred_ci <- pred_grid |>
mutate(log_TP = pred_se$fit,
log_TP_se = pred_se$se.fit,
log_TP_lower = qnorm(0.025, log_TP, log_TP_se),
log_TP_upper = qnorm(0.975, log_TP, log_TP_se),
# Mean
pred_TP = exp(log_TP),
# Mean CI
TP_lower = exp(log_TP_lower),
TP_upper = exp(log_TP_upper))
# Plot raw data with mean response and mean CI
ggplot() +
geom_point(data = lakes_wide, aes(secchi.m, TP.mg_L), alpha = 0.3) +
geom_line(data = pred_ci, aes(secchi.m, pred_TP), color = 'red', linewidth = 1) +
geom_ribbon(data = pred_ci, aes(x = secchi.m, ymin = TP_lower, ymax = TP_upper))
# Display coefficient table
coef(real_sum) |>
kbl(digits = 3, align = 'c')| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -2.123 | 0.008 | -278.933 | 0 |
| secchi.m | -0.358 | 0.003 | -111.140 | 0 |
# Calculate 95% CI bounds in link space
b0_ci_lower <- qnorm(0.025, mean = coef(real_sum)[1], sd = real_sum$coefficients[1, 2])
b0_ci_upper <- qnorm(0.975, mean = coef(real_sum)[1], sd = real_sum$coefficients[1, 2])
b1_ci_lower <- qnorm(0.025, mean = coef(real_sum)[2], sd = real_sum$coefficients[2, 2])
b1_ci_upper <- qnorm(0.975, mean = coef(real_sum)[2], sd = real_sum$coefficients[2, 2])# Create CI table
ci_table <- data.frame(link_estimate = c(coef(real_sum)[1], coef(real_sum)[2]),
response_estimate = c(exp(coef(real_sum)[1]), exp(coef(real_sum)[2])),
lower_ci = c(exp(b0_ci_lower), exp(b1_ci_lower)),
upper_ci = c(exp(b0_ci_upper), exp(b1_ci_upper)))
# Create rownames
rownames(ci_table) <- c("(Intercept)", "secchi.m")
# Display table
ci_table |> kbl(col.names = c("Link Estimate", "Response Estimate", "CI Lower Bound", "CI Upper Bound"), digits = 3, align = 'c')| Link Estimate | Response Estimate | CI Lower Bound | CI Upper Bound | |
|---|---|---|---|---|
| (Intercept) | -2.123 | 0.120 | 0.118 | 0.122 |
| secchi.m | -0.358 | 0.699 | 0.694 | 0.703 |
Extend model to other parameters
response_vars <- lakes_wide |>
select(NOX.mg_L:TN.mg_L) |>
colnames()response_df <- map_df(response_vars, ~{
# Remove zeros for each response before modeling
model_data <- lakes_wide |>
filter(lakes_wide[[.x]] > 0)
# Run model
model <- glm(as.formula(paste(.x, "~ secchi.m")),
data = model_data,
family = Gamma(link = "log"))
# Get model summary
model_sum <- summary(model)
# Create tibble of all responses
tibble(
response = .x,
term = names(coef(model)),
estimate = coef(model),
p_val = coef(model_sum)[, 4]
)})response_df |>
filter(term == "secchi.m") |>
select(-term) |>
kbl(col.names = c("Response", "Estimate", "p-value"), digits = 3, align = 'c')| Response | Estimate | p-value |
|---|---|---|
| NOX.mg_L | -0.142 | 0.000 |
| TKN.mg_L | -0.230 | 0.000 |
| TP.mg_L | -0.358 | 0.000 |
| chla.nonPcorr.ug_L | -0.647 | 0.000 |
| spCond.uS_cm | -0.011 | 0.000 |
| Cl.mg_L | -0.061 | 0.000 |
| NH4.mg_L | -0.154 | 0.000 |
| chla.Pcorr.ug_L | -0.703 | 0.000 |
| nitrate.mg_L | -0.180 | 0.017 |
| nitrite.mg_L | -0.109 | 0.284 |
| TN.mg_L | -0.167 | 0.000 |
Inference
Hypothesis
There is a relationship between secchi depth as turbidity measurement (predictor) and total phosphorus (response).
\(H_0\): \(\beta_1 = 0\)
\(H_1\): \(\beta_1 \neq 0\)
In response space (\(e^{-0.358}\)), the coefficient for secchi.m (\(\beta_1\)) is about \(0.70 \pm 0.004\) at 95% confidence. Meaning, for every 1 meter increase of Secchi depth (clearer waters), total phosphorus (TP) decreases by 30% (multiplied by 0.70). The intercept coefficient (\(\beta_0\)) represents TP when Secchi depth is 0 meters. Undo-ing the link (\(e^{-2.123}\)), this value is estimated at \(0.12 \pm 0.002\) at 95% confidence. Both estimates are highly significant, indicating there’s essentially a zero percent chance that these estimates could have resulted by chance and we can reject the null hypothesis.
It’s known that turbid, murky waters usually indicate higher concentrations of pollutants. Whether it’s sediment, nutrients, or other contaminants, these results align with the expectation that turbid waters contain more pollutants. Overall, using our Gamma model on this data reinforces the idea that water clarity is in fact a reliable indicator of nutrient pollution and lake health.
Citation
@online{loo2025,
author = {Loo, Zach},
title = {Gamma {Models} for {Pollution} {Concentrations}},
date = {2025-12-03},
url = {https://zachyyy700.github.io/posts/2025-12-3-stats-final/},
langid = {en}
}