Background & Data

Dataset description:

Urban lakes are heavily impacted by human activities and climate variability, and they provide many ecosystem services to residents. The Minneapolis-St. Paul Area Long Term Ecological Research Program (MSP LTER) is studying long term changes in urban lake water quality. The goal of this dataset is to understand how land-use change, management, and climate have impacted urban lake biogeochemistry over time. This dataset includes parameters characterizing the long term (> 5 years) surface water quality and chemistry of 294 lakes and ponds in the Minneapolis-Saint Paul Seven County Metropolitan Area, Minnesota, USA.

Main research question:

What’s the relationship between turbidity and nutrients in Minnesota’s urban lakes?

library(tidyverse)
library(lubridate)
library(kableExtra)
theme_set(theme_classic())

lakes <- read_csv(here::here('data', "Timeseries_v6_forEDI.csv"))

head(lakes)

# A tibble: 6 × 7
  DOW      sampleDate parameter result  gtlt dataSource county
  <chr>    <date>     <chr>      <dbl> <dbl> <chr>      <chr> 
1 02000300 1974-07-03 secchi.m    1.22     0 MPCA       Anoka 
2 02000300 1974-07-10 secchi.m    1.37     0 MPCA       Anoka 
3 02000300 1974-07-17 secchi.m    1.52     0 MPCA       Anoka 
4 02000300 1974-07-24 secchi.m    1.07     0 MPCA       Anoka 
5 02000300 1974-07-31 secchi.m    0.91     0 MPCA       Anoka 
6 02000300 1974-08-07 secchi.m    1.07     0 MPCA       Anoka

Table 1: Dataset Overview

(a) Variables Overview

Variables	Description
DOW	DNR Lake Number
sampleDate	Date which sample was collected on
parameter	Water quality parameter
result	Value of parameter
gtlt	Greater than less than, indicates if values are above or below detection limits
dataSource	Source of original data
county	County which lake is located in

(b) Parameters

Parameters	Description
chla.nonPcorr.ug_L	Chlorophyll a, not corrected for pheophytin
chla.Pcorr.ug_L	Chlorophyll a, corrected for pheophytin
Cl.mg_L	Chlorine
NH4.mg_L	Ammonium
nitrate.mg_L	Nitrate
nitrite.mg_L	Nitrite
NOX.mg_L	Nitrate & Nitrite
secchi.m	Secchi depth
spCond.uS_cm	Specific conductivity
TKN.mg_L	Total Kjeldahl nitrogen
TN.mg_L	Total nitrogen
TP.mg_L	Total phosphorus

ggplot(lakes, aes(result)) +
  geom_histogram() +
  facet_wrap(~parameter, scales = "free")

Figure 1: Histograms for all water quality parameters in dataset.

lakes_wide <- lakes |> 
  pivot_wider(names_from = parameter, values_from = result)

DAG

graph TD
  Date[Sample Date]
  County[County]
  Secchi[Secchi Depth]
  TP[Total Phosphorus]
  TN[Total Nitrogen]
  
  Date --> TN
  Date --> TP
  Date --> Secchi
  County --> TN
  County --> TP
  Secchi --> TP
  Secchi --> TN
  
  classDef whitebox fill:#EDEDED;
  class Date,County,Secchi,TN,TP whitebox;

Statistical Model

Model thoughts:

Using a Gamma distribution seems like the best choice for pollution concentrations as a response.
- Right skewed data, positive continuous data

Model in statistical notation:

\[\text{PConcentration} \sim \text{Gamma}(\mu, \phi)\]

\[\log(\mu) = \beta_0 + \beta_1 * \text{SecchiDepth}\]

Response Family: Gamma
Link Function: log()
Predictor: Secchi Disk Depth (m)
Response: Total Phosphorus (mg/L)

Hypothesis:

Since turbidity is known to indicate water quality & pollution concentrations, we’d expect there to be a relationship between secchi depth as turbidity measurement (predictor) and total phosphorus (response).
\(H_0\): \(\beta_1 = 0\)
\(H_1\): \(\beta_1 \neq 0\)

Fit model to simulated data

Traditionally, the Gamma distribution is parameterized as follows:

Shape: \(\alpha\)
Scale: \(\theta\)
Mean: \(\mu = \alpha\theta\)
Variance: \(\sigma^2 = \alpha\theta^2\)

The parameterization can also be defined in terms of \(\phi\) as:

Shape: \(\phi\)
Scale: \(\mu / \phi\)
Mean: \(\mu = \alpha\theta = \phi*(\frac{\mu}{\phi}) = \mu\)
Variance: \(\sigma^2 = \alpha\theta^2=\phi(\frac{\mu}{\phi})^2=\mu^2 / \phi\)

When picking initial parameters to simulate, we also define \(\phi\) in addition to our \(\beta\)s.

n <- 200
beta0 <- 1
beta1 <- -0.7 # Simulate log response decrease as predictor increases
phi <- 3

To confirm our understanding of the model and system, we create simulated data with pre-defined coefficients, run the model and expect our coefficients to be returned.

Generating simulated data:

The predictor:

This is the independent variable. rgamma() was used to pick random values from a Gamma distribution with a shape of 1 & scale of 0.5. Choosing these parameters, gave the predictor a mean of 0.5 (Eq. 3). A Gamma distribution was chosen to add realism since the real pollution concentration data in this post are also Gamma distributed.

# Simulated predictor with a mean of 0.5
predictor <- rgamma(n = n, shape = 1, scale = 0.5)

The response:

This is the dependent variable. rgamma() was again used to pick random values from a Gamma distribution. However, the scale parameter is set to exp(beta0 + beta1 * predictor) / phi which sets the relationship between our predictor and response. This scale gives the mean of the response for each observation which we want to change as the predictor changes, defined in the initial model notation:

\[\log(\mu)=\beta_0 + \beta_1 * \text{predictor}\] Exponentiating both sides gives:

\[\mu = e^{\beta_0 + \beta_1 * \text{predictor}}\] which was used in the scale argument as defined in Eq. 6.

# Simulated response with changing mean
response <- rgamma(n = n, shape = phi, scale = exp(beta0 + beta1 * predictor) / phi)

Plotting simulated data:

# Combine into dataframe
sim_df <- data.frame(predictor, response)

# Create scatterplot
ggplot(sim_df, aes(predictor, response)) +
  geom_point()

Figure 2: Scatterplot of simulated response and predictor.

# Run model on simulated data
sim_model <- glm(response ~ predictor, data = sim_df, family = Gamma(link = "log"))

# Create grid (one predictor for now)
sim_grid <- tibble(predictor = seq(0, 4, length.out = 500))

# Generate predictions with standard error in link space
sim_se <- predict(sim_model, newdata = sim_grid, type = "link", se.fit = TRUE)

# Undo link & add confidence intervals
sim_ci <- sim_grid |> 
  mutate(log_response = sim_se$fit,
         log_response_se = sim_se$se.fit,
         log_response_lower = qnorm(0.025, log_response, log_response_se),
         log_response_upper = qnorm(0.975, log_response, log_response_se),
         # Mean
         predicted_mean = exp(log_response),
         # Mean CI
         mean_lower = exp(log_response_lower),
         mean_upper = exp(log_response_upper))

# Return simulated data coefficients
sim_sum <- summary(sim_model)

coef(sim_sum) |> 
  kbl(digits = 2, align = 'c')

Table 2: Returned coefficients from running model on simulated data.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.89	0.06	14.14	0
predictor	-0.58	0.09	-6.53	0

# Plot simulated data with model fit
ggplot() + 
  geom_point(data = sim_df, aes(predictor, response), alpha = 0.3) +
  geom_line(data = sim_ci, aes(predictor, predicted_mean), color = 'red', linewidth = 1) +
  geom_ribbon(data = sim_ci, aes(predictor, ymin = mean_lower, ymax = mean_upper), alpha = 0.5)

Figure 3: Simulated data with Gamma model fit.

Fit model to real data

# Raw scatterplot
ggplot(lakes_wide, aes(secchi.m, TP.mg_L)) + geom_point(alpha = 0.1)

# Run model on real data
real_model <- glm(TP.mg_L ~ secchi.m, data = lakes_wide, family = Gamma(link = "log"))

# Get real model summary
real_sum <- summary(real_model)

# Create prediction grid (one predictor for now)
pred_grid <- tibble(secchi.m = seq(min(lakes_wide$secchi.m, na.rm = TRUE), 
                                    max(lakes_wide$secchi.m, na.rm = TRUE), 
                                    length.out = 100))

# Generate predictions with standard error in link space
pred_se <- predict(real_model, newdata = pred_grid, type = "link", se.fit = TRUE)

# Undo link & add confidence intervals
pred_ci <- pred_grid |> 
  mutate(log_TP = pred_se$fit,
         log_TP_se = pred_se$se.fit,
         log_TP_lower = qnorm(0.025, log_TP, log_TP_se),
         log_TP_upper = qnorm(0.975, log_TP, log_TP_se),
         # Mean
         pred_TP = exp(log_TP),
         # Mean CI
         TP_lower = exp(log_TP_lower),
         TP_upper = exp(log_TP_upper))

# Plot raw data with mean response and mean CI
ggplot() + 
  geom_point(data = lakes_wide, aes(secchi.m, TP.mg_L), alpha = 0.3) +
  geom_line(data = pred_ci, aes(secchi.m, pred_TP), color = 'red', linewidth = 1) +
  geom_ribbon(data = pred_ci, aes(x = secchi.m, ymin = TP_lower, ymax = TP_upper))

Figure 5: Real data with Gamma model fit.

# Display coefficient table
coef(real_sum) |> 
  kbl(digits = 3, align = 'c')

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-2.123	0.008	-278.933	0
secchi.m	-0.358	0.003	-111.140	0

# Calculate 95% CI bounds in link space
b0_ci_lower <- qnorm(0.025, mean = coef(real_sum)[1], sd = real_sum$coefficients[1, 2])
b0_ci_upper <- qnorm(0.975, mean = coef(real_sum)[1], sd = real_sum$coefficients[1, 2])

b1_ci_lower <- qnorm(0.025, mean = coef(real_sum)[2], sd = real_sum$coefficients[2, 2])
b1_ci_upper <- qnorm(0.975, mean = coef(real_sum)[2], sd = real_sum$coefficients[2, 2])

# Create CI table
ci_table <- data.frame(link_estimate = c(coef(real_sum)[1], coef(real_sum)[2]),
       response_estimate = c(exp(coef(real_sum)[1]), exp(coef(real_sum)[2])),
       lower_ci = c(exp(b0_ci_lower), exp(b1_ci_lower)),
       upper_ci = c(exp(b0_ci_upper), exp(b1_ci_upper)))

# Create rownames
rownames(ci_table) <- c("(Intercept)", "secchi.m")

# Display table
ci_table |> kbl(col.names = c("Link Estimate", "Response Estimate", "CI Lower Bound", "CI Upper Bound"), digits = 3, align = 'c')

	Link Estimate	Response Estimate	CI Lower Bound	CI Upper Bound
(Intercept)	-2.123	0.120	0.118	0.122
secchi.m	-0.358	0.699	0.694	0.703

Extend model to other parameters

response_vars <- lakes_wide |> 
  select(NOX.mg_L:TN.mg_L) |> 
  colnames()

response_df <- map_df(response_vars, ~{
  
  # Remove zeros for each response before modeling
  model_data <- lakes_wide |> 
    filter(lakes_wide[[.x]] > 0)
  
  # Run model
  model <- glm(as.formula(paste(.x, "~ secchi.m")), 
               data = model_data, 
               family = Gamma(link = "log"))
  # Get model summary
  model_sum <- summary(model)
  
  # Create tibble of all responses
  tibble(
  response = .x,
  term = names(coef(model)),
  estimate = coef(model),
  p_val = coef(model_sum)[, 4]
)})

response_df |> 
  filter(term == "secchi.m") |> 
  select(-term) |> 
  kbl(col.names = c("Response", "Estimate", "p-value"), digits = 3, align = 'c')

Table 3: Secchi.m coefficient estimates & p-values for all water quality parameters in dataset.

Response	Estimate	p-value
NOX.mg_L	-0.142	0.000
TKN.mg_L	-0.230	0.000
TP.mg_L	-0.358	0.000
chla.nonPcorr.ug_L	-0.647	0.000
spCond.uS_cm	-0.011	0.000
Cl.mg_L	-0.061	0.000
NH4.mg_L	-0.154	0.000
chla.Pcorr.ug_L	-0.703	0.000
nitrate.mg_L	-0.180	0.017
nitrite.mg_L	-0.109	0.284
TN.mg_L	-0.167	0.000

Inference

Hypothesis

There is a relationship between secchi depth as turbidity measurement (predictor) and total phosphorus (response).
\(H_0\): \(\beta_1 = 0\)
\(H_1\): \(\beta_1 \neq 0\)

In response space (\(e^{-0.358}\)), the coefficient for secchi.m (\(\beta_1\)) is about \(0.70 \pm 0.004\) at 95% confidence. Meaning, for every 1 meter increase of Secchi depth (clearer waters), total phosphorus (TP) decreases by 30% (multiplied by 0.70). The intercept coefficient (\(\beta_0\)) represents TP when Secchi depth is 0 meters. Undo-ing the link (\(e^{-2.123}\)), this value is estimated at \(0.12 \pm 0.002\) at 95% confidence. Both estimates are highly significant, indicating there’s essentially a zero percent chance that these estimates could have resulted by chance and we can reject the null hypothesis.

It’s known that turbid, murky waters usually indicate higher concentrations of pollutants. Whether it’s sediment, nutrients, or other contaminants, these results align with the expectation that turbid waters contain more pollutants. Overall, using our Gamma model on this data reinforces the idea that water clarity is in fact a reliable indicator of nutrient pollution and lake health.

Citation

BibTeX citation:

@online{loo2025,
  author = {Loo, Zach},
  title = {Gamma {Models} for {Pollution} {Concentrations}},
  date = {2025-12-03},
  url = {https://zachyyy700.github.io/posts/2025-12-3-stats-final/},
  langid = {en}
}

For attribution, please cite this work as:

Loo, Zach. 2025. “Gamma Models for Pollution Concentrations.” December 3, 2025. https://zachyyy700.github.io/posts/2025-12-3-stats-final/.