--- title: "Spatial Regression Models" output: rmarkdown::html_vignette: mathjax: "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" vignette: > %\VignetteIndexEntry{Spatial Regression Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} header-includes: - \def\T{{ \scriptstyle \top }} - \newcommand{\thetasp}{{\theta_{\text{sp}}}} - \newcommand{\GP}{\mathsf{GP}} - \newcommand{\N}{\mathsf{N}} - \newcommand{\EF}{\mathsf{EF}} - \newcommand{\Norm}{\mathsf{N}} - \newcommand{\GCMc}{\mathsf{GCM}_c} - \newcommand{\GCM}{\mathsf{GCM}} - \newcommand{\calL}{\mathcal{L}} - \newcommand{\IG}{\mathsf{IG}} - \newcommand{\IW}{\mathsf{IW}} - \newcommand{\given}{\mid} bibliography: refs.bib link-citations: true --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In this article, we discuss the following functions - - `spLMexact()` - `spLMstack()` - `spGLMexact()` - `spGLMstack()` These functions can be used to fit Gaussian and non-Gaussian spatial point-referenced data. ```{r} set.seed(1729) ``` ## Bayesian Gaussian spatial regression models In this section, we thoroughly illustrate our method on synthetic Gaussian as well as non-Gaussian spatial data and provide code to analyze the output of our functions. We start by loading the package. ```{r setup} library(spStack) library(ggplot2) library(patchwork) ``` Some synthetic spatial data are lazy-loaded which includes synthetic spatial Gaussian data `simGaussian`, Poisson data `simPoisson`, binomial data `simBinom` and binary data `simBinary`. One can use the function `sim_spData()` to simulate spatial data. We will be applying our functions on these datasets. ### Using fixed hyperparameters We first load the data `simGaussian` and set up the priors. Supplying the priors is optional. See the documentation of `spLMexact()` to learn more about the default priors. Besides, setting the priors, we also fix the values of the spatial process parameters (spatial decay $\phi$ and smoothness $\nu$) and the noise-to-spatial variance ratio ($\delta^2$). ```{r} data("simGaussian") dat <- simGaussian[1:200, ] # work with first 200 rows muBeta <- c(0, 0) VBeta <- cbind(c(1E4, 0.0), c(0.0, 1E4)) sigmaSqIGa <- 2 sigmaSqIGb <- 2 phi0 <- 3 nu0 <- 0.5 noise_sp_ratio <- 0.8 prior_list <- list(beta.norm = list(muBeta, VBeta), sigma.sq.ig = c(sigmaSqIGa, sigmaSqIGb)) nSamples <- 1000 ``` We define the spatial model using a `formula`, similar to that in the widely used `lm()` function in the `stats` package. Here, the formula `y ~ x1` corresponds to the spatial linear model $$y(s) = \beta_0 + \beta_1 x_1(s) + z(s) + \epsilon(s)\;,$$ where the `y` corresponds to the response variable $y(s)$, which is regressed on the predictor `x1` given by $x_1(s)$. The intercept is automatically considered within the model, and hence `y ~ x1` is functionally equivalent to `y ~ 1 + x1`. Moreover, a spatial random effect is inherent in the model, where the spatial correlation matrix is governed by the spatial correlation function specified by the argument `cor.fn`. Supported correlation functions are `"exponential"` and `"matern"`. The exponential covariogram is specified by the hyperparameter $\phi$ and the Matern covariogram is specified by the hyperparameters $\phi$ and $\nu$. Fixed values of these hyperparameters are supplied through the argument `spParams`. In addition, the noise-to-spatial variance ration is also fixed through the argument `noise_sp_ratio`. If interested in calculation of leave-one-out predictive densities (LOO-PD), `loopd` must be set `TRUE` (the default is `FALSE`). Method of LOO-PD calculation can be also set by the option `loopd.method` which support the keywords `"exact"` and `"psis"`. The option `"exact"` exploits the analytically available expressions of the predictive density and implements an efficient row-deletion Cholesky factor update for fast calculation and avoids refitting the model $n$ times, where $n$ is the sample size. On the other hand, `"psis"` implements Pareto-smoothed importance sampling and finds approximate LOO-PD and is much faster than `"exact"`. We pass these arguments into the function `spLMexact()`. ```{r spLMexactLOO_exact} mod1 <- spLMexact(y ~ x1, data = dat, coords = as.matrix(dat[, c("s1", "s2")]), cor.fn = "matern", priors = prior_list, spParams = list(phi = phi0, nu = nu0), noise_sp_ratio = noise_sp_ratio, n.samples = nSamples, loopd = TRUE, loopd.method = "exact", verbose = TRUE) ``` Next, we can summarize the posterior samples of the fixed effects as follows. ```{r} post_beta <- mod1$samples$beta summary_beta <- t(apply(post_beta, 1, function(x) quantile(x, c(0.025, 0.5, 0.975)))) rownames(summary_beta) <- mod1$X.names print(summary_beta) ``` ### Leave-one-out predictive densities using PSIS Out of curiosity, we find the LOO-PD for the same model using the approximate method that uses Pareto-smoothed importance sampling, or PSIS. See @LOOCV_vehtari17 for details. ```{r spLMexactLOO_PSIS} mod2 <- spLMexact(y ~ x1, data = dat, coords = as.matrix(dat[, c("s1", "s2")]), cor.fn = "matern", priors = prior_list, spParams = list(phi = phi0, nu = nu0), noise_sp_ratio = noise_sp_ratio, n.samples = nSamples, loopd = TRUE, loopd.method = "PSIS", verbose = FALSE) ``` Subsquently, we compare the LOO-PD obtained by the two methods. ```{r fig.align='center'} loopd_exact <- mod1$loopd loopd_psis <- mod2$loopd loopd_df <- data.frame(exact = loopd_exact, psis = loopd_psis) library(ggplot2) plot1 <- ggplot(data = loopd_df, aes(x = exact)) + geom_point(aes(y = psis), size = 1, alpha = 0.5) + geom_abline(slope = 1, intercept = 0, color = "red", alpha = 0.5) + xlab("Exact") + ylab("PSIS") + theme_bw() + theme(panel.background = element_blank(), panel.grid = element_blank(), aspect.ratio = 1) plot1 ``` ### Using predictive stacking Next, we move on to the Bayesian spatial stacking algorithm for Gaussian data. We supply the same prior list and provide candidate models constructed using `candidateModels()`. ```{r spLMstack} cand.mod <- candidateModels(list(phi = c(1.5, 3, 5), nu = c(0.5, 1, 1.5), noise_sp_ratio = c(0.5, 1.5)), "cartesian") mod3 <- spLMstack(y ~ x1, data = dat, coords = as.matrix(dat[, c("s1", "s2")]), cor.fn = "matern", priors = prior_list, candidate.models = cand.mod, n.samples = 1000, loopd.method = "exact", parallel = FALSE, verbose = TRUE) ``` The user can check the solver status and runtime by issuing the following. ```{r} print(mod3$solver) print(mod3$solver.status) print(mod3$run.time) ``` ### Analyzing samples from the stacked posterior To sample from the stacked posterior, the package provides a helper function called `stackedSampler()`. Subsequent inference proceeds from these samples obtained from the stacked posterior. ```{r} post_samps <- stackedSampler(mod3) ``` We then collect the samples of the fixed effects and summarize them as follows. ```{r} post_beta <- post_samps$beta summary_beta <- t(apply(post_beta, 1, function(x) quantile(x, c(0.025, 0.5, 0.975)))) rownames(summary_beta) <- mod3$X.names print(summary_beta) ``` The synthetic data `simGaussian` was simulated using the true value $\beta = (2, 5)^\T$. We notice that the stacked posterior is concentrated around the truth. ```{r, fig.align='center', fig.height=3.5, fig.width=7, fig.alt="Posterior distributions of the fixed effects"} library(tidyr) library(dplyr) post_beta_df <- as.data.frame(post_beta) post_beta_df <- post_beta_df %>% mutate(row = paste0("beta", row_number()-1)) %>% pivot_longer(-row, names_to = "sample", values_to = "value") # True values of beta0 and beta1 truth <- data.frame(row = c("beta0", "beta1"), true_value = c(2, 5)) ggplot(post_beta_df, aes(x = value)) + geom_density(fill = "lightblue", alpha = 0.6) + geom_vline(data = truth, aes(xintercept = true_value), color = "red", linetype = "dashed", linewidth = 0.5) + facet_wrap(~ row, scales = "free") + labs(x = "", y = "Density") + theme_bw() + theme(panel.background = element_blank(), panel.grid = element_blank(), aspect.ratio = 1) ``` Furthermore, we compare the posterior samples of the spatial random effects with their corresponding true values. ```{r fig.align='center', fig.alt="Comparison of stacked posterior with the true values"} post_z <- post_samps$z post_z_summ <- t(apply(post_z, 1, function(x) quantile(x, c(0.025, 0.5, 0.975)))) z_combn <- data.frame(z = dat$z_true, zL = post_z_summ[, 1], zM = post_z_summ[, 2], zU = post_z_summ[, 3]) plotz <- ggplot(data = z_combn, aes(x = z)) + geom_point(aes(y = zM), size = 0.75, color = "darkblue", alpha = 0.5) + geom_errorbar(aes(ymin = zL, ymax = zU), width = 0.05, alpha = 0.15, color = "skyblue") + geom_abline(slope = 1, intercept = 0, color = "red") + xlab("True z") + ylab("Stacked posterior of z") + theme_bw() + theme(panel.background = element_blank(), panel.grid = element_blank(), aspect.ratio = 1) plotz ``` The package also provides helper functions to plot interpolated spatial surfaces in order for visualization purposes. The function `surfaceplot()` creates a single spatial surface plot, while `surfaceplot2()` creates two side-by-side surface plots. We are using the later to visually inspect the interpolated spatial surfaces of the true spatial effects and their posterior medians. ```{r fig.align='center', fig.height=3.5, fig.width=7, fig.alt="Comparison of the interploated spatial surfaces of the true random effects and the posterior medians."} postmedian_z <- apply(post_z, 1, median) dat$z_hat <- postmedian_z plot_z <- surfaceplot2(dat, coords_name = c("s1", "s2"), var1_name = "z_true", var2_name = "z_hat") patchwork::wrap_plots(plot_z) + patchwork::plot_layout(guides = "collect") & ggplot2::theme(legend.position = "right") ``` ### Analysis of spatial non-Gaussian data We also offer functions for Bayesian analysis of spatially point-referenced Poisson, binomial count, and binary data. #### Spatial Poisson count data We first load and plot the point-referenced Poisson count data. ```{r fig.align='center'} data("simPoisson") dat <- simPoisson[1:200, ] # work with first 200 observations ggplot(dat, aes(x = s1, y = s2)) + geom_point(aes(color = y), alpha = 0.75) + scale_color_distiller(palette = "RdYlGn", direction = -1, label = function(x) sprintf("%.0f", x)) + guides(alpha = 'none') + theme_bw() + theme(axis.ticks = element_line(linewidth = 0.25), panel.background = element_blank(), panel.grid = element_blank(), legend.title = element_text(size = 10, hjust = 0.25), legend.box.just = "center", aspect.ratio = 1) ``` #### Under fixed hyperparameters Next, we demonstrate the function `spGLMexact()` which delivers posterior samples of the fixed effects and the spatial random effects. The option `family` must be specified correctly while using this function. For instance, in the following example, the formula `y ~ x1` and `family = "poisson"` corresponds to the spatial regression model $$y(s) \sim \mathsf{Poisson} (\lambda(s)), \quad \log \lambda(s) = \beta_0 + \beta_1 x_1(s) + z(s)\;.$$ We provide fixed values of the spatial process parameters and a boundary adjustment parameter, given by the argument `boundary`, which if not supplied, defaults to 0.5. For details on the priors and its default value, see function documentation. ```{r spGLMexact_Pois} mod1 <- spGLMexact(y ~ x1, data = dat, family = "poisson", coords = as.matrix(dat[, c("s1", "s2")]), cor.fn = "matern", spParams = list(phi = phi0, nu = nu0), priors = list(nu.beta = 5, nu.z = 5), boundary = 0.5, n.samples = 1000, verbose = TRUE) ``` We next collect the samples of the fixed effects and summarize them. The true value of the fixed effects with which the data was simulated is $\beta = (2, -0.5)$ (for more details, see the documentation of the data `simPoisson`). ```{r} post_beta <- mod1$samples$beta summary_beta <- t(apply(post_beta, 1, function(x) quantile(x, c(0.025, 0.5, 0.975)))) rownames(summary_beta) <- mod1$X.names print(summary_beta) ``` #### Posterior recovery of scale parameters The analytic tractability of the posterior distribution under the $\GCM$ framework is enabled by marginalizing out the scale parameters $\sigma^2_\beta$ and $\sigma^2_z$ associated with the fixed effects $\beta$ and the spatial random effects $z$, respectively. However, posterior samples of $\sigma^2_\beta$ and $\sigma^2_z$ can be recovered using the function `recoverGLMscale()`. ```{r} mod1 <- recoverGLMscale(mod1) ``` We visualize the posterior distributions of $\sigma_\beta$ and $\sigma_z$ through histograms. ```{r fig.align='center', fig.height=3.5, fig.width=7, fig.alt="Posterior distributions of the scale parameters of the fixed and the random effects."} post_scale_df <- data.frame(value = sqrt(c(mod1$samples$sigmasq.beta, mod1$samples$sigmasq.z)), group = factor(rep(c("sigma.beta", "sigma.z"), each = length(mod1$samples$sigmasq.beta)))) ggplot(post_scale_df, aes(x = value)) + geom_density(fill = "lightblue", alpha = 0.6) + facet_wrap(~ group, scales = "free") + labs(x = "", y = "Density") + theme_bw() + theme(panel.background = element_blank(), panel.grid = element_blank(), aspect.ratio = 1) ``` #### Using predictive stacking Next, we move on to the function `spGLMstack()` that will implement our proposed stacking algorithm. The argument `loopd.controls` is used to provide details on what algorithm to be used to find LOO-PD. Valid options for the tag `method` is `"exact"` and `"CV"`. We use $K$-fold cross-validation by assigning `method = "CV"`and `CV.K = 10`. The tag `nMC` decides the number of Monte Carlo samples to be used to find the LOO-PD. ```{r spGLMstack_Pois} cand.mod <- candidateModels(list(phi = c(3, 7, 10), nu = c(0.5, 1.5), boundary = c(0.5, 0.6)), "cartesian") mod2 <- spGLMstack(y ~ x1, data = dat, family = "poisson", coords = as.matrix(dat[, c("s1", "s2")]), cor.fn = "matern", candidate.models = cand.mod, n.samples = 1000, priors = list(mu.beta = 5, nu.z = 5), loopd.controls = list(method = "CV", CV.K = 10, nMC = 1000), parallel = FALSE, verbose = TRUE) ``` We can extract information on solver status and runtime by the following. ```{r} print(mod2$solver) print(mod2$solver.status) print(mod2$run.time) ``` Further, we can recover the posterior samples of the scale parameters by passing the output obtained by running `spGLMstack()` once again through `recoverGLMscale()`. ```{r} mod2 <- recoverGLMscale(mod2) ``` #### Sampling from stacked posterior We first obtain final posterior samples by sampling from the stacked sampler. ```{r} post_samps <- stackedSampler(mod2) ``` Subsequently, we summarize the posterior samples of the fixed effects. ```{r} post_beta <- post_samps$beta summary_beta <- t(apply(post_beta, 1, function(x) quantile(x, c(0.025, 0.5, 0.975)))) rownames(summary_beta) <- mod3$X.names print(summary_beta) ``` The synthetic data `simPoisson` was simulated using $\beta = (2, -0.5)^\T$. ```{r, fig.align='center', fig.height=3.5, fig.width=7, fig.alt="Posterior distributions of the fixed effects"} post_beta_df <- as.data.frame(post_beta) post_beta_df <- post_beta_df %>% mutate(row = paste0("beta", row_number()-1)) %>% pivot_longer(-row, names_to = "sample", values_to = "value") # True values of beta0 and beta1 truth <- data.frame(row = c("beta0", "beta1"), true_value = c(2, -0.5)) ggplot(post_beta_df, aes(x = value)) + geom_density(fill = "lightblue", alpha = 0.6) + geom_vline(data = truth, aes(xintercept = true_value), color = "red", linetype = "dashed", linewidth = 0.5) + facet_wrap(~ row, scales = "free") + labs(x = "", y = "Density") + theme_bw() + theme(panel.background = element_blank(), panel.grid = element_blank(), aspect.ratio = 1) ``` Finally, we analyze the posterior samples of the spatial random effects. ```{r fig.align='center'} post_z <- post_samps$z post_z_summ <- t(apply(post_z, 1, function(x) quantile(x, c(0.025, 0.5, 0.975)))) z_combn <- data.frame(z = dat$z_true, zL = post_z_summ[, 1], zM = post_z_summ[, 2], zU = post_z_summ[, 3]) plotz <- ggplot(data = z_combn, aes(x = z)) + geom_point(aes(y = zM), size = 0.75, color = "darkblue", alpha = 0.5) + geom_errorbar(aes(ymin = zL, ymax = zU), width = 0.05, alpha = 0.15, color = "skyblue") + geom_abline(slope = 1, intercept = 0, color = "red") + xlab("True z") + ylab("Stacked posterior of z") + theme_bw() + theme(panel.background = element_blank(), panel.grid = element_blank(), aspect.ratio = 1) plotz ``` We can also compare the interpolated spatial surfaces of the true spatial effects with that of their posterior median. ```{r fig.align='center', fig.height=3.5, fig.width=7} postmedian_z <- apply(post_z, 1, median) dat$z_hat <- postmedian_z plot_z <- surfaceplot2(dat, coords_name = c("s1", "s2"), var1_name = "z_true", var2_name = "z_hat") patchwork::wrap_plots(plot_z) + patchwork::plot_layout(guides = "collect") & ggplot2::theme(legend.position = "right") ``` ### Spatial binomial count data This will follow the same workflow as Poisson data with the exception that the structure of `formula` that defines the model will also contain the total number of trials at each location. Here, we present only the `spGLMexact()` function for brevity. ```{r} data("simBinom") dat <- simBinom[1:200, ] # work with first 200 rows mod1 <- spGLMexact(cbind(y, n_trials) ~ x1, data = dat, family = "binomial", coords = as.matrix(dat[, c("s1", "s2")]), cor.fn = "matern", spParams = list(phi = 3, nu = 0.5), boundary = 0.5, n.samples = 1000, verbose = FALSE) ``` Similarly, we collect the posterior samples of the fixed effects and summarize them. The true value of the fixed effects with which the data was simulated is $\beta = (0.5, -0.5)$. ```{r} post_beta <- mod1$samples$beta summary_beta <- t(apply(post_beta, 1, function(x) quantile(x, c(0.025, 0.5, 0.975)))) rownames(summary_beta) <- mod1$X.names print(summary_beta) ``` ### Spatial binary data Finally, we present only the `spGLMexact()` function for spatial binary data to avoid repetition. In this case, unlike the binomial model, almost nothing changes from that of in the case of spatial Poisson data. ```{r} data("simBinary") dat <- simBinary[1:200, ] mod1 <- spGLMexact(y ~ x1, data = dat, family = "binary", coords = as.matrix(dat[, c("s1", "s2")]), cor.fn = "matern", spParams = list(phi = 4, nu = 0.4), boundary = 0.5, n.samples = 1000, verbose = FALSE) ``` Similarly, we collect the posterior samples of the fixed effects and summarize them. The true value of the fixed effects with which the data was simulated is $\beta = (0.5, -0.5)$. ```{r} post_beta <- mod1$samples$beta summary_beta <- t(apply(post_beta, 1, function(x) quantile(x, c(0.025, 0.5, 0.975)))) rownames(summary_beta) <- mod1$X.names print(summary_beta) ``` ## References {-}