| Title: | Generalization Error Minimization in SubSampling for Gaussian Processes |
|---|---|
| Description: | Implements the Generalization Error Minimization in SubSampling (GEMSS) algorithm for sequential subdata selection in large-scale Gaussian process modeling (Chang, Hua, and Wu, 2026) <doi:10.1080/00401706.2026.2670596>. The method selects data points by a criterion consisting of predictive and space-filling parts, enabling efficient surrogate modeling for massive datasets. |
| Authors: | Sheng-Zhan Hua [aut, cre] |
| Maintainer: | Sheng-Zhan Hua <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.1 |
| Built: | 2026-05-28 08:16:52 UTC |
| Source: | https://github.com/cran/GEMSS |
Calculates the correlation matrix between two sets of locations using stationary kernels. This function supports Gaussian, Matern 5/2, and Matern 3/2 correlation structures in a multivariate product form.
compute_kernel(X1, X2 = NULL, theta, type)compute_kernel(X1, X2 = NULL, theta, type)
X1 |
a matrix of input locations, where each row represents an observation. |
X2 |
a matrix of design locations. If |
theta |
a |
type |
a character string specifying the kernel type. Must be one of
" |
For multivariate inputs, the correlation function is defined as the product
of univariate kernels across all dimensions:
The available univariate correlation functions (where ) are:
"Gaussian":
"Matern5_2":
"Matern3_2":
a numeric matrix of dimensions nrow(X1) x nrow(X2).
Implements a backward elimination process to sequentially remove redundant data points from a Gaussian Process model using the GEMSS criterion.
gemss_remove( X, Y, X_val, Y_val, n_remove, covtype, c1 = NULL, threshold = 0.01, verbose = TRUE )gemss_remove( X, Y, X_val, Y_val, n_remove, covtype, c1 = NULL, threshold = 0.01, verbose = TRUE )
X |
an |
Y |
an |
X_val, Y_val
|
validation data used to evaluate predictive performance.
|
n_remove |
an integer specifying the maximum number of data points to sequentially remove. |
covtype |
a character string specifying the covariance kernel type. Must be one of "Gaussian", "Matern5_2", or "Matern3_2". |
c1 |
a numeric value between 0 and 1 representing the weight of the first
term in the GEMSS criterion ( |
threshold |
a numeric value specifying the minimum required value of predictive
R-squared to justify removing points. Default is |
verbose |
a logical value; if |
In each iteration, the leave-one-out GEMSS criterion is computed for every point in the training set. The point that yields the smallest criterion value is removed. The predictive R-squared is then calculated on the validation set to evaluate the performance of the reduced dataset.
If there exists a reduced dataset with a predictive R-squared value
greater than the specified threshold, the algorithm reports
the subset that maximizes the R-squared. Otherwise, no removal
is suggested.
a list containing:
index: an integer vector of the indices of the retained training data points.
remove: an integer vector of the indices of the removed data points.
eval_matrix: a matrix detailing the removal sequence, including the step j,
the removed_id, and the resulting out-of-sample R2_pred.
Chang, M. C., Hua, S. Z., & Wu, C. F. J. (2026). GEMSS-Driven Subsampling for Information Extraction and Redundancy Elimination. Technometrics, 1–20. doi:10.1080/00401706.2026.2670596
# Generate 1D data with intentionally clustered (redundant) points set.seed(123) X_reg <- seq(0, 1, length.out = 20) X_cluster <- rnorm(10, mean = 0.5, sd = 0.01) # Redundant points tightly packed around x=0.5 X <- as.matrix(c(X_reg, X_cluster)) Y <- sin(2 * pi * X) + rnorm(30, sd = 0.05) # Generate validation data X_val <- as.matrix(seq(0.05, 0.95, length.out = 20)) Y_val <- sin(2 * pi * X_val) + rnorm(20, sd = 0.05) # Run GEMSS removal res <- gemss_remove(X, Y, X_val, Y_val, n_remove = 10, covtype = "Matern3_2", verbose = TRUE) # View the indices of the removed points print(res$remove) print(res$eval_matrix) # Plot the removal data points (red) plot(X, Y, main = "GEMSS Data Removal") points(X[res$remove], Y[res$remove], pch = 19, col = 2) # Compare GP predictions before and after removal x_seq <- as.matrix(seq(0, 1, 0.02)) gp_bf <- hetGP::mleHomGP(X, Y, covtype = "Matern3_2") lines(x_seq, predict(gp_bf, x_seq)$mean, col = 1, lty = 2) gp_af <- hetGP::mleHomGP(X[-res$remove, , drop = FALSE], Y[-res$remove], covtype = "Matern3_2") lines(x_seq, predict(gp_af, x_seq)$mean, col = 3)# Generate 1D data with intentionally clustered (redundant) points set.seed(123) X_reg <- seq(0, 1, length.out = 20) X_cluster <- rnorm(10, mean = 0.5, sd = 0.01) # Redundant points tightly packed around x=0.5 X <- as.matrix(c(X_reg, X_cluster)) Y <- sin(2 * pi * X) + rnorm(30, sd = 0.05) # Generate validation data X_val <- as.matrix(seq(0.05, 0.95, length.out = 20)) Y_val <- sin(2 * pi * X_val) + rnorm(20, sd = 0.05) # Run GEMSS removal res <- gemss_remove(X, Y, X_val, Y_val, n_remove = 10, covtype = "Matern3_2", verbose = TRUE) # View the indices of the removed points print(res$remove) print(res$eval_matrix) # Plot the removal data points (red) plot(X, Y, main = "GEMSS Data Removal") points(X[res$remove], Y[res$remove], pch = 19, col = 2) # Compare GP predictions before and after removal x_seq <- as.matrix(seq(0, 1, 0.02)) gp_bf <- hetGP::mleHomGP(X, Y, covtype = "Matern3_2") lines(x_seq, predict(gp_bf, x_seq)$mean, col = 1, lty = 2) gp_af <- hetGP::mleHomGP(X[-res$remove, , drop = FALSE], Y[-res$remove], covtype = "Matern3_2") lines(x_seq, predict(gp_af, x_seq)$mean, col = 3)
Implements the Generalized Error-Minimizing Subsampling (GEMSS) algorithm to select subdata for Gaussian Process models.
gemss_select( X, Y, ns, covtype, parameters = NULL, X_val = NULL, Y_val = NULL, c1 = NULL, n_srs = NULL, n_top = NULL, verbose = TRUE )gemss_select( X, Y, ns, covtype, parameters = NULL, X_val = NULL, Y_val = NULL, c1 = NULL, n_srs = NULL, n_top = NULL, verbose = TRUE )
X |
an |
Y |
an |
ns |
target size of the final subdata. |
covtype |
covariance kernel type, either "Gaussian", "Matern5_2", or "Matern3_2". |
parameters |
a list of hyperparameters: |
X_val, Y_val
|
optional validation data for calculating the predictive R-squared.
If omitted, a random sample of size |
c1 |
value between 0 and 1 of the first term in GEMSS criterion ( |
n_srs, n_top
|
optional values to determine the candidate set for each iteration:
|
verbose |
logical; if |
The GEMSS criterion is defined as:
where is the squared prediction error and is the
posterior variance. The weights and control the
balance between prediction accuracy and space-filling properties.
The default value for is ,
which is derived by minimizing the variance of the GEMSS criterion. Setting
results in a space-filling subdata
The algorithm initializes with a small random sample of size 2 and
sequentially adds the point from the candidate set that maximizes
until the subdata reaches size ns.
The candidate set in each iteration is composed of two parts: a random
sample of size n_srs, and the n_top points that yielded the highest
values in the previous iteration.
All GP hyperparameters except are estimated from an initial
pilot sample of size and remain fixed throughout the selection process.
a list containing:
index: the indices of the selected subdata.
r_sq: a data.frame containing the subdata size and the corresponding predictive R-square.
parameters: a list of hyperparameters containing beta0, theta, g, and sigma2 (the plug-in estimator of the variance).
Chang, M. C., Hua, S. Z., & Wu, C. F. J. (2026). GEMSS-Driven Subsampling for Information Extraction and Redundancy Elimination. Technometrics, 1–20. doi:10.1080/00401706.2026.2670596
# --- Example 1: 1D Regression --- fx <- function(x) cos((x - 0.8) * 2 * pi)^7 * sin(x) + 5 * sin(x) * (sin(x^2))^10 X <- matrix(runif(200), 200, 1) Y <- apply(X, 1, fx) + rnorm(nrow(X), 0, 0.05) # Select 20 points using GEMSS res <- gemss_select(X, Y, ns = 20, covtype = "Matern5_2") # Predict on a grid x.grid <- as.matrix(seq(0, 1, 0.01)) y.pred <- gp_predict(x.grid, X[res$index, ], Y[res$index], "Matern5_2", res$parameters) # Visualize 1D Results plot(X, Y, col = "grey", main = "Selected Subdata and GP Prediction") points(X[res$index, ], Y[res$index], col = "red", pch = 19) lines(x.grid, y.pred$mean, col = "red", lwd = 2) # --- Example 2: 2D Surface (Michalewicz function) --- michalewicz <- function(xx) { x1 <- xx[1] * pi; x2 <- xx[2] * pi; m <- 10 - (sin(x1) * (sin(x1^2 / pi))^(2 * m) + sin(x2) * (sin(2 * x2^2 / pi))^(2 * m)) } X2D <- matrix(runif(2000), 1000, 2) Y2D <- apply(X2D, 1, michalewicz) + rnorm(nrow(X2D), 0, 0.005) res2D <- gemss_select(X2D, Y2D, ns = 80, covtype = "Matern5_2") # Visualization using ContourFunctions if available if (requireNamespace("ContourFunctions", quietly = TRUE)) { ngrid <- 50 # Reduced grid size for faster example execution gx <- seq(0, 1, len = ngrid) grid <- as.matrix(expand.grid(x1 = gx, x2 = gx)) # plot the contour of Michalewicz function y_grid <- apply(grid, 1, michalewicz) ContourFunctions::cf_grid(gx, gx, matrix(y_grid, ngrid, ngrid), bar = TRUE, main = "Michalewicz Function") prep <- gp_predict(grid, X2D[res2D$index, ], Y2D[res2D$index], "Matern5_2", res2D$parameters) ContourFunctions::cf_grid(gx, gx, matrix(prep$mean, ngrid, ngrid), bar = TRUE, main = "Prediction Using Selected Subdata", afterplotfunc = function() { points(X2D[res2D$index, ], col = 'blue', pch = 1, cex = 1.5, lwd = 2) }) }# --- Example 1: 1D Regression --- fx <- function(x) cos((x - 0.8) * 2 * pi)^7 * sin(x) + 5 * sin(x) * (sin(x^2))^10 X <- matrix(runif(200), 200, 1) Y <- apply(X, 1, fx) + rnorm(nrow(X), 0, 0.05) # Select 20 points using GEMSS res <- gemss_select(X, Y, ns = 20, covtype = "Matern5_2") # Predict on a grid x.grid <- as.matrix(seq(0, 1, 0.01)) y.pred <- gp_predict(x.grid, X[res$index, ], Y[res$index], "Matern5_2", res$parameters) # Visualize 1D Results plot(X, Y, col = "grey", main = "Selected Subdata and GP Prediction") points(X[res$index, ], Y[res$index], col = "red", pch = 19) lines(x.grid, y.pred$mean, col = "red", lwd = 2) # --- Example 2: 2D Surface (Michalewicz function) --- michalewicz <- function(xx) { x1 <- xx[1] * pi; x2 <- xx[2] * pi; m <- 10 - (sin(x1) * (sin(x1^2 / pi))^(2 * m) + sin(x2) * (sin(2 * x2^2 / pi))^(2 * m)) } X2D <- matrix(runif(2000), 1000, 2) Y2D <- apply(X2D, 1, michalewicz) + rnorm(nrow(X2D), 0, 0.005) res2D <- gemss_select(X2D, Y2D, ns = 80, covtype = "Matern5_2") # Visualization using ContourFunctions if available if (requireNamespace("ContourFunctions", quietly = TRUE)) { ngrid <- 50 # Reduced grid size for faster example execution gx <- seq(0, 1, len = ngrid) grid <- as.matrix(expand.grid(x1 = gx, x2 = gx)) # plot the contour of Michalewicz function y_grid <- apply(grid, 1, michalewicz) ContourFunctions::cf_grid(gx, gx, matrix(y_grid, ngrid, ngrid), bar = TRUE, main = "Michalewicz Function") prep <- gp_predict(grid, X2D[res2D$index, ], Y2D[res2D$index], "Matern5_2", res2D$parameters) ContourFunctions::cf_grid(gx, gx, matrix(prep$mean, ngrid, ngrid), bar = TRUE, main = "Prediction Using Selected Subdata", afterplotfunc = function() { points(X2D[res2D$index, ], col = 'blue', pch = 1, cex = 1.5, lwd = 2) }) }
Performs Gaussian Process prediction for a given set of new design locations, using hyperparameters provided in the parameters list.
gp_predict(X_new, X, Y, covtype, parameters)gp_predict(X_new, X, Y, covtype, parameters)
X_new |
a |
X |
a |
Y |
a |
covtype |
a character string specifying the covariance kernel type. Must be
one of " |
parameters |
a list of hyperparameters containing |
The function uses the standard GP prediction formulas:
a list containing:
mean: a m x 1 numeric vector of predicted means at X_new.
sd2: a m x 1 numeric vector of predicted variances at X_new.