Matching observations based on generalized propensity scores involves
tuning multiple hyperparameters. Running separate workflows with
different parameter combinations can be tedious, and the effects of some
parameters are not always predictable. To streamline this process,
vecmatch provides an automated optimization workflow using a random
search algorithm. The function optimize_gps() is
implemented with multithreading to leverage computational resources
efficiently.
Step 1: Define the Formula, Data, and Optimization Space
In this example, we use the built-in cancer dataset and
focus on two predictors: the categorical variable sex and
the continuous variable age. We first specify the model
formula. Note that data and formula are fixed
inputs; if you want to compare different formulas, you must run the
workflow separately for each.
Next, we define the search space for the hyperparameters. The helper
function make_opt_args() validates your inputs and computes
the Cartesian product of all specified values, reporting the total
number of parameter combinations.
opt_args <- make_opt_args(
  data            = cancer,
  formula         = formula_cancer,
  reference       = c("control", "adenoma", "crc_beningn", "crc_malignant"),
  gps_method      = c("m1", "m7", "m8"),
  matching_method = c("fullopt", "nnm"),
  caliper         = seq(0.01, 5, 0.01),
  cluster         = 1:3,
  ratio           = 1:3,
  min_controls    = 1:3,
  max_controls    = 1:3
)
opt_args
#> Optimization Argument Set (class: opt_args)
#> ---------------------------------------- 
#> gps_method      : m1, m7, m8
#> reference       : control, adenoma, crc_beningn, crc_malignant
#> matching_method : fullopt, nnm
#> caliper         : [500 values]
#> order           : desc, asc, original, random
#> cluster         : 1, 2, 3
#> replace         : TRUE, FALSE
#> ties            : TRUE, FALSE
#> ratio           : 1, 2, 3
#> min_controls    : 1, 2, 3
#> max_controls    : 1, 2, 3
#> ---------------------------------------- 
#> Total combinations: 1512000The print() method for make_opt_args()
provides a clear summary of the search space, including each
hyperparameter’s values and the total number of combinations.
With the search space defined, we can launch the optimization. The
optimize_gps() function performs a random search across the
parameter grid and returns a summary table containing key quality
metrics for each tested combination. You control the number of
iterations via the n_iter argument.
The function uses multithreading (via the future
package) to parallelize work. As a guideline, aim for at least 1000–1500
iterations per core for reliable search coverage. Monitor your system’s
memory usage, since the parallel backend can consume substantial
RAM.
The function automatically registers a multisession backend and,
after computing and matching the GPS with foreach and
doRNG, it cleans up the parallel workers to free memory. In
a future release, we plan to allow users to select alternative backends,
such as an external compute cluster, for greater flexibility.
By default, optimize_gps() preserves the global random
seed. For reproducibility, set a seed before calling the optimizer.
set.seed(167894)
seed_before <- .Random.seed
opt_results <- optimize_gps(
  data     = cancer,
  formula  = formula_cancer,
  opt_args = opt_args,
  n_iter   = 1500
)
#> Best Optimization Results by SMD Group
#> ======================================
#> 
#> ------------------------------------------------------- 
#> | smd_group  | unique_configs |  smd   | perc_matched |
#> ------------------------------------------------------- 
#> |0-0.05      |               1|   0.038|         79.07|
#> |0.05-0.10   |               1|   0.086|          79.4|
#> |0.10-0.15   |               1|   0.145|          91.2|
#> |0.15-0.20   |               2|   0.191|         95.68|
#> |0.20-0.25   |               1|   0.213|         93.19|
#> |0.25-0.30   |               1|    0.25|         73.39|
#> |>0.30       |               1|   0.359|         62.81|
#> ------------------------------------------------------- 
#> 
#> Optimization Summary
#> --------------------
#> Total combinations tested  : 1500
#> Total optimization time [s]: 91.09We ran the optimization on a single core with
n_iter = 1500; on our test machine this required 91.09
seconds. Given the size of the parameter grid, increasing
n_iter would improve the search’s coverage, but here we
limited iterations to keep the vignette’s build time reasonable.
When you print opt_results, it summarizes the entire
search by grouping parameter sets into bins defined by their maximum
standardized mean difference (SMD), and within each bin it highlights
the combination that achieves the highest proportion of matched
observations.
After optimization, select_opt() helps you choose
parameter combinations that meet your specific objectives. For example,
you may wish to maximize the number of matched samples for certain
treatment groups while minimizing imbalance on key covariates.
In this example, we aim to: * Retain the largest possible number of
observations in the adenoma and crc_malignant
groups. * Minimize the standardized mean difference (SMD) for the
age variable.
We can achieve this by specifying arguments in
select_opt():
select_results <- select_opt(opt_results,
  perc_matched = c("adenoma", "crc_malignant"),
  smd_variables = "age",
  smd_type = "max"
)
#> 
#> Optimization Selection Summary
#> ====================
#> 
#> ------------------------------------------------------- 
#> | smd_group  | unique_configs |  smd   | perc_matched |
#> ------------------------------------------------------- 
#> |0-0.05      |               1|   0.043|       162.957|
#> |0.05-0.10   |               1|   0.099|       176.715|
#> |0.10-0.15   |               1|   0.145|       189.474|
#> |0.15-0.20   |               2|   0.191|       194.611|
#> |0.20-0.25   |               1|   0.213|       190.179|
#> |0.25-0.30   |               1|    0.25|       148.874|
#> |0.30-0.35   |               1|   0.321|        99.426|
#> |0.35-0.40   |               1|   0.356|        94.214|
#> |0.40-0.45   |               1|   0.429|       113.729|
#> |0.45-0.50   |               1|   0.475|       111.038|
#> |>0.50       |               1|   0.551|       127.861|
#> -------------------------------------------------------The output shows the SMD bins and highlights the combination within
each bin that best meets our criteria. Suppose the configuration in the
0.10–0.15 SMD bin offers a desirable balance of matched
samples; we can extract its parameters for manual refitting.
To inspect the matched dataset and detailed balance summaries, we
rerun the standard vecmatch workflow using the selected
parameters.
param_df <- attr(select_results, "param_df")
subset(param_df, smd_group == "0.10-0.15")
#>   iter_ID  gps_model method_match caliper order kmeans_cluster replace ties
#> 6   ID565 estimate_4          nnm    4.29   asc              1    TRUE TRUE
#>   ratio min_controls max_controls reference p_control p_adenoma.x p_crc_beningn
#> 6     2           NA           NA   adenoma  87.22045         100      85.23985
#>   p_crc_malignant.x method_gps              link      age overall_stat
#> 6          89.47368   multinom generalized_logit 0.144529     0.144529
#>   smd_group p_adenoma.y p_crc_malignant.y
#> 6 0.10-0.15         100          89.47368estimate_gps() and
match_gps(). For example:# estimating gps
gps_mat <- estimate_gps(formula_cancer,
  cancer,
  method = "multinom",
  link = "generalized_logit",
  reference = "adenoma"
)# csr with refitting
csr_mat <- csregion(gps_mat)
#> 
#> Rectangular CSR Borders Evaluation 
#> ==================================
#> 
#> Treatment       | Lower CSR limit | Upper CSR limit | Number excluded 
#> -------------------------------------------------------------------- 
#> adenoma         | 0.2185785       | 0.3541997       | 10              
#> control         | 0.1710161       | 0.3226619       | 12              
#> crc_beningn     | 0.1766004       | 0.2903609       | 13              
#> crc_malignant   | 0.1384517       | 0.2641614       | 11              
#> 
#> ===================================================
#> The total number of excluded observations is:     20 
#> Note: You can view the summary of the CSR calculation using the  `attr()` function.
# matching
matched_df <- match_gps(csr_mat,
  method = "nnm",
  caliper = 4.29,
  reference = "adenoma",
  ratio = 2,
  order = "asc",
  replace = TRUE,
  ties = TRUE,
  kmeans_cluster = 1
)balqual(matched_df,
  formula = formula_cancer,
  type = "smd",
  statistic = "max",
  round = 3,
  cutoffs = 0.2
)
#> 
#> Matching Quality Evaluation
#> ================================================================================ 
#> 
#> Count table for the treatment variable:
#> -------------------------------------------------- 
#> Treatment                 | Before     | After      
#> -------------------------------------------------- 
#> adenoma                   | 373        | 373        
#> control                   | 313        | 273        
#> crc_beningn               | 271        | 231        
#> crc_malignant             | 247        | 221        
#> -------------------------------------------------- 
#> 
#> 
#> Matching summary statistics:
#> ---------------------------------------- 
#> Total n before matching:  1204 
#> Total n after matching:       1098 
#> % of matched observations:    91.20 %
#> Total  maximal   SMD value:   0.145 
#> 
#> 
#> Maximal values :
#> -------------------------------------------------------------------------------- 
#> Variable                  | Coef  | Before       | After        | Quality      
#> -------------------------------------------------------------------------------- 
#> age                       | SMD   | 0.215        | 0.145        | Balanced     
#> sexF                      | SMD   | 0.148        | 0.117        | Balanced     
#> sexM                      | SMD   | 0.148        | 0.117        | Balanced     
#> age:sexF                  | SMD   | 0.155        | 0.122        | Balanced     
#> age:sexM                  | SMD   | 0.159        | 0.136        | Balanced     
#> --------------------------------------------------------------------------------
all.equal(seed_before, .Random.seed)
#> [1] TRUE