Skip to contents

Simulates datasets under a first-difference correlated random walk (DCRW) state-space model for Study 2 in the paper. This study examines the impact of increasing contamination levels and varying outlier distances on model performance. All arguments have default values matching the simulation setup used in the paper.

Usage

simulate_data_study2(
  samples = 100,
  n = 200,
  max_contamination = 0.2,
  distances = c(1, 3, 5, 7, 9),
  n_oos = 20,
  phi_coef = 0.8,
  sigma2_w_lon = 0.1,
  sigma2_w_lat = 0.1,
  sigma2_v_lon = 0.4,
  sigma2_v_lat = 0.4,
  initial_state = c(0, 0, 0, 0),
  seed = NA
)

Arguments

samples

Number of simulated data sets per contamination rate and outlier distance. Default is 100.

n

Number of in-sample timesteps. Default is 200.

max_contamination

Maximum proportion of contaminated (outlying) observations. Default is 0.2.

distances

Vector of five distances for additive outliers. Must be of length 5. Default is c(1, 3, 5, 7, 9).

n_oos

Number of out-of-sample (future) timesteps. Default is 20.

phi_coef

Autocorrelation parameter in the DCRW transition matrix. Default is 0.8.

sigma2_w_lon, sigma2_w_lat

State noise variances (longitude and latitude). Default is 0.1 each.

sigma2_v_lon, sigma2_v_lat

Observation noise variances (longitude and latitude). Default is 0.4 each.

initial_state

Initial state vector of length 4. Default is c(0, 0, 0, 0).

seed

Optional random seed for reproducibility. Default is NA. Use seed = 205 to reproduce the same data as in the paper.

Value

A tibble containing the simulated data sets. Each row corresponds to a simulated data set and includes fields for contamination rate, distance, outliers, clean data, noisy observations, and out-of-sample values.

Details

An equally-spaced sequence of five increasing contamination rates from 0 up to max_contamination will be constructed internally. For example, if max_contamination = 0.2, the contamination rates will be c(0, 0.05, 0.1, 0.15, 0.2). The returned tibble will have \(9\times\) samples rows, with each row corresponding to a unique combination of contamination rate and distance. The levels of contamination rate and distance are not 'crossed'; rather, the middle contamination rate (e.g. 0.1) will be crossed with all distances, and the middle distance (e.g. 5) will be crossed with all contamination rates. This will result in \(5\times 1 + 5\times 1 = 10\) data sets. However, the middle contamination rate and middle distance will be double-counted, so the total number of unique data sets per sample will be \(10 - 1 = 9\).

Examples

data_study2 = simulate_data_study2(samples = 5, seed = 456)