Performs a permutation or randomization test to compare the distributions of
two independent or paired data samples.
-- Function File: PVAL = randtest2 (A, B)
-- Function File: PVAL = randtest2 (A, B, PAIRED)
-- Function File: PVAL = randtest2 (A, B, PAIRED, NREPS)
-- Function File: PVAL = randtest2 (A, B, PAIRED, NREPS)
-- Function File: PVAL = randtest2 (A, B, PAIRED, NREPS, FUNC)
-- Function File: PVAL = randtest2 (A, B, PAIRED, NREPS, FUNC, SEED)
-- Function File: PVAL = randtest2 ([A, GA], [B, GB], ...)
-- Function File: [PVAL, STAT] = randtest2 (...)
-- Function File: [PVAL, STAT, FPR] = randtest2 (...)
-- Function File: [PVAL, STAT, FPR, PERMSTAT] = randtest2 (...)
'PVAL = randtest2 (A, B)' performs a randomization (or permutation) test
to ascertain whether data samples A and B come from populations with
the same distribution. Distributions are compared using the Wasserstein
metric [1,2], which is the area of the difference between the empirical
cumulative distribution functions of A and B. The data in A and B should
be column vectors that represent measurements of the same variable. The
value returned is a 2-tailed p-value against the null hypothesis computed
using the absolute values of the test statistics.
'PVAL = randtest2 (A, B, PAIRED)' specifies whether A and B should be
treated as independent (unpaired) or paired samples. PAIRED accepts a
logical scalar:
o false (default): As above. The rows of samples A and B combined are
permuted or randomized.
o true: Performs a randomization or permutation test to ascertain
whether paired or matched data samples A and B come from
populations with the same distribution. The vectors A and B
must each contain the same number of rows, where each row
across A and B corresponds to a pair of matched observations.
Within each pair, the allocation of data to samples A or B is
permuted or randomized [3].
'PVAL = randtest2 (A, B, PAIRED, NREPS)' specifies the number of resamples
without replacement to take in the randomization test. By default, NREPS
is 5000. If the number of possible permutations is smaller than NREPS, the
test becomes exact. For example, if the number of sampling units across
two independent samples is 6, then the number of possible permutations is
factorial (6) = 720, so NREPS will be truncated at 720 and sampling will
systematically evaluate all possible permutations. If the number of
sampling units in each paired sample is 12, then the number of possible
permutations is 2^12 = 4096, so NREPS will be truncated at 4096 and
sampling will systematically evaluate all possible permutations.
'PVAL = randtest2 (A, B, PAIRED, NREPS, FUNC)' also specifies a custom
function calculated on the original samples, and the permuted or
randomized resamples. Note that FUNC must compute a difference statistic
between samples A and B, and should either be a:
o function handle or anonymous function,
o string of function name, or
o a cell array where the first cell is one of the above function
definitions and the remaining cells are (additional) input arguments
to that function (other than the data arguments).
See the built-in demos for example usage with the mean [3], or vaiance.
'PVAL = randtest2 (A, B, PAIRED, NREPS, FUNC, SEED)' initialises the
Mersenne Twister random number generator using an integer SEED value so
that the results of 'randtest2' results are reproducible when the test
is approximate (i.e. when using randomization if not all permutations
can be evaluated systematically).
'PVAL = randtest2 ([A, GA], [B, GB], ...)' also specifies the sampling
units (i.e. clusters) using consecutive positive integers in GA and GB
for A and B respectively. Defining the sampling units has applications
for clustered resampling, for example in the cases of nested experimental
designs. If PAIRED is false, numeric identifiers in GA and GB must be
unique (e.g. 1,2,3 in GA, 4,5,6 in GB) - resampling of clusters then
occurs across the combined sample of A and B. If PAIRED is true, numeric
identifiers in GA and GB must by identical (e.g. 1,2,3 in GA, 1,2,3 in
GB) - resampling is then restricted to exchange of clusters between A
and B only where the clusters have the same identifier. Note that when
sampling units contain different numbers of values, function evaluations
after sampling cannot be vectorized. If the parallel computing toolbox
(Matlab) or Parallel package (Octave) is installed and loaded, then the
function evaluations will be automatically accelerated by parallel
processing on platforms with multiple processors.
'[PVAL, STAT] = randtest2 (...)' also returns the test statistic.
'[PVAL, STAT, FPR] = randtest2 (...)' also returns the minimum false
positive risk (FPR) calculated for the p-value, computed using the
Sellke-Berger approach.
'[PVAL, STAT, FPR, PERMSTAT] = randtest2 (...)' also returns the
statistics of the permutation distribution.
Bibliography:
[1] Dowd (2020) A New ECDF Two-Sample Test Statistic. arXiv.
https://doi.org/10.48550/arXiv.2007.01360
[2] https://en.wikipedia.org/wiki/Wasserstein_metric
[3] Hesterberg, Moore, Monaghan, Clipson, and Epstein (2011) Bootstrap
Methods and Permutation Tests (BMPT) by in Introduction to the Practice
of Statistics, 7th Edition by Moore, McCabe and Craig.
randtest2 (version 2024.04.17)
Author: Andrew Charles Penn
https://www.researchgate.net/profile/Andrew_Penn/
Copyright 2019 Andrew Charles Penn
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/
The following code
% Mouse data from Table 2 (page 11) of Efron and Tibshirani (1993)
treatment = [94 197 16 38 99 141 23]';
control = [52 104 146 10 51 30 40 27 46]';
% Randomization test comparing the distributions of observations from two
% independent samples (assuming i.i.d and exchangeability) using the
% Wasserstein metric
pval = randtest2 (control, treatment, false, 5000)
% Randomization test comparing the difference in means between two
% independent samples (assuming i.i.d and exchangeability)
pval = randtest2 (control, treatment, false, 5000, ...
@(A, B) mean (A) - mean (B))
% Randomization test comparing the ratio of variances between two
% independent samples (assuming i.i.d and exchangeability). (Note that
% the log transformation is necessary to make the p-value two-tailed)
pval = randtest2 (control, treatment, false, 5000, ...
@(A, B) log (var (A) ./ var (B)))
Produces the following output
pval = 0.3668 pval = 0.2698 pval = 0.30905
The following code
% Example data from: % https://www.biostat.wisc.edu/~kbroman/teaching/labstat/third/notes18.pdf A = [117.3 100.1 94.5 135.5 92.9 118.9 144.8 103.9 103.8 153.6 163.1]'; B = [145.9 94.8 108 122.6 130.2 143.9 149.9 138.5 91.7 162.6 202.5]'; % Randomization test comparing the distributions of observations from two % paired or matching samples (assuming i.i.d and exchangeability) using the % Wasserstein metric pval = randtest2 (A, B, true, 5000) % Randomization test comparing the difference in means between two % paired or matching samples (assuming i.i.d and exchangeability) pval = randtest2 (A, B, true, 5000, @(A, B) mean (A) - mean (B), 1) % Note that this is equivalent to: pval = randtest1 (A - B, 0, 5000, @mean, 1) % Randomization test comparing the ratio of variances between two % paired or matching samples (assuming i.i.d and exchangeability). (Note % that the log transformation is necessary to make the p-value two-tailed) pval = randtest2 (A, B, true, 5000, @(A, B) log (var (A) ./ var (B)))
Produces the following output
pval = 0.12891 pval = 0.037109 pval = 0.037109 pval = 0.51172
The following code
A = [21,26,33,22,18,25,26,24,21,25,35,28,32,36,38]'; GA = [1,1,1,1,2,2,2,2,2,2,3,3,3,3,3]'; B = [26,34,27,38,44,34,45,38,31,41,34,35,38,46]'; GB = [4,4,4,5,5,5,5,5,6,6,6,6,6,6]'; % Randomization test comparing the distributions of observations from two % independent samples (assuming i.i.d) using the Wasserstein metric pval = randtest2 (A, B, false, 5000) % Randomization test comparing the distributions of clustered observations % from two independent samples using the Wasserstein metric pval = randtest2 ([A GA], [B GB], false, 5000)
Produces the following output
pval = 0.00042414 pval = 0.2
The following code
A = [21,26,33,22,18,25,26,24,21,25,35,28,32,36,38]'; GA = [1,1,1,1,2,2,2,2,2,2,3,3,3,3,3]'; B = [26,34,27,38,44,34,45,38,31,41,34,35,38,46,36]'; GB = [1,1,1,1,2,2,2,2,2,2,3,3,3,3,3]'; % Randomization test comparing the distributions of observations from two % paired or matched samples (assuming i.i.d) using the Wasserstein metric pval = randtest2 (A, B, true, 5000) % Randomization test comparing the distributions of clustered observations % from two paired or matched using the Wasserstein metric pval = randtest2 ([A GA], [B GB], true, 5000)
Produces the following output
pval = 0.0024 pval = 0.25
The following code
% Load example data from CSV file
data = csvread ('demo_data.csv');
trt = data(:,1); % Predictor: 0 = no treatment; 1 = treatment
grp = data(:,2); % Cluster IDs
val = data(:,3); % Values measured of the outcome
A = val(trt==0); GA = grp(trt==0);
B = val(trt==1); GB = grp(trt==1);
% Randomization test comparing the distributions of clustered observations
% from two independent samples using the Wasserstein metric
pval = randtest2([A, GA], [B, GB], false)
Produces the following output
pval = 0.0694
Package: statistics-resampling