Resampling#

Module for resampling statistics, including bootstrapping and permutations.

This module provides classes for bootstrapping and permutation tests on two series or samples. It supports tests on means and medians for bootstrapping, and additional t-tests and Mann-Whitney U for permutations. Results include observed statistics, distributions, confidence intervals, and p-values.

BootResult

Dataclass for bootstrap results.

PermResult

Dataclass for permutation results.

TwoSeriesBootstrap

Class for bootstrapping two series.

TwoSampleBootstrap

Class for bootstrapping with a boolean grouping variable.

TwoSeriesPermutation

Class for permutation tests on two series.

TwoSamplePermutation

Class for permutation tests with a boolean grouping variable.

Note that while bootstrapped hypothesis tests are possible, they are much more complicated to perform than permutation hypothesis tests. Bootstrapped hypothesis tests and p-values are a WIP. While bootstrapped CIs are reliable and preferred, currently (and probably generally), permutation tests should be preferred for hypothesis tests. To avoid accidentally using an unreliable or biased bootstrapped p-value in actual practice, displaying of bootstrapped p-values is currently controlled by the boot_p_value Boolean parameter, which currently defaults to False; setting it to True will allow access to the WIP bootstrapped p-value.

Assumes input series are continuous. Handles missing data by dropping NaNs. Some features, like bootstrap p-values, are experimental.

class unistat.resampling.BootResult(obs_stat: float | int, boot_mean: float | int, boot_sem: float | int, boot_ci_lo: float | int, boot_ci_hi: float | int, p: float, boot_ci_pct: float, boot_dist: ndarray)[source]#

Bases: object

Dataclass to store bootstrap resampling results.

obs_stat#

Observed test statistic (difference of means or medians).

Type:

float or int

boot_mean#

Mean of the bootstrap distribution.

Type:

float or int

boot_sem#

Standard error of the mean from the bootstrap distribution.

Type:

float or int

boot_ci_lo#

Lower bound of the bias-corrected accelerated (BCa) confidence interval.

Type:

float or int

boot_ci_hi#

Upper bound of the BCa confidence interval.

Type:

float or int

p#

P-value from bootstrap hypothesis test (experimental).

Type:

float

boot_ci_pct#

Confidence level percentage for the CI.

Type:

float

boot_dist#

Bootstrap distribution array.

Type:

np.ndarray

class unistat.resampling.TwoSeriesBootstrap(test: Series, control: Series, test_type: _BootstrapTestTypes = 'means', alpha_level: float = 0.05, n_resamples: int = 10000, rng: int | None = 519, boot_p_value: bool = False)[source]#

Bases: object

Class for bootstrapping CI & hypothesis tests on two continuous series.

Supports difference of means or medians. Provides bootstrap distribution, CIs, and optional experimental p-value from a bootstrapped hypothesis test.

Currently, by default, no hypothesis tests are performed nor p-values given.

Parameters:
  • test (pd.Series) – Test group data.

  • control (pd.Series) – Control group data.

  • test_type ({'means', 'medians'}, default 'means') – Type of test statistic.

  • alpha_level (float, default 0.05) – Significance level for CI.

  • n_resamples (int, default 10_000) – Number of bootstrap resamples.

  • rng (int, default 519) – Random seed for reproducibility.

  • boot_p_value (bool, default False) – If True, compute experimental bootstrap P-value.

test#

Cleaned test data.

Type:

pd.Series

control#

Cleaned control data.

Type:

pd.Series

alpha#

Significance level.

Type:

float

n_resamples#

Number of resamples.

Type:

int

rng#

Random seed.

Type:

int

boot_result#

Stored bootstrap results.

Type:

BootResult or None

Raises:

ValueError – If test_type is invalid.

Notes

Regarding the current WIP bootstrap hypothesis test, P-values are not currently displayed by default, as controlled by the boot_p_value Boolean parameter.

Under Frequentist statistical philosophy, a P-value represents the probability of getting results as-or-more-extreme than those observed, if the null hypothesis is true (this is even the assumption for classical calculated asymptotic hypothesis tests). In the context of bootstrap statistics, this would mean that a bootstrap distribution should represent the distribution of test statistics (mean, median, t-stat, U-stat) that would be observed under the null hypothesis (H0), and the p-value would be the fraction of test statistic values in that bootstrapped distribution which are at least as extreme as what was actually observed.

In contrast, when creating a bootstrap distribution to estimate CI or SEM (the primary use of bootstrapping), one creates a bootstrap distribution under the assumption that the alternative hypothesis (Ha) is true. Looking at how these bootstrap classes display results, this is apparent: we display bootstrapped means/CIs for the test statistic (difference in group means/medians) between test and control groups. Another way of phrasing a statistically significant difference in means (at the \(\alpha = .05\) significance level), for example, would be to say that the 95% CI of difference in the means of the test and control samples does NOT include zero. Since the null hypothesis (at least in cases relevant to typical biostats) is that the means or medians are equal in the test and control groups (or that the difference in the means/ medians is zero), the fact that the bootstrapped difference in means is not zero here shows that these bootstrap distributions for CI are NOT bootstraps under the null hypothesis. Indeed, when first making this module, that mistake was made, and it was found that even when there was an obvious difference between test vs. control groups, bootstrapped “p-values” taken from the same distribution as bootstrapped CIs returns a p-value near 0.50; to put it another way, this mistaken approach found that the observed difference in means was typically almost exactly the same as the average bootstrapped difference in means – exactly what would be expected in a distribution representing the alternative hypothesis.

While checking for overlapping CIs does indicate statistical significance in a Boolean fashion, it cannot give p-values. Instead, to calculate p-values via a bootstrapped methodology, in addition to the bootstrapped distribution under the Ha (used to estimate CIs), one must also use some method to bootstrap the null distribution in order to calculate p-value. Boos & Stefanski (2013) 6 suggest 2 methods, as summarized in this lecture by Alex Kaizer:

1. Combine all test and control observations (e.g. Hgb value for both blunt & penetrating trauma patients), then sample with replacement for the N in both test and control groups (number of blunt/penetrating patients) to create bootstrapped test and control samples.

  • On average, this approach will create equal means in each group

  • However, the intermingling of observations in each group means that if observed groups do not have equal variance, they will end up with equal variances during resampling

    • This is the same issue encountered in Welch vs. Student t-tests

    • This risks inflating Type I error rate if variances are unequal

2. Create separate observed distributions for test & control, where each value represents how much the original value differed from its group mean. Each groups’ bootstrap sample is sampled with replacement from its own group’s mean-shifted sample.

  • Both test and control groups now have identical group means of 0 (since the mean variation from the mean in a sample is 0, duh)

  • This also preserves unequal variances

  • This seems like a superior strategy, and will ultimately be implemented for this module

  • This method does require performing completely separate bootstrap samples for both test & control groups, if and only if the goal is to perform a hypothesis test and derive a p-value

Due to coding challenges, a hybrid approach has been taken in the interim here, which will later be changed.

  • We take the bootstrap-under-Ha distribution (as used for CIs), and then shift it to be centered about the mean of the bootstrap distribution.

  • From this, we find how many bootstrap samples had a mean at least as extreme as what was observed. This is the p-value.

  • Since the bootstrap samples still allow admixture of test & control observations, this method is accurate, but still destroys any unequal variances between test & control groups

    • For this reason, this approach is not robust for deployment

In contrast to bootstrapping, permutation tests (sampling without replacement, and then randomly assigning to test or control) generates a distribution under the null hypothesis. For this reason, permutation methods more easily generate a p-value, but are not (easily) fit to generating a CI for the test statistic, in the way bootstrapping is. While the acceptability of using both methods (i.e. bootstrap a CI and then permute a p-value), it doesn’t immediately seem like an objectionable choice if both values are truly needed. This would probably be preferred even once bootstrapped hypothesis tests are fully implemented, since even that would require an entire separate resampling run. The only time when bootstrapped hypothesis tests are truly necessary is when the permutation test assumption (that test & control groups are exchangeable; that is, they have the same distributions if H0 is true) is violated. The proposed bootstrapped hypothesis testing method (#2 above) does not make this assumption.

property results#
property summ_stats#
bootstrap(n_resamples: int = None)[source]#

Perform bootstrap resampling.

Parameters:

n_resamples (int, optional) – Number of resamples. Defaults to self.n_resamples.

Returns:

Bootstrap results.

Return type:

BootResult

Notes

Uses BCa method for confidence intervals. P-value is experimental. Updates hypothesis direction in summary stats if p < alpha.

static mean_diff(sample_a: VectorLike, sample_b: VectorLike, axis: Literal[0, 1, 'rows', 'columns'] = 0) float | float64[source]#

Compute difference of means.

Parameters:
  • sample_a (VectorLike) – First sample.

  • sample_b (VectorLike) – Second sample.

  • axis ({0, 1, 'rows', 'columns'}, optional) – Axis to compute along. Defaults to 0.

Returns:

Difference of means.

Return type:

float or np.float64

static median_diff(sample_a: VectorLike, sample_b: VectorLike, axis: Literal[0, 1, 'rows', 'columns'] = 0) float | float64[source]#

Compute difference of medians.

Parameters:
  • sample_a (VectorLike) – First sample.

  • sample_b (VectorLike) – Second sample.

  • axis ({0, 1, 'rows', 'columns'}, optional) – Axis to compute along. Defaults to 0.

Returns:

Difference of medians.

Return type:

float or np.float64

class unistat.resampling.TwoSampleBootstrap(bool_x: Series, num_y: Series, test_type: _BootstrapTestTypes = 'means', alpha_level: float = 0.05, n_resamples: int = 10000, rng: int | None = 519, x_test_lvl: bool = True, boot_p_value: bool = False)[source]#

Bases: TwoSeriesBootstrap

Class for bootstrapping with a boolean grouping variable.

Extends TwoSeriesBootstrap to split a continuous series by a boolean group.

Parameters:
  • bool_x (pd.Series) – Boolean grouping variable.

  • num_y (pd.Series) – Continuous outcome variable.

  • test_type ({'means', 'medians'}, optional) – Test statistic type. Defaults to ‘means’.

  • alpha_level (float, optional) – Significance level. Defaults to 0.05.

  • n_resamples (int, optional) – Number of resamples. Defaults to 10,000.

  • rng (int, optional) – Random seed. Defaults to 519.

  • x_test_lvl (bool, optional) – Boolean level for test group. Defaults to True.

  • boot_p_value (bool, optional) – Compute experimental p-value. Defaults to False.

x#

Grouping variable.

Type:

pd.Series

y#

Outcome variable.

Type:

pd.Series

Raises:

ValueError – If group names collide or groups are empty after splitting.

Notes

Regarding the current WIP bootstrap hypothesis test, p-values are not currently displayed by default, as controlled by the boot_p_value Boolean parameter.

Under Frequentist statistical philosophy, a p-value represents the probability of getting results as-or-more-extreme than those observed, if the null hypothesis is true (this is even the assumption for classical calculated asymptotic hypothesis tests). In the context of bootstrap statistics, this would mean that a bootstrap distribution should represent the distribution of test statistics (mean, median, t-stat, U-stat) that would be observed under the null hypothesis (H0), and the p-value would be the fraction of test statistic values in that bootstrapped distribution which are at least as extreme as what was actually observed.

In contrast, when creating a bootstrap distribution to estimate CI or SEM (the primary use of bootstrapping), one creates a bootstrap distribution under the assumption that the alternative hypothesis (Ha) is true. Looking at how these bootstrap classes display results, this is apparent: we display bootstrapped means/CIs for the test statistic (difference in group means/medians) between test and control groups. Another way of phrasing a statistically significant difference in means (at the \(\alpha = .05\) significance level), for example, would be to say that the 95% CI of difference in the means of the test and control samples does NOT include zero. Since the null hypothesis (at least in cases relevant to typical biostats) is that the means or medians are equal in the test and control groups (or that the difference in the means/ medians is zero), the fact that the bootstrapped difference in means is not zero here shows that these bootstrap distributions for CI are NOT bootstraps under the null hypothesis. Indeed, when first making this module, that mistake was made, and it was found that even when there was an obvious difference between test vs. control groups, bootstrapped “p-values” taken from the same distribution as bootstrapped CIs returns a p-value near 0.50; to put it another way, this mistaken approach found that the observed difference in means was typically almost exactly the same as the average bootstrapped difference in means – exactly what would be expected in a distribution representing the alternative hypothesis.

While checking for overlapping CIs does indicate statistical significance in a Boolean fashion, it cannot give p-values. Instead, to calculate p-values via a bootstrapped methodology, in addition to the bootstrapped distribution under the Ha (used to estimate CIs), one must also use some method to bootstrap the null distribution in order to calculate p-value. Boos & Stefanski (2013) 6 suggest 2 methods, as summarized in this lecture by Alex Kaizer:

  1. Combine all test and control observations (e.g. Hgb value for both blunt & penetrating trauma patients), then sample with replacement for the N in both test and control groups (number of blunt/penetrating patients) to create bootstrapped test and control samples.

    • On average, this approach will create equal means in each group.

    • However, the intermingling of observations in each group means that if observed groups do not have equal variance, they will end up with equal variances during resampling.

      • This is the same issue encountered in Welch vs. Student t-tests.

      • This risks inflating Type I error rate if variances are unequal.

  2. Create separate observed distributions for test & control, where each value represents how much the original value differed from its group mean. Each groups’ bootstrap sample is sampled with replacement from its own group’s mean-shifted sample.

    • Both test and control groups now have identical group means of 0 (since the mean variation from the mean in a sample is 0).

    • This also preserves unequal variances.

    • This seems like a superior strategy, and will ultimately be implemented for this module.

    • This method does require performing completely separate bootstrap samples for both test & control groups, if and only if the goal is to perform a hypothesis test and derive a p-value.

Due to coding challenges, a hybrid approach has been taken in the interim here, which will later be changed.

  • We take the bootstrap-under-Ha distribution (as used for CIs), and then shift it to be centered about the mean of the bootstrap distribution.

  • From this, we find how many bootstrap samples had a mean at least as extreme as what was observed. This is the p-value.

  • Since the bootstrap samples still allow admixture of test & control observations, this method is accurate, but still destroys any unequal variances between test & control groups.

    • For this reason, this approach is not robust for deployment.

In contrast to bootstrapping, permutation tests (sampling without replacement, and then randomly assigning to test or control) generates a distribution under the null hypothesis. For this reason, permutation methods more easily generate a p-value, but are not (easily) fit to generating a CI for the test statistic, in the way bootstrapping is. While the acceptability of using both methods (i.e. bootstrap a CI and then permute a p-value), it doesn’t immediately seem like an objectionable choice if both values are truly needed. This would probably be preferred even once bootstrapped hypothesis tests are fully implemented, since even that would require an entire separate resampling run. The only time when bootstrapped hypothesis tests are truly necessary is when the permutation test assumption (that test & control groups are exchangeable; that is, they have the same distributions if H0 is true) is violated. The proposed bootstrapped hypothesis testing method (#2 above) does not make this assumption.

class unistat.resampling.PermResult(obs_stat: float | int, p: float, perm_dist: ndarray)[source]#

Bases: object

Dataclass to store permutation test results.

obs_stat#

Observed test statistic.

Type:

float or int

perm_dist#

Permutation null distribution.

Type:

np.ndarray

p#

P-value from permutation test.

Type:

float

class unistat.resampling.TwoSeriesPermutation(test: Series, control: Series, test_type: _PermutationTestTypes = 'means', alpha_level: float = 0.05, n_resamples: int = 10000, rng: int | None = 519)[source]#

Bases: object

Class for permutation hypothesis tests on two continuous series.

Supports difference of means/medians, Welch’s t, or Mann-Whitney U.

Parameters:
  • test (pd.Series) – Test group data.

  • control (pd.Series) – Control group data.

  • test_type ({'means', 'medians', 't_welch', 'mwu'}, optional) – Test statistic type. Defaults to ‘means’.

  • alpha_level (float, optional) – Significance level. Defaults to 0.05.

  • n_resamples (int, optional) – Number of permutations. Defaults to 10,000.

  • rng (int, default 519) – Random seed.

test#

Cleaned test data.

Type:

pd.Series

control#

Cleaned control data.

Type:

pd.Series

test_type#

Selected test type.

Type:

str

alpha#

Significance level.

Type:

float

n_resamples#

Number of resamples.

Type:

int

rng#

Random seed.

Type:

int

perm_result#

Stored permutation results.

Type:

PermResult or None

Raises:

ValueError – If test_type is invalid.

property summ_stats#

Get summary statistics based on test type.

Returns:

Parametric or nonparametric summary.

Return type:

pd.DataFrame

property test_stat_func: Callable#

Get the test statistic function based on test_type.

Returns:

Function to compute test statistic.

Return type:

Callable

permute(n_resamples: int = None)[source]#

Perform permutation test.

Parameters:

n_resamples (int, optional) – Number of permutations. Defaults to self.n_resamples.

Returns:

Permutation results.

Return type:

PermResult

Notes

Uses ‘independent’ permutation type, and does not support paired samples. Updates Ha indicator in summ_stats if p < alpha.

static mean_diff(sample_a: VectorLike, sample_b: VectorLike, axis: Literal[0, 1, 'rows', 'columns'] = 0)[source]#

Compute difference of means for permutation.

Parameters:
  • sample_a (VectorLike) – First permuted sample.

  • sample_b (VectorLike) – Second permuted sample.

  • axis ({0, 1, 'rows', 'columns'}, optional) – Axis to compute. Defaults to 0.

Returns:

Difference of means.

Return type:

float or np.float64

static median_diff(sample_a: VectorLike, sample_b: VectorLike, axis: Literal[0, 1, 'rows', 'columns'] = 0)[source]#

Compute difference of medians for permutation.

Parameters:
  • sample_a (VectorLike) – First permuted sample.

  • sample_b (VectorLike) – Second permuted sample.

  • axis ({0, 1, 'rows', 'columns'}, optional) – Axis to compute. Defaults to 0.

Returns:

Difference of medians.

Return type:

float or np.float64

static t_welch(sample_a: VectorLike, sample_b: VectorLike, axis: Literal[0, 1, 'rows', 'columns'] = 0)[source]#

Calculate Welch’s 2-independent t-stat for a single permuted sample.

Takes in observations that have been randomly permuted into test & control samples, called sample_a & sample_b, respectively (to distinguish from actual observed test/control), and returns the t-statistic.

Parameters:
  • sample_a (pd.Series or array_like) – A permuted set of observations for test or control sample, respectively.

  • sample_b (pd.Series or array_like) – A permuted set of observations for test or control sample, respectively.

  • axis ({0, 1, 'rows', 'columns'}, default=0) – Axis of sample_a & sample_b on which to perform t-test; passed to stats.ttest_ind().

Returns:

Welch’s t-statistic for the permuted sample.

Return type:

float or NumPy float

Notes

Computes an asymptotic Welch’s t-test (equal_var=False; correction for heteroskedasticity), which is repeated for each permuted sample. Since unequal-varianced samples violates the exchangeability assumption for permutation testing, the permuted p-value is not exact for finite N, thus Type I error is not guaranteed. However, Janssen & Pauls (2005) Monte Carlo simulation study found that in samples with unequal variance, permuted Welch t-tests still control Type I error better than both standard asymptotic Welch’s t-test, and better than permuted Student’s t-test.

Of note, passed PermutationMethod instance to method would use stats.permutation_test to compute identical results, since self.permute() just runs stats.permutation_test using self.t_welch() as the test statistic. In later releases, permutation t- & U-tests will be implemented using native SciPy method, but for now, all permutation stats methods are being kept housed in this class, and implementing the native SciPy method would be complex and temporary.

static mwu(sample_a: VectorLike, sample_b: VectorLike, axis: Literal[0, 1, 'rows', 'columns'] = 0)[source]#

Calculate Mann-Whitney U test for a single permuted sample.

Takes in observations that have been randomly permuted into test & control samples, called sample_a & sample_b, respectively (to distinguish from actual observed test/control), and returns the Mann- Whitney U-statistic.

Parameters:
  • sample_a (VectorLike) – permuted observations assigned to test sample

  • sample_b (VectorLike) – permuted observations assigned to control sample

  • axis ({0, 1, 'rows', 'columns'}, default=0) – axis to compute MWU test; passed to stats.mannwhitneyu

Returns:

U-statistic for the permuted sample.

Return type:

float or NumPy float

Notes

Computes a standard non-resampling M-W U-test, which is repeated for each permuted sample. Per SciPy docs, method=’auto’ computes an exact test if 8 or fewer n_obs in either sample_a or sample_b.

Of note, passed PermutationMethod instance to method would use stats.permutation_test to compute identical results, since self.permute() just runs stats.permutation_test using self.mwu() as the test statistic. In later releases, permutation t- & U-tests will be implemented using native SciPy method, but for now, all permutation stats methods are being kept housed in this class, and implementing the native SciPy method here would be complex and temporary.

class unistat.resampling.TwoSamplePermutation(bool_x: Series, num_y: Series, test_type: _PermutationTestTypes = 'means', alpha_level: float = 0.05, n_resamples: int = 10000, rng: int | None = 519, x_test_lvl: bool = True)[source]#

Bases: TwoSeriesPermutation

Class for permutation tests with a boolean grouping variable.

Extends TwoSeriesPermutation to split continuous series by boolean group.

Parameters:
  • bool_x (pd.Series) – Boolean grouping.

  • num_y (pd.Series) – Continuous outcome.

  • test_type ({'means', 'medians', 't_welch', 'mwu'}, default 'means') – Test type.

  • alpha_level (float, default 0.05) – Significance level.

  • n_resamples (int, default 10_000) – Permutations..

  • rng (int, default 519) – Random seed.

  • x_test_lvl (bool, default True) – Test level.

x#

Grouping.

Type:

pd.Series

y#

Outcome.

Type:

pd.Series

Raises:
  • SeriesNameCollisionError – If names collide.

  • ValueError – If groups empty.