Continuous Data#

This module provides classes for analyzing relationships involving continuous numerical data, as follows:

Class

Predictor (IV)

Outcome (DV)

Notes

CorrStats

  • numeric predictor

  • implicit; separate Series for each level

  • numeric outcome

  • pd.Series of values for DV column

TwoSeriesStats

  • binary predictor

  • implicit; separate DV Series for each group

  • numeric outcome

  • pd.Series of values for DV column

TwoSampleStats

  • binary predictor

  • Boolean grouping column for each IV level

  • numeric outcome

  • pd.Series of values for DV column

MultiSeries1WayBGStats

  • categorical predictor

  • implicit; separate DV series for each level

  • numeric outcome

  • pd.Series of values for DV column

MultiSample1WayBGStats

  • categorical predictor

  • categorical grouping column for each IV level

  • numeric outcome

  • pd.Series of values for DV column

All classes support both parametric and non-parametric tests for the same IV/DV structure. For multi-sample tests, an initial omnibus test is performed; if significant, pairwise post-hoc tests with P-value adjustment are also performed.

Class

Parametric Test

Non-Parametric Test

CorrStats

Pearson’s correlation

Spearman’s correlation

TwoSeriesStats

Welch’s t-test

Mann-Whitney U-test

TwoSampleStats

Welch’s t-test

Mann-Whitney U-test

MultiSeries1WayBGStats

  • Omnibus: Welch’s ANOVA

  • Post hoc: pairwise Welch’s t-tests

  • Omnibus: Kruskal-Wallis

  • Post hoc: pairwise Mann-Whitney U-tests

MultiSample1WayBGStats

  • Omnibus: Welch’s ANOVA

  • Post hoc: pairwise Welch’s t-tests

  • Omnibus: Kruskal-Wallis

  • Post hoc: pairwise Mann-Whitney U-tests

Classes assume input series are appropriately typed (e.g., continuous for numerical tests, boolean/categorical for grouping). Handles missing data by dropping NaNs.

class unistat.continuous.CorrStats(x: Series, y: Series, parametric: bool = True)[source]#

Bases: object

Compute correlation statistics between two continuous series.

Supports Pearson (parametric) or Spearman (nonparametric) correlation.

Parameters:
  • x (pd.Series) – First continuous variable.

  • y (pd.Series) – Second continuous variable.

  • parametric (bool, optional) – If True, compute Pearson correlation; otherwise, Spearman. Defaults to True.

x#

First variable after cleaning.

Type:

pd.Series

y#

Second variable after cleaning.

Type:

pd.Series

parametric#

Whether to use parametric (Pearson) or nonparametric (Spearman) method.

Type:

bool

property result#

Compute correlation test result.

Returns:

  • statistic (float) – Test statistic (\(r\) or \(\rho\)).

  • pvalue (float) – P-value for the test.

property n#

Sample size after dropping NaNs.

Returns:

Number of observations.

Return type:

int

property stat#

Correlation coefficient (r for Pearson, rho for Spearman).

Returns:

Correlation test statistic.

Return type:

float

property p#

P-value of the correlation test.

Returns:

P-value.

Return type:

float

class unistat.continuous.TwoSeriesStats(test: Series, control: Series, parametric: bool = True, alpha_level: float = 0.05)[source]#

Bases: object

Compare 2 continuous series using parametric or nonparametric tests.

Supports asymptotic t-tests (parametric) or Mann-Whitney U-tests (nonparametric).

Parameters:
  • test (pd.Series) – Test group data.

  • control (pd.Series) – Control group data.

  • parametric (bool, optional) – If True, use t-test; otherwise, Mann-Whitney U. Defaults to True.

  • alpha_level (float, optional) – Significance level for confidence intervals and tests. Defaults to 0.05.

test#

Cleaned test group data.

Type:

pd.Series

control#

Cleaned control group data.

Type:

pd.Series

parametric#

Whether to use parametric or nonparametric methods.

Type:

bool

alpha#

Significance level.

Type:

float

See also

TwoSampleStats

Same tests, starting from a Boolean group column (x) and a numeric outcome column (y).

TwoSeriesPermutation

Permutation hypothesis tests for 2-sample statistics, including t-test & Mann-Whitney U-test.

conf_int(pct_ci: float = None, dist: Literal['t', 'normal', 'z'] = 't')[source]#

Calculate CI of mean of test and control groups, based on SEM.

pct_ci can be used to custom-define a CI width. By default, 1 - alpha is used, as set in the TwoSeriesStats object.

Parameters:
  • pct_ci (float, optional) – Confidence level (e.g., 0.95 for 95%). Defaults to 1 - alpha.

  • dist ({'t', 'normal'}, optional) – Parent distribution to use when calculating SEM. Defaults to 't'.

Returns:

Named tuple with confidence intervals for control and test groups.

Return type:

ControlTestStats

See also

TwoSeriesBootstrap

Bootstrapped CIs. Useful for finding CI around a statistic other than the sample mean (may be useful for median, as in a skewed sample), or for cases of small N.

Notes

SEM is always calculated using sample SD (1 DoF). CI can be calculated using either the normal (Z) distribution or Student’s t-distribution; the latter will give slightly wider CI, all else being equal.

For small N, Gurland & Trepathi (1971) report that CI is too narrow by 25% for N=2, and ~5% for N=6, and provide an equation for calculating CI underestimation. Sokal & Rohlf (1981) give a formula for a correction factor to calculate unbiased CI for N < 20, which is not implemented in this method.

For N > 100, Student’s t-distribution and Gaussian normal distribution are approximately equivalent. Nonetheless, t-distribution is the default form used here.

Rather than considering correction factors, our recommendation for small N, or to define CI about a measure of central tendency other than the sample mean, is to use a bootstrapped CI, as implemented in the resampling module.

parametric_summ_stats(alpha_level: float = None)[source]#

Compute parametric summary statistics (mean, std, CI, etc.).

Parameters:

alpha_level (float, optional) – Significance level for CI. Defaults to alpha.

Returns:

Summary statistics for test and control groups.

Return type:

pd.DataFrame

t_test(equal_var: bool = False)[source]#

Perform 2-independent-samples t-test (Welch’s or Student’s).

Defaults to Welch’s t-test for heteroskedastic samples. Delacre et al. (2017) recommends routine use of Welch’s test for unequal variance over Student’s method, rather than selective use of Welch’s test. The loss in power from Welch’s vs. Student’s test in cases of equal variance is minimal, whereas the reduction in Type I error rate from Welch’s in cases of unequal variance is substantial; a strong determination of homoskedasticity is often not simple.

Parameters:

equal_var (bool, optional) – If True, assume equal variances (Student’s t-test); otherwise, use Welch’s. Defaults to False.

Returns:

Test results including t-statistic, degrees of freedom, and p-value.

Return type:

pd.Series

nonparametric_summ_stats(alpha_level: float = None)[source]#

Compute nonparametric summary statistics (quantiles, IQR).

Parameters:

alpha_level (float, optional) – Significance level for hypothesis direction. Defaults to alpha.

Returns:

Summary statistics with quantiles and IQR, including hypothesis direction.

Return type:

pd.DataFrame

mwu_test()[source]#

Perform Mann-Whitney U test.

Returns:

Test results with index=['U-statistic', 'p-value'].

Return type:

pd.Series

class unistat.continuous.TwoSampleStats(bool_x: Series, num_y: Series, parametric: bool = True, alpha_level: float = 0.05, x_test_lvl: bool = True)[source]#

Bases: TwoSeriesStats

Compare continuous outcome across 2 groups defined by a boolean variable.

Extends TwoSeriesStats to split a continuous series by a boolean grouping variable.

Parameters:
  • bool_x (pd.Series) – Boolean variable defining groups.

  • num_y (pd.Series) – Continuous outcome variable.

  • parametric (bool, optional) – If True, use t-test; otherwise, Mann-Whitney U. Defaults to True.

  • alpha_level (float, optional) – Significance level for tests and CIs. Defaults to 0.05.

  • x_test_lvl (bool, optional) – Boolean level defining the test group. Defaults to True.

x#

Boolean grouping variable.

Type:

pd.Series

y#

Continuous outcome variable.

Type:

pd.Series

Raises:
  • SeriesNameCollisionError – If bool_x and num_y have the same name.

  • ValueError – If test or control group is empty after splitting.

See also

TwoSeriesStats

Same tests, starting from 2 separate numeric outcome columns.

TwoSamplePermutation

Permutation hypothesis tests for 2-sample statistics, including t-test & Mann-Whitney U-test.

class unistat.continuous.MultiSeries1WayBGStats(*data: Series | DataFrame, parametric: bool = True, alpha_level: float = 0.05, **named_data: Series | DataFrame)[source]#

Bases: object

Compare 3+ continuous series using parametric or nonparametric tests.

Class is intended for data formatted as 3+ pd.Series objects, with a Series of continuous dependent variable (DV) data for each level of the categorical independent variable (IV). This is commonly encountered when there are separate DataFrames for each group/category, each of which has a column for the same numeric outcome.

Supports asymptotic analysis of variance (Welch ANOVA) tests (parametric), or Kruskal-Wallis tests (nonparametric).

Also implements post hoc testing with automatic p-value correction for multiple comparisons (Holm-Bonferroni method by default).

  • parametric post hoc tests via pairwise Welch t-tests.

  • nonparametric post hoc tests via pairwise Mann-Whitney U-tests.

Parameters:
  • *data (pd.Series | pd.DataFrame) – Series of outcome data is passed for each category/level. Series will be converted to DataFrame columns in the order they are passed. If present, Series names will be appended with an index value (__0, __1, etc.); unnamed Series will be named with their equivalent integer index. A single DataFrame can also be passed as an argument, in which case column names will be used as Series names.

  • parametric (bool, default True) – If True, use ANOVA; otherwise, Kruskal-Wallis.

  • alpha_level (float, default 0.05) – Significance level for confidence intervals and tests.

  • **named_data (pd.Series | pd.DataFrame) – Keyword arguments may be used to override automatic, index-based Series names (e.g., new_name1=series1, new_name2=series2, etc.). data Is a protected keyword, reserved for passing a single DataFrame as data=df.

data#

All passed Series, concatenated as columns, with reassigned unique column names.

Type:

pd.DataFrame

parametric#

Whether to use parametric or nonparametric methods.

Type:

bool

alpha#

Significance level.

Type:

float

See also

MultiSample1WayBGStats

Same tests, starting from a categorical grouping column (x) and a numeric outcome column (y). Useful when all data is derived from a single DataFrame.

Notes

Series may be passed either as positional arguments (*data)**, or as named keyword arguments (**named_data), but not both. If all Series are contained as a single DataFrame, the DataFrame may be passed either as a data=df keyword argument (preferred for clarity), or as the lone *data argument. For this reason, data is a protected name for **named_data keyword arguments.

conf_int(pct_ci: float = None, dist: Literal['t', 'normal', 'z'] = 't') DataFrame[source]#

Calculate CI of mean of all columns in data, based on SEM.

pct_ci can be used to custom-define a CI width. By default, 1 - alpha is used, as set in the TwoSeriesStats object.

Parameters:
  • pct_ci (float, optional) – Confidence level (e.g., 0.95 for 95%). Defaults to 1 - alpha.

  • dist ({'t', 'normal'}, optional) – Parent distribution to use when calculating SEM. Defaults to 't'.

Returns:

Column names match data.columns; index=['ci_lower', 'ci_upper'].

Return type:

pd.DataFrame

Notes

SEM is always calculated using sample SD (1 DoF). CI can be calculated using either the normal (Z) distribution or Student’s t-distribution; the latter will give slightly wider CI, all else being equal.

For small N, Gurland & Trepathi (1971) report that CI is too narrow by 25% for N=2, and ~5% for N=6, and provide an equation for calculating CI underestimation. Sokal & Rohlf (1981) give a formula for a correction factor to calculate unbiased CI for N < 20, which is not implemented in this method.

For N > 100, Student’s t-distribution and Gaussian normal distribution are approximately equivalent. Nonetheless, t-distribution is the default form used here.

Rather than considering correction factors, our recommendation for small N, or to define CI about a measure of central tendency other than the sample mean, is to use a bootstrapped CI, as implemented in the resampling module. However, a bootstrapping class has not yet been implemented for multiclass (3+ level) cases.

parametric_summ_stats(alpha_level: float = None) DataFrame[source]#

Compute parametric summary statistics (mean, std, CI, etc.).

Parameters:

alpha_level (float, optional) – Significance level for CI. Defaults to alpha.

Returns:

Summary statistics for each column in data.

Return type:

pd.DataFrame

anova(equal_var: bool = False) Series[source]#

Perform 1-way between-groups analysis of variance (ANOVA).

Defaults to Welch’s ANOVA for heteroskedastic samples. Delacre et al. (2019) recommends routine use of Welch’s F-test for unequal variance over Student’s (Fisher’s) method, rather than selective use of Welch’s test. The loss in power from Welch’s vs. Student’s test in cases of equal variance is minimal, whereas the reduction in Type I error rate from Welch’s in cases of unequal variance is substantial; a strong determination of homoskedasticity is often not simple.

Parameters:

equal_var (bool, default False) – If True, assume equal variances (Fisher’s ANOVA); otherwise, use Welch’s.

Returns:

Test results including F-statistic, degrees of freedom, and p-value.

Return type:

pd.Series

nonparametric_summ_stats(alpha_level: float = None) DataFrame[source]#

Compute nonparametric summary statistics (quantiles, IQR).

Parameters:

alpha_level (float, optional) – Significance level for hypothesis direction. Defaults to self.alpha.

Returns:

Summary statistics with quantiles and IQR for each column in data.

Return type:

pd.DataFrame

kruskal_wallis() Series[source]#

Perform 1-way between-groups Kruskal-Wallis test.

Returns:

Test results including H-statistic, degrees of freedom, and p-value.

Return type:

pd.Series

pairwise_t(equal_var: bool = False, p_corr_method: PCorrectionMethod = 'holm') DataFrame[source]#

Perform post hoc pairwise t-tests.

Uses pairwise combinations of input Series, and calculates t-test for each pair. By default, uses Welch t-test rather than Student t-test (see TwoSeriesStats.t_test() for rationale).

p-Values for all pairwise tests then undergo correction for multiple comparisons. By default, Holm-Bonferroni correction is used, but all correction methods supported by statsmodels.stats.multitest.multipletests are supported.

Parameters:
  • equal_var (bool, default False) – If False, use Welch t-test; otherwise, use Student t-test.

  • p_corr_method (str, default 'holm') – P-Value correction method. Cannot be None since P-value correction is always indicated for multiple pairwise comparisons.

Returns:

DataFrame with a pd.MultiIndex in (control, test) format. Columns give t-statistic, Student DoF, calculated Welch DoF, and uncorrected and corrected P-values.

Return type:

pd.DataFrame

pairwise_mwu(p_corr_method: PCorrectionMethod = 'holm') DataFrame[source]#

Perform post hoc pairwise Mann-Whitney U-tests.

Uses pairwise combinations of input Series, and calculates Mann-Whitney U-test for each pair.

p-Values for all pairwise tests then undergo correction for multiple comparisons. By default, Holm-Bonferroni correction is used, but all correction methods supported by statsmodels.stats.multitest.multipletests are supported.

Parameters:

p_corr_method (str, default 'holm') – P-Value correction method. Cannot be None since P-value correction is always indicated for multiple pairwise comparisons.

Returns:

DataFrame with a pd.MultiIndex in (control, test) format. Since P-value can reach significance even if medians are the same between groups, 'Ha' column denotes the direction of effect when uncorrected p is significant. Columns also give U-statistic, DoF, and uncorrected and corrected {-values.

Return type:

pd.DataFrame

class unistat.continuous.MultiSample1WayBGStats(cat_x: Series, num_y: Series, cat_order: list[str] = None, parametric: bool = True, alpha_level: float = 0.05)[source]#

Bases: MultiSeries1WayBGStats

Compare continuous outcome across 3 groups defined by a categorical.

Extends MultiSeries1WayBGStats to split a continuous Series by a categorical grouping variable.

Parameters:
  • cat_x (pd.Series) – Categorical variable defining groups. cat_x.dtype should be either pd.Categorical, or convertible to pd.Categorical. Predefined categorical ordering will be retained automatically.

  • num_y (pd.Series) – Continuous outcome variable.

  • cat_order (list[str], optional) – If no categorical ordering is predefined in cat_x, but this is desired in the output. Should be a list of strings of all unique values that appear in cat_x.

  • parametric (bool, default True) – If True, use ANOVA; otherwise, use Kruskal-Wallus.

  • alpha_level (float, default 0.05) – Significance level for tests and CIs.

data#

Resulting DataFrame after removal of observations with nulls.

Type:

pd.DataFrame

parametric#

Whether to use parametric or nonparametric methods.

Type:

bool

alpha#

Significance level.

Type:

float

See also

MultiSeries1WayBGStats

Same tests, input style is simpler if data is formatted as separate Series of numeric outcome data (for each group/level) without a categorical grouping column.