Continuous Data#
This module provides classes for analyzing relationships involving continuous numerical data, as follows:
Class |
Predictor (IV) |
Outcome (DV) |
Notes |
|---|---|---|---|
CorrStats |
|
|
|
TwoSeriesStats |
|
|
|
TwoSampleStats |
|
|
|
MultiSeries1WayBGStats |
|
|
|
MultiSample1WayBGStats |
|
|
|
All classes support both parametric and non-parametric tests for the same IV/DV structure. For multi-sample tests, an initial omnibus test is performed; if significant, pairwise post-hoc tests with P-value adjustment are also performed.
Class |
Parametric Test |
Non-Parametric Test |
|---|---|---|
CorrStats |
Pearson’s correlation |
Spearman’s correlation |
TwoSeriesStats |
Welch’s t-test |
Mann-Whitney U-test |
TwoSampleStats |
Welch’s t-test |
Mann-Whitney U-test |
MultiSeries1WayBGStats |
|
|
MultiSample1WayBGStats |
|
|
Classes assume input series are appropriately typed (e.g., continuous for numerical tests, boolean/categorical for grouping). Handles missing data by dropping NaNs.
- class unistat.continuous.CorrStats(x: Series, y: Series, parametric: bool = True)[source]#
Bases:
objectCompute correlation statistics between two continuous series.
Supports Pearson (parametric) or Spearman (nonparametric) correlation.
- Parameters:
x (pd.Series) – First continuous variable.
y (pd.Series) – Second continuous variable.
parametric (bool, optional) – If True, compute Pearson correlation; otherwise, Spearman. Defaults to True.
- x#
First variable after cleaning.
- Type:
pd.Series
- y#
Second variable after cleaning.
- Type:
pd.Series
- property result#
Compute correlation test result.
- Returns:
statistic (float) – Test statistic (\(r\) or \(\rho\)).
pvalue (float) – P-value for the test.
- property stat#
Correlation coefficient (
rfor Pearson,rhofor Spearman).- Returns:
Correlation test statistic.
- Return type:
- class unistat.continuous.TwoSeriesStats(test: Series, control: Series, parametric: bool = True, alpha_level: float = 0.05)[source]#
Bases:
objectCompare 2 continuous series using parametric or nonparametric tests.
Supports asymptotic t-tests (parametric) or Mann-Whitney U-tests (nonparametric).
- Parameters:
- test#
Cleaned test group data.
- Type:
pd.Series
- control#
Cleaned control group data.
- Type:
pd.Series
See also
TwoSampleStatsSame tests, starting from a Boolean group column (x) and a numeric outcome column (y).
TwoSeriesPermutationPermutation hypothesis tests for 2-sample statistics, including t-test & Mann-Whitney U-test.
- conf_int(pct_ci: float = None, dist: Literal['t', 'normal', 'z'] = 't')[source]#
Calculate CI of mean of test and control groups, based on SEM.
pct_cican be used to custom-define a CI width. By default,1 - alphais used, as set in theTwoSeriesStatsobject.- Parameters:
pct_ci (float, optional) – Confidence level (e.g.,
0.95for 95%). Defaults to1 - alpha.dist ({'t', 'normal'}, optional) – Parent distribution to use when calculating SEM. Defaults to
't'.
- Returns:
Named tuple with confidence intervals for control and test groups.
- Return type:
ControlTestStats
See also
TwoSeriesBootstrapBootstrapped CIs. Useful for finding CI around a statistic other than the sample mean (may be useful for median, as in a skewed sample), or for cases of small N.
Notes
SEM is always calculated using sample SD (1 DoF). CI can be calculated using either the normal (Z) distribution or Student’s t-distribution; the latter will give slightly wider CI, all else being equal.
For small N, Gurland & Trepathi (1971) report that CI is too narrow by 25% for N=2, and ~5% for N=6, and provide an equation for calculating CI underestimation. Sokal & Rohlf (1981) give a formula for a correction factor to calculate unbiased CI for N < 20, which is not implemented in this method.
For N > 100, Student’s t-distribution and Gaussian normal distribution are approximately equivalent. Nonetheless, t-distribution is the default form used here.
Rather than considering correction factors, our recommendation for small N, or to define CI about a measure of central tendency other than the sample mean, is to use a bootstrapped CI, as implemented in the resampling module.
- parametric_summ_stats(alpha_level: float = None)[source]#
Compute parametric summary statistics (mean, std, CI, etc.).
- t_test(equal_var: bool = False)[source]#
Perform 2-independent-samples t-test (Welch’s or Student’s).
Defaults to Welch’s t-test for heteroskedastic samples. Delacre et al. (2017) recommends routine use of Welch’s test for unequal variance over Student’s method, rather than selective use of Welch’s test. The loss in power from Welch’s vs. Student’s test in cases of equal variance is minimal, whereas the reduction in Type I error rate from Welch’s in cases of unequal variance is substantial; a strong determination of homoskedasticity is often not simple.
- Parameters:
equal_var (bool, optional) – If True, assume equal variances (Student’s t-test); otherwise, use Welch’s. Defaults to False.
- Returns:
Test results including t-statistic, degrees of freedom, and p-value.
- Return type:
pd.Series
- class unistat.continuous.TwoSampleStats(bool_x: Series, num_y: Series, parametric: bool = True, alpha_level: float = 0.05, x_test_lvl: bool = True)[source]#
Bases:
TwoSeriesStatsCompare continuous outcome across 2 groups defined by a boolean variable.
Extends TwoSeriesStats to split a continuous series by a boolean grouping variable.
- Parameters:
bool_x (pd.Series) – Boolean variable defining groups.
num_y (pd.Series) – Continuous outcome variable.
parametric (bool, optional) – If True, use t-test; otherwise, Mann-Whitney U. Defaults to True.
alpha_level (float, optional) – Significance level for tests and CIs. Defaults to 0.05.
x_test_lvl (bool, optional) – Boolean level defining the test group. Defaults to True.
- x#
Boolean grouping variable.
- Type:
pd.Series
- y#
Continuous outcome variable.
- Type:
pd.Series
- Raises:
SeriesNameCollisionError – If bool_x and num_y have the same name.
ValueError – If test or control group is empty after splitting.
See also
TwoSeriesStatsSame tests, starting from 2 separate numeric outcome columns.
TwoSamplePermutationPermutation hypothesis tests for 2-sample statistics, including t-test & Mann-Whitney U-test.
- class unistat.continuous.MultiSeries1WayBGStats(*data: Series | DataFrame, parametric: bool = True, alpha_level: float = 0.05, **named_data: Series | DataFrame)[source]#
Bases:
objectCompare 3+ continuous series using parametric or nonparametric tests.
Class is intended for data formatted as 3+
pd.Seriesobjects, with a Series of continuous dependent variable (DV) data for each level of the categorical independent variable (IV). This is commonly encountered when there are separate DataFrames for each group/category, each of which has a column for the same numeric outcome.Supports asymptotic analysis of variance (Welch ANOVA) tests (parametric), or Kruskal-Wallis tests (nonparametric).
Also implements post hoc testing with automatic p-value correction for multiple comparisons (Holm-Bonferroni method by default).
parametric post hoc tests via pairwise Welch t-tests.
nonparametric post hoc tests via pairwise Mann-Whitney U-tests.
- Parameters:
*data (pd.Series | pd.DataFrame) – Series of outcome data is passed for each category/level. Series will be converted to DataFrame columns in the order they are passed. If present, Series names will be appended with an index value (
__0,__1, etc.); unnamed Series will be named with their equivalent integer index. A single DataFrame can also be passed as an argument, in which case column names will be used as Series names.parametric (bool, default True) – If True, use ANOVA; otherwise, Kruskal-Wallis.
alpha_level (float, default 0.05) – Significance level for confidence intervals and tests.
**named_data (pd.Series | pd.DataFrame) – Keyword arguments may be used to override automatic, index-based Series names (e.g.,
new_name1=series1, new_name2=series2, etc.).dataIs a protected keyword, reserved for passing a single DataFrame asdata=df.
- data#
All passed Series, concatenated as columns, with reassigned unique column names.
- Type:
pd.DataFrame
See also
MultiSample1WayBGStatsSame tests, starting from a categorical grouping column (x) and a numeric outcome column (y). Useful when all data is derived from a single DataFrame.
Notes
Series may be passed either as positional arguments (
*data)**, or as named keyword arguments (**named_data), but not both. If all Series are contained as a single DataFrame, the DataFrame may be passed either as adata=dfkeyword argument (preferred for clarity), or as the lone*dataargument. For this reason,datais a protected name for**named_datakeyword arguments.- conf_int(pct_ci: float = None, dist: Literal['t', 'normal', 'z'] = 't') DataFrame[source]#
Calculate CI of mean of all columns in data, based on SEM.
pct_cican be used to custom-define a CI width. By default,1 - alphais used, as set in the TwoSeriesStats object.- Parameters:
pct_ci (float, optional) – Confidence level (e.g.,
0.95for 95%). Defaults to1 - alpha.dist ({'t', 'normal'}, optional) – Parent distribution to use when calculating SEM. Defaults to
't'.
- Returns:
Column names match
data.columns;index=['ci_lower', 'ci_upper'].- Return type:
pd.DataFrame
Notes
SEM is always calculated using sample SD (1 DoF). CI can be calculated using either the normal (Z) distribution or Student’s t-distribution; the latter will give slightly wider CI, all else being equal.
For small N, Gurland & Trepathi (1971) report that CI is too narrow by 25% for N=2, and ~5% for N=6, and provide an equation for calculating CI underestimation. Sokal & Rohlf (1981) give a formula for a correction factor to calculate unbiased CI for N < 20, which is not implemented in this method.
For N > 100, Student’s t-distribution and Gaussian normal distribution are approximately equivalent. Nonetheless, t-distribution is the default form used here.
Rather than considering correction factors, our recommendation for small N, or to define CI about a measure of central tendency other than the sample mean, is to use a bootstrapped CI, as implemented in the resampling module. However, a bootstrapping class has not yet been implemented for multiclass (3+ level) cases.
- parametric_summ_stats(alpha_level: float = None) DataFrame[source]#
Compute parametric summary statistics (mean, std, CI, etc.).
- anova(equal_var: bool = False) Series[source]#
Perform 1-way between-groups analysis of variance (ANOVA).
Defaults to Welch’s ANOVA for heteroskedastic samples. Delacre et al. (2019) recommends routine use of Welch’s F-test for unequal variance over Student’s (Fisher’s) method, rather than selective use of Welch’s test. The loss in power from Welch’s vs. Student’s test in cases of equal variance is minimal, whereas the reduction in Type I error rate from Welch’s in cases of unequal variance is substantial; a strong determination of homoskedasticity is often not simple.
- Parameters:
equal_var (bool, default False) – If True, assume equal variances (Fisher’s ANOVA); otherwise, use Welch’s.
- Returns:
Test results including F-statistic, degrees of freedom, and p-value.
- Return type:
pd.Series
- nonparametric_summ_stats(alpha_level: float = None) DataFrame[source]#
Compute nonparametric summary statistics (quantiles, IQR).
- kruskal_wallis() Series[source]#
Perform 1-way between-groups Kruskal-Wallis test.
- Returns:
Test results including H-statistic, degrees of freedom, and p-value.
- Return type:
pd.Series
- pairwise_t(equal_var: bool = False, p_corr_method: PCorrectionMethod = 'holm') DataFrame[source]#
Perform post hoc pairwise t-tests.
Uses pairwise combinations of input Series, and calculates t-test for each pair. By default, uses Welch t-test rather than Student t-test (see
TwoSeriesStats.t_test()for rationale).p-Values for all pairwise tests then undergo correction for multiple comparisons. By default, Holm-Bonferroni correction is used, but all correction methods supported by statsmodels.stats.multitest.multipletests are supported.
- Parameters:
- Returns:
DataFrame with a
pd.MultiIndexin(control, test)format. Columns give t-statistic, Student DoF, calculated Welch DoF, and uncorrected and corrected P-values.- Return type:
pd.DataFrame
- pairwise_mwu(p_corr_method: PCorrectionMethod = 'holm') DataFrame[source]#
Perform post hoc pairwise Mann-Whitney U-tests.
Uses pairwise combinations of input Series, and calculates Mann-Whitney U-test for each pair.
p-Values for all pairwise tests then undergo correction for multiple comparisons. By default, Holm-Bonferroni correction is used, but all correction methods supported by statsmodels.stats.multitest.multipletests are supported.
- Parameters:
p_corr_method (str, default 'holm') – P-Value correction method. Cannot be None since P-value correction is always indicated for multiple pairwise comparisons.
- Returns:
DataFrame with a
pd.MultiIndexin(control, test)format. Since P-value can reach significance even if medians are the same between groups,'Ha'column denotes the direction of effect when uncorrected p is significant. Columns also give U-statistic, DoF, and uncorrected and corrected {-values.- Return type:
pd.DataFrame
- class unistat.continuous.MultiSample1WayBGStats(cat_x: Series, num_y: Series, cat_order: list[str] = None, parametric: bool = True, alpha_level: float = 0.05)[source]#
Bases:
MultiSeries1WayBGStatsCompare continuous outcome across 3 groups defined by a categorical.
Extends
MultiSeries1WayBGStatsto split a continuous Series by a categorical grouping variable.- Parameters:
cat_x (pd.Series) – Categorical variable defining groups.
cat_x.dtypeshould be eitherpd.Categorical, or convertible topd.Categorical. Predefined categorical ordering will be retained automatically.num_y (pd.Series) – Continuous outcome variable.
cat_order (list[str], optional) – If no categorical ordering is predefined in
cat_x, but this is desired in the output. Should be a list of strings of all unique values that appear incat_x.parametric (bool, default True) – If True, use ANOVA; otherwise, use Kruskal-Wallus.
alpha_level (float, default 0.05) – Significance level for tests and CIs.
- data#
Resulting DataFrame after removal of observations with nulls.
- Type:
pd.DataFrame
See also
MultiSeries1WayBGStatsSame tests, input style is simpler if data is formatted as separate Series of numeric outcome data (for each group/level) without a categorical grouping column.