Simple Regression#

Module for regression statistics.

This module provides an abstract base class and concrete implementations for performing regression analyses using statsmodels. It supports linear regression, logistic regression, and log-binomial regression, with features like variance inflation factor (VIF) calculation, standardized regressions, odds/risk ratios, and formatted output.

Assumes input data are pandas Series/DataFrames. Boolean columns are not subject to standardization.

class unistat.regression.RegressionStats(X, y, bool_col_names: list | str | None = None)[source]#

Bases: ABC

Abstract base class for regression statistics.

Provides common functionality for regression models, including data preparation, standardization, VIF calculation, and properties for regression results.

Parameters:
  • X (pd.DataFrame or pd.Series) – Independent variables (features).

  • y (pd.Series) – Dependent variable (target).

  • bool_col_names (list[str] or str or None, optional) – Names of boolean columns in X to exclude from standardization.

bool_cols#

List of boolean column names.

Type:

list[str] or None

X#

Feature DataFrame. NaN values are removed and all columns are converted to float64.

Type:

pd.DataFrame

y#

Target DataFrame. NaN values are removed and all columns are converted to float64.

Type:

pd.DataFrame

reg#

Fitted regression model.

Type:

statsmodels regression result

X_std#

X, with all non-Boolean columns transformed to Z-scores.

Type:

pd.DataFrame

std_reg#

Fitted standardized regression model.

Type:

statsmodels regression result

Notes

Observations with any missing data in either X or y are dropped.

vif_matrix()[source]#

Calculate variance inflation factors (VIF) for feature DataFrame.

Returns:

VIF values for each feature.

Return type:

pd.Series

Raises:

ValueError – If fewer than 2 columns in X.

class unistat.regression.LogitStats(X, y, bool_col_names: list | str | None = None)[source]#

Bases: RegressionStats

Class for logistic regression statistics.

Extends RegressionStats for logistic regression using Logit model.

Parameters:
  • X (pd.DataFrame or pd.Series) – Predictor observations.

  • y (pd.Series) – Binary outcome observations.

  • bool_col_names (list[str] or str or None, optional) – String names of Boolean columns to exclude from standardization.

logit_or(standardize: bool = False) DataFrame[source]#

Calculate odds ratios by predictor, with 95% confidence intervals.

Parameters:

standardize (bool, optional) – Use standardized model. Defaults to False.

Returns:

Odds ratios with 95% CI.

Return type:

pd.DataFrame

Raises:

ValueError – If all columns are boolean and standardize is True.

pretty_print_or(standardize: bool = False, label: bool = True) None[source]#

Print formatted ORs & 95% CIs for easy copy-pasting.

Parameters:
  • standardize (bool, optional) – Use standardized model. Defaults to False.

  • label (bool, optional) – Include parameter labels. Defaults to True.

Raises:

ValueError – If all columns are boolean and standardize is True.

class unistat.regression.LinRegStats(X, y, bool_col_names: list | str | None = None)[source]#

Bases: RegressionStats

Class for linear regression statistics.

Extends RegressionStats for ordinary least squares (OLS) regression.

Parameters:
  • X (pd.DataFrame or pd.Series) – Predictor observations.

  • y (pd.Series) – Numeric outcome observations.

  • bool_col_names (list[str] or str or None, optional) – Boolean columns to exclude from standardization.

Notes

unistat does NOT standardize the values of y for linear regression. Typically, “standardized regression” refers to a transformation of \(y \sim X\) such that \(\text{SD}\left( y \right) \sim \text{SD}\left( X \right)\); a coefficient \(\beta\) is thus interpreted as a 1-S.D. increase in \(X\) conferring a \(\beta\) S.D. increase in \(y\). We find this to be difficult to interpret, with no benefit beyond adherence to convention.

Instead, unistat opts for “X-standardized regression”. That is, since only \(X\) is Z-scored, \(y \sim X\) is transformed such that \(y \sim \text{SD}\left( X \right)\). Here, a coefficient \(\beta\) is interpreted as a 1-S.D. increase in \(X\) conferring an absolute increase of \(\beta\) units in \(y\). This is more easily interpretable, while still allowing comparison of the relative strengths of all \(X\) predictors.

pretty_print_coefs(standardize: bool = False, label: bool = True) None[source]#

Print formatted regression coefficients for easy copy-pasting.

Parameters:
  • standardize (bool, Default False) – Use standardized model.

  • label (bool, Default True) – Include parameter labels.

Raises:

ValueError – If all columns are boolean and standardize is True.

class unistat.regression.LogBinStats(X, y, bool_col_names: list | str | None = None)[source]#

Bases: RegressionStats

Class for log-binomial regression statistics (experimental).

Extends RegressionStats for generalized linear model with binomial family and log link.

Parameters:
  • X (pd.DataFrame or pd.Series) – Independent variables.

  • y (pd.Series) – Dependent variable (binary).

  • bool_col_names (list[str] or str or None, optional) – Boolean columns to exclude from standardization.

Warning

This class is experimental; use with caution and verify results.

logbin_rr(standardize: bool = False) DataFrame[source]#

Calculate risk ratios by predictor with 95% confidence intervals.

Parameters:

standardize (bool, optional) – Use standardized model. Defaults to False.

Returns:

Risk ratios with 95% CI.

Return type:

pd.DataFrame

Raises:

ValueError – If all columns are boolean and standardize is True.

pretty_print_rr(standardize: bool = False, label: bool = True) None[source]#

Print formatted risk ratios w/ 95% CIs for easy copy-pasting.

Parameters:
  • standardize (bool, optional) – Use standardized model. Defaults to False.

  • label (bool, optional) – Include parameter labels. Defaults to True.

Raises:

ValueError – If all columns are boolean and standardize is True.