Formula-Based Regression#

Module for regression statistics based on patsy-/R-style formulae.

This module provides an abstract base class and concrete implementations for performing regression analyses using statsmodels. It supports linear regression, logistic regression, and log-binomial regression, with features like variance inflation factor (VIF) calculation, standardized regressions, odds/risk ratios, and formatted output.

Assumes input data are pandas Series/DataFrames. Handles boolean columns specially in standardization.

class unistat.formula_regression.FormulaRegression(formula: str, data: DataFrame)[source]#

Bases: ABC

Abstract base class for formula-based regression statistics.

Provides common functionality for regression models, including data preparation, standardization, VIF calculation, and properties for regression results.

Uses Wilkinson formulae (akin to R-style formulae), which are implemented in statsmodels via the patsy package.

Parameters:
  • formula (str) – statsmodels/patsy/R-style formula defining the regression model.

  • data (pd.DataFrame) – Data for the model. data must contain (at a minimum) all variables referenced by formula.

data#

Input DataFrame after filtering for columns in formula, dropping NaNs, and converting all columns to float.

Type:

pd.DataFrame

static extract_formula_cols(formula: str) FormulaSides[source]#

Extract unique column names from a patsy/R formula string.

This function parses the formula using patsy, then processes each factor to identify underlying column names, handling special cases like C(), Q(), I(), and function calls (e.g., np.log).

Parameters:

formula (str) – The patsy-compatible formula string (e.g., ‘y ~ x + C(cat) + np.log(z)’).

Returns:

Sorted list of unique column names extracted from the formula.

Return type:

list[str]

Examples

>>> extract_columns_from_formula(
...     'mortality_24h ~ ca_grams_4h*C(ca_gluconate_bool) + I(age**2) '
...     '+ np.log(iss) + C(moi_pen) + C(sex_female)'
... )
[
    'age',
    'ca_gluconate_bool',
    'ca_grams_4h',
    'iss',
    'moi_pen',
    'mortality_24h',
    'sex_female'
]
vif_matrix(drop_ints: bool = False)[source]#

Calculate variance inflation factors (VIF) for feature DataFrame.

Returns:

VIF values for each feature.

Return type:

pd.Series

Raises:

ValueError – If fewer than 2 columns in X.

class unistat.formula_regression.FormulaLogit(formula: str, data: DataFrame)[source]#

Bases: FormulaRegression

logit_or(standardize: bool = False)[source]#

Odds ratios (ORs) for each feature in X.

Parameters:

standardize (bool, default False) – Whether crude or standardized ORs calculated.

Returns:

columns=['OR', '95% CI lower', '95% CI upper']; index is each column in X/X_std, including the intercept.

Return type:

pd.DataFrame

class unistat.formula_regression.FormulaLinReg(formula: str, data: DataFrame)[source]#

Bases: FormulaRegression