Formula-Based Regression#
Module for regression statistics based on patsy-/R-style formulae.
This module provides an abstract base class and concrete implementations for performing regression analyses using statsmodels. It supports linear regression, logistic regression, and log-binomial regression, with features like variance inflation factor (VIF) calculation, standardized regressions, odds/risk ratios, and formatted output.
Assumes input data are pandas Series/DataFrames. Handles boolean columns
specially in standardization.
- class unistat.formula_regression.FormulaRegression(formula: str, data: DataFrame)[source]#
Bases:
ABCAbstract base class for formula-based regression statistics.
Provides common functionality for regression models, including data preparation, standardization, VIF calculation, and properties for regression results.
Uses Wilkinson formulae (akin to R-style formulae), which are implemented in statsmodels via the
patsypackage.- Parameters:
formula (str) –
statsmodels/patsy/R-style formula defining the regression model.data (pd.DataFrame) – Data for the model.
datamust contain (at a minimum) all variables referenced byformula.
- data#
Input DataFrame after filtering for columns in formula, dropping NaNs, and converting all columns to float.
- Type:
pd.DataFrame
- static extract_formula_cols(formula: str) FormulaSides[source]#
Extract unique column names from a patsy/R formula string.
This function parses the formula using patsy, then processes each factor to identify underlying column names, handling special cases like C(), Q(), I(), and function calls (e.g., np.log).
- Parameters:
formula (str) – The patsy-compatible formula string (e.g., ‘y ~ x + C(cat) + np.log(z)’).
- Returns:
Sorted list of unique column names extracted from the formula.
- Return type:
Examples
>>> extract_columns_from_formula( ... 'mortality_24h ~ ca_grams_4h*C(ca_gluconate_bool) + I(age**2) ' ... '+ np.log(iss) + C(moi_pen) + C(sex_female)' ... ) [ 'age', 'ca_gluconate_bool', 'ca_grams_4h', 'iss', 'moi_pen', 'mortality_24h', 'sex_female' ]
- vif_matrix(drop_ints: bool = False)[source]#
Calculate variance inflation factors (VIF) for feature DataFrame.
- Returns:
VIF values for each feature.
- Return type:
pd.Series
- Raises:
ValueError – If fewer than 2 columns in X.
- class unistat.formula_regression.FormulaLogit(formula: str, data: DataFrame)[source]#
Bases:
FormulaRegression- logit_or(standardize: bool = False)[source]#
Odds ratios (ORs) for each feature in
X.- Parameters:
standardize (bool, default False) – Whether crude or standardized ORs calculated.
- Returns:
columns=['OR', '95% CI lower', '95% CI upper'];indexis each column inX/X_std, including the intercept.- Return type:
pd.DataFrame
- class unistat.formula_regression.FormulaLinReg(formula: str, data: DataFrame)[source]#
Bases:
FormulaRegression