Regression

Regression for Analysts: From Comparison to Causal Insight

A comprehensive guide to regression analysis for product and data analysts. Learn when to use linear, logistic, and count regression, how to diagnose problems, interpret coefficients correctly, and avoid common pitfalls that lead to misleading conclusions.

Share

Quick Hits

  • Regression = flexible framework for understanding relationships between variables
  • Linear for continuous outcomes, logistic for binary, Poisson/NB for counts
  • Coefficients answer: 'What's the expected change in Y for a unit change in X, holding others constant?'
  • 'Holding others constant' is a statistical adjustment, not an experimental control
  • Diagnostics matter more than the regression itself - always check residuals

TL;DR

Regression is a framework for modeling how one or more predictor variables relate to an outcome. Linear regression handles continuous outcomes, logistic regression handles binary outcomes (yes/no), and Poisson/negative binomial regression handles counts. This guide covers when to use each type, how to interpret coefficients correctly, essential diagnostics, and common mistakes that lead analysts astray. The key insight: regression adjusts for included covariates, but "adjusting for X" is not the same as "experimentally controlling X."


Why Regression Matters for Analysts

Regression answers questions that simple comparisons can't:

  1. What's the relationship between X and Y after accounting for Z?

    • Does the marketing campaign increase revenue after controlling for seasonality?
  2. How much does Y change when X changes by one unit?

    • For each additional day of onboarding, how much does retention increase?
  3. Which predictors matter most?

    • Among pricing, features, and support, which drives upgrade decisions?
  4. What would Y be for a new observation with specific X values?

    • What's the expected lifetime value for a user with these characteristics?

The Regression Family

Linear Regression: Continuous Outcomes

Use when your outcome is a continuous number (revenue, time, satisfaction score).

$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \epsilon$$

Interpretation: $\beta_1$ = expected change in Y for a one-unit increase in $X_1$, holding other predictors constant.

Example: Revenue = 100 + 5(emails sent) + 20(has premium) + error

  • Each additional email is associated with $5 more revenue
  • Premium users have $20 higher revenue on average

Logistic Regression: Binary Outcomes

Use when your outcome is yes/no, converted/not, churned/retained.

$$\log\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 X_1 + ...$$

Interpretation: $e^{\beta_1}$ = odds ratio for a one-unit increase in $X_1$.

Example: Conversion = f(price, has free trial, referral source)

  • $e^{\beta_{price}} = 0.85$: each $1 price increase reduces conversion odds by 15%

Poisson/Negative Binomial: Count Outcomes

Use when your outcome is a count (purchases, logins, support tickets).

$$\log(E[Y]) = \beta_0 + \beta_1 X_1 + ...$$

Interpretation: $e^{\beta_1}$ = multiplicative change in expected count.

Example: Support tickets = f(days since signup, plan type)

  • $e^{\beta_{days}} = 0.98$: each additional day reduces expected tickets by 2%

Coefficient Interpretation: Getting It Right

The Fundamental Statement

For a coefficient $\beta_j$:

"$\beta_j$ is the expected change in Y for a one-unit increase in $X_j$, holding all other predictors in the model constant."

What "Holding Constant" Actually Means

This is statistical adjustment, not experimental control:

Statistical adjustment: "Among people with the same values of other predictors, how does Y vary with $X_j$?"

Experimental control: "If we randomly assigned $X_j$ values while keeping everything else fixed, what would happen?"

These are only equivalent under strong assumptions (no unmeasured confounders, correct model specification).

Example: Education and Salary

Model: Salary = $\beta_0$ + $\beta_1$(Years of Education) + $\beta_2$(Experience)

$\beta_1$ = $5,000

Correct interpretation: "Among people with the same experience level, each additional year of education is associated with $5,000 higher salary."

Incorrect interpretation: "If you get one more year of education, your salary will increase by $5,000."

The second interpretation assumes no confounding—maybe more educated people have other characteristics (motivation, networks, family resources) that also affect salary.


Building a Regression Model

Step 1: Define the Question

What relationship are you trying to understand or predict?

  • Descriptive: What factors are associated with higher retention?
  • Predictive: What's the expected revenue for this user?
  • Causal (with caution): Does the new feature increase engagement?

Step 2: Choose the Right Model Type

Outcome Type Model R Function Python Function
Continuous Linear lm() statsmodels.OLS
Binary (0/1) Logistic glm(..., family=binomial) statsmodels.Logit
Count Poisson glm(..., family=poisson) statsmodels.Poisson
Overdispersed count Negative Binomial MASS::glm.nb() statsmodels.NegativeBinomial
Bounded (0-1) Beta betareg::betareg() statsmodels.GLM with beta family

Step 3: Select Predictors

Include:

  • Variables directly relevant to your research question
  • Known confounders (affect both predictor of interest and outcome)
  • Variables that improve prediction (if prediction is the goal)

Exclude:

  • Variables on the causal path between X and Y (mediators)
  • Variables that could induce collider bias
  • Highly collinear variables (pick one or combine)

Step 4: Fit and Diagnose

Fit the model, then check:

  • Residual plots for assumption violations
  • Influential observations
  • Multicollinearity (VIF)
  • Model fit metrics (R², deviance, AIC/BIC)

Step 5: Interpret Carefully

  • Report effect sizes with confidence intervals
  • Distinguish association from causation
  • Acknowledge limitations and alternative explanations

Diagnostics: The Most Important Part

For Linear Regression

Residuals vs. Fitted Plot (Linearity, Homoscedasticity)

  • Should show random scatter around zero
  • Patterns indicate non-linearity or heteroscedasticity

Q-Q Plot (Normality)

  • Points should follow the diagonal line
  • Deviations at tails suggest non-normality

Scale-Location Plot (Homoscedasticity)

  • Should show horizontal band with even spread
  • Funnel shape indicates heteroscedasticity

Residuals vs. Leverage Plot (Influential Points)

  • Watch for points with high Cook's distance (>1 or >4/n)

For Logistic Regression

ROC Curve and AUC

  • Measures discrimination ability
  • AUC > 0.7 is acceptable, > 0.8 is good, > 0.9 is excellent

Calibration Plot

  • Predicted probabilities should match observed rates
  • Group predictions into bins and compare to actual outcome rates

Hosmer-Lemeshow Test

  • Tests whether predicted probabilities match observed outcomes
  • Non-significant p-value suggests good fit (but sensitive to sample size)

For Count Models

Dispersion Check

  • Poisson assumes mean = variance
  • If variance >> mean, use negative binomial
  • Check: ratio of Pearson chi-square to degrees of freedom ≈ 1

Zero-Inflation Check

  • More zeros than Poisson/NB predicts?
  • Consider zero-inflated models

Code: Complete Regression Workflow

Python

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns


def run_linear_regression(data, formula):
    """
    Complete linear regression workflow with diagnostics.

    Parameters:
    -----------
    data : pd.DataFrame
        Dataset
    formula : str
        Patsy formula (e.g., 'y ~ x1 + x2')

    Returns:
    --------
    dict with model, summary, and diagnostic flags
    """
    # Fit model
    model = smf.ols(formula, data=data).fit()

    # Extract components for diagnostics
    fitted = model.fittedvalues
    residuals = model.resid
    standardized_resid = model.get_influence().resid_studentized_internal
    leverage = model.get_influence().hat_matrix_diag
    cooks_d = model.get_influence().cooks_distance[0]

    # Calculate VIF for multicollinearity
    X = model.model.exog
    vif_data = pd.DataFrame({
        'Variable': model.model.exog_names,
        'VIF': [variance_inflation_factor(X, i) for i in range(X.shape[1])]
    })

    # Diagnostic flags
    flags = []

    # High Cook's distance
    high_cooks = np.sum(cooks_d > 4/len(data))
    if high_cooks > 0:
        flags.append(f"{high_cooks} observations with Cook's d > 4/n")

    # High VIF
    high_vif = vif_data[vif_data['VIF'] > 5]['Variable'].tolist()
    if 'Intercept' in high_vif:
        high_vif.remove('Intercept')
    if high_vif:
        flags.append(f"High VIF (>5): {high_vif}")

    # Heteroscedasticity (Breusch-Pagan)
    from statsmodels.stats.diagnostic import het_breuschpagan
    bp_stat, bp_p, _, _ = het_breuschpagan(residuals, model.model.exog)
    if bp_p < 0.05:
        flags.append(f"Heteroscedasticity detected (BP p={bp_p:.4f})")

    return {
        'model': model,
        'summary': model.summary(),
        'vif': vif_data,
        'diagnostics': {
            'fitted': fitted,
            'residuals': residuals,
            'standardized_resid': standardized_resid,
            'leverage': leverage,
            'cooks_d': cooks_d
        },
        'flags': flags,
        'r_squared': model.rsquared,
        'adj_r_squared': model.rsquared_adj
    }


def plot_linear_diagnostics(result, figsize=(12, 10)):
    """Create diagnostic plots for linear regression."""
    diag = result['diagnostics']

    fig, axes = plt.subplots(2, 2, figsize=figsize)

    # 1. Residuals vs Fitted
    ax1 = axes[0, 0]
    ax1.scatter(diag['fitted'], diag['residuals'], alpha=0.6)
    ax1.axhline(y=0, color='red', linestyle='--')
    ax1.set_xlabel('Fitted Values')
    ax1.set_ylabel('Residuals')
    ax1.set_title('Residuals vs Fitted')
    # Add lowess smoothing
    from statsmodels.nonparametric.smoothers_lowess import lowess
    smooth = lowess(diag['residuals'], diag['fitted'], frac=0.3)
    ax1.plot(smooth[:, 0], smooth[:, 1], color='red', linewidth=2)

    # 2. Q-Q Plot
    ax2 = axes[0, 1]
    from scipy import stats
    stats.probplot(diag['standardized_resid'], dist="norm", plot=ax2)
    ax2.set_title('Normal Q-Q')

    # 3. Scale-Location
    ax3 = axes[1, 0]
    sqrt_abs_resid = np.sqrt(np.abs(diag['standardized_resid']))
    ax3.scatter(diag['fitted'], sqrt_abs_resid, alpha=0.6)
    ax3.set_xlabel('Fitted Values')
    ax3.set_ylabel('√|Standardized Residuals|')
    ax3.set_title('Scale-Location')
    smooth = lowess(sqrt_abs_resid, diag['fitted'], frac=0.3)
    ax3.plot(smooth[:, 0], smooth[:, 1], color='red', linewidth=2)

    # 4. Residuals vs Leverage
    ax4 = axes[1, 1]
    ax4.scatter(diag['leverage'], diag['standardized_resid'], alpha=0.6)
    ax4.axhline(y=0, color='grey', linestyle='--')
    ax4.set_xlabel('Leverage')
    ax4.set_ylabel('Standardized Residuals')
    ax4.set_title('Residuals vs Leverage')

    # Add Cook's distance contours
    xlim = ax4.get_xlim()
    ylim = ax4.get_ylim()
    for thresh in [0.5, 1.0]:
        x = np.linspace(0.001, xlim[1], 50)
        y = np.sqrt(thresh * len(diag['fitted']) * (1-x) / x)
        ax4.plot(x, y, 'r--', linewidth=0.5)
        ax4.plot(x, -y, 'r--', linewidth=0.5)

    plt.tight_layout()
    return fig


def run_logistic_regression(data, formula):
    """
    Logistic regression workflow with diagnostics.
    """
    # Fit model
    model = smf.logit(formula, data=data).fit(disp=0)

    # Calculate predicted probabilities
    pred_probs = model.predict()

    # Calculate odds ratios with CIs
    params = model.params
    conf = model.conf_int()
    odds_ratios = pd.DataFrame({
        'Variable': params.index,
        'Coefficient': params.values,
        'Odds Ratio': np.exp(params.values),
        'OR CI Lower': np.exp(conf[0].values),
        'OR CI Upper': np.exp(conf[1].values),
        'P-value': model.pvalues.values
    })

    return {
        'model': model,
        'summary': model.summary(),
        'odds_ratios': odds_ratios,
        'pred_probs': pred_probs,
        'aic': model.aic,
        'bic': model.bic
    }


# Example usage
if __name__ == "__main__":
    # Generate sample data
    np.random.seed(42)
    n = 500

    data = pd.DataFrame({
        'x1': np.random.normal(0, 1, n),
        'x2': np.random.normal(0, 1, n),
        'x3': np.random.choice(['A', 'B', 'C'], n)
    })
    data['y'] = 5 + 2*data['x1'] - 1*data['x2'] + \
                np.where(data['x3']=='B', 3, np.where(data['x3']=='C', -2, 0)) + \
                np.random.normal(0, 2, n)

    # Run regression
    result = run_linear_regression(data, 'y ~ x1 + x2 + C(x3)')

    print("Regression Summary")
    print("=" * 60)
    print(result['summary'])
    print("\nVIF Table:")
    print(result['vif'])
    print("\nDiagnostic Flags:")
    for flag in result['flags']:
        print(f"  - {flag}")

R

library(tidyverse)
library(broom)
library(car)  # For VIF
library(performance)  # For comprehensive diagnostics


run_linear_regression <- function(formula, data) {
    #' Complete linear regression workflow

    model <- lm(formula, data = data)

    # Calculate VIF
    vif_values <- tryCatch(
        vif(model),
        error = function(e) NA
    )

    # Diagnostic tests
    # Breusch-Pagan for heteroscedasticity
    bp_test <- lmtest::bptest(model)

    # Shapiro-Wilk for normality (on sample of residuals if n > 5000)
    resids <- residuals(model)
    if (length(resids) > 5000) {
        resids <- sample(resids, 5000)
    }
    normality_test <- shapiro.test(resids)

    # Influential observations
    cooks_d <- cooks.distance(model)
    influential <- sum(cooks_d > 4/nrow(data))

    # Compile flags
    flags <- c()
    if (bp_test$p.value < 0.05) {
        flags <- c(flags, sprintf("Heteroscedasticity detected (BP p=%.4f)", bp_test$p.value))
    }
    if (normality_test$p.value < 0.05) {
        flags <- c(flags, sprintf("Non-normality detected (SW p=%.4f)", normality_test$p.value))
    }
    if (any(vif_values > 5, na.rm = TRUE)) {
        high_vif <- names(vif_values)[vif_values > 5]
        flags <- c(flags, sprintf("High VIF (>5): %s", paste(high_vif, collapse=", ")))
    }
    if (influential > 0) {
        flags <- c(flags, sprintf("%d observations with Cook's d > 4/n", influential))
    }

    list(
        model = model,
        summary = summary(model),
        tidy = tidy(model, conf.int = TRUE),
        glance = glance(model),
        vif = vif_values,
        flags = flags
    )
}


plot_diagnostics <- function(model) {
    #' Create diagnostic plots using base R

    par(mfrow = c(2, 2))
    plot(model)
    par(mfrow = c(1, 1))
}


run_logistic_regression <- function(formula, data) {
    #' Logistic regression with odds ratios

    model <- glm(formula, data = data, family = binomial)

    # Odds ratios with CIs
    or_table <- exp(cbind(
        OR = coef(model),
        confint(model)
    ))

    # Model fit
    null_dev <- model$null.deviance
    resid_dev <- model$deviance
    pseudo_r2 <- 1 - (resid_dev / null_dev)

    list(
        model = model,
        summary = summary(model),
        odds_ratios = or_table,
        tidy = tidy(model, conf.int = TRUE, exponentiate = TRUE),
        aic = AIC(model),
        pseudo_r2 = pseudo_r2
    )
}


run_poisson_regression <- function(formula, data) {
    #' Poisson regression with overdispersion check

    model <- glm(formula, data = data, family = poisson)

    # Overdispersion check
    pearson_resid <- sum(residuals(model, type = "pearson")^2)
    df_resid <- model$df.residual
    dispersion <- pearson_resid / df_resid

    # Rate ratios with CIs
    rr_table <- exp(cbind(
        RR = coef(model),
        confint(model)
    ))

    flags <- c()
    if (dispersion > 1.5) {
        flags <- c(flags, sprintf("Overdispersion detected (%.2f > 1.5). Consider negative binomial.", dispersion))
    }

    list(
        model = model,
        summary = summary(model),
        rate_ratios = rr_table,
        dispersion = dispersion,
        flags = flags
    )
}


# Example usage
set.seed(42)
n <- 500

data <- tibble(
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = sample(c("A", "B", "C"), n, replace = TRUE)
) %>%
    mutate(
        y = 5 + 2*x1 - 1*x2 +
            case_when(x3 == "B" ~ 3, x3 == "C" ~ -2, TRUE ~ 0) +
            rnorm(n, sd = 2)
    )

# Run regression
result <- run_linear_regression(y ~ x1 + x2 + x3, data)

cat("Regression Results\n")
cat(strrep("=", 60), "\n")
print(result$tidy)
cat("\nModel Fit:\n")
print(result$glance)
cat("\nDiagnostic Flags:\n")
for (flag in result$flags) {
    cat("  -", flag, "\n")
}

Common Mistakes and How to Avoid Them

Mistake 1: Interpreting Coefficients as Causal Effects

Wrong: "Our model shows that each additional year of education increases salary by $5,000."

Right: "Our model shows that education is associated with $5,000 higher salary per year, after adjusting for experience. This may reflect causal effects, selection effects, or unmeasured confounders."

Fix: Be precise about what regression can and cannot tell you. Use causal language only for experimental data.

Mistake 2: Including Post-Treatment Variables

If X → M → Y, controlling for M blocks the causal path from X to Y.

Wrong: Testing if a marketing campaign (X) affects purchases (Y) while controlling for website visits (M) that the campaign causes.

Right: Don't control for mediators. Report total effect (without M) and potentially decompose direct/indirect effects if that's your question.

Mistake 3: Ignoring Diagnostics

Wrong: Running regression, seeing p < 0.05, declaring success.

Right: Check residual plots before interpreting any coefficients. Violations can make standard errors wrong and coefficients biased.

Mistake 4: Stepwise Selection

Wrong: Using automated stepwise procedures to select variables.

Right: Select variables based on theory and research questions. Use cross-validation for purely predictive models.

Mistake 5: Overfitting

Wrong: Adding many predictors to maximize R².

Right: Report adjusted R². Use AIC/BIC for model comparison. Consider holdout validation for predictive models.

Mistake 6: Ignoring Multicollinearity

Wrong: Including highly correlated predictors and interpreting their coefficients.

Right: Check VIF. If VIF > 5-10, consider combining variables or dropping redundant ones.


Regression vs. Simple Tests: A Unified View

Regression subsumes many common tests:

Simple Test Regression Equivalent
One-sample t-test Regression with intercept only, test β₀
Two-sample t-test y ~ group (dummy coded)
Paired t-test y_diff ~ 1 (regression on differences)
One-way ANOVA y ~ factor (dummy coded)
Two-way ANOVA y ~ factor1 * factor2
Correlation test y ~ x (standardized both)
Chi-square test Logistic regression on counts

Why this matters: Understanding regression as a general framework helps you see when simple tests are appropriate and when you need more flexibility.


When Regression Isn't Enough

For Causal Inference

Regression alone can't handle unmeasured confounding. Consider:

  • Randomized experiments: Gold standard for causal claims
  • Instrumental variables: When you have a valid instrument
  • Regression discontinuity: When treatment is assigned by a threshold
  • Difference-in-differences: When you have pre/post and treatment/control

For Non-Linear Relationships

Standard regression assumes linear relationships (in parameters). For complex patterns:

  • Polynomial terms: y ~ x + x² for U-shaped relationships
  • Splines: Flexible local fitting
  • GAMs: Generalized additive models for smooth non-linearities
  • Machine learning: Random forests, gradient boosting for prediction

For Hierarchical/Clustered Data

When observations aren't independent (students in classrooms, users across sessions):

  • Mixed effects models: Account for group structure
  • Cluster-robust standard errors: At minimum, adjust SEs
  • GEE: Generalized estimating equations

The Analyst's Regression Checklist

Before Fitting

  • Research question is clear
  • Outcome variable type identified (continuous, binary, count)
  • Predictors selected based on theory/question (not p-values)
  • Sample size adequate for planned predictors
  • Data cleaned (missing values handled, outliers examined)

After Fitting

  • Residuals vs. fitted shows no patterns
  • Q-Q plot shows approximate normality (linear regression)
  • Scale-location plot shows homoscedasticity
  • No highly influential observations (or understood)
  • VIF < 5 for all predictors (or multicollinearity acknowledged)
  • Model makes substantive sense

When Reporting

  • Effect sizes with confidence intervals
  • Clear distinction between association and causation
  • Limitations acknowledged
  • Sensitivity to model choices discussed

Regression Fundamentals

Specialized Models

Handling Complications


Key Takeaway

Regression is the analyst's most versatile tool—it unifies simple comparisons (t-tests, ANOVA) and correlation into one framework that handles continuous predictors, multiple covariates, and different outcome types. But versatility requires vigilance: always check diagnostics, interpret "controlling for" correctly (it's statistical adjustment, not experimental control), and remember that without randomization, coefficients show association, not causation. Master regression, and you'll have a framework for understanding relationships in almost any data.


References

  1. https://www.stat.berkeley.edu/~aldous/157/Papers/Fox.pdf
  2. https://www.amazon.com/Regression-Stories-Prediction-Quantitative-Analysis/dp/1107676517
  3. https://journals.sagepub.com/doi/10.1177/2515245917745629
  4. Fox, J. (2015). Applied regression analysis and generalized linear models. Sage Publications.
  5. Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press.
  6. Westfall, J., & Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. *PLoS ONE*, 11(3), e0152719.

Frequently Asked Questions

When should I use regression instead of a t-test or ANOVA?
Use regression when you have continuous predictors, multiple predictors, or want to control for covariates. T-tests and ANOVA are special cases of regression with categorical predictors. Regression is more flexible and generalizes better.
What's the difference between linear and logistic regression?
Linear regression predicts a continuous outcome (revenue, time, scores). Logistic regression predicts the probability of a binary outcome (converted/not, churned/retained). Use linear when your outcome is numeric and unbounded, logistic when it's 0/1 or yes/no.
Can regression prove causation?
No. Regression shows association after controlling for included variables. Unobserved confounders can still bias results. For causal claims, you need randomization (experiments), instrumental variables, or quasi-experimental designs with strong assumptions.
How do I know if my regression is valid?
Check assumptions via residual plots: linearity (residuals vs fitted), normality (Q-Q plot), homoscedasticity (scale-location plot), and independence (consider your data structure). Also check for influential points and multicollinearity.

Key Takeaway

Regression is the analyst's Swiss Army knife - it unifies t-tests, ANOVA, and correlation into one flexible framework. But flexibility creates responsibility: you must check assumptions, interpret coefficients correctly (association, not causation without experiments), and understand what 'controlling for' actually means.

Send to a friend

Share this with someone who loves clean statistical work.