Logistic Regression
Logistic regression models probability using log-odds transformation, assuming a linear relationship between predictors and the logit of probability. For beginners: probabilities must be transformed using the sigmoid function because raw linear predictions can exceed the valid 0-1 probability range—the sigmoid "squeezes" any number into a valid probability while preserving ranking order.
Run Logistic Regression →What is Logistic Regression?
Logistic regression is a statistical method for predicting binary outcomes. Unlike linear regression which predicts continuous values, logistic regression predicts the probability of an event occurring using the logistic (sigmoid) function, which outputs values between 0 and 1.
A critical distinction exists between probability modeling and classification decision thresholding. Logistic regression outputs probability first; classification occurs second by applying a threshold (typically 0.5) to convert probabilities into class predictions. This two-stage process allows analysts to adjust sensitivity based on business costs of false positives versus false negatives.
Logistic regression belongs to the family of Generalized Linear Models (GLM), extending linear regression to non-normal distributions through link functions. The logit link connects the linear predictor to the binomial distribution of binary outcomes, enabling valid statistical inference for categorical response variables.
Logistic Regression Fundamentals
What Logistic Regression Predicts: The model estimates the probability that an observation belongs to a specific class (typically coded as 1) given its predictor values. Output ranges between 0 and 1 (exclusive), interpreted as estimated event probabilities.
When to Use: Apply logistic regression when predicting binary outcomes (success/failure, yes/no, default/repay), estimating risk probabilities, or ranking observations by likelihood of event occurrence.
Simple Example: A bank wants to predict loan default. Using historical data, they fit a logistic regression with income, credit score, and debt-to-income ratio as predictors. For a new applicant earning $60K with a 700 credit score, the model predicts 0.15 probability of default (15% risk). The bank might auto-approve applications below 0.10 probability, manually review 0.10-0.30, and decline above 0.30, converting probability estimates into actionable business rules.
Logistic Regression Equations
The logit transformation prevents probability range violations by mapping probabilities (0,1) to real numbers (-∞, +∞). This ensures linear combinations of predictors never produce invalid probability estimates. Coefficients represent log-odds change per unit predictor change—holding all other variables constant (ceteris paribus). Odds ratios assume ceteris paribus interpretation; each OR represents the multiplicative effect on odds when that specific predictor increases by one unit, assuming other predictors remain unchanged.
Odds ratio > 1 indicates increased odds; OR < 1 indicates decreased odds.
Classification Metrics
Classification performance involves inherent trade-offs between sensitivity (true positive rate—capturing actual positives) and specificity (true negative rate—avoiding false alarms). Improving one typically degrades the other as the decision threshold shifts. AUC evaluates ranking quality rather than classification accuracy, measuring the probability that a randomly chosen positive instance ranks higher than a randomly chosen negative instance. Classification threshold selection dramatically affects confusion matrix results—thresholds below 0.5 increase sensitivity at the cost of specificity, while thresholds above 0.5 do the opposite.
Accuracy
Sensitivity
Specificity
AUC-ROC
Common Applications
Logistic regression supports risk scoring and probability-based decision thresholds across industries. Models must be periodically recalibrated with new data as population characteristics drift over time—unre-calibrated models degrade in predictive accuracy. Importantly, these models provide decision support rather than deterministic outcomes; they estimate probabilities that inform human judgment rather than replacing it entirely.
Customer Churn Prediction
Predict which customers are likely to cancel their subscription or service.
Credit Risk Assessment
Estimate probability of loan default based on applicant characteristics.
Medical Diagnosis
Predict disease presence based on symptoms and test results.
Quality Pass/Fail
Predict whether a product will pass quality inspection based on process parameters.
Marketing Response
Predict likelihood of customer response to marketing campaigns.
Equipment Failure
Predict probability of equipment failure based on operating conditions.
Logistic Regression Assumptions
Valid logistic regression inference requires specific statistical assumptions. Violations bias coefficients, invalidate standard errors, or reduce predictive accuracy.
Independence of Observations
Observations must be independent—one observation's outcome should not influence another's. Clustered or longitudinal data requires generalized estimating equations (GEE) or mixed-effects models.
Linearity in Log-Odds
The relationship between continuous predictors and the log-odds of the outcome must be linear. Nonlinear relationships require transformations, polynomial terms, or spline functions.
Absence of Multicollinearity
Predictors should not be highly correlated with each other. Perfect multicollinearity prevents model estimation; high multicollinearity inflates standard errors and destabilizes coefficient estimates.
Adequate Sample Size
Traditional rules of thumb suggest approximately 10 events per predictor variable (EPV), though modern research shows acceptable EPV may vary depending on model complexity, effect size, and regularization.
Correct Outcome Specification
The outcome must be binary (or properly coded for multinomial). Misclassification of outcomes or inclusion of ordinal categories as continuous violates model structure.
Model Limitations
Understanding logistic regression boundaries prevents misinterpretation and guides appropriate methodological choices:
Association vs. Causation
Logistic regression identifies statistical association but does not confirm causation. Confounding variables, reverse causality, or selection bias may explain observed relationships even with significant odds ratios.
Omitted Variable Bias
Models are sensitive to missing variables. Omitting predictors correlated with both included predictors and the outcome biases coefficient estimates and distorts odds ratios.
Nonlinear Relationships
Performance decreases with highly nonlinear predictor relationships. The linear log-odds assumption fails when effects vary across predictor ranges or involve complex interactions.
Interaction Specification
Does not automatically discover interaction effects unless explicitly included in the model specification. Important moderating relationships remain undiscovered unless analysts specifically test interaction terms.
When NOT to Use Logistic Regression
Logistic regression provides inappropriate methodology for specific data structures and analytical objectives:
Continuous Outcomes
Continuous outcome prediction requires linear regression or generalized linear models with identity links, not logistic regression. Using logistic regression for continuous responses produces meaningless probability estimates.
Imbalanced Datasets
Extremely imbalanced datasets (e.g., 99:1 ratios) require resampling, weighting, or threshold adjustment. Without these techniques, logistic regression may produce models biased toward the majority class despite high accuracy.
Highly Nonlinear Problems
Highly nonlinear classification problems without feature engineering or kernel methods may require decision trees, random forests, or neural networks that capture complex decision boundaries logistic regression cannot represent.
Small Sample, Many Predictors
Small datasets with large predictor counts violate the events-per-variable rule, producing overfit models with unreliable coefficients. Regularization (LASSO, Ridge) or dimensionality reduction should precede modeling.
Interpreting Odds Ratios
Odds ratio interpretation requires contextual understanding beyond statistical significance. OR magnitude must consider baseline probability; an OR of 2.0 has different practical impact when baseline risk is 1% versus 40%. Statistically significant predictors may not always be practically important—a significant OR of 1.05 may not justify intervention costs despite p<0.05. Interaction terms change odds interpretation; main effects cannot be interpreted independently when interaction terms are present—the effect of one predictor depends on the level of another.
OR = 1
No effect. The predictor has no association with the outcome.
OR > 1
Positive association. Higher predictor values increase odds of positive outcome.
OR < 1
Negative association. Higher predictor values decrease odds of positive outcome.
Confidence Interval
If 95% CI excludes 1, the effect is statistically significant.
Industry Application Expansion
Logistic regression extends across sectors to support risk-based decision making:
Fraud Detection
Credit card companies use logistic regression to score transaction risk in real-time, calculating probability of fraud based on amount, location, time, and merchant type to trigger authentication requests.
Healthcare Readmission
Hospitals predict 30-day readmission probability using patient demographics, diagnoses, and length of stay to identify high-risk patients for targeted discharge interventions and follow-up care coordination.
Credit Underwriting
Banks estimate default probability using income, credit history, debt ratios, and employment status to automate lending decisions, pricing interest rates according to risk tiers.
Manufacturing Defects
Quality engineers predict defect probability based on process parameters (temperature, pressure, speed) to implement real-time statistical process control and automatic production adjustments.
Marketing Conversion
E-commerce platforms calculate purchase probability based on browsing behavior, cart contents, and demographics to personalize discounts, recommendations, and abandonment recovery campaigns.
Frequently Asked Questions
What is the difference between logistic regression and linear regression?
Linear regression predicts continuous outcomes with unbounded range (-∞ to +∞), assumes normally distributed errors, and models E(Y|X) directly. Logistic regression predicts binary outcomes bounded [0,1], assumes a binomial response distribution, and models log-odds through the logistic link function. Linear regression can predict impossible values (probabilities >1 or <0) for binary outcomes, while logistic regression constrains predictions to valid probability ranges.
What is log-odds interpretation?
Log-odds (logit) is the natural logarithm of the odds ratio p/(1-p). A log-odds of 0 corresponds to 50% probability (odds of 1:1). Positive log-odds indicate probability >50%; negative indicate <50%. Coefficients in logistic regression represent the change in log-odds per unit change in the predictor. Exponentiating coefficients converts log-odds changes to odds ratios.
How does classification threshold affect results?
Threshold selection determines the sensitivity-specificity trade-off. Lower thresholds (0.3) classify more observations as positive, increasing sensitivity (catching more actual positives) but decreasing specificity (more false alarms). Higher thresholds (0.7) increase specificity but decrease sensitivity. The "optimal" threshold depends on relative costs of false positives versus false negatives in your specific business context.
Can logistic regression be used for multi-class classification?
Yes, through multinomial logistic regression (softmax regression) for nominal outcomes with three or more unordered categories, or ordinal logistic regression for ordered categories. Multinomial logistic regression estimates separate log-odds equations for each category relative to a reference baseline and typically assumes independence of irrelevant alternatives (IIA).
What happens if logistic regression assumptions are violated?
Violations produce biased coefficients, incorrect standard errors, or poor predictions. Non-independence requires mixed models or GEE; nonlinearity requires transformations or splines; multicollinearity requires regularization or variable removal; small samples produce unreliable estimates requiring Bayesian methods or exact logistic regression. Model misspecification leads to calibration failure where predicted probabilities don't match observed frequencies.
Predict Binary Outcomes
logistic regression analysis with odds ratios, ROC curves, and classification metrics.
Launch Logistic Regression →