The StatsTest Flow: Prediction >> Binary Dependent Variable >> One Independent Variable
Not sure this is the right statistical method? Use the Choose Your StatsTest workflow to select the right method.
What is Simple Logistic Regression?
Simple Logistic Regression is a statistical test used to predict a single binary variable using one other variable. It also is used to determine the numerical relationship between two such variables. The variable you want to predict should be binary and your data should meet the other assumptions listed below.
Simple Logistic Regression is sometimes also called Logit Regression, or Binary Logistic Regression.
Assumptions for Simple Logistic Regression
Every statistical method has assumptions. Assumptions mean that your data must satisfy certain properties in order for statistical method results to be accurate.
The assumptions for Simple Logistic Regression include:
- No Outliers
Let’s dive in to each one of these separately.
Logistic regression fits a logistic curve to binary data. This logistic curve can be interpreted as the probability associated with each outcome across independent variable values. Logistic regression assumes that the relationship between the natural log of these probabilities (when expressed as odds) and your predictor variable is linear.
The variables that you care about must not contain outliers. Logistic Regression is sensitive to outliers, or data points that have unusually large or small values. You can tell if your variables have outliers by plotting them and observing if any points are far from all other points.
Each of your observations (data points) should be independent. This means that each value of your variables doesn’t “depend” on any of the others. For example, this assumption is usually violated when there are multiple data points over time from the same unit of observation (e.g. subject/participant/customer/store), because the data points from the same unit of observation are likely to be related or affect one another.
If your data have repeated measures over time from the same units of observation, you should use Mixed Effects Logistic Regression.
When to use Simple Logistic Regression?
You should use Simple Logistic Regression in the following scenario:
- You want to use one variable in a prediction of another, or you want to quantify the numerical relationship between two variables
- The variable you want to predict (your dependent variable) is binary
- You have one independent variable, or one variable that you are using as a predictor
Let’s clarify these to help you know when to use Simple Logistic Regression
You are looking for a statistical test to predict one variable using another. This is a prediction question. Other types of analyses include examining the strength of the relationship between two variables (correlation) or examining differences between groups (difference).
Binary Dependent Variable
The variable you want to predict must be binary. Binary data have only two possible values. Some examples of binary data include: true/false, purchased the product or not, has the disease or not, etc.
Types of data that are NOT binary include ordered data (such as finishing place in a race, best business rankings, etc.), categorical data (gender, eye color, race, etc.), or continuous data (height, income, etc.).
If your dependent variable is continuous, you should use Simple Linear Regression, and if your dependent variable is categorical, then you should use Multinomial Logistic Regression or Linear Discriminant Analysis.
One Independent Variable
Simple Logistic Regression is used when there is one predictor variable measured at a single point in time.
If you have more than one independent variable, you should use another variant of logistic regression called Multiple Logistic Regression instead, and if you have one independent variable but it is measured for the same group at multiple points in time, then you should use Mixed Effects Logistic Regression.
Simple Logistic Regression Example
Dependent Variable: Purchase made (Yes/No)
Independent Variable: Consumer income
The null hypothesis, which is statistical lingo for what would happen if the treatment does nothing, is that there is no relationship between consumer income and whether or not a purchase is made. Our test will assess the likelihood of this hypothesis being true.
We gather our data and after assuring that the assumptions of logistic regression are met, we perform the analysis.
When we run this analysis, we get coefficients and p-values for each term in the model. The coefficient for consumer income is the expected increase/decrease in the log odds of our outcome variable for each unit increase in consumer income.
The p-value associated with this coefficient is the chance of seeing our results assuming there is actually no relationship between consumer income and whether or not a purchase is made. A p-value less than or equal to 0.05 means that our result is statistically significant and we can trust that the difference is not due to chance alone.
In addition, this analysis will result in an accuracy measure. Accuracy is the proportion of the binary outcome variable that is correctly predicting using the logistic regression model. In this example, the accuracy would be the number of customers that the model correctly identified as making a purchase or not.
Frequently Asked Questions
Q: How do I run Simple Logistic Regression in SPSS, R, SAS, or STATA?
A: This resource is focused on helping you pick the right statistical method every time. There are many resources available to help you figure out how to run this method with your data:
SPSS article: http://www.restore.ac.uk/srme/www/fac/soc/wie/research-new/srme/modules/mod4/11/index.html
SPSS video: https://www.youtube.com/watch?v=cpWSSJHuT2s
R article: https://stats.idre.ucla.edu/r/dae/logit-regression/
R video: https://www.youtube.com/watch?v=XycruVLySDg
If you still can’t figure something out, feel free to reach out.