Selecting Variables for Regression Models
Building regression models takes extensive knowledge about your data in order to determine the variables to include in the model building process. When you are first starting out, this process may seem overwhelming. The key to building a good regression model is an in depth understanding of how each variable affects the data.
When you first start out with regression analysis, you should first try to see if anyone has modeled anything similar in the past and use their learnings to apply to your own model. If the data is similar, it is likely that some of the same variables will be key predictors for your model. If there is no information available, performing an EDA on the variables in the dataset will be useful to understanding how each variable affects your parameters of interest. During this process you may find that some of the variables in your dataset don’t have a strong correlation to your outcome and these can be eliminated from consideration.
During your EDA, or simply through common sense reasoning or knowledge of your dataset, you will likely discover variables that do affect the parameters of interest. You should take note of all of these. The next step would be to determine if any of these variables are collinear and are giving you redundant information. Another thing to consider beyond the individual variables is interaction effects between variables. Interaction occurs when one variable affects the outcome differently depending on the value of another variable. Depending on your data, it may also be necessary to include polynomials in your model.
If you still find yourself with too many variables, there are a few tools to help you narrow down your variables to include in your model. Although they have their own limitations, forward or backward stepwise procedures or the LASSO model can help you narrow down the variables that you should include in your final model. The ultimate goal with regression models is to find the smallest model that fits the data.