Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is used widely by data scientists as a first step of looking at a particular data set. This initial analysis will help them spot trends, identify missing data, and generally make sure the stakeholders are looking at the right questions.
Before beginning any EDA, it is important to understand what my stakeholders are seeking to understand and what problems they are trying to solve. A thorough understanding of the data that is available is key to being able to effectively support my stakeholders.
Once you have a complete picture of the problems at hand, you would find all of the available data that could pertain to the problem that we are trying to solve. You would pull all of that data into different datasets, joining them when relevant. First, you should take a look at any missing data and try to determine if there are any patterns to the missing data that would be of interest to the problem at hand and note those in my EDA.
The next step would be to start looking at the data itself. There are four main categories of data here: quantitative summaries, quantitative plots, categorical summaries, and categorical plots. Quantitative summaries would contain numerical summaries of quantitative data as well as correlations between different numerical summaries. Quantitative plots would include scatter plots, histograms, box plots and other graphical summaries. Categorical data would include counts of different types of categorical data. Categorical plots would generally contain box plots that explore different categorical variables in relation to one another.
Once you have done an exploration of the data, you should seek to understand any correlations or trends that exist in your datasets. Sometimes there will be a clear reason for a correlation, but other times you may have to dig a little deeper to understand why data may be correlated. Another key thing that you should look at with your EDA is any outliers that exist in the dataset. You should seek to understand any outliers and whether you can correlate them to a root cause. It may be necessary to go back to your stakeholders if you don’t have a complete understanding of the processes that could cause outliers in the data. For example, in some high tech engineering applications, some parameters may change based on things you may not even be aware of, like a change in atmospheric pressure due to a storm or even things like variations in elevation if you are looking at multi-site level data.
Once you have rolled up all of your data, you should be prepared to present your EDA to your stakeholders. Understanding your audience will help you understand how in depth your EDA should be. From there, the stakeholders may want to explore different data in more depth. They may want to explore hypothesis testing to determine if there are significant correlations in the data. They may also choose to design experiments to understand how different variables affect their results.
As you can see, EDA is an important tool for any data scientist exploring large datasets. Understanding your customer needs and being able to use a plethora of data analysis techniques are both key to providing the best support for your stakeholders.