Little Miss Data

View Original

Simple EDA in R with inspectdf

Note

This blog was previously posted in May 2019 but has been updated in Sept 2021 to include all of the new, exciting features and visuals in more recent package versions.

Exploratory data analysis in R

Previously, I wrote a blog post showing a number of R packages and functions which you could use to quickly explore your data set. Since posting that, I’ve become aware of another exciting EDA package: inspectdf by Alastair Rushworth! As is very often the case, I became aware of this package in a twitter post by none other than Mara Averick.

Preview of some of the inspectdf output graphs

I like this package because it’s got a lot of functionality and it’s incredibly straightforward to use. In short, it allows you to understand and visualize column types, sizes, values, value imbalance & distributions as well as correlations.

But, one feature that makes it a very unique EDA package is that you can run each of these features for an individual data frame, or compare the differences between two data frames.

I liked the inspectdf package so much that in this blog, I’m going to extend my previous EDA tutorial with an overview of the package.


R Basics: Working Environment

For this tutorial, we are going to be using R as our programming language. The entire code is hosted in my github repo, and you can also copy and paste to follow along below. If you are looking to understand your options for an R working environment, I recommend that you can check out IBM Watson Studio to run hosted R notebooks, or RStudio.

R Basics: Install and Load Packages

Before we get rolling with the tutorial, we need to get our environment ready. Please remember that if you do not have any of the packages already installed, uncomment the installation line by removing the #.

#First install devtools to allow you to install inspectdf from github
install.packages("devtools")
library(devtools)

#install and load the package - https://github.com/alastairrushworth/inspectdf

devtools::install_github("alastairrushworth/inspectdf")
library(inspectdf)

install.packages("tidyverse")
library(tidyverse)

install.packages("readr")
library(readr)

EDA Prep: Download the Data

We are going to be using the survey data from my previous data + art STEAM project. Note that there were some issues with survey gathering and therefore you will see some odd values in the data.

#Download the data set
df= read_csv('https://raw.githubusercontent.com/lgellis/STEM/master/DATA-ART-1/Data/FinalData.csv', col_names = TRUE)

EDA Prep: Transform the Data Set

We will create three data frames for our tutorial.

  • allGrades is the full data frame with the complete set of survey results

  • oldGrades includes a subset of the survey results for all grades greater than 5. This includes grades 6-8.

  • youngGrades includes a subset of the survey results for all grades less than 6. This includes grades 3-5.

We will use allGrades for the single data frame analysis and oldGrades and youngGrades for the data frame comparisons.

allGrades <- df

oldGrades <- allGrades %>% 
  filter(Grade > 5)

youngGrades <- allGrades %>% 
  filter(Grade < 6)

#View the distribution of grade to ensure it was split properly
ggplot(oldGrades, aes(x=Grade)) + geom_histogram()
ggplot(youngGrades, aes(x=Grade)) + geom_histogram()

For each of the functions, we are going to run it first against the full data frame (allGrades) to view the basic functionality. We will then pass two data frames into the function (oldGrades, youngGrades) to see how the data frame comparison works.

Running InspectDf Functions: inspect_types()

We can use the inspect_types() command to very easily see a breakdown of character vs numeric variables.

inspect_types(allGrades)  %>% show_plot()
inspect_types(youngGrades, oldGrades)  %>% show_plot()

Running InspectDf Functions: inspect_mem()

The inspect_mem() function will tell us some basic sizing information, including data frame columns, rows, total size and the sizes of each variable.

inspect_mem(allGrades)  %>% show_plot()
inspect_mem(youngGrades)  %>% show_plot()

Running InspectDf Functions: inspect_na()

The inspect_na() function shows us the percentage of na values for each variable. The comparison view is quite neat as it highlights variables with unequal na percentages.

inspect_na(allGrades) %>% show_plot()
inspect_na(youngGrades, oldGrades) %>% show_plot()

Running InspectDf Functions: inspect_num()

The inspect_num() function shows us the distribution of the numeric variables. The heat plots used for the data frame comparison are very cool. To understand the heat plot comparison a little better, I’m including a description from the package website:

When comparing a pair of dataframes using inspect_num(), the histograms of common numeric features are calculated, using identical bins. The list columns hist_1 and hist_2 contain the histograms of the features in the first and second dataframes. A formal statistical comparison of each pair of histograms is calculated using Fisher’s exact test, the resulting p value is reported in the column fisher_p.

When show_plot = TRUE, heat plot comparisons are returned for each numeric column in each dataframe. Where a column is present in only one of the dataframes, grey cells are shown in the comparison. The significance of Fisher’s test is illustrated by coloured vertical bands around each plot: if the colour is grey, no p value could be calculated, if blue, the histograms are not found to be significantly different otherwise the bands are red.

inspect_num(allGrades) %>% show_plot()
inspect_num(youngGrades, oldGrades) %>% show_plot()

Running InspectDf Functions: inspect_imb()

Similar to the inspect_num() function, the inspect_imb() function allows us to understand the a bit about the value distribution for our categorical values. It shows the most prevalent values for each variable and displays how prevalent they are.

inspect_imb(allGrades) %>% show_plot()
inspect_imb(youngGrades, oldGrades) %>% show_plot()


Running InspectDf Functions: inspect_cat()

A step further from inspect_imb(), inspect_cat() allows us to visualize the full distribution of our categorical values. Note that if there are a lot of unique values in a particular category, it’s not expected that you should see every value. However, it quite nicely surfaces common values.

inspect_cat(allGrades) %>% show_plot()
inspect_cat(youngGrades, oldGrades) %>% show_plot()

Running InspectDf Functions: inspect_cor()

We finish off our review with the inspect_cor() function. This allows us to see the Pearson correlation coefficient to see how the variables may relate to one another. onlinestatbook.com has a great definition of the Pearson correlation coefficient below.

The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there is no association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one variable increases, so does the value of the other variable.

inspect_cor(allGrades) %>% show_plot()
inspect_cor(youngGrades, oldGrades) %>% show_plot()

Thank you For Reviewing EDA with InspectDf with Me!

Thank you for exploring the inspectdf package with me.  Please comment below if you enjoyed this blog, have questions, or would like to see something different in the future.  Note that the full code is available on my  github repo.  

If you have trouble downloading the files or cloning the repo from github, please go to the main page of the repo and select "Clone or Download" and then "Download Zip". Alternatively or you can execute the following R commands to download the whole repo through R

install.packages("usethis")
library(usethis)
use_course("https://github.com/lgellis/MiscTutorial/archive/master.zip")