EDA using Pandas Profiling | EDA part-2
EDA using Pandas Profiling
“Pandas profiling is a boost up of EDA”.
It is a python package, which is popularly used to boost up EDA. This package generates profile reports from pandas DataFrame. Dataframe.describe() is great but very basic for serious exploratory data analysis. Using pandas_profiling extends the pandas DataFrame with df.profile_report() for quick
It presents the following statistics for each column in an interactive HTML report and this report consists of the following.
- Type inference: detects the type of columns in a data frame.
- Essential: type, unique values, missing values.
- Quantile statistics: like min value, Q1, median, Q3, maximum, range, interquartile range.
- Most frequent value
- Correlations: It is highlighting of highly correlated variables, spearman, Pearson, and Kendall matrices.
- Missing values: It consists of a matrix, count, heatmap, and dendrogram of missing values.
- Text analysis: It is an analysis of Uppercase, space, scripts, and blocks of text data.
- File and image analysis: It involves the extracts of the file like sizes, creation dates, and dimensions. Also, it scans for truncated images or those containing EXIF information.
Install and import pandas
Syntax for installation:
pip install pandas-profiling
Syntax for importing:
The method used for profiling:
pandas-profiling.ProfileReport(df , **kwargs)
df: dataframe type, data to be analyzed.
bin: It's type is int, it the total number of bins that histogram consists, by default it is 10.
check_correlation: Its type is Boolean, it specifies whether correlation will be checked of not, by default it is set to True.
correlation_threshold: type float, It specifies the threshold to determine if the variable pair is correlated. The default value of the threshold is 0.9.
correlation_overrides: type List, If any variable name is specified to this argument, then that mentioned variable will not be rejected in any case. By default, it is set to NONE.
check_recoded: it is of Boolean type, and by default False. Also an expensive computation.
pool_size: it is of int type, It defines the number of workers in a single thread pool by default it is equal to the number of CPU.
Now, let’s look at the steps that demonstrate how pandas_profiling can be used for Exploratory Data Analysis(EDA).
Step1: Either create or use any dataset of your choice. Here, I am using the Titanic dataset for Exploratory Data analysis. It consists of 891 rows and 12 columns.
- passengerId: provided passenger id
- survival: survival( 1-yes, 0-Not)
- pclass: passenger class (1-first, 2-second, 3-third)
- name: Name of passenger
- sex: Sex( male or female)
- age: age of passenger
- sibsp: no. of siblings/spouses aboard
- parch: no. of parents/children aboard
- ticket: Ticket no.
- fare: passenger Fare
- cabin: cabin chosen by the passenger
- embarked: port of embarkation(S-Southampton, C-Cherbourg, Q-Queenstown)
Step2: Import pandas and pandas-profiling packages.
import pandas as pd from pandas_profiling import ProfileReport
Step3: How o read csv file by creating pandas dataframe.
df = pd.read_csv('titanic.csv')
Step4: Display dataframe (df)
Step5: Creating an object after passing dataframe (df) to method ProfileReport(), which makes data profiling and EDA process a breeze.
profile = ProfileReport(df)
Step6: Using the above-created object ‘profile’ to save the report with the title profile.
Step7: open the titanic.html file using any browser it will display the pandas' profile report.
Pandas Profile Report of Titanic dataset is given below:
Here, it is clear that we can do Exploratory Data Analysis.
Using pandas_profiling within 4-5 lines of code, Thus we can say that it is boosting up EDA.