EDA using Pandas Profiling | EDA part-2

Aug 28, 2020 EDA , pandas , pandas_profiling , ProfileReport() , dataframe , df , profile.to_file(), 10966 Views

this article demonstrates how pandas-profiling boost up EDA. using this package we can quickly do EDA on any dataset.

EDA using Pandas Profiling

“Pandas profiling is a boost up of EDA”.

Pandas-Profiling:

It is a python package, which is popularly used to boost up EDA. This package generates profile reports from pandas DataFrame. Dataframe.describe() is great but very basic for serious exploratory data analysis. Using pandas_profiling extends the pandas DataFrame with df.profile_report() for quick

Data analysis.

It presents the following statistics for each column in an interactive HTML report and this report consists of the following.

Type inference: detects the type of columns in a data frame.
Essential: type, unique values, missing values.
Quantile statistics: like min value, Q1, median, Q3, maximum, range, interquartile range.
Most frequent value
histogram
Correlations: It is highlighting of highly correlated variables, spearman, Pearson, and Kendall matrices.
Missing values: It consists of a matrix, count, heatmap, and dendrogram of missing values.
Text analysis: It is an analysis of Uppercase, space, scripts, and blocks of text data.
File and image analysis: It involves the extracts of the file like sizes, creation dates, and dimensions. Also, it scans for truncated images or those containing EXIF information.

Prerequisites:

Install and import pandas

Syntax for installation:

pip install pandas-profiling

Syntax for importing:

import pandas-profiling

The method used for profiling:

pandas-profiling.ProfileReport(df , **kwargs)

Parameters:

df: dataframe type, data to be analyzed.

bin: It's type is int, it the total number of bins that histogram consists, by default it is 10.

check_correlation: Its type is Boolean, it specifies whether correlation will be checked of not, by default it is set to True.

correlation_threshold: type float, It specifies the threshold to determine if the variable pair is correlated. The default value of the threshold is 0.9.

correlation_overrides: type List, If any variable name is specified to this argument, then that mentioned variable will not be rejected in any case. By default, it is set to NONE.

check_recoded: it is of Boolean type, and by default False. Also an expensive computation.

pool_size: it is of int type, It defines the number of workers in a single thread pool by default it is equal to the number of CPU.

Now, let’s look at the steps that demonstrate how pandas_profiling can be used for Exploratory Data Analysis(EDA).

Step1: Either create or use any dataset of your choice. Here, I am using the Titanic dataset for Exploratory Data analysis. It consists of 891 rows and 12 columns.

Data description:

passengerId: provided passenger id
survival: survival( 1-yes, 0-Not)
pclass: passenger class (1-first, 2-second, 3-third)
name: Name of passenger
sex: Sex( male or female)
age: age of passenger
sibsp: no. of siblings/spouses aboard
parch: no. of parents/children aboard
ticket: Ticket no.
fare: passenger Fare
cabin: cabin chosen by the passenger
embarked: port of embarkation(S-Southampton, C-Cherbourg, Q-Queenstown)

Step2: Import pandas and pandas-profiling packages.

import pandas as pd
from pandas_profiling import ProfileReport

Step3: How o read csv file by creating pandas dataframe.

df = pd.read_csv('titanic.csv')

Step4: Display dataframe (df)

print(df)

Step5: Creating an object after passing dataframe (df) to method ProfileReport(), which makes data profiling and EDA process a breeze.

profile = ProfileReport(df)

Step6: Using the above-created object ‘profile’ to save the report with the title profile.

profile.to_file(output_file='titanic.html')

Step7: open the titanic.html file using any browser it will display the pandas' profile report.

Pandas Profile Report of Titanic dataset is given below:

Here, it is clear that we can do Exploratory Data Analysis.

Using pandas_profiling within 4-5 lines of code, Thus we can say that it is boosting up EDA.

Related Article

Function Application in Pandas | pipe() | apply() | applymap()

Introduction to Numpy | Pandas | Matplotlib

What are the basic Functionality of Pandas Data Structure| Pandas tutorial

Data structure -DataFrame |Pandas tutorial

Data Structure-SERIES |Pandas tutorial

Introduction to Pandas | Pandas Tutorial

EDA using Pandas Profiling | EDA part-2

EDA using Pandas Profiling

Related Article

Advertisement

COMPANY

CONTRIBUTE

EDA using Pandas Profiling | EDA part-2

EDA using Pandas Profiling

Related Article

Advertisement

COMPANY

JOIN TUTORIALS LINK

Our Newsletter Will Let You Know When Any NewArticles, Tutorials and Video Are Released.

CONTRIBUTE

Follow us

Our Newsletter Will Let You Know When Any New
Articles, Tutorials and Video Are Released.