Your data science project step by step — case study with python Part1

In this article I will help you learn the basic steps and phases of a data science project to follow from raw data preparation , cleaning ,exploring to building a machine learning model, and ultimately, to operationalization.

Gathering Business Knowledge:

Most important part of the creating a model is to have a sound business knowledge of the problem you are trying to solve and to define your goal

CASE STUDY: Our problem here is to find the true value of a property that we want to sell (prediction of house price for sell)

the quality of your inputs will decide the quality of your outputs

Data Exploration

once we have the business knowledge , next steps are :

  • Identify the data you need for the research

Our dataset that we will use :

dataset

Data dictionary:

each dataset has to have a data dictionary that contain the dataset column and their meaning

  • price :Value of the house

Univariate Analysis and EDD

Uni means one so we will only describe and find patterns in the same data(central tendency,dispersion,count)

EDD
  • the median is the 50%
Missing values

Missing values imputation

Machine learning algorithms do not support missing values

Solution:

  • remove them (be careful because you remove an information)
Missing values

Histogram visualization is very useful to see the variables distributions

Outlier handling:

  • Outlier is an observation that appears far away and diverges from an overall pattern in a sample

In the next i will be removing ouliers from the n_hots_rooms & rainfull column in 2 ways : quantiles and IQR

Removing Outlier

An other way

Now values are lot closer that before.

Bi-variate Analysis:

the curve looks logarithms , we need to transform the variable to get a linear relationships

before and after transformation

the column dist1 ,dist2,dist3,dist4 have the same informations so we will take the average distance that make more sense and drop them

new column

we will also drop the bus_ter column , it only holds one value and does not provide any useful information

Non usable variable

  • variables with single values
droping a column

Handle categorical variables

Regression analysis treats all independent variables as numerical, we will use dummy variables to incorporate nominal variables

dummy variable make a separate column for each category and assign to them 0 or 1 values

-For ordinal variables , we can use the map function and assign to them 1,2,3 .. respectively

dummy variables

Correlation

the correlation quantify the relationships between 2 variables, if both variables are highly correlated (corr~1) , it may lead to multicollinearity problem

in our case it is the case for air_qual & parks (corr=0.92), we will delete one of them (del df[‘parks’])

for more informations: https://en.wikipedia.org/wiki/Correlation_and_dependence

Conclusion

In this article we have defined our business goal, get our data , do some wrangling , so that our data will be ready for the next step, Modeling! that I will be covering in the part 2

Connect with me in https://www.linkedin.com/in/kaoutar-bentalha/

Data science enthusiast