Your data science project step by step — case study with python Part1

5 min readOct 15, 2020

In this article I will help you learn the basic steps and phases of a data science project to follow from raw data preparation , cleaning ,exploring to building a machine learning model, and ultimately, to operationalization.

Gathering Business Knowledge:

Most important part of the creating a model is to have a sound business knowledge of the problem you are trying to solve and to define your goal

CASE STUDY: Our problem here is to find the true value of a property that we want to sell (prediction of house price for sell)

the quality of your inputs will decide the quality of your outputs

Data Exploration

once we have the business knowledge , next steps are :

Identify the data you need for the research
Plan request data
Do quality check on the data received

Our dataset that we will use :

Data dictionary:

each dataset has to have a data dictionary that contain the dataset column and their meaning

price :Value of the house
crime_rate Crime rate in that neighborhood
resid_area Proportion of residential area in the town
air_qual Quality of air in that neighborhood
room_num Average number of rooms in houses of that locality
age How old is the house construction in years
dist1 Distance from employment hub 1
dist2 Distance from employment hub 2
dist3 Distance from employment hub 3
dist4 Distance from employment hub 4
teachers Number of teachers per thousand population in the town
poor_prop Proportion of poor population in the town
airport Is there an airport in the city? (Yes/No)
n_hos_beds Number of hospital beds per 1000 population in the town
n_hot_rooms Number of hotel rooms per 1000 population in the town
waterbody What type of natural fresh water source is there in the city (lake/ river/ both/ none)
rainfall The yearly average rainfall in centimeters
bus_ter Is there a bus terminal in the city? (Yes/No)
parks Proportion of land assigned as parks and green areas in the town

Univariate Analysis and EDD

Uni means one so we will only describe and find patterns in the same data(central tendency,dispersion,count)

the median is the 50%
we can see from the count row that the column n_hos_beds has 8 missing values and most data are skewed like the crime_rate column(most data lies between the 75% and max values), and outliers that we need to deal with in the next steps:

Missing values imputation

Machine learning algorithms do not support missing values

Solution:

remove them (be careful because you remove an information)
impute them with the mean,median,0 (for numerical variables)/mode(for categorical variables)

Histogram visualization is very useful to see the variables distributions

Outlier handling:

Outlier is an observation that appears far away and diverges from an overall pattern in a sample
Outliers effects mostly the mean ,it is preferable to work with the median when dealing with few outliers
You can detect Outliers by data visualization(histogram ,scatter plots ,Boxplots ) , or with the Z-score test (that fall outside of 2 standard deviation)

In the next i will be removing ouliers from the n_hots_rooms & rainfull column in 2 ways : quantiles and IQR

An other way

Now values are lot closer that before.

Bi-variate Analysis:

the curve looks logarithms , we need to transform the variable to get a linear relationships

the column dist1 ,dist2,dist3,dist4 have the same informations so we will take the average distance that make more sense and drop them

we will also drop the bus_ter column , it only holds one value and does not provide any useful information

Non usable variable

variables with single values
variable with low fill rate
variable with no business sense

Handle categorical variables

Regression analysis treats all independent variables as numerical, we will use dummy variables to incorporate nominal variables

dummy variable make a separate column for each category and assign to them 0 or 1 values

-For ordinal variables , we can use the map function and assign to them 1,2,3 .. respectively

Correlation

the correlation quantify the relationships between 2 variables, if both variables are highly correlated (corr~1) , it may lead to multicollinearity problem

in our case it is the case for air_qual & parks (corr=0.92), we will delete one of them (del df[‘parks’])

for more informations: https://en.wikipedia.org/wiki/Correlation_and_dependence

Conclusion

In this article we have defined our business goal, get our data , do some wrangling , so that our data will be ready for the next step, Modeling! that I will be covering in the part 2

Connect with me in https://www.linkedin.com/in/kaoutar-bentalha/