# Your data science project step by step — case study with python Part1

In this article I will help you learn the basic steps and phases of a data science project to follow from raw data preparation , cleaning ,exploring to building a machine learning model, and ultimately, to operationalization.

**Gathering Business Knowledge:**

Most important part of the creating a model is to have a sound business knowledge of the problem you are trying to solve and to define your goal

**CASE STUDY: **Our problem here is to find the true value of a property that we want to sell (prediction of house price for sell)

the quality of your inputs will decide the quality of your outputs

**Data Exploration**

once we have the business knowledge , next steps are :

- Identify the data you need for the research
- Plan request data
- Do quality check on the data received

Our dataset that we will use :

**Data dictionary:**

each dataset has to have a data dictionary that contain the dataset column and their meaning

- price :Value of the house
- crime_rate Crime rate in that neighborhood
- resid_area Proportion of residential area in the town
- air_qual Quality of air in that neighborhood
- room_num Average number of rooms in houses of that locality
- age How old is the house construction in years
- dist1 Distance from employment hub 1
- dist2 Distance from employment hub 2
- dist3 Distance from employment hub 3
- dist4 Distance from employment hub 4
- teachers Number of teachers per thousand population in the town
- poor_prop Proportion of poor population in the town
- airport Is there an airport in the city? (Yes/No)
- n_hos_beds Number of hospital beds per 1000 population in the town
- n_hot_rooms Number of hotel rooms per 1000 population in the town
- waterbody What type of natural fresh water source is there in the city (lake/ river/ both/ none)
- rainfall The yearly average rainfall in centimeters
- bus_ter Is there a bus terminal in the city? (Yes/No)
- parks Proportion of land assigned as parks and green areas in the town

**Univariate Analysis and EDD**

Uni means one so we will only describe and find patterns in the same data(central tendency,dispersion,count)

- the median is the 50%
- we can see from the count row that the column n_hos_beds has 8 missing values and most data are skewed like the crime_rate column(most data lies between the 75% and max values), and outliers that we need to deal with in the next steps:

**Missing values imputation**

Machine learning algorithms do not support missing values

Solution:

- remove them (be careful because you remove an information)
- impute them with the mean,median,0 (for numerical variables)/mode(for categorical variables)

Histogram visualization is very useful to see the variables distributions

**Outlier handling:**

- Outlier is an observation that appears far away and diverges from an overall pattern in a sample
- Outliers effects mostly the mean ,it is preferable to work with the median when dealing with few outliers
- You can detect Outliers by data visualization(histogram ,scatter plots ,Boxplots ) , or with the Z-score test (that fall outside of 2 standard deviation)

In the next i will be removing ouliers from the n_hots_rooms & rainfull column in 2 ways : quantiles and IQR

An other way

Now values are lot closer that before.

**Bi-variate Analysis:**

the curve looks logarithms , we need to transform the variable to get a linear relationships

the column dist1 ,dist2,dist3,dist4 have the same informations so we will take the average distance that make more sense and drop them

we will also drop the bus_ter column , it only holds one value and does not provide any useful information

**Non usable variable**

- variables with single values
- variable with low fill rate
- variable with no business sense

# Handle categorical variables

Regression analysis treats all independent variables as numerical, we will use dummy variables to incorporate nominal variables

dummy variable make a separate column for each category and assign to them 0 or 1 values

-For ordinal variables , we can use the map function and assign to them 1,2,3 .. respectively

# Correlation

the correlation quantify the relationships between 2 variables, if both variables are highly correlated (corr~1) , it may lead to multicollinearity problem

in our case it is the case for air_qual & parks (corr=0.92), we will delete one of them (del df[‘parks’])

for more informations: https://en.wikipedia.org/wiki/Correlation_and_dependence

# C**onclusion**

In this article we have defined our business goal, get our data , do some wrangling , so that our data will be ready for the next step, Modeling! that I will be covering in the part 2

Connect with me in https://www.linkedin.com/in/kaoutar-bentalha/