After 7 years in the Insurance Industry, Agata wanted to get more technical on her daily tasks. After her Fundamentals and Fullstack Bootcamp, she got hired by Direct Assurance as a Data Scientist ! Here’s the project she lead at the end of her Fundamentals bootcamp.

Most cars sold each year around the world are second-hand. They represent **70% of the total number of cars sold in France**, € 5.6 million in 2016 (according to Le Figaro). The market is very big and largely unregulated, with the price being very often established through negotiation. As such, buying a car or obtaining a fair price for their car without overpaying is a concern for most people at some point in their lives.

Thanks to vast and growing quantities of data collected about our vehicles, we could remove the doubt over fair valuation through application of Machine Learning. The raw dataset used for this exercise is available on Kaggle. It contains 371,528 records of car sale ads scraped from the German Ebay in March 2016.

## Data Cleaning

Due to the origin of the data as well as its vast quantity, **the dataset is rather “unclean” **: many missing values, false values, outliers etc. That’s why data cleaning was the most complicated and time consuming step of the process, however it was crucial for the quality of the future predictions.

**Missing Values**

Thousands of missing values were detected across 6 different columns :

I started by removing observations where it was impossible or tricky to impute the missing values. This was the case for these variables :

– **notRepairedDamage** , where it was impossible to know if the car was damaged or not

– **price**, which is the dependent variable that the Machine Learning model is trying to predict.

Imputation was implemented in the case of vehicleType, gearbox and fuelType. To do this, the most common value for a given car model was assigned to each null value.

## False values & outliers

After closer inspection, it became apparent that the numerical variables price, yearOfRegistration and powerPS contain false or zero values.To deal with these, and to narrow down the scope of the study I discarded all records where: the price is less than 100 or over 100,000 EURThe year of registration is before 1950 or after 2016.

The false brake horsepower values can be easily replaced by applying a similar technique to the one I used for imputation, which is inserting average BHP (horsepower) values for a given model. I have done this for records where the BHP values were zero or greater than 600.

I have then narrowed the sample further by keeping only cars registered in the 2000-2016 period.The reason I’ve done this is because the valuation and depreciation of classic cars, luxury or sportscars follow completely different rules. If kept in the model, those outliers would negatively affect the quality of my predictions.

## Brand segmentation

The total number of brands in the dataset is 40. I have decided to group them by market segment, which would simplify the model by reducing the number of dummy variables. The brands were assigned to three segments: mass market, premium and luxury.

## Modelling

The exploratory data analysis suggested negative correlation between the age and mileage of a vehicle and its price, and positive correlation between brake horsepower and price.

Based on data visualisation which showed linear relationship between age and vehicle price, I have selected the Multiple Linear Regression model. The model was trained on 70% of the dataset and yielded the following coefficients:

**We can interpret the above as follows : **

For a starting value of 11,428€,

- every unit of brake horsepower increases the price by 64€

- every 100km driven reduce the price by 60€

- every year since registration reduces the price by 578€

- the presence of any unrepaired damage reduces the price by 2,063€

and so on.

## Model evaluation

The coefficient of determination (R-squared) is a widely used as an indicator of performance of regression models. The closer it is to 0, the closer the given model’s predictions are to reality. Our model’s R-squared coefficient is 0.70, indicating good fit.

The difference between the model’s prediction and the true value is known as error. The below histogram shows that our model’s error is normally distributed and concentrated around zero, which in statistical terms is very satisfactory.

## Conclusion & perspectives

Given a limited scope of this project, a simplified **Multiple Linear Regression** model trained on static, historical data has fulfilled the project’s goals.

Thanks to the great power of Machine Learning, the idea could potentially be taken much further. For example, the model could be made dynamic and use more granular data for even more precise results. Another interesting aspect which could be explored is whether and how the rate of depreciation differs among car brands and models. Machine Learning could help us choose a car which loses value less fast than other comparable models.