Introduction
This tutorial is intended to assist beginners to learn tree-based modeling from scratch. After the successful completion of this tutorial, one is expected to become able to use tree-based algorithms and build predictive models.
Explore the data using Pandas
The first thing we'll need to do is load in the dataset. we will use the UK government’s land registry data. The goal of the tutorial is to predict how much a house will sell for.
You can find and download the data at: (download FULL Price Paid Data-Single file 1995-2017)
A description of the columns in the data is available at: .
We will need to use Pandas library to process the data. Pandas is one of the basic libraries for Python data science, it provides powerful and easy-to-use data structures, as well as the means to quickly perform operations on these structures. Next, we’ll check our data by head() function as below.
import pandas as pd
#Importing the data
data = pd.read_csv("~\pp-monthly-update-new-version.csv", sep='delimiter', header=None, delimiter=',', encoding = "ISO-8859-1")
data.head()
Handle missing values¶
Missing values arise for several reasons. So, we are going to see if there are any missing values, which will be represented with NaN or None.
#Checking missing values
missing_values_count = data.isnull().sum()
missing_values_count
There is a lot of ways to handle the missing value on a data set. yet, I’ll show you the best method for this: By using imputation.
Imputation fills within the missing value with some number. The imputed value will not be precisely right in most cases, however, it always offers additional correct models than dropping the column entirely.
from sklearn.impute import SimpleImputer my_imputer = SimpleImputer() data = my_imputer.fit_transform(data)
Feature selection¶
Now let’s do some feature selection, that we will need for our prediction. For the sake of simplicity. we won’t complicate this task a lot, and we will choose three important features to predict the house price in our task.
Each row in our data contains a house that was purchased, the price that was paid, and the features of the house and purchase. we choose to predict the price that was paid from this features: the lease duration, the property type, and whether or not the property is in London. each feature is present in 4, 6, and 11th column. So let’s drop the remaining columns, and do a quick transformation for the 11th, saying it’s True:T if the property is in London or False:F.
#Selecting The Prediction Target
y = data[1]
#Selecting features
X = data.drop(data.columns[[0,1,2,3,5,7,8,9,10,12,13,14,15]], axis=1)
X[11] = X[11].apply(lambda x: "T" if 'LONDON' in x else "F")
Splitting the data¶
We divide our data into training and test.
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y,test_size=0.2, random_state = 0)
Convert categorical data using one-hot encoding¶
The feature that we’ve select are categorical now. Thought, for our regression problem, we should convert them into numerical values.
The standard Approach for doing this is One-hot encoding. It generally produces new binary columns, showing the placement of each potential value from the original data.
Hopefully, Pandas offers a useful method called get_dummies() that do the job, as follow.
#One-Hot Encoding
train_X = pd.get_dummies(train_X)
test_X = pd.get_dummies(test_X)
Your first model: Decision tree¶
You will use the scikit-learn library to create our model. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.
For our regression problem, Scikit-learn offers the DecisionTreeRegressor class. The method fit () takes as input the features X, and predict the targets y. (Warning: y are not class labels but real values).
After having a fitted model, we can then use it to make predictions, and evaluate it using the Mean Absolute Error in this task.
#Building the first decision tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
tree_model = DecisionTreeRegressor(random_state=1)
tree_model.fit(train_X, train_y)
melb_preds = tree_model.predict(test_X)
print(mean_absolute_error(test_y, melb_preds))
Random Forest¶
After training our first decision tree, let’s try the Random Forest and compare its result.
Overall, the random forest algorithm provides optimization of decision trees. It uses the same principle as bagging, but with an additional randomization step in the selection of node attributes in order to reduce the variance of the resulting estimator. Additionally, one of the best features of Random Forest models is that they generally work reasonably even without this tuning.
Next, we build a random forest model similarly to how we built a decision tree in sci-kit-learn, this time using the RandomForestRegressor class instead of DecisionTreeRegressor.
#Building the random Forest
from sklearn.ensemble import RandomForestRegressor
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(test_X)
print(mean_absolute_error(test_y, melb_preds))
XGBoost¶
Now let’s loo, at a more sophisticated technique:XGBoost, which stand for Extrem Gradient Boosting.
This technique becomes very successful in kaggle’s competitions nowadays, as it is more efficient and faster since it is composed of both a linear model and tree models. This in addition to its ability to perform parallel calculations on a single machine.
For now, we will not go into details how this model works, let’s proceed directly to the prediction.
#Building XGboost
from xgboost import XGBRegressor
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(train_X, train_y, early_stopping_rounds=5,
eval_set=[(test_X, test_y)], verbose=False)
melb_preds = my_model.predict(test_X)
print(mean_absolute_error(test_y, melb_preds))
XGBoost incorporates a few parameters that may dramatically have an effect on your model's accuracy and training speed, the ones we tend to employ in this tutorial are:
- n_estimators: it defines how many times to go into the modeling cycle.
- early_stopping_rounds: it causes the model to prevent iterating once the validation score stops rising, even though we tend to are not at the arduous stop for n_estimators. (5 may be a reasonable value).
- learning_rate: normally, a little learning rate (and an oversized range of estimators) can yield a lot of correct XGBoost models, although it'll additionally take the model longer to train since it will a lot of iterations through the cycle.
As you can see, XGboost does very well comparing to both random forest and decision tree.
Here we have done with this tutorial. Thanks for your reading :)