Scikit-Learn ML pipeline for regression

Figure 1: Scikit-Learn ML pipeline with multiple transformations

In this post we will see how we can use Scikit-Learn pipelines to transform the data and train ML models.
Scikit-Learn pipelines can have multiple stages which will be run in specific order. These stages can be either Transformers or Estimators. We can see more details on Scikit-Learn ML Pipelines here.

Auto MPG dataset

We will take the example dataset of measuring fuel efficiency of cars. The original dataset is available here
I have taken this data and pre-processed it a bit to use it easily in this example (adding headers, change column delimiter etc.) This dataset which we will use can be downloaded from here

This data contains features: MPG, Cylinders, Displacement, Horsepower, Weight, Acceleration, Model Year, Origin, Car Name. MPG specifies the fuel efficiency of the car.

Reading the data

To read the data, we will use Pandas library like below:

mpg_data = pd.read_csv('auto-mpg.csv.gz', sep='|')

We can see the details of the loaded data like type of column, count of each field using:

mpg_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MPG           398 non-null    float64
 1   Cylinders     398 non-null    int64  
 2   Displacement  398 non-null    float64
 3   Horsepower    392 non-null    float64
 4   Weight        398 non-null    float64
 5   Acceleration  398 non-null    float64
 6   Model Year    398 non-null    int64  
 7   Origin        398 non-null    object 
 8   Car Name      398 non-null    object 
dtypes: float64(5), int64(2), object(2)
memory usage: 28.1+ KB

We can see that columns except Origin and Car Name are of numeric type.
Horsepower is missing 6 values. We will fill these values using Imputer.

Origin is a non-numeric column. Let us see the distinct values in the column:

mpg_data['Origin'].unique()

array(['USA', 'Japan', 'Europe'], dtype=object)

Origin is a categorical column. We will convert it to one-hot encoding.

Simlilarly Car Name is a String column. However it does not having any relation with fuel efficiency. We will drop this column.

mpg_data = mpg_data.drop('Car Name', axis=1)

Visualizing the data

Let us visualize this data using Seaborn. A pairplot renders pairwise relationship in the dataset.
Pairplot of different columns can be seen using:


sns.pairplot(mpg_data[['MPG', 'Cylinders', 'Displacement', 'Weight', 'Horsepower']], diag_kind='kde')

Figure 1: Spark ML pipeline with multiple transformations

To see pairwise correlation of the features we use the heatmap:

corr = mpg_data.corr()
sns.heatmap(corr, annot=True)

Figure 2: Correlation heatmap

We can see in the scatter plot and correlation matrix that as the horsepower, weight and displacement increase, MPG is reducing.

Build pipeline and transform the data

We will transform only features not the labels. So we split the features and labels in the data:

mpg_data_labels = mpg_data['MPG'].copy().to_numpy()
mpg_data_features = mpg_data.drop('MPG', axis=1)

We will build two pipelines to transform the data. One of the pipeline is for numeric columns which will use these Transformers:
Imputer: Since we have missing values in Horsepower column, we will fill the missing data using median of all the remaining values.
StandardScaler: The features in the dataset are not in the same range. Machine learning algorithms perform well when data is in similar ranges.
We will use StandarScaler to scale the input data.

This pipeline is defined as below:

num_tr_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])

Second pipeline is to transform categorical feature ‘Origin’ to one-hot encoding. This pipeline will have below Transformers:
OrdinalEncoder: Before we convert Origin to a one hot encoding we need to assign label indicies to each unique value. OrdinalEncoder will perform this operation.
OneHotEncoder: Once OrdinalEncoder stage is completed, we can convert the categorical values to one-hot encoding using this transformation.

This pipeline is defined as below:

cat_tr_pipeline = Pipeline([
        ('ordinal_encoder', OrdinalEncoder()),
        ('one_hot_encoder', OneHotEncoder()),
    ])

We need to specify which features should be processed by each pipeline. Since features except Origin are numeric, they should be processed by numeric pipeline.
We define the numeric columns using:

num_attribs = [col for col in mpg_data_features.columns if col != 'Origin']

Origin column should be processed by categorical pipeline. We define the categorical column:

cat_attribs = ['Origin']

Now we can transform the columns using the pipelines:

full_pipeline = ColumnTransformer([
        ("num_tr_pipeline", num_tr_pipeline, num_attribs),
        ("cat_tr_pipeline", cat_tr_pipeline, cat_attribs),
    ])

mpg_transformed = full_pipeline.fit_transform(mpg_data_features)

mpg_transformed will have all the data transformations applied.

Build the model, predict and evaluate the model

Before we build the model, we split the data into train and test data:

mpg_train_data, mpg_test_data, mpg_train_labels, mpg_test_labels = train_test_split(mpg_transformed, mpg_data_labels, test_size=0.3, random_state=0)

First let us try a simple Linear Regression model. Train the model using train data and evaluate how it performs on the test data:

lin_reg = LinearRegression()
lin_reg.fit(mpg_train_data, mpg_train_labels)
mpg_test_predicted = lin_reg.predict(mpg_test_data)
np.sqrt(mean_squared_error(mpg_test_labels, mpg_test_predicted, squared=True))

This gives output of:

3.366034587971452

We will build another model using RandomForestRegressor:

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=50, random_state=0)
forest_reg.fit(mpg_train_data, mpg_train_labels)
mpg_test_predicted = forest_reg.predict(mpg_test_data)
np.sqrt(mean_squared_error(mpg_test_predicted, mpg_test_labels, squared=True))

This gives output of:

2.7287667360915995

This is improvement over Linear Regression. We can further tweak the model parameters or build different models to further improve the prediction.
Since the goal of this post was to show the usage of Scikit-Learn ML pipelines, we will stop here.

Complete implementation of Scikit-Learn ML Pipeline for regression

This is available as Jupyter Notebook and as a Python script