The Dataset

A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term. The bank has various outreach plans to sell term deposits to their customers such as email marketing, advertisements, telephonic marketing, and digital marketing.

Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However, they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand so that they can be specifically targeted via call.

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

Source


Check Python version. This be useful when creating a production environment

In [1]:
!python -V
Python 3.9.12
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
sklearn.__version__ # note sklearn version too
Out[2]:
'1.0.2'
In [3]:
train_df = pd.read_csv("train.csv", sep=";")
test_df = pd.read_csv("test.csv", sep=";")
In [4]:
print(train_df.shape)
print(test_df.shape)
(45211, 17)
(4521, 17)
In [5]:
test_df.head()
Out[5]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
4 59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no
In [6]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB
In [7]:
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  duration   4521 non-null   int64 
 12  campaign   4521 non-null   int64 
 13  pdays      4521 non-null   int64 
 14  previous   4521 non-null   int64 
 15  poutcome   4521 non-null   object
 16  y          4521 non-null   object
dtypes: int64(7), object(10)
memory usage: 600.6+ KB

DictVectorizer

Encode the categorical columns using DictVectorizer.

In [8]:
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer()

trainX = train_df.drop(columns=["y"])
trainy = train_df["y"]

testX = test_df.drop(columns=["y"])
testy = test_df["y"]

trainy = trainy.replace({"no":0, "yes": 1})
testy = testy.replace({"no":0, "yes": 1})
In [9]:
trainX = dv.fit_transform(trainX.to_dict(orient="records")).toarray()

testX = dv.fit_transform(testX.to_dict(orient="records")).toarray()
In [10]:
train_df.columns.sort_values()
Out[10]:
Index(['age', 'balance', 'campaign', 'contact', 'day', 'default', 'duration',
       'education', 'housing', 'job', 'loan', 'marital', 'month', 'pdays',
       'poutcome', 'previous', 'y'],
      dtype='object')
In [11]:
print(dv.get_feature_names_out())
['age' 'balance' 'campaign' 'contact=cellular' 'contact=telephone'
 'contact=unknown' 'day' 'default=no' 'default=yes' 'duration'
 'education=primary' 'education=secondary' 'education=tertiary'
 'education=unknown' 'housing=no' 'housing=yes' 'job=admin.'
 'job=blue-collar' 'job=entrepreneur' 'job=housemaid' 'job=management'
 'job=retired' 'job=self-employed' 'job=services' 'job=student'
 'job=technician' 'job=unemployed' 'job=unknown' 'loan=no' 'loan=yes'
 'marital=divorced' 'marital=married' 'marital=single' 'month=apr'
 'month=aug' 'month=dec' 'month=feb' 'month=jan' 'month=jul' 'month=jun'
 'month=mar' 'month=may' 'month=nov' 'month=oct' 'month=sep' 'pdays'
 'poutcome=failure' 'poutcome=other' 'poutcome=success' 'poutcome=unknown'
 'previous']

Unlike other encoders, DictVectorizer are more flexible and are very easy to apply to new data.

How it works

  1. Convert the dataframe to a dictionary using .to_dict(orient="records")
  2. Transform the dict dataframe using dv.fit_transform(), where dv is the DictVecortizer object.
  3. Convert to array to avoid errors during model training: to_array.

This encodes columns in the dataframe that are categorical, and retain values of columns that are non-categorical.

The two cells above shows a comparison of the columns in the dataframe vs the columns in the dictvectorizer. You may agree that it looks similar to one-hot-encoding, concatenated to non-catgeorical columns.

Train the Model

In [ ]:
!pip install xgboost
In [13]:
from xgboost import XGBClassifier

xgb = XGBClassifier()
In [14]:
xgb.fit(trainX, trainy)
y_pred = xgb.predict(testX)
In [15]:
from sklearn.metrics import classification_report, f1_score

print(classification_report(y_pred, testy))
              precision    recall  f1-score   support

           0       0.98      0.96      0.97      4092
           1       0.69      0.84      0.76       429

    accuracy                           0.95      4521
   macro avg       0.84      0.90      0.87      4521
weighted avg       0.96      0.95      0.95      4521

In [16]:
print(f1_score(y_pred, testy))
0.7600000000000001

While more feature engineering is encourged to increase the model's performance on class 1, we'll proceed to refactoring the codes and including experiment tracking with MLflow.

1. Put all used libraries into one cell.

In [ ]:
# !pip install xgboost
!pip install mlflow --quiet;
In [18]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import f1_score

from xgboost import XGBClassifier

import pickle # to export and load the model

import mlflow

If this is your first time with mlflow, it is one of those tools used in machine learning operations that helps to keep track of data, models, experiment results, etc. This helps you to know what models performed best, their hyperparameters, the time they took to run and names of the data used for the training. You can also choose to take track of the name of the data scientist that trained the model.

To start, one key thing to define is the experiment name.

In [19]:
# define experiment name. Give it any good name
mlflow.set_experiment("term-deposit-exp")
Out[19]:
<Experiment: artifact_location='file:///root/mlops/week4/mlruns/1', experiment_id='1', lifecycle_stage='active', name='term-deposit-exp', tags={}>

This creates a folder in your working directory, called "mlruns"/1 : where 1 represents the experiement serial number. If we create a new experiment it will be number 2

To see this, in your terminal,

  1. Create a virtual enviroment and install mlflow. Remember to make your environment run on python 3.9
  2. cd to this working directory where this particular notebook is located at. In my case, it is week4. Then run the following: mlflow ui, as shown in the image below.

The Commands are actions we can carry out in mlflow. For a quick demo, mlflow ui will be used, but afterwards, we'll be working with mlflow server.

After running mlflow ui in the current working directory, you should have something like this:

We can populate the table once we tell mlflow to track any of our ML activities.

So let's get started. [Back to this Notebook]

In [20]:
# to make mlflow store your model files (artifacts) to a db, we have to define that
# we can work with sqlite

mlflow.set_tracking_uri("sqlite:///mlflow.db")

so now, you can go back to the terminal and run mlflow ui --backend-uri-sqlite:///mlflow.db. This will produce the same mlflow ui except that you can now open the Models tab (beside Experiments).

2. Refactor the codes

In [21]:
def load_data(train_path:str, test_path:str):
    train = pd.read_csv(train_path, sep=";")
    test = pd.read_csv(test_path, sep=";")
    
    return train, test

    
def preprocess_data(train_data, test_data):

    dv = DictVectorizer()

    trainX = train_data.drop(columns=["y"])
    trainy = train_data["y"]

    testX = test_data.drop(columns=["y"])
    testy = test_df["y"]

    trainy = trainy.replace({"no":0, "yes": 1})
    testy = testy.replace({"no":0, "yes": 1})

    trainX = dv.fit_transform(trainX.to_dict(orient="records")).toarray()

    testX = dv.fit_transform(testX.to_dict(orient="records")).toarray()
    
    return dv, trainX, trainy, testX, testy


def train_model(trainX, trainy, testX, testy):

    xgb = XGBClassifier()
    
    xgb.fit(trainX, trainy)
    
    pred = xgb.predict(testX)
    
    return xgb, pred
    

def evaluate(testy, pred):
    
    return (f1_score(testy, pred))
In [22]:
# save the model and dict vectorizer

def save_model(path:str, model, encoder):
    
    with open(path, "wb") as f_out:
        pickle.dump((model, encoder), f_out)
        
    return path

The four functions in the two cells above are a summary of all we've done previously. The last cell is for saving the model and encoder as a pickle file.

We will call all the functions in the next cell, but wrapped under mlflow.start_run().

3. Log the experiment

In [23]:
with mlflow.start_run(): # first start
    
    mlflow.set_tag("developer", "your_name") # tell mlflow to record your name.
    
    mlflow.log_param("train-data", "train.csv") # log the data path if you want
    mlflow.log_param("test-data", "test.csv")
    
    train, test = load_data("train.csv", "test.csv") # call the first function
    dv, trainX, trainy, testX, testy = preprocess_data(train, test) # call the second function
    
    xgb, pred = train_model(trainX, trainy, testX, testy) # call the third function
    
    # you can also log hyperparameters of the model.
    # for example: mlflow.log_params("max_leaves", "4")
    
    score = evaluate(testy, pred) # call the fourth function
    
    print(score)
    
    model_path = save_model("term-deposit.bin", xgb, dv) # save the model
    
    mlflow.log_metric("f1_score", score) # tell mlflow to record the model score
    
    # tell mlflow where to pick the model from and where to store it at:
    mlflow.log_artifact(local_path=model_path, artifact_path="xgb-model") # tell mlflow to store the model too.
    
0.7600000000000001

There is also another way of logging the model to mlflow since we are using xgboost.

So instead of: mlflow.log_artifact(local_path="term-deposit.bin", artifact_path="xgb-model")

Do: mlflow.xgboost.log_model(xgb, artifact_path="xgb-model-version") as shown the cell below.

NOTE: The cell below is an alternative to the one above. You can run both to see the difference, but ensure you use different artifact paths

The latter is better because it stores more information about the model.

In [24]:
with mlflow.start_run(): # first start
    
    mlflow.set_tag("developer", "your_name") # tell mlflow to record your name.
    
    mlflow.log_param("train-data", "train.csv") # log the data path if you want
    mlflow.log_param("test-data", "test.csv")
    
    train, test = load_data("train.csv", "test.csv") # call the first function
    dv, trainX, trainy, testX, testy = preprocess_data(train, test) # call the second function
    
    xgb, pred = train_model(trainX, trainy, testX, testy) # call the third function
    
    # you can also log hyperparameters of the model.
    # for example: mlflow.log_params("max_leaves", "4")
    
    score = evaluate(testy, pred) # call the fourth function
    
    print(score)
    
    model_path = save_model("term-deposit.bin", xgb, dv)
    
    mlflow.log_metric("f1_score", score)
    
    # changes are from here:
    
    # pickle dump just the dict vectoriser
    with open("preprocessor.b", "wb") as f_out:
        pickle.dump(dv, f_out)
    
    # log the dict vectoriser
    mlflow.log_artifact("preprocessor.b", artifact_path="preprocessor")
    
    # log the model directly using .xgboost
    mlflow.xgboost.log_model(xgb, artifact_path="xgb-model-version")
0.7600000000000001
/root/anaconda3/lib/python3.9/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

After successfully running the cell above, you should now have mlflow.db file in your working directory. To view it from the ui, go back to your terminal and run this now:

mlflow ui --backend-store-uri sqlite:///mlflow.db

Remember to run this in your working directory, as you did previously and also ensure your environment is activated.

Your interface should be looking similar to mine below. Click on the time under the "Start Time" column, to see more details of the model and the artifacts.

Voila! We've come to the end of this note that introduces model experiementation with MLFlow.

THANKS FOR READING.

In [ ]: