Trending February 2024 # How To Evaluate The Business Value Of A Machine Learning Model # Suggested March 2024 # Top 7 Popular

You are reading the article How To Evaluate The Business Value Of A Machine Learning Model updated in February 2024 on the website Cancandonuts.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 How To Evaluate The Business Value Of A Machine Learning Model

This article was published as a part of the Data Science Blogathon

But, what if we are able to answer the business questions without the complex metrics and technical jargon? well, we might stand a chance to get a buy-in from them. In this blog, we will take a look at a use case where we will still build our models but explain it in a different way – the business way.

Approach to Extract Business Value using ML model

In this blog, we will explore the use of deciles, understand various evaluation plots like Cumulative Gain plots and Lift plots, etc to assess the business value of ML models. The approach would help us explain the predictive power of our ML models and also makes it simple enough to interpret the model outcome. The plots and metrics would enable the business to make informed decisions with a lot more confidence.

We will explore the below topics as we go along in this blog.

Data exploration

Data processing

Model building

Generating deciles and reports

Model comparison

Business scenarios

Conclusion

Getting Started

We will be using the publicly available bank dataset from UCI Machine learning Repository There are four datasets in the zip file but our interest is in the bank-additional-full.csv. All the attributes information can be found in the above URL. The data is from the direct marketing phone calls made to contact the client to assess if the client is interested to subscribe to bank term deposits. It would be Yes if subscribed and No if not. Our interest in this blog is to understand how to evaluate the business value of the ML model/models.

Data Loading & Processing:

 Let us load the data and take a look to get a better understanding.

import wget import zipfile import pandas as pd import numpy as np wget.download(url) zf = zipfile.ZipFile('bank-additional.zip') df= pd.read_csv(zf.open('bank-additional/bank-additional-full.csv'), sep=';')

We can carry out complete EDA/feature engineering/selection/select significant variables and then build models but to keep it simple, we will select few variables for model building.

df= df[['y', 'duration', 'campaign', 'pdays', 'previous', 'euribor3m']]

Also, let’s explore data a bit more, convert the target variable to categorical and encode it.

df.y[df.y == 'yes'] = 'term deposit' df.y = pd.Categorical(df.y) df['y'] = df.y.cat.codes df.info() Output:

RangeIndex: 41188 entries, 0 to 41187

Data columns (total 6 columns):

# Column Non-Null Count Dtype

— —— ————– —–

0 y 41188 non-null int8

1 duration 41188 non-null int64

2 campaign 41188 non-null int64

3 pdays 41188 non-null int64

4 previous 41188 non-null int64

5 euribor3m 41188 non-null float64

dtypes: float64(1), int64(4), int8(1)

memory usage: 1.6 MB

df.head() y duration campaign pdays previous euribor3m 0 261 1 999 0 4.857 0 149 1 999 0 4.857 0 226 1 999 0 4.857 0 151 1 999 0 4.857 0 307 1 999 0 4.857 df.describe() y duration campaign pdays previous euribor3m count 41188.000000 41188.000000 41188.000000 41188.000000 41188.000000 41188.000000 mean 0.112654 258.285010 2.567593 962.475454 0.172963 3.621291 std 0.316173 259.279249 2.770014 186.910907 0.494901 1.734447 min 0.000000 0.000000 1.000000 0.000000 0.000000 0.634000 25% 0.000000 102.000000 1.000000 999.000000 0.000000 1.344000 50% 0.000000 180.000000 2.000000 999.000000 0.000000 4.857000 75% 0.000000 319.000000 3.000000 999.000000 0.000000 4.961000 max 1.000000 4918.000000 56.000000 999.000000 7.000000 5.045000 Model Building to Extract Business Value

Step1: Define independent and target variables

y = df.y X = df.drop('y', axis = 1)

Step2: Split the dataset into train/test with a test size of 0.2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2023)

Step3: Building logistic regression model

from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression # Logistic regression model clf_glm = LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg').fit(X_train, y_train) prob_glm = clf_glm.predict_proba(X_test) max_prob_glm = round(pd.DataFrame(np.amax(prob_glm, axis=1), columns = ['prob_glm']),2)

So, we have built the model and also scored (predicted) it on the test data which gives us predicted probabilities for each of the observations.

Building Deciles

The deciles simply put are splitting our data into ten different bins. So, we take all our predicted probabilities, segregate them into ten groups, and rank them meaning that the highest predicted probabilities will be in decile 1 and the lowest setting will be in decile 10. We will use the pandas’s cut() function to split the data.

The below line of code creates a new column by the name Decile_rank_glm which holds the rank of each predicted record.

max_prob_glm['Decile_rank_glm'] = pd.cut(max_prob_glm['prob_glm'], 10, labels = np.arange(10,0, -1))   prob_glm Decile_rank_glm 0 0.99 1 1 0.59 9 2 0.96 1 3 0.83 4 4 0.85 4 ... ... ... 8233 0.98 1 8234 0.98 1 8235 0.99 1 8236 0.99 1 8237 0.93 2

Note: The probability of 0.99 is ranked 1, 0.93 is 2,  0.85 is 4, and 0.59 is 9 in the above decile ranks. We will see the visual representation of this result in the later sections.

Model Evaluation to Extract Business Value

Any model that we build will have to be compared with the baseline model to see if how the model fairs in its performance. Let us explore this further below.

Random Model: The baseline model will be a random model meaning it is as good as the flip of a coin meaning there is 50% probability that the call to a customer will be positive / customer buys our product. Our logistic regression model’s performance should obviously be better than this.

Wizard Model: This is the other extreme model which is perfect in its prediction meaning it predicts nearly with 100% accuracy. This model should never be used in production or for any business decision as there is a heavy chance of overfitting.

Logistic Model: Our model should be somewhere in between these two extreme models which give us enough confidence to make our business decisions.

We will visualize the above models in a cumulative gain plot. This will give us an indication of where the logistic model stands in terms of performance.

kds.metrics.plot_cumulative_gain(y_test.to_numpy(), prob_glm[:,1])

 

Looks good so far, the plot is on the expected lines and the logistic regression model is in between the two extreme models we have discussed.

Insights from the cumulative gain plot: 

If we can select only the top 20% (decile 1 and decile 2) then we have coverage of nearly 80% of the target class.

As this is a cumulative plot, we see that the curve flattens after decile 5 which means the deciles 6 to 10 either have minimal records or none.

The wizard model hits the 100% mark in decile 2 – we already know this is an idealistic model just for reference. In case our model starts nearing/resembling any of these two extreme models then we need to review our model.

We have so far discussed models, deciles, and their performance comparison. Let us explore this further at the decile level to get a better understanding of what is at play here and how we can explain the process better. We will carry out our analysis with help of visuals which makes it much easier. The kds package has a very nice function to generate all the metrics reports in one line of code.

kds.metrics.report(y_test, prob_glm[:,1])

Let us understand each of these plots. Please note that the x-axis of all the plots is Deciles.

Lift Plot: This plot shows us how much better is logistic regression model is compared to the random model at all. Eg: decile 2 gives us a lift almost 4 times meaning we can do 4 times better than the random model approach. As we go to higher deciles the lift drops and eventually meets the random model line, this is because all the higher probability score values are in the top deciles (1 to 3) which we had already seen in the cumulative gains plot too. So, bottom deciles will have probabilities that are lower and almost the same as the random model.

Decile-wise Lift Plot: This plot shows us the percentage of the target class observation in each of the deciles and we observe that decile 1 had maximum and as we go higher deciles the percentage drops and after a certain point it even goes below random model line. This is because the random model has equally distributed observations that are randomly set whereas our model has predicted fewer observations in the higher deciles.

Cumulative Gain Plot: We discussed this in the earlier section and also looked into the interpretation of the plot.

KS Statistic Plot: The KS plot evaluates different distributions i.e events and non-events and the KS value is a point where the difference is maximum between the distributions. In short, it helps us in understanding the ability of the ML model to differentiate between two events. The KS score is greater than 40 and if it happens to be in the top 3 deciles then it is considered to be good. In our case, we have a score of 68.932 and decile 3 from the plot.

Let us build one more model with a random forest and see how the results will be.

clf_rf = RandomForestClassifier().fit(X_train, y_train) prob_rf = clf_rf.predict_proba(X_test) max_prob_rf = pd.DataFrame(np.amax(prob_rf, axis=1), columns = ['prob_rf']) max_prob_rf['Decile_rank_rf'] = pd.cut(max_prob_rf['prob_rf'], 10, labels = np.arange(10,0, -1)) kds.metrics.plot_cumulative_gain(y_test.to_numpy(), prob_rf[:,1]) kds.metrics.report(y_test, prob_rf[:,1])

Observations:

The random forest model is slightly better than the logistic model.

Decile 2 gives marginally higher lift and KS statistics is 72.18 compared to logistic which had 68.93

Business scenarios

Control Over Recommendations: There are situations where the client has a business demand that a minimum of X number of recommendations should always be generated. In such cases, we can have larger recommendations by considering the top 3 deciles instead of 2 deciles and also have granular control on additional records.

Measure Market Response: The post recommendation analysis and market response are easy to measure. For instance, from the previous point, we can separately track how was the performance of all additional recommendations from decile 3. Did additional push from decile 3 generate any impact (positive or negative)?

Optimizing Marketing Spend: By focusing on the top 20-30%, businesses can save time, resources, and money that they would spend on non-responders or targeting the wrong customers.

Closing Note

Technology has its place and businesses have their say. At the end of the day, It is all about the business value that technology brings. It will always be more effective when these gains are explained in business terms. It not only helps in gaining confidence from the business but also opens up new opportunities to explore.

Please note that we built two classification models but didn’t look into the ROC curve, confusion matrix, precision, recall, and other standard metrics that we generally do for such models. It is highly recommended that these metrics are tracked and measured to assess the model’s performance and then follow the decile approach from this blog. Depending on the target audience and the goal, use the pitch that best suits the objective.

Hope you liked the blog. Happy learnings !!!!

You can connect with me – Linkedin

You can find the code on Github

 

References:

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

You're reading How To Evaluate The Business Value Of A Machine Learning Model

Machine Learning Is Growing Significantly In Business

The improvements in technology and the availability of Machine Learning capacities, such as TensorFlow or Cloud services like Google Cloud AI. This operational tool such as Talent has helped to improve the skills and adopt the Machine Learning theories and accelerate delivery of alternatives.

Machine Learning is growing rapidly

The key reasons for this would be the improvements in accessibility and costs of information storage and calculate, with much more availability to Machine Learning capacities. Because this generated the perfect storm for businesses research on how to exploit this field of Data Science. But, Machine Learning remains basically about statistical modelling using information – so the information remains crucial. Data Science, as well as the areas about Machine Learning, are in high demand in most businesses that are driving this momentum in the programmer level to educate and empower that alignment with business opportunities or goals could be confirmed.

We’re also beginning to find some revolutionary applications of Machine Learning within our client base as businesses begin as a consequence of the lowering of several obstacles to its adoption. Also, E-commerce sites through indicating next greatest actions (NBA) in gambling and gambling platforms to forecasting supply chain demands based on additional measurements like key and weather events, our clients are researching Machine Learning initiatives might help improve the consumer and client experience, increase earnings & conversions.

The first case of that is a worldwide pharmaceutical company, Bayer CropScience AG that utilized Machine learning how to discover a remedy for farmers. Weeds that harm crops are a problem for farmers because farming began. A suitable solution would be to employ a narrow spectrum that efficiently kills the specific species from the area while using as few undesirable side effects as you can. However, to be able to do so, farmers need to correctly identify the weeds in their own fields.

Using Talent Real-time Big Data, the company managed to come up with a brand new application that farmers could download at no cost. The program uses Machine Learning and Artificial Intelligence to accommodate photographs of weeds at the organization’s database with marijuana photographs farmers send. Available all around the Earth, the photograph database resides on a personal cloud saved on AWS. It offers to grow the chance to precisely predict the effect of her or his activities like, selection of seed collection, program rate of crop protection products, or crop timing. The outcome is a much more efficient method of farming which increases return and enables farmers to become more environmentally conscious of their activities.

Possible to reinvent

“This is simply an example of the Machine Learning can alter a company, by allowing success more readily and economically than conventional coding-centric approaches. Owing to the open source, standards-based structure, Machine Learning models could be easily deployed to business programs and bridge the skills gap that typically exists between information scientists and IT programmers.”

As accessibility and adoption for this technology raise, Machine Learning will continue to encourage increasingly more sophisticated use cases to assist organisations to drive new inventions and improved customer experiences. A lot of individuals now begin to chat about Cognitive Computing since the nirvana of Machine Learning where systems can learn at scale, reason with the goal and also socialize with people more obviously. By imitating the human mind and the way that people process and conclude information through an idea, expertise, as well as the sensations, Cognitive Learning guarantees to help deliver top end programs of Machine Learning like personal vision and recognition, genuinely intelligent chat-bots, flexible handwriting recognition and much more.

Rapid improvements in hardware production are helping provide the compute power necessary for this cognitive software available in committed processors that help optimise processing and decrease the hardware footprint normally needed to support such programs.

AI and Machine Learning is the most critical technology for creation but it’s widely recognized that there are not the skills set up to reap the benefits. The skills gap is not anything new, but it will continue to evolve as new technologies become more complicated and it’s something which will always be on the peak of the schedule and need to be handled as the workforce becomes increasingly focused.

For all the reasons mentioned here, it is apparent that Machine Learning has the capacity to reinvent an assortment of business processes, and we’re seeing a few of that software today. I am really excited to find out the Machine Learning adoption grows and can affect change from the venture.

The Curse Of Dimensionality In Machine Learning!

This article was published as a part of the Data Science Blogathon

What is the curse of dimensionality?

Photo by Tim Carey on Unsplash

It refers to the phenomena of strange/weird things happening as we try to analyze the data in high-dimensional spaces. Let us understand this peculiarity with an example, suppose we are building several machine learning models to analyze the performance of a Formula One (F1) driver. Consider the following cases:

i) Model_1 consists of only two features say the circuit name and the country name.

ii) Model_2 consists of 4 features say weather and max speed of the car including the above two.

iii) Model_3 consists of 8 features say driver’s experience, number of wins, car condition, and driver’s physical fitness including all the above features.

iv) Model_4 consists of 16 features say driver’s age, latitude, longitude, driver’s height, hair color, car color, the car company, and driver’s marital status including all the above features.

v) Model_5 consists of 32 features.

vi) Model_6 consists of 64 features.

vii) Model_7 consists of 128 features.

viii) Model_8 consists of 256 features.

ix) Model_9 consists of 512 features.

x) Model_10 consists of 1024 features.

Assuming the training data remains constant, it is observed that on increasing the number of features the accuracy tends to increase until a certain threshold value and after that, it starts to decrease. From the above example the accuracy of Model_1 < accuracy of Model_2 < accuracy of Model_3 but if we try to extrapolate this trend it doesn’t hold true for all the models having more than 8 features. Now you might wonder if we are providing some extra information for the model to learn why is it so that the performance starts to degrade. My friends welcome to the curse of dimensionality!

If we think logically some of the features provided to Model_4 don’t actually contribute anything towards analyzing the performance of the F1 driver. For example, the driver’s height, hair color, car color, car company, and the driver’s marital status is giving useless information for the model to learn, hence the model gets confused with all this extra information, and the accuracy starts to go down.

The curse of dimensionality was first termed by Richard E. Bellman when considering problems in dynamic programming.

Curse of dimensionality in various domains

There are several domains where we can see the effect of this phenomenon. Machine Learning is one such domain. Other domains include numerical analysis, sampling, combinatorics, data mining, and databases. As it is clear from the title we will see its effect only in Machine Learning.

How to overcome its effect

This was a general overview of the curse of dimensionality. Now we will go slightly technical in order to understand it completely. In ML, it can be defined as follows: as the number of features or dimensions ‘d’ grows, the amount of data we require to generalize accurately grows exponentially. As the dimensions increase the data becomes sparse and as the data becomes sparse it becomes hard to generalize the model. In order to better generalize the model, more training data is required.

1. 

Hughes phenomenon

Again let’s take an example under this phenomenon. Assume all the features in a dataset are binary. If the dimensionality is 3 i.e. there are 3 features then the total number of data points will be equal to 23 = 8. If the dimensionality is 10 i.e. there are 10 features then the total number of data points will be equal to 210 = 1024. It is clear that as dimensionality increases the number of data points also increases exponentially which implies dimensionality is directly proportional to the number of data points required for training a machine learning model.

There is a very interesting phenomenon called the Hughes phenomenon which states that for a fixed size dataset the performance of a machine learning model decreases as the dimensionality increases.

2. Distance functions (especially Euclidean distance)

Let’s think of a 1D world where n points are spread randomly between 0 and 1, we have a point xi.

From the above two figures, it is clear that the Euclidean distance between pair of points is very close to 0.

Now let me define two terms,

Dist_min (xi) = min{euc-dist(xi, xj} where xi is not equal to xj.

Dist_max (xi) = max{euc-dist(xi, xj} where xi is not equal to xj.

For 1D, 2D and 3D,

From the above figures, we can see how those peaks are getting formed as the dimensions are increasing. At the heart of KNN, it works well if the pair of points are closer together in a cluster but at higher dimensions, we can see the pair of points that are very close to each other reduces and we have lot many pair of points having distance 5-10 and 15-20 when d=100 and it only increases on increasing the dimensions. So we know for sure KNN will break apart in such conditions.

Let me break it down for you even further.

{[dist-max(xi) – dist-min(xi)] / dist-min(xi)}

The above ratio will only become 0 when the numerator becomes 0 i.e. dist-max and dist-min are equal, which means in higher dimensional spaces every pair of points are equally distant from every other pair of points. For example, the distance between xi and xj is almost equal to the distance between xi and xk. This is true for every pair of points.

In high dimensional spaces, whenever the distance of any pair of points is the same as any other pair of points, any machine learning model like KNN which depends a lot on Euclidean distance, makes no more sense logically. Hence KNN doesn’t work well when the dimensionality increases. Even though this was theoretically proven for n random points, it has been observed experimentally also that KNN doesn’t work well in higher dimensional spaces. So what is the solution?

The solution is very simple. Use cosine-similarity instead of Euclidean distance as it is impacted less in higher dimensional spaces. That’s why especially in-text problems where we use a bag of words, TF-IDF, word-to-vec, etc., cosine similarity is preferred because of high dimensional space.

It is important to note that all these observations were made assuming the spread of points is uniform and random. So the very next thing that comes into mind is what if the spread of points are not uniform and random. We can think of this from a different angle i.e.

a) When dimensionality is high and points are dense, the impact of dimensionality is high.

b) When dimensionality is high and points are sparse, the impact of dimensionality is low.

3. Overfitting and Underfitting

There is a relationship between ‘d’ and overfitting which is as follows:

‘d’ is directly proportional to overfitting i.e. as the dimensionality increases the chances of overfitting also increases.

Let’s discuss the solutions to tackle this problem.

a) Model-dependent approach: Whenever we have a large number of features, we can always perform forward feature selection to determine the most relevant features for the prediction.

b) Unlike the above solution which is classification-oriented, we can also perform dimensionality reduction techniques like PCA and t-SNE which do not use the class labels to determine the most relevant features for the prediction.

So it is important to keep in mind whenever you download a new dataset that has a large number of features, you can reduce it by some of the techniques like PCA, t-SNE, or forward selection in order to ensure your model is not affected by the curse of dimensionality.

If you liked my article, you can connect with me via LinkedIn

References

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

Deploying Machine Learning Models Using Streamlit – An Introductory Guide To Model Deployment

Overview

Understand the concept of model deployment

Perform model deployment using Streamlit for loan prediction data

Introduction

I believe most of you must have done some form of a data science project at some point in your lives, let it be a machine learning project, a deep learning project, or even visualizations of your data. And the best part of these projects is to showcase them to others. This will not only motivate and encourage you about your hard work but will also help you to improve upon your project.

But the question is how will you showcase your work to others? Well, this is where Model Deployment will help you.

I have been exploring the field of Model Deployment for the past few months now. Model Deployment helps you showcase your work to the world and make better decisions with it. But deploying a model can get a little tricky at times. Before deploying the model a lot of things need to be looked into, such as data storage, pre-processing, model building, and monitoring. This can be a bit confusing as the number of tools that perform these model deployment tasks efficiently is few. Enter, Streamlit!

Streamlit is a popular open-source framework used for model deployment by machine learning and data science teams. And the best part is it’s free of cost and purely in python.

In this article, we are going to deep dive into model deployment. We will first build a loan prediction model and then deploy it using Streamlit.

Table of Contents

Overview of Machine Learning Lifecycle

Understanding the Problem Statement: Automating Loan Prediction

Machine Learning model for Automating Loan Prediction

Introduction to Streamlit

Model Deployment of the Loan Prediction model using Streamlit

Overview of Machine Learning Lifecycle

Let’s start with understanding the overall machine learning lifecycle, and the different steps that are involved in creating a machine learning project. Broadly, the entire machine learning lifecycle can be described as a combination of 6 stages. Let me break these stages for you:

Stage 1: Problem Definition

The first and most important part of any project is to define the problem statement. Here, we want to describe the aim or the goal of our project and what we want to achieve at the end.

Stage 2: Hypothesis Generation

Once the problem statement is finalized, we move on to the hypothesis generation part. Here, we try to point out the factors/features that can help us to solve the problem at hand.

Stage 3: Data Collection

After generating hypotheses, we get the list of features that are useful for a problem. Next, we collect the data accordingly. This data can be collected from different sources.

Stage 4: Data Exploration and Pre-processing

After collecting the data, we move on to explore and pre-process it. These steps help us to generate meaningful insights from the data. We also clean the dataset in this step, before building the model

Stage 5: Model Building

Once we have explored and pre-processed the dataset, the next step is to build the model. Here, we create predictive models in order to build a solution for the project.

Stage 6: Model Deployment

Once you have the solution, you want to showcase it and make it accessible for others. And hence, the final stage of the machine learning lifecycle is to deploy that model.

These are the 6 stages of a machine learning lifecycle. The aim of this article is to understand the last stage, i.e. model deployment, in detail using streamlit. However, I will briefly explain the remaining stages and the complete machine learning lifecycle along with their implementation in Python, before diving deep into the model deployment part using streamlit.

So, in the next section, let’s start with understanding the problem statement.

Understanding the Problem Statement: Automating Loan Prediction

The project that I have picked for this particular blog is automating the loan eligibility process. The task is to predict whether the loan will be approved or not based on the details provided by customers. Here is the problem statement for this project:

Automate the loan eligibility process based on customer details provided while filling online application form

Based on the details provided by customers, we have to create a model that can decide where or not their loan should be approved. This completes the problem definition part of the first stage of the machine learning lifecycle. The next step is to generate hypotheses and point out the factors that will help us to predict whether the loan for a customer should be approved or not.

As a starting point, here are a couple of factors that I think will be helpful for us with respect to this project:

Amount of loan: The total amount of loan applied by the customer. My hypothesis here is that the higher the amount of loan, the lesser will be the chances of loan approval and vice versa.

Income of applicant: The income of the applicant (customer) can also be a deciding factor. A higher income will lead to higher probability of loan approval.

Education of applicant: Educational qualification of the applicant can also be a vital factor to predict the loan status of a customer. My hypothesis is if the educational qualification of the applicant is higher, the chances of their loan approval will be higher.

These are some factors that can be useful to predict the loan status of a customer. Obviously, this is a very small list, and you can come up with many more hypotheses. But, since the focus of this article is on model deployment, I will leave this hypothesis generation part for you to explore further.

Next, we need to collect the data. We know certain features that we want like the income details, educational qualification, and so on. And the data related to the customers and loan is provided at the datahack platform of Analytics Vidhya. You can go to the link, register for the practice problem, and download the dataset from the problem statement tab. Here is a summary of the variables available for this particular problem:

We have some variables related to the loan, like the loan ID, which is the unique ID for each customer, Loan Amount and Loan Amount Term, which tells us the amount of loan in thousands and the term of the loan in months respectively. Credit History represents whether a customer has any previous unclear debts or not. Apart from this, we have customer details as well, like their Gender, Marital Status, Educational qualification, income, and so on. Using these features, we will create a predictive model that will predict the target variable which is Loan Status representing whether the loan will be approved or not.

Now we have finalized the problem statement, generated the hypotheses, and collected the data. Next are the Data exploration and pre-processing phase. Here, we will explore the dataset and pre-process it. The common steps under this step are as follows:

Univariate Analysis

Bivariate Analysis

Missing Value Treatment

Outlier Treatment

Feature Engineering

We explore the variables individually which is called the univariate analysis. Exploring the effect of one variable on the other, or exploring two variables at a time is the bivariate analysis. We also look for any missing values or outliers that might be present in the dataset and deal with them. And we might also create new features using the existing features which are referred to as feature engineering. Again, I will not focus much on these data exploration parts and will only do the necessary pre-processing.

After exploring and pre-processing the data, next comes the model building phase. Since it is a classification problem, we can use any of the classification models like the logistic regression, decision tree, random forest, etc. I have tried all of these 3 models for this problem and random forest produced the best results. So, I will use a random forest as the predictive model for this project.

Till now, I have briefly explained the first five stages of the machine learning lifecycle with respect to the project automating loan prediction. Next, I will demonstrate these steps in Python.

Machine Learning model for Automating Loan Prediction

In this section, I will demonstrate the first five stages of the machine learning lifecycle for the project at hand. The first two stages, i.e. Problem definition and hypothesis generation are already covered in the previous section and hence let’s start with the third stage and load the dataset. For that, we will first import the required libraries and then read the CSV file:

Here are the first five rows from the dataset. We know that machine learning models take only numbers as inputs and can not process strings. So, we have to deal with the categories present in the dataset and convert them into numbers.

Python Code:



Here, we have converted the categories present in the Gender, Married and the Loan Status variable into numbers, simply using the map function of python. Next, let’s check if there are any missing values in the dataset:

So, there are missing values on many variables including the Gender, Married, LoanAmount variable. Next, we will remove all the rows which contain any missing values in them:

Now there are no missing values in the dataset. Next, we will separate the dependent (Loan_Status) and the independent variables:

View the code on Gist.

For this particular project, I have only picked 5 variables that I think are most relevant. These are the Gender, Marital Status, ApplicantIncome, LoanAmount, and Credit_History and stored them in variable X. Target variable is stored in another variable y. And there are 480 observations available. Next, let’s move on to the model building stage.

Here, we will first split our dataset into a training and validation set, so that we can train the model on the training set and evaluate its performance on the validation set.

View the code on Gist.

We have split the data using the train_test_split function from the sklearn library keeping the test_size as 0.2 which means 20 percent of the total dataset will be kept aside for the validation set. Next, we will train the random forest model using the training set:

View the code on Gist.

Here, I have kept the max_depth as 4 for each of the trees of our random forest and stored the trained model in a variable named model. Now, our model is trained, let’s check its performance on both the training and validation set:

View the code on Gist.

The model is 80% accurate on the validation set. Let’s check the performance on the training set too:

View the code on Gist.

Performance on the training set is almost similar to that on the validation set. So, the model has generalized well. Finally, we will save this trained model so that it can be used in the future to make predictions on new observations:

View the code on Gist.

We are saving the model in pickle format and storing it as chúng tôi This will store the trained model and we will use this while deploying the model.

This completes the first five stages of the machine learning lifecycle. Next, we will explore the last stage which is model deployment. We will be deploying this loan prediction model so that it can be accessed by others. And to do so, we will use Streamlit which is a recent and the simplest way of building web apps and deploying machine learning and deep learning models.

So, let’s first discuss this tool, and then I will demonstrate how to deploy your machine learning model using it.

Introduction to Streamlit

As per the founders of Streamlit, it is the fastest way to build data apps and share them. It is a recent model deployment tool that simplifies the entire model deployment cycle and lets you deploy your models quickly. I have been exploring this tool for the past couple of weeks and as per my experience, it is a simple, quick, and interpretable model deployment tool.

Here are some of the key features of Streamlit which I found really interesting and useful:

It quickly

turns data scripts into shareable web applications

. You just have to pass a running script to the tool and it can convert that to a web app.

Everything in Python

. The best thing about Streamlit is that everything we do is in Python. Starting from loading the model to creating the frontend, all can be done using Python.

All for free

. It is open source and hence no cost is involved. You can deploy your apps without paying for them.

No front-end experience required

. Model deployment generally contains two parts, frontend, and backend. The backend is generally a working model, a machine learning model in our case, which is built-in python. And the front end part, which generally requires some knowledge of other languages like java scripts, etc. Using Streamlit, we can create this front end in Python itself. So, we need not learn any other programming languages or web development techniques. Understanding Python is enough.

Let’s say we are deploying the model without using Streamlit. In that case, the entire pipeline will look something like this:

Model Building

Creating a python script

Write Flask app

Create front-end: JavaScript

Deploy

We will first build our model and convert it into a python script. Then we will have to create the web app using let’s say flask. We will also have to create the front end for the web app and here we will have to use JavaScript. And then finally, we will deploy the model. So, if you would notice, we will require the knowledge of Python to build the model and then a thorough understanding of JavaScript and flask to build the front end and deploying the model. Now, let’s look at the deployment pipeline if we use Streamlit:

Model Building

Creating a python script

Create front-end: Python

Deploy

Here we will build the model and create a python script for it. Then we will build the front-end for the app which will be in python and finally, we will deploy the model. That’s it. Our model will be deployed. Isn’t it amazing? If you know python, model deployment using Streamlit will be an easy journey. I hope you are as excited about Streamlit as I was while exploring it earlier. So, without any further ado, let’s build our own web app using Streamlit.

Model Deployment of the Loan Prediction model using Streamlit

We will start with the basic installations:

View the code on Gist.

We have installed 3 libraries here. pyngrok is a python wrapper for ngrok which helps to open secure tunnels from public URLs to localhost. This will help us to host our web app. Streamlit will be used to make our web app. 

Next, we will have to create a separate session in Streamlit for our app. You can download the chúng tôi file from here and store that in your current working directory. This will help you to create a session for your app. Finally, we have to create the python script for our app. Let me show the code first and then I will explain it to you in detail:

View the code on Gist.

This is the entire python script which will create the app for us. Let me break it down and explain in detail:

In this part, we are saving the script as chúng tôi and then we are loading the required libraries which are pickle to load the trained model and streamlit to build the app. Then we are loading the trained model and saving it in a variable named classifier.

Next, we have defined the prediction function. This function will take the data provided by users as input and make the prediction using the model that we have loaded earlier. It will take the customer details like the gender, marital status, income, loan amount, and credit history as input, and then pre-process that input so that it can be feed to the model and finally, make the prediction using the model loaded as a classifier. In the end, it will return whether the loan is approved or not based on the output of the model.

And here is the main app. First of all, we are defining the header of the app. It will display “Streamlit Loan Prediction ML App”. To do that, we are using the markdown function from streamlit. Next, we are creating five boxes in the app to take input from the users. These 5 boxes will represent the five features on which our model is trained. 

The first box is for the gender of the user. The user will have two options, Male and Female, and they will have to pick one from them. We are creating a dropdown using the selectbox function of streamlit. Similarly, for Married, we are providing two options, Married and Unmarried and again, the user will pick one from it. Next, we are defining the boxes for Applicant Income and Loan Amount.

Since both these variables will be numeric in nature, we are using the number_input function from streamlit. And finally, for the credit history, we are creating a dropdown which will have two categories, Unclear Debts, and No Unclear Debts. 

Alright, let’s now host this app to a public URL using pyngrok library.

View the code on Gist.

Here, we are first running the python script. And then we will connect it to a public URL:

View the code on Gist.

This will generate a link something like this:

And it is as simple as this to build and deploy your machine learning models using Streamlit. 

End Notes

Congratulations! We have now successfully completed loan prediction model deployment using Streamlit. I encourage you to first try this particular project, play around with the values as input, and check the results. And then, you can try out other machine learning projects as well and perform model deployment using streamlit. 

The deployment is simple, fast, and most importantly in Python. However, there are a couple of challenges with it. We have used Google colab as the backend to build us and as you might be aware, the colab session automatically restarts after 12 hours. Also, if your internet connection breaks, the colab session breaks. Hence, if we are using colab as the backend, we have to rerun the entire application once the session expires. 

We recommend you go through the following articles on model deployment to solidify your concepts-

To deal with this, we can change the backend. AWS can be the right option here for the backend and using that, we can host our web app permanently. So, in my next article, I will demonstrate how to integrate AWS with Streamlit and make the model deployment process more efficient.

Related

Announcing The Machine Learning Starter Program!

Ideal Time to Start your Machine Learning Journey!

Picture this – you want to learn all about machine learning but just can’t find the time. There’s too much to do, whether that’s our professional work or your exams are around the corner. Suddenly, you have a lot of time on your hands and a once-in-a-lifetime opportunity to learn machine learning and apply it!

That’s exactly the opportunity in front of you right now. We are living in unprecedented times with half the world in complete lockdown and following social distancing protocols. There are two types of people emerging during this lockdown:

Those who are watching movies and surfing the internet to pass the time

Those who are eager to pick up a new skill, learn a new programming language, or apply your machine learning knowledge

If you’re in the latter category – we are thrilled to announce the:

You can use the code ‘LOCKDOWN’ to enroll in the Machine Learning Starter Program for FREE! You will have access to the course for 14 days from the day of your enrollment. Post this, the fee of the Program will be Rs. 4,999 (or $80).

What is the Machine Learning Starter Program?

The Machine Learning Starter Program is a step-by-step online starter program to learn the basics of Machine Learning, hear from industry experts and data science professionals, and apply your learning in machine learning hackathons!

This is the perfect starting point to ignite your fledging machine learning career and take a HUGE step towards your dream data scientist role.

The aim of the Machine Learning Starter Program is to:

Help you understand how this field is transforming and disrupting industries

Acquaint you with the core machine learning algorithms

Enhance and complement your learning through competition and hackathon exposure

We believe in a holistic learning approach and that’s how we’ve curated the Machine Learning Starter Program.

What does the Machine Learning Starter Program include?

There are several components in the Machine Learning Starter Program:

Machine Learning Basics Course

Expert Talks on various machine learning topics by industry practitioners

2 awesome machine learning hackathons

E-book on “Machine Learning Simplified”

Let’s explore each offering in a bit more detail.

Machine Learning Basics Course

This course provides you all the tools and techniques you need to apply machine learning to solve business problems. Here’s what you’ll learn in the Machine Learning Basics course:

Understand how Machine Learning and Data Science are disrupting multiple industries today

Linear, Logistic Regression, Decision Tree and Random Forest algorithms for building machine learning models

Understand how to solve Classification and Regression problems in machine learning

How to evaluate your machine learning models and improve them through Feature Engineering

Improve and enhance your machine learning model’s accuracy through feature engineering

Expert Talks

There is no substitute for experience.

This course is an amalgamation of various talks by machine learning experts, practitioners, professionals and leaders who have decades upon decades of learning experience with them. They have already gone through the entire learning process and they showcase their work and thought process in these talks.

This course features rockstar data science experts like Sudalai Rajkumar (SRK), Professor Balaraman Ravindran, Dipanjan Sarkar, Kiran R and many more!

Machine Learning Hackathons

The Machine Learning Starter Program features two awesome hackathons to augment your learning:

JanataHack

Machine Learning Starter Program hackathon

Come, interact with the community, apply your machine learning knowledge, hack, have fun and stay safe.

E-Book on “Machine Learning Simplified”

This e-book aims to provide an overview of machine learning, recent developments and current challenges in Machine Learning. Here’s a quick summary of what’s included:

What is Machine Learning?

Applications of Machine Learning

How do Machines Learn?

Why is Machine Learning getting so much attention?

Steps required to Build a Machine Learning Model

How can one build a career in Machine learning?

and much more!

Who is the Machine Learning Starter Program for?

The Machine Learning Starter Program is for anyone who:

Is a beginner in machine learning

Wants to kick start their machine learning journey

Wants to learn about core machine learning algorithms

Is interested in a practical learning environment

Wants to practice and enhance their existing machine learning knowledge

So, what are you waiting for? Enroll in the Machine Learning Starter Program for FREE using the code ‘LOCKDOWN’ and begin your learning journey today!

Related

How Should A Machine Learning Beginner Get Started On Kaggle?

Kaggle is a great platform to practice and improve your skills. However, if you’re new to Kaggle, the platform can be quite overwhelming to navigate. In this article, you’ll get a quick overview of how ML engineers can make the most of Kaggle. We’ll guide you through the process, from setting up your account to exploring datasets to competing in challenges and collaborating with other data scientists.

Now, before getting started, it is highly recommended to create a professional Kaggle profile, as it can help you get potential opportunities and make your profile credible. Let’s see how different features on Kaggle can help you excel as a machine learning engineer −

Kaggle Courses

As an MLE, one has to be typically skilled in programming languages like Python, as well as in ML libraries and frameworks such as TensorFlow, PyTorch, and Scikit-Learn. Kaggle Learn provides short and concise courses, covering topics including Python, ML libraries, SQL, and data analysis and visualization. These are completely free of cost, while providing you an opportunity to earn certificates as well.

Kaggle Competitions

Kaggle’s community competitions offer a great chance to sharpen your skills by solving problems based on real-world datasets. This helps you gain practical experience while networking and collaborating with other like-minded enthusiasts. It is important to identify and choose the competition that best matches your set of skills; one can do this by looking at the competition details. The good part is that you can win swag and incentives along the way. But always keep in mind that your prime focus should be on solving the problems first. Each competition has its own set of rules and guidelines to follow to ensure a fair environment, so make sure to review those as well.

Kaggle Discussions

One of the useful features on Kaggle is their discussion section. Data science and ML enthusiasts gather here to discuss different topics. You can ask for help, receive assistance from others, and gain actionable insights from professionals on how to improve your models.

Choosing to participate in conversations is a wise move, as this will help build your credibility online.

Kaggle Kernel and Notebooks

Kaggle’s most potent feature, aside from learning resources and competitions, is the ability to create notebooks on the go. Kaggle notebooks are built over the Jupyter environment, which support programming languages like R and Python and come with pre-installed machine learning packages. These notebooks can be easily integrated with the datasets already available on the platform, allowing users to analyze huge datasets without having to worry about downloading them.

Users can also collaborate via notebooks (e.g., working with a teammate on a Kaggle competition) for different projects. You can also choose to share your notebook with others and explore the notebooks of fellow practitioners as well. To open yourself up to new opportunities, you can post the project links on your professional portfolio.

You may browse the trending notebooks on the Code tab or even conduct a search for notebooks on a particular topic, which is a terrific way to find inspiration for your own portfolio projects. GPUs are also available to train deep neural networks.

Kaggle Datasets

Kaggle datasets are one of the most popular resources when it comes to finding freely available, publicly accessible datasets. Data science and ML enthusiasts can access and use these datasets in their own projects. A vast array of datasets from sources including social, financial, entertainment, cultural, and economic data are available on Kaggle. It is possible to develop models directly on the platform by integrating these datasets with Kaggle notebooks.

Users can also contribute their own datasets, which makes the community even more robust and powerful.

Conclusion

In conclusion, Kaggle provides an amazing opportunity for learners who want to practice and hone their machine learning skills. One can start taking part in competitions and getting involved with the platform by learning, taking part in competitions, and exploring datasets, which will help you gain practical experience as a beginner.

For ML engineers, Kaggle is a wonderful place to network with other data scientists, learn from the best, receive feedback and validation, and collaborate with them to build their portfolios and get recognition for the work they do. So, get going right away and start building your ML portfolio on Kaggle!

Update the detailed information about How To Evaluate The Business Value Of A Machine Learning Model on the Cancandonuts.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!