Trending November 2023 # Classification Model Simulator Application Using Dash In Python # Suggested December 2023 # Top 18 Popular

You are reading the article Classification Model Simulator Application Using Dash In Python updated in November 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 Classification Model Simulator Application Using Dash In Python

This article was published as a part of the Data Science Blogathon.


Build an app and bring data to life !!!!

Dash as an open-source python framework for analytic applications. It is built on top of Flask, chúng tôi and chúng tôi If you have used python for data exploration, analysis, visualization, model building, or reporting then you find it extremely useful to building highly interactive analytic web applications with minimal code. In this article, we will explore some key features including DCC & DAQ components, plotly express for visuals and build classification models with an app.

Here are various topics that this article explores

1. Quick look at plotly features & widgets

2. Build an interface for the user to experiment with parameters

3. Build models and measure metrics

4. Leverage Pytest for automated testing

5. Logging errors for debugging

6. Conclusion


We will be using Analytics Vidhya’s dataset from Loan prediction . let’s create a separate file for loading data ‘’ and have created an object call obj_Data which is accessible across files within the project. Firstly, let’s look at the data.

Front End – Add DCC & DAQ controls

Before we begin, let’s take a look at what we will build by the end of this blog.



    daq.Slider( id = 'slider', min=0, max=100, value=70, handleLabel={"showCurrentValue": True,"label": "SPLIT"}, step=10 ),


Next, let’s build two dropdowns, one for selecting the target variable and the other for independent variables. The only thing to note here is that the values are being populated from the dataset and not hardcoded.

options=[{'label':x, 'value':x} for x in obj_Data.df_train_dummies.columns],

html.P("Select Target", className="control_label"), dcc.Dropdown( , options=[{'label':x, 'value':x} for x in obj_Data.df_train_dummies.columns], multi=False, value='Loan_Status', clearable=False, ),

Numeric Input with DAQ:

We also would like to have the user select a number of splits for model building. For more information refer KFOLD. let’s add a numeric field with min=1 and max=10.

   daq.NumericInput( id='id-daq-splits', min=0, max=10, size = 75, value=2 ),

LED Display with DAQ:

It would also be very useful to have certain standard metrics upfront like the number of records, a number of categorical & numeric fields, etc as part of basic information. For this, let’s make use of dash-daq widgets. We can change the font, color, and background depending on the layout/theme.

    daq.LEDDisplay( id='records', #label="Default", value=0, label = "Records", size=FONTSIZE, color = FONTCOLOR, backgroundColor=BGCOLOR )

We will use the same code snippet to generate a few more such cards/widgets making sure ids are unique. We will look at how to populate the values in the later segment but for now, let’s set the value to zero.

Now that all user enterable fields are covered, we will have placeholders for showcasing some of the model metrics plots such as AUC-ROC which is a standard curve for classification models. We will populate the chart once the model building is completed in a later segment.

html.Div( [dcc.Graph(id="main_graph")], ),

Back End – let’s build models and measure metrics:

There are two aspects to factor in-

1. Build all seven classification models and plot a bar chart based on accuracy. We will code this in a separate file named

2. Automatically select the best performing model and detail the relevant metric specific to the chosen model. We will code this in file

Classification Model/s:

Let’s start with the first part – We will build seven classification models namely Logistic regression, light GBM, KNN, Decision Tree, AdaBoost Classifier, Random Forest, and Gaussian Naive Bayes. Here is the snippet for LGBM. As the article is about building an analytics app and not a model building, you can refer to the complete model building code for more details.

    ... ... clf = lgb.LGBMClassifier(n_estimators=1000,max_depth=4,random_state=22),y_trn) predictions = clf.predict(X_val) fun_metrics(predictions, y_val) fpr, tpr, _ = roc_curve(y_val, predictions) fun_metricsPlots(fpr, tpr, "LGBM") fun_updateAccuracy(clf, predictions) .... ....

Now, for the second part where we will generate metrics specific to the best model among the seven. Here is the pseudo-code snippet – refer code for more details.

if bestModel == 'GNB': model = GaussianNB() elif bestModel == 'LGBM': model = lgb.LGBMClassifier() elif bestModel == 'Logistic': model = LogisticRegression() elif bestModel == 'KNN': model = KNeighborsClassifier() elif bestModel == 'Raondom Forest': model = RandomForestClassifier() elif bestModel == 'DT': model = tree.DecisionTreeClassifier() else: model = AdaBoostClassifier()

Measure Model Metrics:

We will track the metrics for the best model – precision, recall, and accuracy, and for this, we will be using sklearn.metrics library for deriving these numbers. These are the numbers that will be populating our dash-daq widgets.

from sklearn.metrics import roc_curve, roc_auc_score, recall_score, precision_score,accuracy_score precision = round(precision_score(testy, yhat),2) recall = round(recall_score(testy, yhat),2) accuracy = round(accuracy_score(testy, yhat)*100,1)

testy has the actual value from the test set and yhat has predicted values.

Build an AUC-ROC plot with Plotly Express:

Similarly, build an AUC-ROC curve using plotly express and save it on fig object fig_ROC

fig_ROC = px.area( x=lr_fpr, y=lr_tpr, title=f'ROC Curve (AUC={lr_auc:.4f})', labels=dict(x='False Positive Rate', y='True Positive Rate') ) fig_ROC.add_shape( type='line', line=dict(dash='dash'), x0=0, x1=1, y0=0, y1=1 ) fig_ROC.update_yaxes(scaleanchor="x", scaleratio=1) fig_ROC.update_xaxes(constrain='domain')

Interaction with callbacks:

Now that we have designed the front end with widgets, place holders, and for the back end, wrote a function for building classification model/s which does the prediction and also generates model metrics. Now, these two should interact with each other every time user changes the input and this can be achieved using callbacks. The callbacks are Python functions that are automatically called by Dash whenever an input component’s property changes.

There are 3 sections to callbacks-

1. List of all the outputs (or just a single output)

2. List of all the inputs (or just a single input)

3. Function which takes the input, does the defined processing, and gives back the output.

Note: If there are multiple inputs or multiple outputs then the controls are wrapped under [ ] if not then it can be skipped.

[ Output("main_graph", 'figure'), Output("recall", 'value'), ] .... [ Input("select_target", "value"), Input("select_independent", "value"), ... ] ....

In the above code snippet for output, the first argument is the main_graph that we had set during UI design. The second argument is the object type which in this case is figure. Similarly, the recall control expects the object of type value which in this case is numeric. More information on callbacks can be found here . Bringing all our input/output controls together, the code would like this.

@app.callback( [ Output("main_graph", 'figure'), Output("individual_graph", 'figure'), Output("aggregate_graph", 'figure'), Output("slider-output-container", 'children'), Output("precision", 'value'), Output("recall", 'value'), Output("accuracy", 'value'), Output("auc", 'value'), Output("trainset", 'value'), Output("testset", 'value'), Output('model-graduated-bar', 'value'), Output('id-insights', 'children'), Output("model-graphs", 'figure'), Output("best-model", 'children'), Output("id-daq-switch-model", 'on'), Output('auto-toast-model', 'is_open') ], [ Input("select_target", "value"), Input("select_independent", "value"), Input("slider", "value"), Input("id-daq-splits", "value"), Input("select_models", "value") ] ) def measurePerformance(target, independent, slider, splits, selected_models): fig_ROC, Fig_Precision, fig_Threshold,precision, recall, accuracy, trainX, testX, auc, fig_model, bestModel = multiModel.getModels(target,independent, slider, splits, selected_models) auc_toast = True if auc < 0.5 else False return fig_ROC, Fig_Precision, fig_Threshold, 'Train / Test split size: {} / {}'.format(slider, 100-slider), precision, recall, accuracy,auc, trainX, testX, auc*100, f'The best performing model is {bestModel} with accuracy of {accuracy}, precision of {precision} and recall of {recall} with Area under curve of {auc}. Try for various K FOLD values to explore further.' ,fig_model, f'The top performaing model is {bestModel}', True, auc_toast

Write some testcases using PyTest:

Writing unit test cases for typical web development is normal but generally, for analytic apps with predictive models and visuals, there is a tendency to skip and just do a sanity check manually at the end. The pytest library makes it easier to configure the test cases, write functions to test for specific inputs & outputs. In short, write it once and keep running the test before pushing code to QA/Prod environment. Refer pytest document for more details.

As an example, let’s write a case to check for Precision value. We can use the same framework and extend it to many more cases – positive, negative, and borderline cases.

#pip install pytest import pytest def test_buildModels(): fig_ROC, fig_precision, fig_threshold, precision, recall, accuracy, trainX, testX, lr_auc = buildModel(target, independent, slider, selected_models) assert precision < 1

The assert keyword ensures that the specified criteria is met and designates the test case either as Pass or Fail.

Configure test cases

Test cases under execution

One test case failed

All test cases passed

Logging errors:

Logging errors/ warnings help us keep track of issues in our code and for this, we will use a logging library. We will create a separate file by name chúng tôi Logging is not only a good practice to follow but also helps immensely during the debugging process. Some prefer to use print() statement which logs output in the console for their reference but is recommended that one uses logging.

Create a file by name ‘model.log’ in your project directory and use the below code for logging errors in this file.

# install the library if you haven't already done # pip install logging import logging logging.basicConfig(filename= 'model.log', level = logging.DEBUG,format='%(asctime)s:%(levelname)s:%(filename)s:%(funcName)s:%(message)s')

The errors can be tracked in the chúng tôi file. Here is a sample error:



Python with plotly Dash can be used to build some very complex analytics applications in a short time. I personally find it useful for rapid prototyping, client demos, proposals, and POC’s. The best part of the whole process is you only need to know the basics of python and you can create the front end, back end, visuals, and predictive models which are core to analytics apps. If you use your creative side and focus on the user experience, then you are sure to impress your team, client, or end-user.

What Next?:

The app can be extended to multi-class classification models, add more visuals & metrics as required, build a login page with user authentication, maybe save data to DB, and much more. Hope you learned something new today.

Happy learnings !!!!

You can connect with me – Linkedin

You can find the code for reference – Github



You're reading Classification Model Simulator Application Using Dash In Python

Classification Algorithms In Python – Heart Attack Prediction And Analysis

Logistic Regression

Decision Trees

Random Forest

K nearest neighbor.

After we build the models using training data, we will test the accuracy of the model with test data and determine the appropriate model for this dataset.

The dataset used is available on Kaggle – Heart Attack Prediction and Analysis

In this article, we will focus only on implementing outlier detection, outlier treatment, training models, and choosing an appropriate model.

Problem Statement:

output: 0= less chance of heart attack 1= more chance of heart attack

Before we start with code, we need to import all the required libraries in Python.

I follow a convention of dedicating one cell in the Notebook only for imports. This is beneficial when we want to add additional import statements. We just need to run the cell which only has imports. It will not affect the remaining ‘code’.

Python Code:

Before proceeding, we will get a basic understanding of our data by using the following command.

Now, we want to understand the number of records and the number of features. This can be achieved by using the following code snippet,

#number of records and features in the dataset data1.shape

The 303 in the output defines the number of records in the dataset and 14 defines the number of features in the dataset including the ‘target variable’.

Data Cleaning/ Data preprocessing

Before providing data to a model, it is essential to clean the data and treat the nulls, outliers, duplicate data records.

We will begin with checking for duplicate rows with the code snippet,

#Check duplicate rows in data duplicate_rows = data1[data1.duplicated()] print("Number of duplicate rows :: ", duplicate_rows.shape)

The data contains 1 duplicate row. We will remove the duplicate row and check for duplicates again.

#we have one duplicate row. #Removing the duplicate row data1 = data1.drop_duplicates() duplicate_rows = data1[data1.duplicated()] print("Number of duplicate rows :: ", duplicate_rows.shape) #Number of duplicate rows after dropping one duplicate row

Now, there are 0 duplicate rows in the data. We will check for ‘null’ values in the data.

#Looking for null values print("Null values :: ") print(data1.isnull() .sum()) #Check if the other data is consistent data1.shape

As there are no ‘null’ values in data, we will go ahead with ‘Outlier Detection‘ using box plots.

We will plot box plots for all features.

#As there are no null values in data, we can proceed with the next steps. #Detecting Outliers # 1. Detecting Outliers using IQR (InterQuartile Range) sns.boxplot(x=data1['age']) #No Outliers observed in 'age' sns.boxplot(x=data1['sex']) #No outliers observed in sex data sns.boxplot(x=data1['cp']) #No outliers in 'cp' sns.boxplot(x=data1['trtbps']) #Some outliers are observed in 'trtbps'. They will be removed later sns.boxplot(x=data1['chol']) #Some outliers are observed in 'chol'. They will be removed later sns.boxplot(x=data1['fbs']) sns.boxplot(x=data1['restecg']) sns.boxplot(x=data1['thalachh']) #Outliers present in thalachh sns.boxplot(x=data1['exng']) sns.boxplot(x=data1['oldpeak']) #Outliers are present in 'OldPeak' sns.boxplot(x=data1['slp']) sns.boxplot(x=data1['caa']) #Outliers are present in 'caa' sns.boxplot(x=data1['thall'])

From the box plots, outliers are present in trtbps, chol, thalachh, oldpeak, caa, thall.

The Outliers are removed using two methods,

1. Inter-Quartile Range and

2. Z-score

We will use both methods and check the effect on the dataset.

1. Inter-Quartile Range

In IQR, the data points higher than the upper limit and lower than the lower limit are considered outliers.

upper limit = Q3 + 1.5 * IQR

lower limit = Q1 – 1.5 * IQR

We find the IQR for all features using the code snippet,

#Find the InterQuartile Range Q1 = data1.quantile(0.25) Q3 = data1.quantile(0.75) IQR = Q3-Q1 print('*********** InterQuartile Range ***********') print(IQR) # Remove the outliers using IQR data2.shape

After removing outliers using IQR, the data contains 228 records.

2. Z – Score

If a Z-score is greater than 3, it implies that the data point differs from the other data points and hence is treated as an outlier.

#Removing outliers using Z-score z = np.abs(stats.zscore(data1)) data3 = data1[(z<3).all(axis=1)] data3.shape

After using Z-score to detect and remove outliers, the number of records in the dataset is 287. 

As the number of records available is higher after Z-score, we will proceed with ‘data3’


After removing outliers from data, we will find the correlation between all the features.

Two types of correlation will be used here.

Pearson Correlation

Spearman Correlation


1. Pearson Correlation #Finding the correlation between variables pearsonCorr = data3.corr(method='pearson') spearmanCorr = data3.corr(method='spearman') fig = plt.subplots(figsize=(14,8)) sns.heatmap(pearsonCorr, vmin=-1,vmax=1, cmap = "Greens", annot=True, linewidth=0.1) plt.title("Pearson Correlation")

From the heat map, the same values of correlation are repeated twice. To remove this, we will mask the upper half of the heat map and show only the lower half. The same procedure will be carried out for Spearman Correlation.

#Create mask for both correlation matrices #Pearson corr masking #Generating mask for upper triangle maskP = np.triu(np.ones_like(pearsonCorr,dtype=bool)) #Adjust mask and correlation maskP = maskP[1:,:-1] pCorr = pearsonCorr.iloc[1:,:-1].copy() #Setting up a diverging palette cmap = sns.diverging_palette(0, 200, 150, 50, as_cmap=True) fig = plt.subplots(figsize=(14,8)) sns.heatmap(pCorr, vmin=-1,vmax=1, cmap = cmap, annot=True, linewidth=0.3, mask=maskP) plt.title("Pearson Correlation") 2. Spearman Correlation fig = plt.subplots(figsize=(14,8)) sns.heatmap(spearmanCorr, vmin=-1,vmax=1, cmap = "Blues", annot=True, linewidth=0.1) plt.title("Spearman Correlation")

After masking the upper half of the heat map,

#Create mask for both correlation matrices #Spearson corr masking #Generating mask for upper triangle maskS = np.triu(np.ones_like(spearsonCorr,dtype=bool)) #Adjust mask and correlation maskS = maskS[1:,:-1] sCorr = spearsonCorr.iloc[1:,:-1].copy() #Setting up a diverging palette cmap = sns.diverging_palette(0, 250, 150, 50, as_cmap=True) fig = plt.subplots(figsize=(14,8)) sns.heatmap(sCorr, vmin=-1,vmax=1, cmap = cmap, annot=True, linewidth=0.3, mask=maskS) plt.title("Spearman Correlation")

From both the heat maps, the features fbps, chol and trtbps have the lowest correlation with output. 


Before implementing any classification algorithm, we will divide our dataset into training data and test data. I have used 70% of the data for training and the remaining 30% will be used for testing.

#From this we observe that the minimum correlation between output and other features in #fbs,trtbps and chol x = data3.drop("output", axis=1) y = data3["output"] x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

We will implement four classification algorithms,

1. Logistic Regression Classifier

2. Decision Trees Classifier

3. Random Forest Classifier

4. K Nearest Neighbours Classifier

1. Logistic Regression Classifier

The code snippet used to build Logistic Regression Classifier is,

#Building classification models names = ['Age', 'Sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall'] # ****************Logistic Regression***************** logReg = LogisticRegression(random_state=0, solver='liblinear'), y_train) #Check accuracy of Logistic Regression y_pred_logReg = logReg.predict(x_test) #Model Accuracy print("Accuracy of logistic regression classifier :: " ,metrics.accuracy_score(y_test,y_pred_logReg)) #Removing the features with low correlation and checking effect on accuracy of model x_train1 = x_train.drop("fbs",axis=1) x_train1 = x_train1.drop("trtbps", axis=1) x_train1 = x_train1.drop("chol", axis=1) x_train1 = x_train1.drop("restecg", axis=1) x_test1 = x_test.drop("fbs", axis=1) x_test1 = x_test1.drop("trtbps", axis=1) x_test1 = x_test1.drop("chol", axis=1) x_test1 = x_test1.drop("restecg", axis=1) logReg1 = LogisticRegression(random_state=0, solver='liblinear').fit(x_train1,y_train) y_pred_logReg1 = logReg1.predict(x_test1) print("nAccuracy of logistic regression classifier after removing features:: " ,metrics.accuracy_score(y_test,y_pred_logReg1))

The accuracy of logistic regression classifier using all features is 85.05%

While the accuracy of logistic regression classifier after removing features with low correlation is 88.5%

2. Decision Tree Classifier

The code snippet used to build a decision tree is,

# ***********************Decision Tree Classification*********************** decTree = DecisionTreeClassifier(max_depth=6, random_state=0),y_train) y_pred_decTree = decTree.predict(x_test) print("Accuracy of Decision Trees :: " , metrics.accuracy_score(y_test,y_pred_decTree)) #Remove features which have low correlation with output (fbs, trtbps, chol) x_train_dt = x_train.drop("fbs",axis=1) x_train_dt = x_train_dt.drop("trtbps", axis=1) x_train_dt = x_train_dt.drop("chol", axis=1) x_train_dt = x_train_dt.drop("age", axis=1) x_train_dt = x_train_dt.drop("sex", axis=1) x_test_dt = x_test.drop("fbs", axis=1) x_test_dt = x_test_dt.drop("trtbps", axis=1) x_test_dt = x_test_dt.drop("chol", axis=1) x_test_dt = x_test_dt.drop("age", axis=1) x_test_dt = x_test_dt.drop("sex", axis=1) decTree1 = DecisionTreeClassifier(max_depth=6, random_state=0), y_train) y_pred_dt1 = decTree1.predict(x_test_dt) print("Accuracy of decision Tree after removing features:: ", metrics.accuracy_score(y_test,y_pred_dt1))

The accuracy of the decision tree with all features is 70.11% while accuracy after removing low correlation features is 78.16%

3. Random Forest Classifier

Implement a random forest classifier using the code,

# Using Random forest classifier rf = RandomForestClassifier(n_estimators=500),y_train) y_pred_rf = rf.predict(x_test) print("Accuracy of Random Forest Classifier :: ", metrics.accuracy_score(y_test, y_pred_rf)) #Find the score of each feature in model and drop the features with low scores f_imp = rf.feature_importances_ for i,v in enumerate(f_imp): print('Feature: %s, Score: %.5f' % (names[i],v))

The accuracy of the model is 86.20%. Along with accuracy, we will also print the feature and its importance in the model. Then, we will eliminate features with low importance and create another classifier and check the effect on the accuracy of the model. As all the features have some contribution to the model, we will keep all the features.

4. K Nearest Neighbours Classifier

Implement K nearest neighbor classifier and print the accuracy of the model.

#K Neighbours Classifier knc = KNeighborsClassifier(),y_train) y_pred_knc = knc.predict(x_test) print("Accuracy of K-Neighbours classifier :: ", metrics.accuracy_score(y_test,y_pred_knc))

The accuracy is only 59.77%

Conclusion #Models and their accuracy print("*****************Models and their accuracy*****************") print("Logistic Regression Classifier :: ", metrics.accuracy_score(y_test,y_pred_logReg1)) print("Decision Tree :: ", metrics.accuracy_score(y_test,y_pred_dt1)) print("Random Forest Classifier :: ", metrics.accuracy_score(y_test, y_pred_rf)) print("K Neighbours Classifier :: ", metrics.accuracy_score(y_test,y_pred_knc))

After implementing four classification models and comparing their accuracy, we can conclude that for this dataset Logistic Regression Classifier is the appropriate model to be used.

About Me:

Data Visualization Enthusiast. Business Analytics Student. E&TC Engineer.

Kaggle Notebook link for entire code.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. 


Autocorrect Feature Using Nlp In Python

This article was published as a part of the Data Science Blogathon.

Natural Language Processing (NLP) is the field of artificial intelligence that relates lingual to Computer Science. I am assuming that you have understood the basic concepts of NLP. So we will move ahead. There are Some NLP applications as follows: Auto Spelling Correction, Sentiment Analysis, Fake News Detection, Machine Translation, Question and Answering(Q&A), Chatbot, and many more…

Introduction to Autocorrect

Have you ever wondered about how the Autocorrect features work on the keyboard of a Smartphone? Now almost every smartphone brand regardless of its price provides an autocorrect feature in their keyboards today. Everyone knows the sake of smartphones would be a never-ending list and we are not going to focus on that topic in this blog!

The main purpose of this article, as you have seen the title so you can guess that is to build an Autocorrect Feature. Yes, it’s some kind of similar, but not the exact copy, to that of the smartphone we are using now, but this would be an implementation of Natural Language Processing on a smaller dataset like a book.

Okay, let’s understand how these autocorrect features work. In this article, I am going to take you through “How to build Autocorrect with Python”.

Autocorrect using NLP With Python- How it works?

In the backdrop of machine learning, autocorrect is purely based on Natural Language Processing (NLP). As the name suggests that it is programmed in order to correct spellings and errors while typing text. So let’s see how it works?

Before I move ahead into the coding stuff let us understand “How Autocorrect works?”. Let’s assume that you have typed a word on your keyboard but if that word exists in the vocabulary of our smartphone then it will assume that you have written the right word. Okay. Now it does not matter whether you write a name, a noun, or any word that you wanted to type.

Understood this scenario? If the word exists in the history of the smartphone, it will generalize or create the word as a correct word to choose. But What if the word doesn’t exist? Okay, If the word that you have typed is a nonexisting word in the history of smartphones then the autocorrect is specially programmed to find the most similar words in the history of our smartphone as it suggests.

So let us understand the algorithm.

There are 4 key steps to building an autocorrect model that corrects spelling errors:

1:- Identify Misspelled Word — Let us consider an example, how would we get to know the word “drea” is spelled incorrectly or correctly? If a word is spelled correctly then the word will be found in a dictionary and if it is not there then it is probably a Misspelled Word. Hence, when a word is not found in a dictionary then we will flag it for correction.

2:- Find ‘n’ Strings Edit distance away — An edit is one of the operations which is performed on a string in order to transform it into another String, and n is nothing but the edit distance that is an edit distance like- 1, 2, 3, so on… which will count the number of edit operations that to be performed. Hence, the edit distance n tells us that how many operations are away from one string to another. Following are the different types of edits:-

Insert (will add a letter)

Delete (will remove a letter)

Switch (it will swap two nearby letters)

Replace (exchange one letter to another one)

With these four edits, we are proficient in modifying any string. So the combination of edits allows us to find a list of all possible strings that are n edits to perform.

IMPORTANT Note: For autocorrect, we take n  usually between 1 to 3 edits.

3:- Filtering of Candidates — Here we want to consider only correctly spelled real words from our generated candidate list so we can compare the words to a known dictionary (like we did in the first step) and then filter out the words in our generated candidate list that do not appear in the known “dictionary”.

4:- Calculate Probabilities of Words — We can calculate the probabilities of words and then find the most likely word from our generated candidates with our list of actual words. This requires word frequencies that we know and the total number of words in the corpus (also known as dictionary).

Build an Autocorrect Feature using NLP with Python

I hope you are now clear about what autocorrect is and how it works. Now let us see how we can build an autocorrect feature with Python for smartphones. As our smartphone uses past history to match the typed words whether it is correct or not. So here we are required to use some words to run the functionality in our Autocorrect.

So I am going to use the text from a book to understand it practically which you can easily download from here. Now let’s get started with the task to build an autocorrect model with Python.

Note: You can use any kind of text data.

Download Link

To run this task, we are required some libraries. I am going to use libraries that are very general for machine learning. So you should be having all these libraries already installed in your system except one library. You need to install one library known as “text distance”, which can be easily installed by using the pip command.

pip install textdistance

Now let us get started with this by importing all the necessary packages, libraries and by reading our text file:


import pandas as pd import numpy as np import textdistance import re from collections import Counter words = [] with open('auto.txt', 'r') as f: file_name_data = file_name_data=file_name_data.lower() words = re.findall('w+',file_name_data) # This is our vocabulary V = set(words) print("Top ten words in the text are:{words[0:10]}") print("Total Unique words are {len(V)}.") Output: Top ten words in the text are:['moby', 'dick', 'by', 'herman', 'melville', '1851', 'etymology', 'supplied', 'by', 'a'] Total Unique words are 17140.

In the above code, you can see that we have made a list of words and now we will build the frequency of those words, which can be easily done by using the “counter function” in Python:


word_freq = {} word_freq = Counter(words) print(word_freq.most_common()[0:10]) Output: [('the', 14431), ('and', 6430), ('a', 4736), ('to', 4625), ('in', 4172), ('his', 2530), ('it', 2522), ('i', 2127)] Relative Frequency of words

Now here we want to get the occurrence of each word that is nothing but we have to find probabilities, which equals the Relative Frequencies of the words:


probs = {} Total = sum(word_freq.values()) for k in word_freq.keys(): probs[k] = word_freq[k]/Total Finding Similar Words

So we will sort similar words according to the “Jaccard Distance” by calculating the two grams Q of the words. Then next, we will return the five most similar words which are ordered by similarity and probability:-


def my_autocorrect(input_word): input_word = input_word.lower() if input_word in V: return('Your word seems to be correct') else: sim = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq.keys()] df = pd.DataFrame.from_dict(probs, orient='index').reset_index() df = df.rename(columns={'index':'Word', 0:'Prob'}) df['Similarity'] = sim output = df.sort_values(['Similarity', 'Prob'], ascending=False).head() return(output)

Okay, Now, let us find some similar words by using our autocorrect function:



Word Prob Similarity

2209 nevertheless 0.000229 0.750000

13300 boneless 0.000014 0.416667

12309 elevates 0.000005 0.416667

718 never 0.000942 0.400000

6815 level 0.000110 0.400000

This is how the autocorrect algorithm works here!!

As we have taken words from a book. In the same way, there are some words that are already present in the vocabulary of the smartphone and then some words it records while the user starts typing using the keyboard.


You can use this feature to implement in real-time. I hope you liked this article that how to build an Autocorrect Feature using NLP with Python.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion


Cricket Shot Classification Using Pose Of The Player Image

This article was published as a part of the Data Science Blogathon.

This article is an ongoing part of a blog that I have already written.

Please check out the below link to have a better understanding of Pose Detection.

many interesting applications and use cases of pose detection. Now, in this article, we’ll discuss one such interesting application and build a model to solve that problem.

The objective of this article is to build a model that can classify the cricket shots using the pose of a player. For this, an image will be input into the model. It will detect the pose of the person in the image and then using the pose that was detected, we will classify what type of shot it was.

Table of Contents

6. Evaluating model performance

!pip install pyyaml==5.1

# install detectron2: !pip install detectron2==0.1.3 -f

We are going to load the dataset which is saved on the drive. So for that, we’ll mount the drive first after that we’ll extract the short zip file.

# mount drive from google.colab import drive drive.mount('drive/')

The short zip file contains the images for the different types of shots. Next, we are getting the names of the folders which are the classes or different types of shots.

# extract files !unzip 'drive/My Drive/'

Next, we are doing this using the list ERR function of the OS library. Here we are printing the folder names that we have so we have the four folders which are pull, cut, drive and sweep.

import os # specify path path='shot/' # list down the folders folders = os.listdir(path) print(folders)

Output:-     [‘pull’, ‘cut’, ‘drive’, ‘sweep’]

Next, we are reading all the images and stored them in a list named images. WWe will also be storing the labels in a list which basically is the class for each image. This class will be nothing but the name of the folder in which the image has been stored. You’re already familiar with the process that we are going to go through each folder and read the images one by one and append them in the created list.

# for dealing with images import cv2 # create lists images = [] labels = [] # for each folder for folder in folders: # list down image names names=os.listdir(path+folder) # for each image for name in names: # read an image img=cv2.imread(path+folder+'/'+name) # append image to list images.append(img) # append folder name (type of shot) to list labels.append(folder)

Let’s quickly check the number of images using the length function. We can observe that there are 290 images.

# number of images len(images)

Output:- 290

Now here we are visualizing a few images from the data set. So for each type of shot. We are plotting five images randomly. We will use the matplotlib to visualize the images. The random function will be used to randomly select the images.

We are going to create a subplot with four rows for the four different classes and five columns for the five examples. Next for each class, we’ll randomly pick five images and read the images using the cv2.imread function.  Once You read the image, you can convert these images into RGB format and visualize these images.

# visualization library import matplotlib.pyplot as plt # for randomness import random # create subplots with 4 rows and 5 columns fig, ax = plt.subplots(nrows=4, ncols=5, figsize=(15,15)) # randomly display 5 images for each shot for each folder for i in range(len(folders)): # read image names names=os.listdir(path+folders[i]) # randomly select 5 image names names= random.sample(names, 5) # for each image for j in range(len(names)): # read an image img = cv2.imread(path+ folders[i]+ '/' +names[j]) # convert BGR to RGB img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # display image ax[i, j].imshow(img) # set folder name as title ax[i, j].set_title(folders[i]) # Turn off axis ax[i, j].axis('off')

                                    Source:- Author

So, here you can see are a few examples of the images that we have taken from the dataset. Now since we have less number of images in the training set. We’ll use the data augmentation techniques to increase our training size.

Data Augmentation

To increase our training size so we’ll flip the images horizontally and this will help us with two things first of all the players can be both right-handed and left-handed so by flipping the images. It will make our model more generalized. It will also increase the number of images for training.

So here we are creating an empty list to store the augmented images and their corresponding labels for each image in the dataset.

We are flipping it using the flip function of cv2 and then we are appending it to the list.

# image augmentation aug_images=[] aug_labels=[] # for each image in training data for idx in range(len(images)): # fetch an image and label img = images[idx] label= labels[idx] # flip an image img_flip = cv2.flip(img, 1) # append augmented image to list aug_images.append(img_flip) # append label to list aug_labels.append(label)

Next, we are going to visualize a few augmented images along with the original images.

So we are randomly picking five images. Also, we are creating a subplot to visualize like before we did. We are first plotting the actual image and then its augmented version.

So here we can see that using data augmentation for flipping the images the type of shot does not change. A pull shot is going to be a pull shot even if we rotate the image horizontally.

# display actual and augmented image for sample images # create indices ind = range(len(aug_images)) # randomly sample indices ind = random.sample(ind, 5) # create subplots with 5 rows and 2 columns fig, ax = plt.subplots(nrows=5, ncols=2, figsize=(15,15)) # for each row for row in range(5): # for each column for col in range(2): # first column for actual image if col==0: # display actual image ax[row, col].imshow(images[ ind[row] ] ) # set title ax[row, col].set_title('Actual') # Turn off axis ax[row, col].axis('off') # second column for augmented image else: # display augmented image ax[row, col].imshow(aug_images[ ind[row] ] ) # set title ax[row, col].set_title('Augmented') # Turn off axis ax[row, col].axis('off')

                                                 Source:- Author

Now we are combining the actual and the augmented images and checking the number of images.

# combine actual and augmented images & labels images = images + aug_images labels = labels + aug_labels # number of images len(images)

Output:- 580

Detecting pose using detectron2

Now we have 580 images including both the actual and the augmented images for training. Now our data set is ready. Next, we’ll detect the pose of the players in all of these images using detectron2.

So we will use a pre-trained model present in detectron2 to detect these poses here. We are defining the model and a few libraries. We are defining the model architecture that we will be using. We have also defined the path for the weights of the pre-trained model to use.

After that, we are defining the threshold for the bounding box which is set to 0.8. Finally, we are defining our predictor. Now the model is ready.

# import some common detectron2 utilities # to obtain pretrained models from detectron2 import model_zoo # set up predictor from detectron2.engine import DefaultPredictor # set config from detectron2.config import get_cfg # define configure instance cfg = get_cfg() # get a model specified by relative path under Detectron2’s official configs/ directory. cfg.merge_from_file(model_zoo.get_config_file ("COCO-Keypoints/keypoint_rcnn_R_101_FPN_3x.yaml")) # download pretrained model cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url ("COCO-Keypoints/keypoint_rcnn_R_101_FPN_3x.yaml") # set threshold for this model cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.8 # create predictor predictor = DefaultPredictor(cfg)

Let’s visualize a few predictions from the model. Here we are randomly picking five images and then for each image, we are taking the predictions defining the visualizer and drawing the predictions on the image,  and finally plotting the predictions.

# for drawing predictions on images from detectron2.utils.visualizer import Visualizer # to obtain metadata from chúng tôi import MetadataCatalog # to display an image from google.colab.patches import cv2_imshow # randomly select images for img in random.sample(images,5): # make predictions outputs = predictor(img) # use `Visualizer` to draw the predictions on the image. v = Visualizer(img[:, :, ::-1], MetadataCatalog.get(cfg.DATASETS.TRAIN[0]), scale=1) # draw prediction on image v = v.draw_instance_predictions(outputs["instances"].to("cpu")) # display image cv2_imshow(v.get_image()[:, :, ::-1])

Source:- Author

So here are the predictions from the model. You can see that we have bounding boxes along with the key points predicted for each of these players. You can see that the model has even predicted some of the images in the background as well. So these are a few predictions from the model.

Next, we are going to define a function that will be used to extract and detect the poses for the images. So this function will take an image as input make these predictions for the image using the pre-trained model and then convert the extracted key points into a numpy array for a single image.

There can be multiple objects as well. So we will select the object which has the highest score and keep only those key points and then finally we are converting the key points to a 1d array.

Since we wish to build a neural network model on top of that and the neural network takes a single-dimensional input.

So here we are converting it into a single dimension now we are going to use the defined function and extract the key points for all the images and store them in a list key point.

Now we have the key points for all the images. Next, we are going to build a neural network that will classify these key points into the type of shots.

# define function that extracts the keypoints for an image def extract_keypoints(img): # make predictions outputs = predictor(img) # fetch keypoints keypoints = outputs['instances'].pred_keypoints # convert to numpy array kp = keypoints.cpu().numpy() # if keypoints detected # fetch keypoints of a person with maximum confidence score kp = kp[0] kp = np.delete(kp,2,1) # convert 2D array to 1D array kp = kp.flatten() # return keypoints return kp # progress bar from tqdm import tqdm import numpy as np # create list keypoints = [] # for every image for i in tqdm(range(len(images))): # extract keypoints kp = extract_keypoints(images[i]) # append keypoints keypoints.append(kp) 5. Classifying cricket shot using pose of a player

First of all, we are going to normalize the values of our key points which will eventually speed up the training process.

# for normalization from sklearn.preprocessing import StandardScaler # define normalizer scaler= StandardScaler() # normalize keypoints keypoints = scaler.fit_transform(keypoints) # convert to an array keypoints = np.array(keypoints)

So here we have normalized the values of our key points. We are converting our target which is currently in the text form into numbers using the label encoding.

# converting the target categories into numbers from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y=le.fit_transform(labels)

After that, we are splitting our dataset into the training and the validation sets using the train test split function. So we have kept the test size as 0.2 which means 80(%) of the data will be used for training and 20(%) will be in the validation set.

# for creating training and validation sets from sklearn.model_selection import train_test_split # split keypoints and labels in 80:20 x_tr, x_val, y_tr, y_val = train_test_split(keypoints, y, test_size=0.2, stratify=labels, random_state=120)

Now in order to use the key points and the targets, we must convert them into tensors. Hence here we are converting the key points as well as the targets into python tensors for both the training and the validation set.

# converting the keypoints and target value to tensor import torch x_tr = torch.Tensor(x_tr) x_val = torch.Tensor(x_val) y_tr = torch.Tensor(y_tr) y_tr = y_tr.type(torch.long) y_val = torch.Tensor(y_val) y_val = y_val.type(torch.long)

Here is the shape of the training and the validation set has 464 images for training and 116 for validation.

# shape of training and validation set (x_tr.shape, y_tr.shape), (x_val.shape, y_val.shape)

The output layer has four neurons since we have four different classes and the activation function of the output layer will return probabilities. Hence we have a softmax activation function.

# importing libraries for defining the architecture of model from torch.autograd import Variable from torch.optim import Adam from chúng tôi import Linear, ReLU, Sequential, Softmax, CrossEntropyLoss # defining the model architecture model = Sequential(Linear(34, 64), ReLU(), Linear(64, 4), Softmax() )

Next, we are defining the optimizer as adam and the loss as cross-entropy. It is a multi-class classification problem and then we are transferring the model to GPU.

# define optimizer and loss function optimizer = Adam(model.parameters(), lr=0.01) criterion = CrossEntropyLoss() # checking if GPU is available if torch.cuda.is_available(): model = model.cuda() criterion = criterion.cuda()

Next, we are defining a function that will be used to train our model. So this function will take the number of epochs as input. We are going to set the model to train. Firstly we are initializing the loss as zero then we are loading the training and the validation set using the Pytorch variable.

Transferring our model and validation to GPU after that we are clearing the gradients of the model parameter. Next, we are taking the predictions from the model for both the training as well as the validation sets and sorting them into separate variables.

We have calculated the train and validation loss and finally, we are back-propagating the gradients and updating the parameters.

Additionally, we are also printing the validation loss after every 10th epoch.

def train(epoch): model.train() tr_loss = 0 # getting the training set x_train, y_train = Variable(x_tr), Variable(y_tr) # getting the validation set x_valid, y_valid = Variable(x_val), Variable(y_val) # converting the data into GPU format if torch.cuda.is_available(): x_train = x_train.cuda() y_train = y_train.cuda() x_valid = x_valid.cuda() y_valid = y_valid.cuda() # clearing the Gradients of the model parameters optimizer.zero_grad() # prediction for training and validation set output_train = model(x_train) output_val = model(x_valid) # computing the training and validation loss loss_train = criterion(output_train, y_train) loss_val = criterion(output_val, y_valid) # computing the updated weights of all the model parameters loss_train.backward() optimizer.step() if epoch%10 == 0: # printing the validation loss print('Epoch : ',epoch+1, 't', 'loss :', loss_val.item())

Now we have defined our function. We will use this train function and start the training for our model. Also, we are training 400 epochs. You can see that the model is printing loss at every 10th epoch.

Finally, we started with a loss of 1.38 and now we have a loss of 0.97 at the end. So we can see that the model performance is improving as the model training progresses.

# defining the number of epochs n_epochs = 100 # training the model for epoch in range(n_epochs): train(epoch)

Evaluating model performance

Let’s evaluate the model performance so we are going to check the accuracy of the model.

Hence importing the function from sklearn. we are getting the validation set including the key points as well as the target variables. Once you get the variable first transfer these values to GPU that we are taking the predictions from the model on the validation images using the trained model.

Now we are converting the predicted probabilities to the respective classes using the arg max function.

# to check the model performance from sklearn.metrics import accuracy_score # get validation accuracy x, y = Variable(x_val), Variable(y_val) if torch.cuda.is_available(): x_val = x.cuda() y_val = y.cuda() pred = model(x_val) final_pred = np.argmax(pred.cpu().data.numpy(), axis=1) accuracy_score(y_val.cpu(), final_pred)

Finally, we calculated the accuracy score so the accuracy of this model comes out to be 0.79 which is approximately 80 %.


In order to improve the accuracy, you can play around with different hyperparameters like increasing the number of hidden layers in the model, changing the optimizer, changing the activation function, increasing the number of epochs, and much more.

About the Author

Hi, I am Kajal Kumari. have completed my Master’s from IIT(ISM) Dhanbad in Computer Science & Engineering. As of now, I am working as Machine Learning Engineer in Hyderabad. Here is my Linkedin profile if you want to connect with me.

End Notes

Thanks for reading!

If you want to read my previous blogs, you can read Previous Data Science Blog posts from here.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.


Twitter Sentiment Analysis Using Python

A Twitter sentiment analysis determines negative, positive, or neutral emotions within the text of a tweet using NLP and ML models. Sentiment analysis or opinion mining refers to identifying as well as classifying the sentiments that are expressed in the text source. Tweets are often useful in generating a vast amount of sentiment data upon analysis. These data are useful in understanding the opinion of people on social media for a variety of topics.

This article was published as a part of the Data Science Blogathon.

What is Twitter Sentiment Analysis?

Twitter sentiment analysis analyzes the sentiment or emotion of tweets. It uses natural language processing and machine learning algorithms to classify tweets automatically as positive, negative, or neutral based on their content. It can be done for individual tweets or a larger dataset related to a particular topic or event.

Why is Twitter Sentiment Analysis Important?

Understanding Customer Feedback: By analyzing the sentiment of customer feedback, companies can identify areas where they need to improve their products or services.

Political Analysis: Sentiment analysis can help political campaigns understand public opinion and tailor their messaging accordingly.

Crisis Management: In the event of a crisis, sentiment analysis can help organizations monitor social media and news outlets for negative sentiment and respond appropriately.

How to Do Twitter Sentiment Analysis?

In this article, we aim to analyze Twitter sentiment analysis using machine learning algorithms, the sentiment of tweets provided from the Sentiment140 dataset by developing a machine learning pipeline involving the use of three classifiers (Logistic Regression, Bernoulli Naive Bayes, and SVM)along with using Term Frequency- Inverse Document Frequency (TF-IDF). The performance of these classifiers is then evaluated using accuracy and F1 Scores.

For data preprocessing, we will be using Natural Language Processing’s (NLP) NLTK library.

Twitter Sentiment Analysis: Problem Statement

In this project, we try to implement an NLP Twitter sentiment analysis model that helps to overcome the challenges of sentiment classification of tweets. We will be classifying the tweets into positive or negative sentiments. The necessary details regarding the dataset involving the Twitter sentiment analysis project are:

The dataset provided is the Sentiment140 Dataset which consists of 1,600,000 tweets that have been extracted using the Twitter API. The various columns present in this Twitter data are:

target: the polarity of the tweet (positive or negative)

ids: Unique id of the tweet

date: the date of the tweet

flag: It refers to the query. If no such query exists, then it is NO QUERY.

user: It refers to the name of the user that tweeted

text: It refers to the text of the tweet

Twitter Sentiment Analysis: Project Pipeline

The various steps involved in the Machine Learning Pipeline are:

Import Necessary Dependencies

Read and Load the Dataset

Exploratory Data Analysis

Data Visualization of Target Variables

Data Preprocessing

Splitting our data into Train and Test sets.

Transforming Dataset using TF-IDF Vectorizer

Function for Model Evaluation

Model Building

Model Evaluation

Let’s get started,

Step-1: Import the Necessary Dependencies # utilities import re import numpy as np import pandas as pd # plotting import seaborn as sns from wordcloud import WordCloud import matplotlib.pyplot as plt # nltk from chúng tôi import WordNetLemmatizer # sklearn from chúng tôi import LinearSVC from sklearn.naive_bayes import BernoulliNB from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import confusion_matrix, classification_report Step-2: Read and Load the Dataset # Importing the dataset DATASET_COLUMNS=['target','ids','date','flag','user','text'] DATASET_ENCODING = "ISO-8859-1" df = pd.read_csv('Project_Data.csv', encoding=DATASET_ENCODING, names=DATASET_COLUMNS) df.sample(5)


Step-3: Exploratory Data Analysis

3.1: Five top records of data



3.2: Columns/features in data



Index(['target', 'ids', 'date', 'flag', 'user', 'text'], dtype='object')

3.3: Length of the dataset

print('length of data is', len(df))


length of data is 1048576

3.4: Shape of data

df. shape


(1048576, 6)

3.5: Data information


3.6: Datatypes of all columns



target int64 ids int64 date object flag object user object text object dtype: object

3.7: Checking for null values




3.8: Rows and columns in the dataset

print('Count of columns in the data is: ', len(df.columns)) print('Count of rows in the data is: ', len(df))


Count of columns in the data is: 6 Count of rows in the data is: 1048576

3.9: Check unique target values



array([0, 4], dtype=int64)

3.10: Check the number of target values



2 Step-4: Data Visualization of Target Variables # Plotting the distribution for dataset. ax = df.groupby('target').count().plot(kind='bar', title='Distribution of data',legend=False) ax.set_xticklabels(['Negative','Positive'], rotation=0) # Storing data in lists. text, sentiment = list(df['text']), list(df['target'])


import seaborn as sns sns.countplot(x='target', data=df)


Step-5: Data Preprocessing

In the above-given problem statement, before training the model, we performed various pre-processing steps on the dataset that mainly dealt with removing stopwords, removing special characters like emojis, hashtags, etc. The text document is then converted into lowercase for better generalization.

Subsequently, the punctuations were cleaned and removed, thereby reducing the unnecessary noise from the dataset. After that, we also removed the repeating characters from the words along with removing the URLs as they do not have any significant importance.

At last, we then performed Stemming(reducing the words to their derived stems) and Lemmatization(reducing the derived words to their root form, known as lemma) for better results.

5.1: Selecting the text and Target column for our further analysis


5.2: Replacing the values to ease understanding. (Assigning 1 to Positive sentiment 4)

data['target'] = data['target'].replace(4,1)

5.3: Printing unique values of target variables



array([0, 1], dtype=int64)

5.4: Separating positive and negative tweets

data_pos = data[data['target'] == 1] data_neg = data[data['target'] == 0]

5.5: Taking one-fourth of the data so we can run it on our machine easily

data_pos = data_pos.iloc[:int(20000)] data_neg = data_neg.iloc[:int(20000)]

5.6: Combining positive and negative tweets

dataset = pd.concat([data_pos, data_neg])

5.7: Making statement text in lowercase

dataset['text']=dataset['text'].str.lower() dataset['text'].tail()


5.8: Defining set containing all stopwords in English.

stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an', 'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do', 'does', 'doing', 'down', 'during', 'each','few', 'for', 'from', 'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once', 'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're','s', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such', 't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was', 'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom', 'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre", "youve", 'your', 'yours', 'yourself', 'yourselves']

5.9: Cleaning and removing the above stop words list from the tweet text

STOPWORDS = set(stopwordlist) def cleaning_stopwords(text): return " ".join([word for word in str(text).split() if word not in STOPWORDS]) dataset['text'] = dataset['text'].apply(lambda text: cleaning_stopwords(text)) dataset['text'].head()


5.10: Cleaning and removing punctuations

import string english_punctuations = string.punctuation punctuations_list = english_punctuations def cleaning_punctuations(text): translator = str.maketrans('', '', punctuations_list) return text.translate(translator) dataset['text']= dataset['text'].apply(lambda x: cleaning_punctuations(x)) dataset['text'].tail()


5.11: Cleaning and removing repeating characters

def cleaning_repeating_char(text): return re.sub(r'(.)1+', r'1', text) dataset['text'] = dataset['text'].apply(lambda x: cleaning_repeating_char(x)) dataset['text'].tail()


5.12: Cleaning and removing URLs

def cleaning_URLs(data): dataset['text'] = dataset['text'].apply(lambda x: cleaning_URLs(x)) dataset['text'].tail()


5.13: Cleaning and removing numeric numbers

def cleaning_numbers(data): return re.sub('[0-9]+', '', data) dataset['text'] = dataset['text'].apply(lambda x: cleaning_numbers(x)) dataset['text'].tail()


5.14: Getting tokenization of tweet text

from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'w+') dataset['text'] = dataset['text'].apply(tokenizer.tokenize) dataset['text'].head()


5.15: Applying stemming

import nltk st = nltk.PorterStemmer() def stemming_on_text(data): text = [st.stem(word) for word in data] return data dataset['text']= dataset['text'].apply(lambda x: stemming_on_text(x)) dataset['text'].head()


5.16: Applying lemmatizer

lm = nltk.WordNetLemmatizer() def lemmatizer_on_text(data): text = [lm.lemmatize(word) for word in data] return data dataset['text'] = dataset['text'].apply(lambda x: lemmatizer_on_text(x)) dataset['text'].head()


5.17: Separating input feature and label


5.18: Plot a cloud of words for negative tweets

data_neg = data['text'][:800000] plt.figure(figsize = (20,20)) wc = WordCloud(max_words = 1000 , width = 1600 , height = 800, collocations=False).generate(" ".join(data_neg)) plt.imshow(wc)


5.19: Plot a cloud of words for positive tweets

data_pos = data['text'][800000:] wc = WordCloud(max_words = 1000 , width = 1600 , height = 800, collocations=False).generate(" ".join(data_pos)) plt.figure(figsize = (20,20)) plt.imshow(wc)


Step-6: Splitting Our Data Into Train and Test Subsets # Separating the 95% data for training data and 5% for testing data X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.05, random_state =26105111) Step-7: Transforming the Dataset Using TF-IDF Vectorizer

7.1: Fit the TF-IDF Vectorizer

vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000) print('No. of feature_words: ', len(vectoriser.get_feature_names()))


No. of feature_words: 500000

7.2: Transform the data using TF-IDF Vectorizer

X_train = vectoriser.transform(X_train) X_test = vectoriser.transform(X_test) Step-8: Function for Model Evaluation

After training the model, we then apply the evaluation measures to check how the model is performing. Accordingly, we use the following evaluation parameters to check the performance of the models respectively:

Accuracy Score

Confusion Matrix with Plot


def model_Evaluate(model): # Predict values for Test dataset y_pred = model.predict(X_test) # Print the evaluation metrics for the dataset. print(classification_report(y_test, y_pred)) # Compute and plot the Confusion matrix cf_matrix = confusion_matrix(y_test, y_pred) categories = ['Negative','Positive'] group_names = ['True Neg','False Pos', 'False Neg','True Pos'] group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)] labels = [f'{v1}n{v2}' for v1, v2 in zip(group_names,group_percentages)] labels = np.asarray(labels).reshape(2,2) sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '', xticklabels = categories, yticklabels = categories) plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10) plt.ylabel("Actual values" , fontdict = {'size':14}, labelpad = 10) plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20) Step-9: Model Building

In the problem statement, we have used three different models respectively :

Bernoulli Naive Bayes Classifier

SVM (Support Vector Machine)

Logistic Regression

The idea behind choosing these models is that we want to try all the classifiers on the dataset ranging from simple ones to complex models, and then try to find out the one which gives the best performance among them.

8.1: Model-1

BNBmodel = BernoulliNB(), y_train) model_Evaluate(BNBmodel) y_pred1 = BNBmodel.predict(X_test)


8.2: Plot the ROC-AUC Curve for model-1

from sklearn.metrics import roc_curve, auc fpr, tpr, thresholds = roc_curve(y_test, y_pred1) roc_auc = auc(fpr, tpr) plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC CURVE') plt.legend(loc="lower right")


8.3: Model-2:

SVCmodel = LinearSVC(), y_train) model_Evaluate(SVCmodel) y_pred2 = SVCmodel.predict(X_test)


8.4: Plot the ROC-AUC Curve for model-2

from sklearn.metrics import roc_curve, auc fpr, tpr, thresholds = roc_curve(y_test, y_pred2) roc_auc = auc(fpr, tpr) plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC CURVE') plt.legend(loc="lower right")


8.5: Model-3

LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1), y_train) model_Evaluate(LRmodel) y_pred3 = LRmodel.predict(X_test)


8.6: Plot the ROC-AUC Curve for model-3

from sklearn.metrics import roc_curve, auc fpr, tpr, thresholds = roc_curve(y_test, y_pred3) roc_auc = auc(fpr, tpr) plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC CURVE') plt.legend(loc="lower right")


Step-10: Model Evaluation

Upon evaluating all the models, we can conclude the following details i.e.

Accuracy: As far as the accuracy of the model is concerned, Logistic Regression performs better than SVM, which in turn performs better than Bernoulli Naive Bayes.

AUC Score: All three models have the same ROC-AUC score.

We, therefore, conclude that the Logistic Regression is the best model for the above-given dataset.

In our problem statement, Logistic Regression follows the principle of Occam’s Razor, which defines that for a particular problem statement, if the data has no assumption, then the simplest model works the best. Since our dataset does not have any assumptions and Logistic Regression is a simple model. Therefore, the concept holds true for the above-mentioned dataset.


We hope through this article, you got a basic of how Sentimental Analysis is used to understand public emotions behind people’s tweets. As you’ve read in this article, Twitter Sentimental Analysis helps us preprocess the data (tweets) using different methods and feed it into ML models to give the best accuracy.

Key Takeaways

Twitter Sentimental Analysis is used to identify as well as classify the sentiments that are expressed in the text source.

Logistic Regression, SVM, and Naive Bayes are some of the ML algorithms that can be used for Twitter Sentimental Analysis.

Frequently Asked Questions


Deploying Machine Learning Models Using Streamlit – An Introductory Guide To Model Deployment


Understand the concept of model deployment

Perform model deployment using Streamlit for loan prediction data


I believe most of you must have done some form of a data science project at some point in your lives, let it be a machine learning project, a deep learning project, or even visualizations of your data. And the best part of these projects is to showcase them to others. This will not only motivate and encourage you about your hard work but will also help you to improve upon your project.

But the question is how will you showcase your work to others? Well, this is where Model Deployment will help you.

I have been exploring the field of Model Deployment for the past few months now. Model Deployment helps you showcase your work to the world and make better decisions with it. But deploying a model can get a little tricky at times. Before deploying the model a lot of things need to be looked into, such as data storage, pre-processing, model building, and monitoring. This can be a bit confusing as the number of tools that perform these model deployment tasks efficiently is few. Enter, Streamlit!

Streamlit is a popular open-source framework used for model deployment by machine learning and data science teams. And the best part is it’s free of cost and purely in python.

In this article, we are going to deep dive into model deployment. We will first build a loan prediction model and then deploy it using Streamlit.

Table of Contents

Overview of Machine Learning Lifecycle

Understanding the Problem Statement: Automating Loan Prediction

Machine Learning model for Automating Loan Prediction

Introduction to Streamlit

Model Deployment of the Loan Prediction model using Streamlit

Overview of Machine Learning Lifecycle

Let’s start with understanding the overall machine learning lifecycle, and the different steps that are involved in creating a machine learning project. Broadly, the entire machine learning lifecycle can be described as a combination of 6 stages. Let me break these stages for you:

Stage 1: Problem Definition

The first and most important part of any project is to define the problem statement. Here, we want to describe the aim or the goal of our project and what we want to achieve at the end.

Stage 2: Hypothesis Generation

Once the problem statement is finalized, we move on to the hypothesis generation part. Here, we try to point out the factors/features that can help us to solve the problem at hand.

Stage 3: Data Collection

After generating hypotheses, we get the list of features that are useful for a problem. Next, we collect the data accordingly. This data can be collected from different sources.

Stage 4: Data Exploration and Pre-processing

After collecting the data, we move on to explore and pre-process it. These steps help us to generate meaningful insights from the data. We also clean the dataset in this step, before building the model

Stage 5: Model Building

Once we have explored and pre-processed the dataset, the next step is to build the model. Here, we create predictive models in order to build a solution for the project.

Stage 6: Model Deployment

Once you have the solution, you want to showcase it and make it accessible for others. And hence, the final stage of the machine learning lifecycle is to deploy that model.

These are the 6 stages of a machine learning lifecycle. The aim of this article is to understand the last stage, i.e. model deployment, in detail using streamlit. However, I will briefly explain the remaining stages and the complete machine learning lifecycle along with their implementation in Python, before diving deep into the model deployment part using streamlit.

So, in the next section, let’s start with understanding the problem statement.

Understanding the Problem Statement: Automating Loan Prediction

The project that I have picked for this particular blog is automating the loan eligibility process. The task is to predict whether the loan will be approved or not based on the details provided by customers. Here is the problem statement for this project:

Automate the loan eligibility process based on customer details provided while filling online application form

Based on the details provided by customers, we have to create a model that can decide where or not their loan should be approved. This completes the problem definition part of the first stage of the machine learning lifecycle. The next step is to generate hypotheses and point out the factors that will help us to predict whether the loan for a customer should be approved or not.

As a starting point, here are a couple of factors that I think will be helpful for us with respect to this project:

Amount of loan: The total amount of loan applied by the customer. My hypothesis here is that the higher the amount of loan, the lesser will be the chances of loan approval and vice versa.

Income of applicant: The income of the applicant (customer) can also be a deciding factor. A higher income will lead to higher probability of loan approval.

Education of applicant: Educational qualification of the applicant can also be a vital factor to predict the loan status of a customer. My hypothesis is if the educational qualification of the applicant is higher, the chances of their loan approval will be higher.

These are some factors that can be useful to predict the loan status of a customer. Obviously, this is a very small list, and you can come up with many more hypotheses. But, since the focus of this article is on model deployment, I will leave this hypothesis generation part for you to explore further.

Next, we need to collect the data. We know certain features that we want like the income details, educational qualification, and so on. And the data related to the customers and loan is provided at the datahack platform of Analytics Vidhya. You can go to the link, register for the practice problem, and download the dataset from the problem statement tab. Here is a summary of the variables available for this particular problem:

We have some variables related to the loan, like the loan ID, which is the unique ID for each customer, Loan Amount and Loan Amount Term, which tells us the amount of loan in thousands and the term of the loan in months respectively. Credit History represents whether a customer has any previous unclear debts or not. Apart from this, we have customer details as well, like their Gender, Marital Status, Educational qualification, income, and so on. Using these features, we will create a predictive model that will predict the target variable which is Loan Status representing whether the loan will be approved or not.

Now we have finalized the problem statement, generated the hypotheses, and collected the data. Next are the Data exploration and pre-processing phase. Here, we will explore the dataset and pre-process it. The common steps under this step are as follows:

Univariate Analysis

Bivariate Analysis

Missing Value Treatment

Outlier Treatment

Feature Engineering

We explore the variables individually which is called the univariate analysis. Exploring the effect of one variable on the other, or exploring two variables at a time is the bivariate analysis. We also look for any missing values or outliers that might be present in the dataset and deal with them. And we might also create new features using the existing features which are referred to as feature engineering. Again, I will not focus much on these data exploration parts and will only do the necessary pre-processing.

After exploring and pre-processing the data, next comes the model building phase. Since it is a classification problem, we can use any of the classification models like the logistic regression, decision tree, random forest, etc. I have tried all of these 3 models for this problem and random forest produced the best results. So, I will use a random forest as the predictive model for this project.

Till now, I have briefly explained the first five stages of the machine learning lifecycle with respect to the project automating loan prediction. Next, I will demonstrate these steps in Python.

Machine Learning model for Automating Loan Prediction

In this section, I will demonstrate the first five stages of the machine learning lifecycle for the project at hand. The first two stages, i.e. Problem definition and hypothesis generation are already covered in the previous section and hence let’s start with the third stage and load the dataset. For that, we will first import the required libraries and then read the CSV file:

Here are the first five rows from the dataset. We know that machine learning models take only numbers as inputs and can not process strings. So, we have to deal with the categories present in the dataset and convert them into numbers.

Python Code:

Here, we have converted the categories present in the Gender, Married and the Loan Status variable into numbers, simply using the map function of python. Next, let’s check if there are any missing values in the dataset:

So, there are missing values on many variables including the Gender, Married, LoanAmount variable. Next, we will remove all the rows which contain any missing values in them:

Now there are no missing values in the dataset. Next, we will separate the dependent (Loan_Status) and the independent variables:

View the code on Gist.

For this particular project, I have only picked 5 variables that I think are most relevant. These are the Gender, Marital Status, ApplicantIncome, LoanAmount, and Credit_History and stored them in variable X. Target variable is stored in another variable y. And there are 480 observations available. Next, let’s move on to the model building stage.

Here, we will first split our dataset into a training and validation set, so that we can train the model on the training set and evaluate its performance on the validation set.

View the code on Gist.

We have split the data using the train_test_split function from the sklearn library keeping the test_size as 0.2 which means 20 percent of the total dataset will be kept aside for the validation set. Next, we will train the random forest model using the training set:

View the code on Gist.

Here, I have kept the max_depth as 4 for each of the trees of our random forest and stored the trained model in a variable named model. Now, our model is trained, let’s check its performance on both the training and validation set:

View the code on Gist.

The model is 80% accurate on the validation set. Let’s check the performance on the training set too:

View the code on Gist.

Performance on the training set is almost similar to that on the validation set. So, the model has generalized well. Finally, we will save this trained model so that it can be used in the future to make predictions on new observations:

View the code on Gist.

We are saving the model in pickle format and storing it as chúng tôi This will store the trained model and we will use this while deploying the model.

This completes the first five stages of the machine learning lifecycle. Next, we will explore the last stage which is model deployment. We will be deploying this loan prediction model so that it can be accessed by others. And to do so, we will use Streamlit which is a recent and the simplest way of building web apps and deploying machine learning and deep learning models.

So, let’s first discuss this tool, and then I will demonstrate how to deploy your machine learning model using it.

Introduction to Streamlit

As per the founders of Streamlit, it is the fastest way to build data apps and share them. It is a recent model deployment tool that simplifies the entire model deployment cycle and lets you deploy your models quickly. I have been exploring this tool for the past couple of weeks and as per my experience, it is a simple, quick, and interpretable model deployment tool.

Here are some of the key features of Streamlit which I found really interesting and useful:

It quickly

turns data scripts into shareable web applications

. You just have to pass a running script to the tool and it can convert that to a web app.

Everything in Python

. The best thing about Streamlit is that everything we do is in Python. Starting from loading the model to creating the frontend, all can be done using Python.

All for free

. It is open source and hence no cost is involved. You can deploy your apps without paying for them.

No front-end experience required

. Model deployment generally contains two parts, frontend, and backend. The backend is generally a working model, a machine learning model in our case, which is built-in python. And the front end part, which generally requires some knowledge of other languages like java scripts, etc. Using Streamlit, we can create this front end in Python itself. So, we need not learn any other programming languages or web development techniques. Understanding Python is enough.

Let’s say we are deploying the model without using Streamlit. In that case, the entire pipeline will look something like this:

Model Building

Creating a python script

Write Flask app

Create front-end: JavaScript


We will first build our model and convert it into a python script. Then we will have to create the web app using let’s say flask. We will also have to create the front end for the web app and here we will have to use JavaScript. And then finally, we will deploy the model. So, if you would notice, we will require the knowledge of Python to build the model and then a thorough understanding of JavaScript and flask to build the front end and deploying the model. Now, let’s look at the deployment pipeline if we use Streamlit:

Model Building

Creating a python script

Create front-end: Python


Here we will build the model and create a python script for it. Then we will build the front-end for the app which will be in python and finally, we will deploy the model. That’s it. Our model will be deployed. Isn’t it amazing? If you know python, model deployment using Streamlit will be an easy journey. I hope you are as excited about Streamlit as I was while exploring it earlier. So, without any further ado, let’s build our own web app using Streamlit.

Model Deployment of the Loan Prediction model using Streamlit

We will start with the basic installations:

View the code on Gist.

We have installed 3 libraries here. pyngrok is a python wrapper for ngrok which helps to open secure tunnels from public URLs to localhost. This will help us to host our web app. Streamlit will be used to make our web app. 

Next, we will have to create a separate session in Streamlit for our app. You can download the chúng tôi file from here and store that in your current working directory. This will help you to create a session for your app. Finally, we have to create the python script for our app. Let me show the code first and then I will explain it to you in detail:

View the code on Gist.

This is the entire python script which will create the app for us. Let me break it down and explain in detail:

In this part, we are saving the script as chúng tôi and then we are loading the required libraries which are pickle to load the trained model and streamlit to build the app. Then we are loading the trained model and saving it in a variable named classifier.

Next, we have defined the prediction function. This function will take the data provided by users as input and make the prediction using the model that we have loaded earlier. It will take the customer details like the gender, marital status, income, loan amount, and credit history as input, and then pre-process that input so that it can be feed to the model and finally, make the prediction using the model loaded as a classifier. In the end, it will return whether the loan is approved or not based on the output of the model.

And here is the main app. First of all, we are defining the header of the app. It will display “Streamlit Loan Prediction ML App”. To do that, we are using the markdown function from streamlit. Next, we are creating five boxes in the app to take input from the users. These 5 boxes will represent the five features on which our model is trained. 

The first box is for the gender of the user. The user will have two options, Male and Female, and they will have to pick one from them. We are creating a dropdown using the selectbox function of streamlit. Similarly, for Married, we are providing two options, Married and Unmarried and again, the user will pick one from it. Next, we are defining the boxes for Applicant Income and Loan Amount.

Since both these variables will be numeric in nature, we are using the number_input function from streamlit. And finally, for the credit history, we are creating a dropdown which will have two categories, Unclear Debts, and No Unclear Debts. 

Alright, let’s now host this app to a public URL using pyngrok library.

View the code on Gist.

Here, we are first running the python script. And then we will connect it to a public URL:

View the code on Gist.

This will generate a link something like this:

And it is as simple as this to build and deploy your machine learning models using Streamlit. 

End Notes

Congratulations! We have now successfully completed loan prediction model deployment using Streamlit. I encourage you to first try this particular project, play around with the values as input, and check the results. And then, you can try out other machine learning projects as well and perform model deployment using streamlit. 

The deployment is simple, fast, and most importantly in Python. However, there are a couple of challenges with it. We have used Google colab as the backend to build us and as you might be aware, the colab session automatically restarts after 12 hours. Also, if your internet connection breaks, the colab session breaks. Hence, if we are using colab as the backend, we have to rerun the entire application once the session expires. 

We recommend you go through the following articles on model deployment to solidify your concepts-

To deal with this, we can change the backend. AWS can be the right option here for the backend and using that, we can host our web app permanently. So, in my next article, I will demonstrate how to integrate AWS with Streamlit and make the model deployment process more efficient.


Update the detailed information about Classification Model Simulator Application Using Dash In Python on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!