Trending December 2023 # Autocorrect Feature Using Nlp In Python # Suggested January 2024 # Top 14 Popular

You are reading the article Autocorrect Feature Using Nlp In Python updated in December 2023 on the website Cancandonuts.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Autocorrect Feature Using Nlp In Python

This article was published as a part of the Data Science Blogathon.

Natural Language Processing (NLP) is the field of artificial intelligence that relates lingual to Computer Science. I am assuming that you have understood the basic concepts of NLP. So we will move ahead. There are Some NLP applications as follows: Auto Spelling Correction, Sentiment Analysis, Fake News Detection, Machine Translation, Question and Answering(Q&A), Chatbot, and many more…

Introduction to Autocorrect

Have you ever wondered about how the Autocorrect features work on the keyboard of a Smartphone? Now almost every smartphone brand regardless of its price provides an autocorrect feature in their keyboards today. Everyone knows the sake of smartphones would be a never-ending list and we are not going to focus on that topic in this blog!

The main purpose of this article, as you have seen the title so you can guess that is to build an Autocorrect Feature. Yes, it’s some kind of similar, but not the exact copy, to that of the smartphone we are using now, but this would be an implementation of Natural Language Processing on a smaller dataset like a book.

Okay, let’s understand how these autocorrect features work. In this article, I am going to take you through “How to build Autocorrect with Python”.

Autocorrect using NLP With Python- How it works?

In the backdrop of machine learning, autocorrect is purely based on Natural Language Processing (NLP). As the name suggests that it is programmed in order to correct spellings and errors while typing text. So let’s see how it works?

Before I move ahead into the coding stuff let us understand “How Autocorrect works?”. Let’s assume that you have typed a word on your keyboard but if that word exists in the vocabulary of our smartphone then it will assume that you have written the right word. Okay. Now it does not matter whether you write a name, a noun, or any word that you wanted to type.

Understood this scenario? If the word exists in the history of the smartphone, it will generalize or create the word as a correct word to choose. But What if the word doesn’t exist? Okay, If the word that you have typed is a nonexisting word in the history of smartphones then the autocorrect is specially programmed to find the most similar words in the history of our smartphone as it suggests.

So let us understand the algorithm.

There are 4 key steps to building an autocorrect model that corrects spelling errors:

1:- Identify Misspelled Word — Let us consider an example, how would we get to know the word “drea” is spelled incorrectly or correctly? If a word is spelled correctly then the word will be found in a dictionary and if it is not there then it is probably a Misspelled Word. Hence, when a word is not found in a dictionary then we will flag it for correction.

2:- Find ‘n’ Strings Edit distance away — An edit is one of the operations which is performed on a string in order to transform it into another String, and n is nothing but the edit distance that is an edit distance like- 1, 2, 3, so on… which will count the number of edit operations that to be performed. Hence, the edit distance n tells us that how many operations are away from one string to another. Following are the different types of edits:-

Insert (will add a letter)

Delete (will remove a letter)

Switch (it will swap two nearby letters)

Replace (exchange one letter to another one)

With these four edits, we are proficient in modifying any string. So the combination of edits allows us to find a list of all possible strings that are n edits to perform.

IMPORTANT Note: For autocorrect, we take n  usually between 1 to 3 edits.

3:- Filtering of Candidates — Here we want to consider only correctly spelled real words from our generated candidate list so we can compare the words to a known dictionary (like we did in the first step) and then filter out the words in our generated candidate list that do not appear in the known “dictionary”.

4:- Calculate Probabilities of Words — We can calculate the probabilities of words and then find the most likely word from our generated candidates with our list of actual words. This requires word frequencies that we know and the total number of words in the corpus (also known as dictionary).

Build an Autocorrect Feature using NLP with Python

I hope you are now clear about what autocorrect is and how it works. Now let us see how we can build an autocorrect feature with Python for smartphones. As our smartphone uses past history to match the typed words whether it is correct or not. So here we are required to use some words to run the functionality in our Autocorrect.

So I am going to use the text from a book to understand it practically which you can easily download from here. Now let’s get started with the task to build an autocorrect model with Python.

Note: You can use any kind of text data.

Download Link

To run this task, we are required some libraries. I am going to use libraries that are very general for machine learning. So you should be having all these libraries already installed in your system except one library. You need to install one library known as “text distance”, which can be easily installed by using the pip command.

pip install textdistance

Now let us get started with this by importing all the necessary packages, libraries and by reading our text file:

Code:

import pandas as pd import numpy as np import textdistance import re from collections import Counter words = [] with open('auto.txt', 'r') as f: file_name_data = f.read() file_name_data=file_name_data.lower() words = re.findall('w+',file_name_data) # This is our vocabulary V = set(words) print("Top ten words in the text are:{words[0:10]}") print("Total Unique words are {len(V)}.") Output: Top ten words in the text are:['moby', 'dick', 'by', 'herman', 'melville', '1851', 'etymology', 'supplied', 'by', 'a'] Total Unique words are 17140.

In the above code, you can see that we have made a list of words and now we will build the frequency of those words, which can be easily done by using the “counter function” in Python:

Code:

word_freq = {} word_freq = Counter(words) print(word_freq.most_common()[0:10]) Output: [('the', 14431), ('and', 6430), ('a', 4736), ('to', 4625), ('in', 4172), ('his', 2530), ('it', 2522), ('i', 2127)] Relative Frequency of words

Now here we want to get the occurrence of each word that is nothing but we have to find probabilities, which equals the Relative Frequencies of the words:

Code:

probs = {} Total = sum(word_freq.values()) for k in word_freq.keys(): probs[k] = word_freq[k]/Total Finding Similar Words

So we will sort similar words according to the “Jaccard Distance” by calculating the two grams Q of the words. Then next, we will return the five most similar words which are ordered by similarity and probability:-

Code:

def my_autocorrect(input_word): input_word = input_word.lower() if input_word in V: return('Your word seems to be correct') else: sim = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq.keys()] df = pd.DataFrame.from_dict(probs, orient='index').reset_index() df = df.rename(columns={'index':'Word', 0:'Prob'}) df['Similarity'] = sim output = df.sort_values(['Similarity', 'Prob'], ascending=False).head() return(output)

Okay, Now, let us find some similar words by using our autocorrect function:

Code:

my_autocorrect('neverteless')

Word Prob Similarity

2209 nevertheless 0.000229 0.750000

13300 boneless 0.000014 0.416667

12309 elevates 0.000005 0.416667

718 never 0.000942 0.400000

6815 level 0.000110 0.400000

This is how the autocorrect algorithm works here!!

As we have taken words from a book. In the same way, there are some words that are already present in the vocabulary of the smartphone and then some words it records while the user starts typing using the keyboard.

Conclusion

You can use this feature to implement in real-time. I hope you liked this article that how to build an Autocorrect Feature using NLP with Python.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

Related

You're reading Autocorrect Feature Using Nlp In Python

Classification Model Simulator Application Using Dash In Python

This article was published as a part of the Data Science Blogathon.

Introduction

Build an app and bring data to life !!!!

Dash as an open-source python framework for analytic applications. It is built on top of Flask, chúng tôi and chúng tôi If you have used python for data exploration, analysis, visualization, model building, or reporting then you find it extremely useful to building highly interactive analytic web applications with minimal code. In this article, we will explore some key features including DCC & DAQ components, plotly express for visuals and build classification models with an app.

Here are various topics that this article explores

1. Quick look at plotly features & widgets

2. Build an interface for the user to experiment with parameters

3. Build models and measure metrics

4. Leverage Pytest for automated testing

5. Logging errors for debugging

6. Conclusion

  Data

We will be using Analytics Vidhya’s dataset from Loan prediction . let’s create a separate file for loading data ‘definition.py’ and have created an object call obj_Data which is accessible across files within the project. Firstly, let’s look at the data.

Front End – Add DCC & DAQ controls

Before we begin, let’s take a look at what we will build by the end of this blog.

 

Slider:

    daq.Slider( id = 'slider', min=0, max=100, value=70, handleLabel={"showCurrentValue": True,"label": "SPLIT"}, step=10 ),

Dropdowns:

Next, let’s build two dropdowns, one for selecting the target variable and the other for independent variables. The only thing to note here is that the values are being populated from the dataset and not hardcoded.

options=[{'label':x, 'value':x} for x in obj_Data.df_train_dummies.columns],

html.P("Select Target", className="control_label"), dcc.Dropdown( , options=[{'label':x, 'value':x} for x in obj_Data.df_train_dummies.columns], multi=False, value='Loan_Status', clearable=False, ),

Numeric Input with DAQ:

We also would like to have the user select a number of splits for model building. For more information refer KFOLD. let’s add a numeric field with min=1 and max=10.

   daq.NumericInput( id='id-daq-splits', min=0, max=10, size = 75, value=2 ),

LED Display with DAQ:

It would also be very useful to have certain standard metrics upfront like the number of records, a number of categorical & numeric fields, etc as part of basic information. For this, let’s make use of dash-daq widgets. We can change the font, color, and background depending on the layout/theme.

    daq.LEDDisplay( id='records', #label="Default", value=0, label = "Records", size=FONTSIZE, color = FONTCOLOR, backgroundColor=BGCOLOR )

We will use the same code snippet to generate a few more such cards/widgets making sure ids are unique. We will look at how to populate the values in the later segment but for now, let’s set the value to zero.

Now that all user enterable fields are covered, we will have placeholders for showcasing some of the model metrics plots such as AUC-ROC which is a standard curve for classification models. We will populate the chart once the model building is completed in a later segment.

html.Div( [dcc.Graph(id="main_graph")], ),

Back End – let’s build models and measure metrics:

There are two aspects to factor in-

1. Build all seven classification models and plot a bar chart based on accuracy. We will code this in a separate file named multiModel.py

2. Automatically select the best performing model and detail the relevant metric specific to the chosen model. We will code this in file models.py

Classification Model/s:

Let’s start with the first part – We will build seven classification models namely Logistic regression, light GBM, KNN, Decision Tree, AdaBoost Classifier, Random Forest, and Gaussian Naive Bayes. Here is the snippet for LGBM. As the article is about building an analytics app and not a model building, you can refer to the complete model building code for more details.

    ... ... clf = lgb.LGBMClassifier(n_estimators=1000,max_depth=4,random_state=22) clf.fit(X_trn,y_trn) predictions = clf.predict(X_val) fun_metrics(predictions, y_val) fpr, tpr, _ = roc_curve(y_val, predictions) fun_metricsPlots(fpr, tpr, "LGBM") fun_updateAccuracy(clf, predictions) .... ....

Now, for the second part where we will generate metrics specific to the best model among the seven. Here is the pseudo-code snippet – refer code for more details.

if bestModel == 'GNB': model = GaussianNB() elif bestModel == 'LGBM': model = lgb.LGBMClassifier() elif bestModel == 'Logistic': model = LogisticRegression() elif bestModel == 'KNN': model = KNeighborsClassifier() elif bestModel == 'Raondom Forest': model = RandomForestClassifier() elif bestModel == 'DT': model = tree.DecisionTreeClassifier() else: model = AdaBoostClassifier()

Measure Model Metrics:

We will track the metrics for the best model – precision, recall, and accuracy, and for this, we will be using sklearn.metrics library for deriving these numbers. These are the numbers that will be populating our dash-daq widgets.

from sklearn.metrics import roc_curve, roc_auc_score, recall_score, precision_score,accuracy_score precision = round(precision_score(testy, yhat),2) recall = round(recall_score(testy, yhat),2) accuracy = round(accuracy_score(testy, yhat)*100,1)

testy has the actual value from the test set and yhat has predicted values.

Build an AUC-ROC plot with Plotly Express:

Similarly, build an AUC-ROC curve using plotly express and save it on fig object fig_ROC

fig_ROC = px.area( x=lr_fpr, y=lr_tpr, title=f'ROC Curve (AUC={lr_auc:.4f})', labels=dict(x='False Positive Rate', y='True Positive Rate') ) fig_ROC.add_shape( type='line', line=dict(dash='dash'), x0=0, x1=1, y0=0, y1=1 ) fig_ROC.update_yaxes(scaleanchor="x", scaleratio=1) fig_ROC.update_xaxes(constrain='domain')

Interaction with callbacks:

Now that we have designed the front end with widgets, place holders, and for the back end, wrote a function for building classification model/s which does the prediction and also generates model metrics. Now, these two should interact with each other every time user changes the input and this can be achieved using callbacks. The callbacks are Python functions that are automatically called by Dash whenever an input component’s property changes.

There are 3 sections to callbacks-

1. List of all the outputs (or just a single output)

2. List of all the inputs (or just a single input)

3. Function which takes the input, does the defined processing, and gives back the output.

Note: If there are multiple inputs or multiple outputs then the controls are wrapped under [ ] if not then it can be skipped.

[ Output("main_graph", 'figure'), Output("recall", 'value'), ] .... [ Input("select_target", "value"), Input("select_independent", "value"), ... ] ....

In the above code snippet for output, the first argument is the main_graph that we had set during UI design. The second argument is the object type which in this case is figure. Similarly, the recall control expects the object of type value which in this case is numeric. More information on callbacks can be found here . Bringing all our input/output controls together, the code would like this.

@app.callback( [ Output("main_graph", 'figure'), Output("individual_graph", 'figure'), Output("aggregate_graph", 'figure'), Output("slider-output-container", 'children'), Output("precision", 'value'), Output("recall", 'value'), Output("accuracy", 'value'), Output("auc", 'value'), Output("trainset", 'value'), Output("testset", 'value'), Output('model-graduated-bar', 'value'), Output('id-insights', 'children'), Output("model-graphs", 'figure'), Output("best-model", 'children'), Output("id-daq-switch-model", 'on'), Output('auto-toast-model', 'is_open') ], [ Input("select_target", "value"), Input("select_independent", "value"), Input("slider", "value"), Input("id-daq-splits", "value"), Input("select_models", "value") ] ) def measurePerformance(target, independent, slider, splits, selected_models): fig_ROC, Fig_Precision, fig_Threshold,precision, recall, accuracy, trainX, testX, auc, fig_model, bestModel = multiModel.getModels(target,independent, slider, splits, selected_models) auc_toast = True if auc < 0.5 else False return fig_ROC, Fig_Precision, fig_Threshold, 'Train / Test split size: {} / {}'.format(slider, 100-slider), precision, recall, accuracy,auc, trainX, testX, auc*100, f'The best performing model is {bestModel} with accuracy of {accuracy}, precision of {precision} and recall of {recall} with Area under curve of {auc}. Try for various K FOLD values to explore further.' ,fig_model, f'The top performaing model is {bestModel}', True, auc_toast

Write some testcases using PyTest:

Writing unit test cases for typical web development is normal but generally, for analytic apps with predictive models and visuals, there is a tendency to skip and just do a sanity check manually at the end. The pytest library makes it easier to configure the test cases, write functions to test for specific inputs & outputs. In short, write it once and keep running the test before pushing code to QA/Prod environment. Refer pytest document for more details.

As an example, let’s write a case to check for Precision value. We can use the same framework and extend it to many more cases – positive, negative, and borderline cases.

#pip install pytest import pytest def test_buildModels(): fig_ROC, fig_precision, fig_threshold, precision, recall, accuracy, trainX, testX, lr_auc = buildModel(target, independent, slider, selected_models) assert precision < 1

The assert keyword ensures that the specified criteria is met and designates the test case either as Pass or Fail.

Configure test cases

Test cases under execution

One test case failed

All test cases passed

Logging errors:

Logging errors/ warnings help us keep track of issues in our code and for this, we will use a logging library. We will create a separate file by name chúng tôi Logging is not only a good practice to follow but also helps immensely during the debugging process. Some prefer to use print() statement which logs output in the console for their reference but is recommended that one uses logging.

Create a file by name ‘model.log’ in your project directory and use the below code for logging errors in this file.

# install the library if you haven't already done # pip install logging import logging logging.basicConfig(filename= 'model.log', level = logging.DEBUG,format='%(asctime)s:%(levelname)s:%(filename)s:%(funcName)s:%(message)s')

The errors can be tracked in the chúng tôi file. Here is a sample error:

Conclusion

:

Python with plotly Dash can be used to build some very complex analytics applications in a short time. I personally find it useful for rapid prototyping, client demos, proposals, and POC’s. The best part of the whole process is you only need to know the basics of python and you can create the front end, back end, visuals, and predictive models which are core to analytics apps. If you use your creative side and focus on the user experience, then you are sure to impress your team, client, or end-user.

What Next?:

The app can be extended to multi-class classification models, add more visuals & metrics as required, build a login page with user authentication, maybe save data to DB, and much more. Hope you learned something new today.

Happy learnings !!!!

You can connect with me – Linkedin

You can find the code for reference – Github

References:

Related

Twitter Sentiment Analysis Using Python

A Twitter sentiment analysis determines negative, positive, or neutral emotions within the text of a tweet using NLP and ML models. Sentiment analysis or opinion mining refers to identifying as well as classifying the sentiments that are expressed in the text source. Tweets are often useful in generating a vast amount of sentiment data upon analysis. These data are useful in understanding the opinion of people on social media for a variety of topics.

This article was published as a part of the Data Science Blogathon.

What is Twitter Sentiment Analysis?

Twitter sentiment analysis analyzes the sentiment or emotion of tweets. It uses natural language processing and machine learning algorithms to classify tweets automatically as positive, negative, or neutral based on their content. It can be done for individual tweets or a larger dataset related to a particular topic or event.

Why is Twitter Sentiment Analysis Important?

Understanding Customer Feedback: By analyzing the sentiment of customer feedback, companies can identify areas where they need to improve their products or services.

Political Analysis: Sentiment analysis can help political campaigns understand public opinion and tailor their messaging accordingly.

Crisis Management: In the event of a crisis, sentiment analysis can help organizations monitor social media and news outlets for negative sentiment and respond appropriately.

How to Do Twitter Sentiment Analysis?

In this article, we aim to analyze Twitter sentiment analysis using machine learning algorithms, the sentiment of tweets provided from the Sentiment140 dataset by developing a machine learning pipeline involving the use of three classifiers (Logistic Regression, Bernoulli Naive Bayes, and SVM)along with using Term Frequency- Inverse Document Frequency (TF-IDF). The performance of these classifiers is then evaluated using accuracy and F1 Scores.

For data preprocessing, we will be using Natural Language Processing’s (NLP) NLTK library.

Twitter Sentiment Analysis: Problem Statement

In this project, we try to implement an NLP Twitter sentiment analysis model that helps to overcome the challenges of sentiment classification of tweets. We will be classifying the tweets into positive or negative sentiments. The necessary details regarding the dataset involving the Twitter sentiment analysis project are:

The dataset provided is the Sentiment140 Dataset which consists of 1,600,000 tweets that have been extracted using the Twitter API. The various columns present in this Twitter data are:

target: the polarity of the tweet (positive or negative)

ids: Unique id of the tweet

date: the date of the tweet

flag: It refers to the query. If no such query exists, then it is NO QUERY.

user: It refers to the name of the user that tweeted

text: It refers to the text of the tweet

Twitter Sentiment Analysis: Project Pipeline

The various steps involved in the Machine Learning Pipeline are:

Import Necessary Dependencies

Read and Load the Dataset

Exploratory Data Analysis

Data Visualization of Target Variables

Data Preprocessing

Splitting our data into Train and Test sets.

Transforming Dataset using TF-IDF Vectorizer

Function for Model Evaluation

Model Building

Model Evaluation

Let’s get started,

Step-1: Import the Necessary Dependencies # utilities import re import numpy as np import pandas as pd # plotting import seaborn as sns from wordcloud import WordCloud import matplotlib.pyplot as plt # nltk from chúng tôi import WordNetLemmatizer # sklearn from chúng tôi import LinearSVC from sklearn.naive_bayes import BernoulliNB from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import confusion_matrix, classification_report Step-2: Read and Load the Dataset # Importing the dataset DATASET_COLUMNS=['target','ids','date','flag','user','text'] DATASET_ENCODING = "ISO-8859-1" df = pd.read_csv('Project_Data.csv', encoding=DATASET_ENCODING, names=DATASET_COLUMNS) df.sample(5)

Output:

Step-3: Exploratory Data Analysis

3.1: Five top records of data

df.head()

Output:

3.2: Columns/features in data

df.columns

Output:

Index(['target', 'ids', 'date', 'flag', 'user', 'text'], dtype='object')

3.3: Length of the dataset

print('length of data is', len(df))

Output:

length of data is 1048576

3.4: Shape of data

df. shape

Output:

(1048576, 6)

3.5: Data information

df.info()

Output:

3.6: Datatypes of all columns

df.dtypes

Output:

target int64 ids int64 date object flag object user object text object dtype: object

3.7: Checking for null values

np.sum(df.isnull().any(axis=1))

Output:

0

3.8: Rows and columns in the dataset

print('Count of columns in the data is: ', len(df.columns)) print('Count of rows in the data is: ', len(df))

Output:

Count of columns in the data is: 6 Count of rows in the data is: 1048576

3.9: Check unique target values

df['target'].unique()

Output:

array([0, 4], dtype=int64)

3.10: Check the number of target values

df['target'].nunique()

Output:

2 Step-4: Data Visualization of Target Variables # Plotting the distribution for dataset. ax = df.groupby('target').count().plot(kind='bar', title='Distribution of data',legend=False) ax.set_xticklabels(['Negative','Positive'], rotation=0) # Storing data in lists. text, sentiment = list(df['text']), list(df['target'])

Output:

import seaborn as sns sns.countplot(x='target', data=df)

Output:

Step-5: Data Preprocessing

In the above-given problem statement, before training the model, we performed various pre-processing steps on the dataset that mainly dealt with removing stopwords, removing special characters like emojis, hashtags, etc. The text document is then converted into lowercase for better generalization.

Subsequently, the punctuations were cleaned and removed, thereby reducing the unnecessary noise from the dataset. After that, we also removed the repeating characters from the words along with removing the URLs as they do not have any significant importance.

At last, we then performed Stemming(reducing the words to their derived stems) and Lemmatization(reducing the derived words to their root form, known as lemma) for better results.

5.1: Selecting the text and Target column for our further analysis

data=df[['text','target']]

5.2: Replacing the values to ease understanding. (Assigning 1 to Positive sentiment 4)

data['target'] = data['target'].replace(4,1)

5.3: Printing unique values of target variables

data['target'].unique()

Output:

array([0, 1], dtype=int64)

5.4: Separating positive and negative tweets

data_pos = data[data['target'] == 1] data_neg = data[data['target'] == 0]

5.5: Taking one-fourth of the data so we can run it on our machine easily

data_pos = data_pos.iloc[:int(20000)] data_neg = data_neg.iloc[:int(20000)]

5.6: Combining positive and negative tweets

dataset = pd.concat([data_pos, data_neg])

5.7: Making statement text in lowercase

dataset['text']=dataset['text'].str.lower() dataset['text'].tail()

Output:

5.8: Defining set containing all stopwords in English.

stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an', 'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do', 'does', 'doing', 'down', 'during', 'each','few', 'for', 'from', 'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once', 'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're','s', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such', 't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was', 'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom', 'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre", "youve", 'your', 'yours', 'yourself', 'yourselves']

5.9: Cleaning and removing the above stop words list from the tweet text

STOPWORDS = set(stopwordlist) def cleaning_stopwords(text): return " ".join([word for word in str(text).split() if word not in STOPWORDS]) dataset['text'] = dataset['text'].apply(lambda text: cleaning_stopwords(text)) dataset['text'].head()

Output:

5.10: Cleaning and removing punctuations

import string english_punctuations = string.punctuation punctuations_list = english_punctuations def cleaning_punctuations(text): translator = str.maketrans('', '', punctuations_list) return text.translate(translator) dataset['text']= dataset['text'].apply(lambda x: cleaning_punctuations(x)) dataset['text'].tail()

Output:

5.11: Cleaning and removing repeating characters

def cleaning_repeating_char(text): return re.sub(r'(.)1+', r'1', text) dataset['text'] = dataset['text'].apply(lambda x: cleaning_repeating_char(x)) dataset['text'].tail()

Output:

5.12: Cleaning and removing URLs

def cleaning_URLs(data): dataset['text'] = dataset['text'].apply(lambda x: cleaning_URLs(x)) dataset['text'].tail()

Output:

5.13: Cleaning and removing numeric numbers

def cleaning_numbers(data): return re.sub('[0-9]+', '', data) dataset['text'] = dataset['text'].apply(lambda x: cleaning_numbers(x)) dataset['text'].tail()

Output:

5.14: Getting tokenization of tweet text

from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'w+') dataset['text'] = dataset['text'].apply(tokenizer.tokenize) dataset['text'].head()

Output:

5.15: Applying stemming

import nltk st = nltk.PorterStemmer() def stemming_on_text(data): text = [st.stem(word) for word in data] return data dataset['text']= dataset['text'].apply(lambda x: stemming_on_text(x)) dataset['text'].head()

Output:

5.16: Applying lemmatizer

lm = nltk.WordNetLemmatizer() def lemmatizer_on_text(data): text = [lm.lemmatize(word) for word in data] return data dataset['text'] = dataset['text'].apply(lambda x: lemmatizer_on_text(x)) dataset['text'].head()

Output:

5.17: Separating input feature and label

X=data.text y=data.target

5.18: Plot a cloud of words for negative tweets

data_neg = data['text'][:800000] plt.figure(figsize = (20,20)) wc = WordCloud(max_words = 1000 , width = 1600 , height = 800, collocations=False).generate(" ".join(data_neg)) plt.imshow(wc)

Output:

5.19: Plot a cloud of words for positive tweets

data_pos = data['text'][800000:] wc = WordCloud(max_words = 1000 , width = 1600 , height = 800, collocations=False).generate(" ".join(data_pos)) plt.figure(figsize = (20,20)) plt.imshow(wc)

Output:

Step-6: Splitting Our Data Into Train and Test Subsets # Separating the 95% data for training data and 5% for testing data X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.05, random_state =26105111) Step-7: Transforming the Dataset Using TF-IDF Vectorizer

7.1: Fit the TF-IDF Vectorizer

vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000) vectoriser.fit(X_train) print('No. of feature_words: ', len(vectoriser.get_feature_names()))

Output:

No. of feature_words: 500000

7.2: Transform the data using TF-IDF Vectorizer

X_train = vectoriser.transform(X_train) X_test = vectoriser.transform(X_test) Step-8: Function for Model Evaluation

After training the model, we then apply the evaluation measures to check how the model is performing. Accordingly, we use the following evaluation parameters to check the performance of the models respectively:

Accuracy Score

Confusion Matrix with Plot

ROC-AUC Curve

def model_Evaluate(model): # Predict values for Test dataset y_pred = model.predict(X_test) # Print the evaluation metrics for the dataset. print(classification_report(y_test, y_pred)) # Compute and plot the Confusion matrix cf_matrix = confusion_matrix(y_test, y_pred) categories = ['Negative','Positive'] group_names = ['True Neg','False Pos', 'False Neg','True Pos'] group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)] labels = [f'{v1}n{v2}' for v1, v2 in zip(group_names,group_percentages)] labels = np.asarray(labels).reshape(2,2) sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '', xticklabels = categories, yticklabels = categories) plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10) plt.ylabel("Actual values" , fontdict = {'size':14}, labelpad = 10) plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20) Step-9: Model Building

In the problem statement, we have used three different models respectively :

Bernoulli Naive Bayes Classifier

SVM (Support Vector Machine)

Logistic Regression

The idea behind choosing these models is that we want to try all the classifiers on the dataset ranging from simple ones to complex models, and then try to find out the one which gives the best performance among them.

8.1: Model-1

BNBmodel = BernoulliNB() BNBmodel.fit(X_train, y_train) model_Evaluate(BNBmodel) y_pred1 = BNBmodel.predict(X_test)

Output:

8.2: Plot the ROC-AUC Curve for model-1

from sklearn.metrics import roc_curve, auc fpr, tpr, thresholds = roc_curve(y_test, y_pred1) roc_auc = auc(fpr, tpr) plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC CURVE') plt.legend(loc="lower right") plt.show()

Output:

8.3: Model-2:

SVCmodel = LinearSVC() SVCmodel.fit(X_train, y_train) model_Evaluate(SVCmodel) y_pred2 = SVCmodel.predict(X_test)

Output:

8.4: Plot the ROC-AUC Curve for model-2

from sklearn.metrics import roc_curve, auc fpr, tpr, thresholds = roc_curve(y_test, y_pred2) roc_auc = auc(fpr, tpr) plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC CURVE') plt.legend(loc="lower right") plt.show()

Output:

8.5: Model-3

LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1) LRmodel.fit(X_train, y_train) model_Evaluate(LRmodel) y_pred3 = LRmodel.predict(X_test)

Output:

8.6: Plot the ROC-AUC Curve for model-3

from sklearn.metrics import roc_curve, auc fpr, tpr, thresholds = roc_curve(y_test, y_pred3) roc_auc = auc(fpr, tpr) plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC CURVE') plt.legend(loc="lower right") plt.show()

Output:

Step-10: Model Evaluation

Upon evaluating all the models, we can conclude the following details i.e.

Accuracy: As far as the accuracy of the model is concerned, Logistic Regression performs better than SVM, which in turn performs better than Bernoulli Naive Bayes.

AUC Score: All three models have the same ROC-AUC score.

We, therefore, conclude that the Logistic Regression is the best model for the above-given dataset.

In our problem statement, Logistic Regression follows the principle of Occam’s Razor, which defines that for a particular problem statement, if the data has no assumption, then the simplest model works the best. Since our dataset does not have any assumptions and Logistic Regression is a simple model. Therefore, the concept holds true for the above-mentioned dataset.

Conclusion

We hope through this article, you got a basic of how Sentimental Analysis is used to understand public emotions behind people’s tweets. As you’ve read in this article, Twitter Sentimental Analysis helps us preprocess the data (tweets) using different methods and feed it into ML models to give the best accuracy.

Key Takeaways

Twitter Sentimental Analysis is used to identify as well as classify the sentiments that are expressed in the text source.

Logistic Regression, SVM, and Naive Bayes are some of the ML algorithms that can be used for Twitter Sentimental Analysis.

Frequently Asked Questions

Related

Feature Detection, Description And Matching Of Images Using Opencv

This article was published as a part of the Data Science Blogathon

Introduction

In this article, I am gonna discuss various algorithms of image feature detection, description, and feature matching using OpenCV.

First of all, let’s see what is computer vision because OpenCV is an Open source Computer Vision library.

What happens when a human sees this image?

Source

He will be able to recognize the faces which are there inside the images. So, in a simple form, computer vision is what allows computers to see and process visual data just like humans. Computer vision involves analyzing images to produce useful information.

What is a feature?

When you see a mango image, how can you identify it as a mango?

By analyzing the color, shape, and texture you can say that it is a mango.

The clues which are used to identify or recognize an image are called features of an image. In the same way, computer functions, to detect various features in an image.

We will discuss some of the algorithms of the OpenCV library that are used to detect features.

1. Feature Detection Algorithms 1.1 Harris Corner Detection

Harris corner detection algorithm is used to detect corners in an input image. This algorithm has three main steps.

Determine which part of the image has a large variation in intensity as corners have large variations in intensities. It does this by moving a sliding window throughout the image.

For each window identified, compute a score value R.

Apply threshold to the score and mark the corners.

Here is the Python implementation of this algorithm.

import cv2 import numpy as np imput_img = 'det_1.jpg' ori = cv2.imread(imput_img) image = cv2.imread(imput_img) gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY) gray = np.float32(gray) dst = cv2.cornerHarris(gray,2,3,0.04) dst = cv2.dilate(dst,None) cv2.imshow('Original',ori) cv2.imshow('Harris',image) if cv2.waitKey(0) & 0xff == 27: cv2.destroyAllWindows()

Here is the output.

1.2 Shi-Tomasi Corner Detector

This is another corner detection algorithm. It works similar to Harris Corner detection. The only difference here is the computation of the value of R. This algorithm also allows us to find the best n corners in an image.

Let’s see the Python implementation.

import numpy as np import cv2 from matplotlib import pyplot as plt img = cv2.imread('det_1.jpg') ori = cv2.imread('det_1.jpg') gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) corners = cv2.goodFeaturesToTrack(gray,20,0.01,10) corners = np.int0(corners) for i in corners: x,y = i.ravel() cv2.circle(img,(x,y),3,255,-1) cv2.imshow('Original', ori) cv2.imshow('Shi-Tomasi', img) cv2.waitKey(0) cv2.destroyAllWindows()

This is the output of the Shi-Tomasi algorithm. Here the top 20 corners are detected.

The next one is Scale-Invariant Feature Transform.

1.3 Scale-Invariant Feature Transform (SIFT)

SIFT is used to detect corners, blobs, circles, and so on. It is also used for scaling an image.

Source

Consider these three images. Though they differ in color, rotation, and angle, you know that these are the three different images of mangoes. How can a computer be able to identify this?

Both Harris corner detection and Shi-Tomasi corner detection algorithms fail in this case. But SIFT algorithm plays a vital role here. It can detect features from the image irrespective of its size and orientation.

Let’s implement this algorithm.

import numpy as np import cv2 as cv ori = cv.imread('det_1.jpg') img = cv.imread('det_1.jpg') gray = cv.cvtColor(img,cv.COLOR_BGR2GRAY) sift = cv.SIFT_create() kp, des = sift.detectAndCompute(gray,None) img=cv.drawKeypoints(gray,kp,img,flags=cv.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS) cv.imshow('Original',ori) cv.imshow('SIFT',image) if cv.waitKey(0) & 0xff == 27: cv.destroyAllWindows()

The output is shown below.

You can see that there are some lines and circles in the image. The size and orientation of the feature are indicated by the circle and line inside the circle respectively.

We will see the next algorithm of feature detection.

1.4 Speeded-up Robust Features (SURF)

SURF algorithm is simply an upgraded version of SIFT.

Let’s implement this.

import numpy as np import cv2 as cv ori =cv.imread('/content/det1.jpg') img = cv.imread('/content/det1.jpg') surf = cv.xfeatures2d.SURF_create(400) kp, des = surf.detectAndCompute(img,None) img2 = cv.drawKeypoints(img,kp,None,(255,0,0),4) cv.imshow('Original', ori) cv.imshow('SURF', img2)

Next, we will see how to extract another feature called bob.

2. Detection of blobs

implement this one.

import cv2 import numpy as np; ori = cv2.imread('det_1.jpg') im = cv2.imread("det_1.jpg", cv2.IMREAD_GRAYSCALE) detector = cv2.SimpleBlobDetector_create() keypoints = detector.detect(im) im_with_keypoints = cv2.drawKeypoints(im, keypoints, np.array([]), (0,0,255), cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS) cv2.imshow('Original',ori) cv2.imshow('BLOB',im_with_keypoints) if cv2.waitKey(0) & 0xff == 27: cv2.destroyAllWindows()

Let’s see the output. Here, the blobs are detected very well.

Now, let’s jump into feature descriptor algorithms.

3. Feature Descriptor Algorithms

Features are typically distinct points in an image and the descriptor gives a signature, so it describes the key point that is considered. It extracts the local neighborhood around that point so a local image patch is created and a signature from this local patch is computed.

3.1 Histogram of Oriented Gradients (HoG)

detection applications. HoG is a technique that is used to count the occurrence of gradient orientation in localized portions of an image.

Let’s implement this algorithm.

from skimage.feature import hog import cv2 ori = cv2.imread('/content/det1.jpg') img = cv2.imread("/content/det1.jpg") _, hog_image = hog(img, orientations=8, pixels_per_cell=(16, 16), cells_per_block=(1, 1), visualize=True, multichannel=True) cv2.imshow('Original', ori) cv2.imshow('HoG', hog_image)

The next one is BRIEF.

3.2 Binary Robust Independent Elementary Features (BRIEF)

BRIEF is an alternative to the popular SIFT descriptor and they are faster to compute and more compact.

Let’s see its implementation.

import numpy as np import cv2 as cv ori = cv.imread('/content/det1.jpg') img = cv.imread('/content/det1.jpg',0) star = cv.xfeatures2d.StarDetector_create() brief = cv.xfeatures2d.BriefDescriptorExtractor_create() kp = star.detect(img,None) print( brief.descriptorSize() ) print( des.shape ) img2 = cv.drawKeypoints(img, kp, None, color=(0, 255, 0), flags=0) cv.imshow('Original', ori) cv.imshow('BRIEF', img2)

Here is the result.

3.3 Oriented FAST and Rotated BRIEF (ORB)

ORB is a one-shot facial recognition algorithm. It is currently being used in your mobile phones and apps like Google photos in which you group the people stab you see the images are grouped according to the people. This algorithm does not require any kind of major computations. It does not require GPU. Here, two algorithms are involved. FAST and BRIEF. It works on keypoint matching. Key point matching of distinctive regions in an image like the intensity variations.

Here is the implementation of this algorithm.

import numpy as np import cv2 ori = cv2.imread('/content/det1.jpg') img = cv2.imread('/content/det1.jpg', 0) orb = cv2.ORB_create(nfeatures=200) kp = orb.detect(img, None) img2 = cv2.drawKeypoints(img, kp, None, color=(0, 255, 0), flags=0) cv2.imshow('Original', ori) cv2.imshow('ORB', img2)

Here is the output.

Now, let’s see about feature matching.

4. Feature Matching

Feature matching is like comparing the features of two images which may be different in orientations, perspective, lightening, or even differ in sizes and colors. Let’s see its implementation.

import cv2 img1 = cv2.imread('/content/det1.jpg', 0) img2 = cv2.imread('/content/88.jpg', 0) orb = cv2.ORB_create(nfeatures=500) kp1, des1 = orb.detectAndCompute(img1, None) kp2, des2 = orb.detectAndCompute(img2, None) bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True) matches = bf.match(des1, des2) matches = sorted(matches, key=lambda x: x.distance) match_img = cv2.drawMatches(img1, kp1, img2, kp2, matches[:50], None) cv2.imshow('original image', img1) cv2.imshow('test image', img2) cv2.imshow('Matches', match_img) cv2.waitKey()

This is the result of this algorithm.

Endnotes

I hope you enjoyed this article. I have given a brief introduction to various feature detection, description, and feature matching techniques. The above-mentioned techniques are used in object detection, object tracking, and object classification applications.

The real fun starts when you start practicing. So, start practicing these algorithms, implement them in real-world projects, and see the fun. Keep learning.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

What To Know About Using Google Calendar’s Time Insights Feature

In today’s data-heavy world, there seems to be no shortage of tools that collect and display information about our lives, ripe for whatever level of analysis you’d like. Google Calendar is no different, with an easy-to-miss feature called Time Insights.

Introduced at the end of August, this analytical tool rests within the left-hand sidebar when viewing GCal on a computer—you won’t see it in the app. It takes up a mere five lines (one if you’ve somehow minimized it), so it’s possible to overlook if you’re not constantly poring over your schedule or only work out of the Android or iOS app. Whether you’ve noticed it or not, understanding how it works may help you better structure your day.

How to see and use Time Insights

When not minimized, Time Insights displays the same chunk of time visible on the main calendar page (perhaps a day, week, month, or year), how much of that is filled by meetings, and how that compares to your average total meeting time across the three previous equivalent periods. For example, if you’re looking at the week of Dec. 19 to 25, it may say you have five hours of scheduled meetings after averaging 1.9 hours over the past three weeks.

Google also displays this information with a multi-colored bar that’s divided into chunks for “focus time,” meetings, and meetings you have yet to respond to. Hover your cursor over the graphic and you’ll see the total and average times for each category.

[Related: How to slow down time because you’re not getting any younger]

Time breakdown

The prime feature here is the time breakdown ring, which is broken into separate colors for focus time, one-on-ones, meetings with three or more guests, requests you haven’t responded to, and—if enabled—how many remaining work hours you have in the day, week, or whatever time period you’re looking at.

Hover your cursor over a color and it’ll dim all events on your calendar except ones that match the type you’re on, putting a shadow under those so you can see them easier among everything else on your calendar.

Focus time Working hours Time in meetings

Under the time breakdown ring, there’s a Time in meetings heading. This shows you which day you tend to have the most meetings, your daily average of time spent in meetings over the past three weeks, and colored bars detailing your current calendar view, the next time period of the same length, and the two prior equivalent chunks of time. These have separate colors for recurring and one-time meetings, and if you hover your cursor over a block you’re currently in, GCal will highlight all meetings of that type.

People you meet with

The last heading on the Time Insights sidebar is People you meet with. This shows who you have the most meetings with in the selected time period, and you can pin up to 10 people to always see your shared meeting time. You’ll also see colored bars that indicate whether these meetings are one-on-one or in a group of up to 15 people—mouse over them to highlight them on the main calendar. If you don’t have a meeting with a pinned person within the chosen time period, this section will also tell you when your next meeting with them is.

Using Python + Streamlit To Find Striking Distance Keyword Opportunities

Python is an excellent tool to automate repetitive tasks as well as gain additional insights into data.

It’s perfect for Python beginners and pros alike and is a great introduction to using Python for SEO.

If you’d just like to get stuck in there’s a handy Streamlit app available for the code. This is simple to use and requires no coding experience.

There’s also a Google Colaboratory Sheet if you’d like to poke around with the code. If you can crawl a website, you can use this script!

Here’s an example of what we’ll be making today:

These keywords are found in the page title and H1, but not in the copy. Adding these keywords naturally to the existing copy would be an easy way to increase relevancy for these keywords.

By taking the hint from search engines and naturally including any missing keywords a site already ranks for, we increase the confidence of search engines to rank those keywords higher in the SERPs.

This report can be created manually, but it’s pretty time-consuming.

So, we’re going to automate the process using a Python SEO script.

Preview Of The Output

This is a sample of what the final output will look like after running the report:

The final output takes the top five opportunities by search volume for each page and neatly lays each one horizontally along with the estimated search volume.

It also shows the total search volume of all keywords a page has within striking distance, as well as the total number of keywords within reach.

The top five keywords by search volume are then checked to see if they are found in the title, H1, or copy, then flagged TRUE or FALSE.

This is great for finding quick wins! Just add the missing keyword naturally into the page copy, title, or H1.

Getting Started

The setup is fairly straightforward. We just need a crawl of the site (ideally with a custom extraction for the copy you’d like to check), and an exported file of all keywords a site ranks for.

This post will walk you through the setup, the code, and will link to a Google Colaboratory sheet if you just want to get stuck in without coding it yourself.

To get started you will need:

A crawl of the website.

An export of all keywords a site ranks for.

This Google Colab sheet or this Streamlit app to mash up the crawl and keyword data

We’ve named this the Striking Distance Report as it flags keywords that are easily within striking distance.

(We have defined striking distance as keywords that rank in positions four to 20, but have made this a configurable option in case you would like to define your own parameters.)

Striking Distance SEO Report: Getting Started 1. Crawl The Target Website

Set a custom extractor for the page copy (optional, but recommended).

Filter out pagination pages from the crawl.

2. Export All Keywords The Site Ranks For Using Your Favorite Provider

Filter keywords that trigger as a site link.

Remove keywords that trigger as an image.

Filter branded keywords.

Use both exports to create an actionable Striking Distance report from the keyword and crawl data with Python.

Crawling The Site

I’ve opted to use Screaming Frog to get the initial crawl. Any crawler will work, so long as the CSV export uses the same column names or they’re renamed to match.

The script expects to find the following columns in the crawl CSV export:

"Address", "Title 1", "H1-1", "Copy 1", "Indexability" Crawl Settings

The first thing to do is to head over to the main configuration settings within Screaming Frog:

The main settings to use are:

Crawl Internal Links, Canonicals, and the Pagination (Rel Next/Prev) setting.

(The script will work with everything else selected, but the crawl will take longer to complete!)

Next, it’s on to the Extraction tab.

At a bare minimum, we need to extract the page title, H1, and calculate whether the page is indexable as shown below.

Indexability is useful because it’s an easy way for the script to identify which URLs to drop in one go, leaving only keywords that are eligible to rank in the SERPs.

If the script cannot find the indexability column, it’ll still work as normal but won’t differentiate between pages that can and cannot rank.

Setting A Custom Extractor For Page Copy

In order to check whether a keyword is found within the page copy, we need to set a custom extractor in Screaming Frog.

Name the extractor “Copy” as seen below.

Important: The script expects the extractor to be named “Copy” as above, so please double check!

Lastly, make sure Extract Text is selected to export the copy as text, rather than HTML.

There are many guides on using custom extractors online if you need help setting one up, so I won’t go over it again here.

Once the extraction has been set it’s time to crawl the site and export the HTML file in CSV format.

Exporting The CSV File

Exporting the CSV file is as easy as changing the drop-down menu displayed underneath Internal to HTML and pressing the Export button.

The export screen should look like the below:

Tip 1: Filtering Out Pagination Pages

I recommend filtering out pagination pages from your crawl either by selecting Respect Next/Prev under the Advanced settings (or just deleting them from the CSV file, if you prefer).

Tip 2: Saving The Crawl Settings

Once you have set the crawl up, it’s worth just saving the crawl settings (which will also remember the custom extraction).

This will save a lot of time if you want to use the script again in the future.

Exporting Keywords

Once we have the crawl file, the next step is to load your favorite keyword research tool and export all of the keywords a site ranks for.

The goal here is to export all the keywords a site ranks for, filtering out branded keywords and any which triggered as a sitelink or image.

For this example, I’m using the Organic Keyword Report in Ahrefs, but it will work just as well with Semrush if that’s your preferred tool.

In Ahrefs, enter the domain you’d like to check in Site Explorer and choose Organic Keywords.

This will bring up all keywords the site is ranking for.

Filtering Out Sitelinks And Image links

The next step is to filter out any keywords triggered as a sitelink or an image pack.

The reason we need to filter out sitelinks is that they have no influence on the parent URL ranking. This is because only the parent page technically ranks for the keyword, not the sitelink URLs displayed under it.

Filtering out sitelinks will ensure that we are optimizing the correct page.

Here’s how to do it in Ahrefs.

Lastly, I recommend filtering out any branded keywords. You can do this by filtering the CSV output directly, or by pre-filtering in the keyword tool of your choice before the export.

Finally, when exporting make sure to choose Full Export and the UTF-8 format as shown below.

By default, the script works with Ahrefs (v1/v2) and Semrush keyword exports. It can work with any keyword CSV file as long as the column names the script expects are present.

Processing

The following instructions pertain to running a Google Colaboratory sheet to execute the code.

There is now a simpler option for those that prefer it in the form of a Streamlit app. Simply follow the instructions provided to upload your crawl and keyword file.

Now that we have our exported files, all that’s left to be done is to upload them to the Google Colaboratory sheet for processing.

The script will prompt you to upload the keyword CSV from Ahrefs or Semrush first and the crawl file afterward.

That’s it! The script will automatically download an actionable CSV file you can use to optimize your site.

Once you’re familiar with the whole process, using the script is really straightforward.

Code Breakdown And Explanation

If you’re learning Python for SEO and interested in what the code is doing to produce the report, stick around for the code walkthrough!

Install The Libraries

Let’s install pandas to get the ball rolling.

!pip install pandas Import The Modules

Next, we need to import the required modules.

import pandas as pd from pandas import DataFrame, Series from typing import Union from google.colab import files Set The Variables

Now it’s time to set the variables.

The script considers any keywords between positions four and 20 as within striking distance.

Changing the variables here will let you define your own range if desired. It’s worth experimenting with the settings to get the best possible output for your needs.

# set all variables here min_volume = 10 # set the minimum search volume min_position = 4 # set the minimum position / default = 4 max_position = 20 # set the maximum position / default = 20 drop_all_true = True # If all checks (h1/title/copy) are true, remove the recommendation (Nothing to do) Upload The Keyword Export CSV File

The next step is to read in the list of keywords from the CSV file.

It is set up to accept an Ahrefs report (V1 and V2) as well as a Semrush export.

upload = files.upload() upload = list(upload.keys())[0] df_keywords = pd.read_csv( (upload), error_bad_lines=False, low_memory=False, encoding="utf8", dtype={ "URL": "str", "Keyword": "str", "Volume": "str", "Position": int, "Current URL": "str", "Search Volume": int, }, ) print("Uploaded Keyword CSV File Successfully!")

If everything went to plan, you’ll see a preview of the DataFrame created from the keyword CSV export. 

Upload The Crawl Export CSV File

Once the keywords have been imported, it’s time to upload the crawl file.

upload = files.upload() upload = list(upload.keys())[0] df_crawl = pd.read_csv( (upload), error_bad_lines=False, low_memory=False, encoding=”utf8″, dtype=”str”, ) print(“Uploaded Crawl Dataframe Successfully!”)

Once the CSV file has finished uploading, you’ll see a preview of the DataFrame.

Clean And Standardize The Keyword Data

The next step is to rename the column names to ensure standardization between the most common types of file exports.

Essentially, we’re getting the keyword DataFrame into a good state and filtering using cutoffs defined by the variables.

df_keywords.rename( columns={ "Current position": "Position", "Current URL": "URL", "Search Volume": "Volume", }, inplace=True, ) # keep only the following columns from the keyword dataframe cols = "URL", "Keyword", "Volume", "Position" df_keywords = df_keywords.reindex(columns=cols) try: # clean the data. (v1 of the ahrefs keyword export combines strings and ints in the volume column) df_keywords["Volume"] = df_keywords["Volume"].str.replace("0-10", "0") except AttributeError: pass # clean the keyword data df_keywords = df_keywords[df_keywords["URL"].notna()] # remove any missing values df_keywords = df_keywords[df_keywords["Volume"].notna()] # remove any missing values df_keywords = df_keywords.astype({"Volume": int}) # change data type to int df_keywords = df_keywords.sort_values(by="Volume", ascending=False) # sort by highest vol to keep the top opportunity # make new dataframe to merge search volume back in later df_keyword_vol = df_keywords[["Keyword", "Volume"]] # drop rows if minimum search volume doesn't match specified criteria df_keywords.loc[df_keywords["Volume"] < min_volume, "Volume_Too_Low"] = "drop" df_keywords = df_keywords[~df_keywords["Volume_Too_Low"].isin(["drop"])] # drop rows if minimum search position doesn't match specified criteria df_keywords.loc[df_keywords["Position"] <= min_position, "Position_Too_High"] = "drop" df_keywords = df_keywords[~df_keywords["Position_Too_High"].isin(["drop"])] # drop rows if maximum search position doesn't match specified criteria df_keywords = df_keywords[~df_keywords["Position_Too_Low"].isin(["drop"])] Clean And Standardize The Crawl Data

Next, we need to clean and standardize the crawl data.

Essentially, we use reindex to only keep the “Address,” “Indexability,” “Page Title,” “H1-1,” and “Copy 1” columns, discarding the rest.

We use the handy “Indexability” column to only keep rows that are indexable. This will drop canonicalized URLs, redirects, and so on. I recommend enabling this option in the crawl.

Lastly, we standardize the column names so they’re a little nicer to work with.

# keep only the following columns from the crawl dataframe cols = "Address", "Indexability", "Title 1", "H1-1", "Copy 1" df_crawl = df_crawl.reindex(columns=cols) # drop non-indexable rows df_crawl = df_crawl[~df_crawl["Indexability"].isin(["Non-Indexable"])] # standardise the column names df_crawl.rename(columns={"Address": "URL", "Title 1": "Title", "H1-1": "H1", "Copy 1": "Copy"}, inplace=True) df_crawl.head() Group The Keywords

As we approach the final output, it’s necessary to group our keywords together to calculate the total opportunity for each page.

Here, we’re calculating how many keywords are within striking distance for each page, along with the combined search volume.

# groups the URLs (remove the dupes and combines stats) # make a copy of the keywords dataframe for grouping - this ensures stats can be merged back in later from the OG df df_keywords_group = df_keywords.copy() df_keywords_group["KWs in Striking Dist."] = 1 # used to count the number of keywords in striking distance df_keywords_group = ( df_keywords_group.groupby("URL") .agg({"Volume": "sum", "KWs in Striking Dist.": "count"}) .reset_index() ) df_keywords_group.head()

Once complete, you’ll see a preview of the DataFrame.

Display Keywords In Adjacent Rows

We use the grouped data as the basis for the final output. We use Pandas.unstack to reshape the DataFrame to display the keywords in the style of a GrepWords export.

# create a new df, combine the merged data with the original data. display in adjacent rows ala grepwords df_merged_all_kws = df_keywords_group.merge( df_keywords.groupby("URL")["Keyword"] .apply(lambda x: x.reset_index(drop=True)) .unstack() .reset_index() ) # sort by biggest opportunity df_merged_all_kws = df_merged_all_kws.sort_values( by="KWs in Striking Dist.", ascending=False ) # reindex the columns to keep just the top five keywords cols = "URL", "Volume", "KWs in Striking Dist.", 0, 1, 2, 3, 4 df_merged_all_kws = df_merged_all_kws.reindex(columns=cols) # create union and rename the columns df_striking: Union[Series, DataFrame, None] = df_merged_all_kws.rename( columns={ "Volume": "Striking Dist. Vol", 0: "KW1", 1: "KW2", 2: "KW3", 3: "KW4", 4: "KW5", } ) # merges striking distance df with crawl df to merge in the title, h1 and category description df_striking = pd.merge(df_striking, df_crawl, on="URL", how="inner") Set The Final Column Order And Insert Placeholder Columns

Lastly, we set the final column order and merge in the original keyword data.

There are a lot of columns to sort and create!

# set the final column order and merge the keyword data in cols = [ "URL", "Title", "H1", "Copy", "Striking Dist. Vol", "KWs in Striking Dist.", "KW1", "KW1 Vol", "KW1 in Title", "KW1 in H1", "KW1 in Copy", "KW2", "KW2 Vol", "KW2 in Title", "KW2 in H1", "KW2 in Copy", "KW3", "KW3 Vol", "KW3 in Title", "KW3 in H1", "KW3 in Copy", "KW4", "KW4 Vol", "KW4 in Title", "KW4 in H1", "KW4 in Copy", "KW5", "KW5 Vol", "KW5 in Title", "KW5 in H1", "KW5 in Copy", ] # re-index the columns to place them in a logical order + inserts new blank columns for kw checks. df_striking = df_striking.reindex(columns=cols) Merge In The Keyword Data For Each Column

This code merges the keyword volume data back into the DataFrame. It’s more or less the equivalent of an Excel VLOOKUP function.

# merge in keyword data for each keyword column (KW1 - KW5) df_striking = pd.merge(df_striking, df_keyword_vol, left_on="KW1", right_on="Keyword", how="left") df_striking['KW1 Vol'] = df_striking['Volume'] df_striking.drop(['Keyword', 'Volume'], axis=1, inplace=True) df_striking = pd.merge(df_striking, df_keyword_vol, left_on="KW2", right_on="Keyword", how="left") df_striking['KW2 Vol'] = df_striking['Volume'] df_striking.drop(['Keyword', 'Volume'], axis=1, inplace=True) df_striking = pd.merge(df_striking, df_keyword_vol, left_on="KW3", right_on="Keyword", how="left") df_striking['KW3 Vol'] = df_striking['Volume'] df_striking.drop(['Keyword', 'Volume'], axis=1, inplace=True) df_striking = pd.merge(df_striking, df_keyword_vol, left_on="KW4", right_on="Keyword", how="left") df_striking['KW4 Vol'] = df_striking['Volume'] df_striking.drop(['Keyword', 'Volume'], axis=1, inplace=True) df_striking = pd.merge(df_striking, df_keyword_vol, left_on="KW5", right_on="Keyword", how="left") df_striking['KW5 Vol'] = df_striking['Volume'] df_striking.drop(['Keyword', 'Volume'], axis=1, inplace=True) Clean The Data Some More

The data requires additional cleaning to populate empty values, (NaNs), as empty strings. This improves the readability of the final output by creating blank cells, instead of cells populated with NaN string values.

Next, we convert the columns to lowercase so that they match when checking whether a target keyword is featured in a specific column.

# replace nan values with empty strings df_striking = df_striking.fillna("") # drop the title, h1 and category description to lower case so kws can be matched to them df_striking["Title"] = df_striking["Title"].str.lower() df_striking["H1"] = df_striking["H1"].str.lower() df_striking["Copy"] = df_striking["Copy"].str.lower() Check Whether The Keyword Appears In The Title/H1/Copy and Return True Or False

This code checks if the target keyword is found in the page title/H1 or copy.

It’ll flag true or false depending on whether a keyword was found within the on-page elements.

df_striking["KW1 in Title"] = df_striking.apply(lambda row: row["KW1"] in row["Title"], axis=1) df_striking["KW1 in H1"] = df_striking.apply(lambda row: row["KW1"] in row["H1"], axis=1) df_striking["KW1 in Copy"] = df_striking.apply(lambda row: row["KW1"] in row["Copy"], axis=1) df_striking["KW2 in Title"] = df_striking.apply(lambda row: row["KW2"] in row["Title"], axis=1) df_striking["KW2 in H1"] = df_striking.apply(lambda row: row["KW2"] in row["H1"], axis=1) df_striking["KW2 in Copy"] = df_striking.apply(lambda row: row["KW2"] in row["Copy"], axis=1) df_striking["KW3 in Title"] = df_striking.apply(lambda row: row["KW3"] in row["Title"], axis=1) df_striking["KW3 in H1"] = df_striking.apply(lambda row: row["KW3"] in row["H1"], axis=1) df_striking["KW3 in Copy"] = df_striking.apply(lambda row: row["KW3"] in row["Copy"], axis=1) df_striking["KW4 in Title"] = df_striking.apply(lambda row: row["KW4"] in row["Title"], axis=1) df_striking["KW4 in H1"] = df_striking.apply(lambda row: row["KW4"] in row["H1"], axis=1) df_striking["KW4 in Copy"] = df_striking.apply(lambda row: row["KW4"] in row["Copy"], axis=1) df_striking["KW5 in Title"] = df_striking.apply(lambda row: row["KW5"] in row["Title"], axis=1) df_striking["KW5 in H1"] = df_striking.apply(lambda row: row["KW5"] in row["H1"], axis=1) df_striking["KW5 in Copy"] = df_striking.apply(lambda row: row["KW5"] in row["Copy"], axis=1) Delete True/False Values If There Is No Keyword

This will delete true/false values when there is no keyword adjacent.

# delete true / false values if there is no keyword df_striking.loc[df_striking["KW1"] == "", ["KW1 in Title", "KW1 in H1", "KW1 in Copy"]] = "" df_striking.loc[df_striking["KW2"] == "", ["KW2 in Title", "KW2 in H1", "KW2 in Copy"]] = "" df_striking.loc[df_striking["KW3"] == "", ["KW3 in Title", "KW3 in H1", "KW3 in Copy"]] = "" df_striking.loc[df_striking["KW4"] == "", ["KW4 in Title", "KW4 in H1", "KW4 in Copy"]] = "" df_striking.loc[df_striking["KW5"] == "", ["KW5 in Title", "KW5 in H1", "KW5 in Copy"]] = "" df_striking.head() Drop Rows If All Values == True

This configurable option is really useful for reducing the amount of QA time required for the final output by dropping the keyword opportunity from the final output if it is found in all three columns.

def true_dropper(col1, col2, col3): drop = df_striking.drop( df_striking[ (df_striking[col1] == True) & (df_striking[col2] == True) & (df_striking[col3] == True) ].index ) return drop if drop_all_true == True: df_striking = true_dropper("KW1 in Title", "KW1 in H1", "KW1 in Copy") df_striking = true_dropper("KW2 in Title", "KW2 in H1", "KW2 in Copy") df_striking = true_dropper("KW3 in Title", "KW3 in H1", "KW3 in Copy") df_striking = true_dropper("KW4 in Title", "KW4 in H1", "KW4 in Copy") df_striking = true_dropper("KW5 in Title", "KW5 in H1", "KW5 in Copy") Download The CSV File

The last step is to download the CSV file and start the optimization process.

df_striking.to_csv('Keywords in Striking Distance.csv', index=False) files.download("Keywords in Striking Distance.csv") Conclusion

If you are looking for quick wins for any website, the striking distance report is a really easy way to find them.

Don’t let the number of steps fool you. It’s not as complex as it seems. It’s as simple as uploading a crawl and keyword export to the supplied Google Colab sheet or using the Streamlit app.

The results are definitely worth it!

More Resources:

Featured Image: aurielaki/Shutterstock

Update the detailed information about Autocorrect Feature Using Nlp In Python on the Cancandonuts.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!