Ramgopal Prajapat:

Learnings and Views

Wonderful Word Cloud in Python

By: Ram on Jul 05, 2020

Overview

Extracting key words and topics from text or textual document is one of the key applications of Natural Language Processing (NLP).

Frequency of occurrences for a word or phrase indicates significance for that word or phrase. Word Cloud is one of the wonderful ways to depict relative importance of words from the collection of textual documents.

 

Context and Application

In this example, we have pulled Google Play Store reviews for one of the leading Online Grocery company – Big Basket. We want to find out key topic users are discussing when rating 4 & 5 stars and then compare with when users are rating 1,2, & 3 stars.

 

Data

Based on web scrapping of user reviews, we have extracted latest 204 reviews. We can extract more reviews if you want to do a detailed analysis of your app, happy to help you.

 

 

Positive Review Word Cloud

Applying filter condition on rating column to get a new data frame which has reviews when user gave 4- or 5-star ratings. It may be interesting for the management to understand the features or services liked by the users. So that they can continue focus on those.

positive_reviews = bb_reviews.loc[bb_reviews.rating.gt(3)]

 

Now we will be generating a basic word cloud without many options.

from wordcloud import WordCloud

import matplotlib.pyplot as plt

wordcloud = WordCloud(width = 800, height = 800,

                background_color ='white',

                min_font_size = 10).generate(reviews)

 

plt.imshow(wordcloud) 

plt.show()

 

 

We can improve visual by changing the input parameters.

wordcloud = WordCloud(min_font_size = 10,

                      width=800,

                      height=800,

                      max_words=50,

                      background_color="white").generate(reviews)

plt.figure(figsize = (8, 8), facecolor = None)

plt.imshow(wordcloud, interpolation="bilinear")

plt.axis("off")

plt.show()

 

The current Word Cloud is that it shows ‘app’ and ‘Order’ as the most frequently talked about topics. These are as expected and we may want to exclude ‘app’ from the list before word cloud is created

Key topics for us to reviews are:

  • Ontime Delivery
  • Available Products
  • Quality Items/Products

Also, we may want to perform some of the text treatments before using the words for generating the word clouds.

Text Processing Steps are:

  • Tokenization
  • Changing to lower case
  • Removing Stopwords
  • Stemming or Lemmatization

Here are the steps coded in Python

# Tokenization
from nltk.tokenize import sent_tokenize, word_tokenize

data_wtoken=[]    
data = positive_reviews['review']
# word Tokenizations
for dt in data:
    data_wtoken.append(word_tokenize(dt))
    
# Change to lower case 
data_lower=[[wd.lower() for wd in sent] for sent in data_wtoken]

# remove stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
data_stopwords= [[wd for wd in sent if wd not in stop_words] for sent in data_lower]

# Remove words with 3 characters
data_keepwords= [[ wd for wd in sent if len(wd)>3] for sent in data_stopwords]

# Lemmatize words
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

data_lemms= [[ wordnet_lemmatizer.lemmatize(wd,pos="v") for wd in sent ] for sent in data_keepwords]
data_lemms1= [[ wordnet_lemmatizer.lemmatize(wd,pos="n") for wd in sent ] for sent in data_lemms]

 

# Convert to text
from itertools import chain
df_text =pd.DataFrame({'col':data_lemms1})

review_word_list = list(chain.from_iterable(df_text.col.values))
review_words = ' '.join(review_word_list)

 

Then generate the word cloud and see if there is any change in the theme of the topic discussed by the user.

 

Other useful steps could be to prepare word cloud based on the phrases.

 

Positive and Negative Sentiments

Now for reviews associated with star rating 1, 2 and 3, we can create word cloud to find some area of opportunities. Some of the topics are similar to that for reviews 4 and 5 star. But there are some new topics such as “Checkout”, “lastmoney” and “wallet”.

 

Next improvement could be that we create word cloud based on phrases and not based on words. This may help us understand the real issue topics. E.g. delayed delivery, slots full etc.

 

 

Leave a comment