By: Ram on Jul 05, 2020
Extracting key words and topics from text or textual document is one of the key applications of Natural Language Processing (NLP).
Frequency of occurrences for a word or phrase indicates significance for that word or phrase. Word Cloud is one of the wonderful ways to depict relative importance of words from the collection of textual documents.
In this example, we have pulled Google Play Store reviews for one of the leading Online Grocery company – Big Basket. We want to find out key topic users are discussing when rating 4 & 5 stars and then compare with when users are rating 1,2, & 3 stars.
Based on web scrapping of user reviews, we have extracted latest 204 reviews. We can extract more reviews if you want to do a detailed analysis of your app, happy to help you.
Applying filter condition on rating column to get a new data frame which has reviews when user gave 4- or 5-star ratings. It may be interesting for the management to understand the features or services liked by the users. So that they can continue focus on those.
positive_reviews = bb_reviews.loc[bb_reviews.rating.gt(3)]
Now we will be generating a basic word cloud without many options.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
min_font_size = 10).generate(reviews)
plt.imshow(wordcloud)
plt.show()
We can improve visual by changing the input parameters.
wordcloud = WordCloud(min_font_size = 10,
width=800,
height=800,
max_words=50,
background_color="white").generate(reviews)
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
The current Word Cloud is that it shows ‘app’ and ‘Order’ as the most frequently talked about topics. These are as expected and we may want to exclude ‘app’ from the list before word cloud is created
Key topics for us to reviews are:
Also, we may want to perform some of the text treatments before using the words for generating the word clouds.
Text Processing Steps are:
Here are the steps coded in Python
# Tokenization
from nltk.tokenize import sent_tokenize, word_tokenize
data_wtoken=[]
data = positive_reviews['review']
# word Tokenizations
for dt in data:
data_wtoken.append(word_tokenize(dt))
# Change to lower case
data_lower=[[wd.lower() for wd in sent] for sent in data_wtoken]
# remove stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
data_stopwords= [[wd for wd in sent if wd not in stop_words] for sent in data_lower]
# Remove words with 3 characters
data_keepwords= [[ wd for wd in sent if len(wd)>3] for sent in data_stopwords]
# Lemmatize words
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
data_lemms= [[ wordnet_lemmatizer.lemmatize(wd,pos="v") for wd in sent ] for sent in data_keepwords]
data_lemms1= [[ wordnet_lemmatizer.lemmatize(wd,pos="n") for wd in sent ] for sent in data_lemms]
# Convert to text
from itertools import chain
df_text =pd.DataFrame({'col':data_lemms1})
review_word_list = list(chain.from_iterable(df_text.col.values))
review_words = ' '.join(review_word_list)
Then generate the word cloud and see if there is any change in the theme of the topic discussed by the user.
Other useful steps could be to prepare word cloud based on the phrases.
Now for reviews associated with star rating 1, 2 and 3, we can create word cloud to find some area of opportunities. Some of the topics are similar to that for reviews 4 and 5 star. But there are some new topics such as “Checkout”, “lastmoney” and “wallet”.
Next improvement could be that we create word cloud based on phrases and not based on words. This may help us understand the real issue topics. E.g. delayed delivery, slots full etc.