By: Ram on Dec 29, 2022
Customers write reviews and share their experiences about the brands and products. Extracting structured information from the review data and perform aspect-based sentiment analysis.
In this scenario, the review data is about the beauty products and customers’ beauty related issues. For tracking brand mentions from the reviews across period, we need to find the brands mentioned in the reviews for each date.
We have list of brands and review data. We need to developed NER Model that identify these brands from the reviews. Also, the model will be used for future reviews to identify brand mentions.
The brands are not part of the standard entities; hence we need to develop custom NER. We can develop custom Named Entity Recognition model using SpaCy.
spaCy is a free, open-source library for Natural Language Processing (NLP) in Python. It facilitates Part-of-speech (POS) Tagging, Named Entity Recognition (NER), Text Classification and many more.
In this blog, we will use spaCy for Named Entity Recognition (NER). The spaCy has a pre-trained model that helps in extracting named entities from the input test. A named entity is a can be of type person, country, product, or book title. Since, the brand is not a standard Entity, we will train the model, SpaCy has a training pipeline for developing a custom NER using SpaCy V3
We can do the text data processing and create word cloud. This helps in understand the data. Also, we can review the prominent brands.
For creating word cloud, we need to process the text data.
For developing custom NER model, we need to follow these steps.
We removed the unwanted to characters and any words of less than 2 characters. The below code removes specially characters as well. But some brands may have & or some special characters as a part of the brand names. So, the text processing steps must be revised accordingly.
def processed_review(review):
processed_token = []
for token in review.split():
if len(token)>2:
processed_token.append(token)
out_str = ' '.join(processed_token)
out_str = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", out_str)
return out_str
From the review data – we will create training data with labelled information. We will get the start and end position of the brand. Input review, start and end positions along with Entity Label. Since, we need only ‘BRAND’ entities, the values will be only ‘BRAND’ but we can custom train for multiple entities such as Brand (e.g. Dermadew, Boroline, or purple etc.), Issues (e.g. Pimples, Dry Skin or Itching etc.) and Product (e.g. Lotion, Serum, Facewash and Sunscreen)
# Create Training Data for Brands
count = 0
TRAIN_DATA = []
for _, item in reviews.iterrows():
ent_dict = {}
proceed_review = processed_review(item['Reviews'])
visited_items = []
entities = []
for token in proceed_review.split():
if token in brand_list:
for i in re.finditer(token, proceed_review):
if token not in visited_items:
entity = (i.span()[0], i.span()[1], 'BRAND')
visited_items.append(token)
entities.append(entity)
if len(entities) > 0:
ent_dict['entities'] = entities
train_item = (proceed_review, ent_dict)
TRAIN_DATA.append(train_item)
count=count+1
We need to format the labelled data into the format required for SpaCy V3.
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA): # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
continue
else:
ents.append(span)
doc.ents = ents # label the text with the ents
db.add(doc)
db.to_disk("/kaggle/working/train.spacy") # save the docbin object
We will be using the Transformer for the training the Custom NER model. For that we need to create the configuration file using SpaCy.
Few learnings, select the language appropriately, and select only NER from the Components. Also, for the transformer, please select GPU.
Then download the config file using download option below right-hand side.
If you are using Google Colab or Kaggle for the training, please upload this “base_config.cfg” And run the below common after updating the paths appropriately.
!python -m spacy init fill-config /kaggle/input/config/base_config.cfg /kaggle/working/config.cfg
Also, we need to define few additional paths for input and configurations files to set up for training.
!python -m spacy train /kaggle/working/config.cfg --output /kaggle/working/output --paths.train /kaggle/working/train.spacy --paths.dev /kaggle/working/train.spacy --gpu-id 0
Once the model is trained, the last and best models will be saved in the output folder.
Now, we can leverage the best custom model trained. We need to load the model first.
Once the model is developed, we can use that to identify brand mentions across reviews and track the brand trends.
Some of these brands have growth in the brand mentioned in the reviews.