Ramgopal Prajapat:

Learnings and Views

Brand Tracking using Custom NER using SpaCY V3

By: Ram on Dec 29, 2022

Context

Customers write reviews and share their experiences about the brands and products. Extracting structured information from the review data and perform aspect-based sentiment analysis.

In this scenario, the review data is about the beauty products and customers’ beauty related issues. For tracking brand mentions from the reviews across period, we need to find the brands mentioned in the reviews for each date.

We have list of brands and review data. We need to developed NER Model that identify these brands from the reviews. Also, the model will be used for future reviews to identify brand mentions.

The brands are not part of the standard entities; hence we need to develop custom NER. We can develop custom Named Entity Recognition model using SpaCy.

SpaCy and NER

spaCy is a free, open-source library for Natural Language Processing (NLP) in Python. It facilitates Part-of-speech (POS) Tagging, Named Entity Recognition (NER), Text Classification and many more.

 

In this blog, we will use spaCy for Named Entity Recognition (NER).  The spaCy has a pre-trained model that helps in extracting named entities from the input test. A named entity is a can be of type person, country, product, or book title.  Since, the brand is not a standard Entity, we will train the model, SpaCy has a training pipeline for developing a custom NER using SpaCy V3

 

Review Data and Analysis

We can do the text data processing and create word cloud. This helps in understand the data. Also, we can review the prominent brands.

For creating word cloud, we need to process the text data.

  • Tokenize the review data
  • Clean up the data – remove the URLs and non-text data

 

Custom NER using SpaCY V3

For developing custom NER model, we need to follow these steps.

  1. Processed review data
  2. Label data – Brands – position of the brands etc
  3. Formatting Label Data to. spacy format
  4. Creating the config file using spaCy
  5. Setting up and Custom NER Model Training
  6. Inferences

 

  1. Data Processing

We removed the unwanted to characters and any words of less than 2 characters. The below code removes specially characters as well. But some brands may have & or some special characters as a part of the brand names. So, the text processing steps must be revised accordingly.

def processed_review(review):

    processed_token = []

    for token in review.split():

        if len(token)>2:

            processed_token.append(token)

    out_str = ' '.join(processed_token)

    out_str = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?""", out_str)

    return out_str

  1. Label data – Brands – position of the brands

From the review data – we will create training data with labelled information.  We will get the start and end position of the brand.  Input review, start and end positions along with Entity Label. Since, we need only ‘BRAND’ entities, the values will be only ‘BRAND’ but we can custom train for multiple entities such as Brand (e.g. Dermadew, Boroline, or purple etc.), Issues (e.g. Pimples, Dry Skin or Itching etc.) and Product (e.g. Lotion, Serum, Facewash and Sunscreen)

Text

Description automatically generated

# Create Training Data for Brands

count = 0

TRAIN_DATA = []

for  _, item in reviews.iterrows():

    ent_dict = {}

    proceed_review = processed_review(item['Reviews'])  

    visited_items = []

    entities = []

    for token in proceed_review.split():

        if token in brand_list:

            for i in re.finditer(token, proceed_review):

                if token not in visited_items:

                    entity = (i.span()[0], i.span()[1], 'BRAND')

                    visited_items.append(token)

                    entities.append(entity)

    if len(entities) > 0:

        ent_dict['entities'] = entities

        train_item = (proceed_review, ent_dict)

        TRAIN_DATA.append(train_item)

        count=count+1

  1. Formatting Label Data to. spacy format

We need to format the labelled data into the format required for SpaCy V3.

 

nlp = spacy.blank("en"# load a new spacy model

db = DocBin() # create a DocBin object

 

for text, annot in tqdm(TRAIN_DATA): # data in previous format

    doc = nlp.make_doc(text) # create doc object from text

    ents = []

    for start, end, label in annot["entities"]: # add character indexes

        span = doc.char_span(start, end, label=label, alignment_mode="contract")

        if span is None:

          continue

        else:

            ents.append(span)

    doc.ents = ents # label the text with the ents

    db.add(doc)

 

db.to_disk("/kaggle/working/train.spacy") # save the docbin object

 

  1. Creating the config file using spaCy

We will be using the Transformer for the training the Custom NER model. For that we need to create the configuration file using SpaCy.

Few learnings, select the language appropriately, and select only NER from the Components. Also, for the transformer, please select GPU.

Then download the config file using download option below right-hand side.

Graphical user interface, text

Description automatically generated

If you are using Google Colab or Kaggle for the training, please upload this “base_config.cfg” And run the below common after updating the paths appropriately.

!python -m spacy init fill-config /kaggle/input/config/base_config.cfg /kaggle/working/config.cfg

 

  1. Configurations and Custom NER Model Training

Also, we need to define few additional paths for input and configurations files to set up for training.

  • Configuration– this is the file that was created in the above step.
  • Output - models will be saved in the output path given.  When Kaggle is used, we need to give path in the /Kaggle/working/ folder only.
  • Train and Dev - in this case, I have same file for training and validation, we can give different files as well.
!python -m spacy train /kaggle/working/config.cfg --output /kaggle/working/output --paths.train /kaggle/working/train.spacy --paths.dev /kaggle/working/train.spacy --gpu-id 0

 

Table

Description automatically generated

 

Once the model is trained, the last and best models will be saved in the output folder.

 

  1. Inferences

Now, we can leverage the best custom model trained. We need to load the model first.

Graphical user interface, text, application, email

Description automatically generated

Brand Mentions and Monitoring

Once the model is developed, we can use that to identify brand mentions across reviews and track the brand trends. 

Table

Description automatically generated with medium confidence

Some of these brands have growth in the brand mentioned in the reviews.

Full Code 

 

Leave a comment