Ramgopal Prajapat:

Learnings and Views

Complete tutorial on NER using spaCy Python

By: Ram on Dec 20, 2022

SpaCy and NER

spaCy is a free, open-source library for Natural Language Processing (NLP) in Python. It facilitates Part-of-speech (POS) Tagging, Named Entity Recognition (NER), Text Classification and many more.

In this blog, we will use spaCy for Named Entity Recognition (NER). The spaCy has a pre-trained model that helps in extracting named entities from the input test. A named entity is a can be of type person, country, product or book title. SpaCy has following list of built-in entity types and few more.

PERSON - People, including fictional.

ORG - Companies, agencies, institutions, etc.

GPE - Countries, cities, states.

LANGUAGE - Any named language.

DATE - Absolute or relative dates or periods.

TIME - Times smaller than a day.

PERCENT - Percentage, including "%".

MONEY - Monetary values, including unit.

ORDINAL - "first", "second", etc.

CARDINAL - Numerals that do not fall under another type.

Named Entity Recognition (NER) model in spaCy is a statistical or Machine Learning model and the quality of output depends on the examples these models were trained on. The model can be trained on the contextual data available to improve the accuracy or we may want to add additional entity type not available in the mode.

There are multiple model available and across languages. We will download a model and import the model.

The model must be downloaded before it can be used. The model’s name string follows a structure.

Name contains information on Language, type, Genre, and size in this sequence.  Example with details as below.

 Diagram

Description automatically generated

Now we are ready to test the NER model using spaCy. But we need text data. we will take text from any new article.

We need to install spaCy and download models to start with.

NER using SpaCy

!pip install spacy

 

!python -m spacy download en_core_web_sm

import spacy

from spacy import displacy

# load the model

spacy_nlp = spacy.load("en_core_web_sm")

We can try the pre-trained spaCy NER model using sample text.

text ="Lionel Messi bagged his maiden FIFA World Cup title with Argentina as his team defeated France in the final on penalties."

tagged_text=spacy_nlp(text)

The output “Doc” has multiple NLP attributes. We can get tokens or NER entities etc.

Timeline

Description automatically generated

For this use-case, we want to extract NER entities.

Graphical user interface, application

Description automatically generated

We can get additional information such as starting and end position index or the labels for each of the entities.

for ent in tagged_text.ents:

    print(ent.text, ent.start_char, ent.end_char, ent.label_)

 

Graphical user interface, text, application, Word

Description automatically generated

We can view beautiful view of the input text along with NER tags. Ideally, it should have tagged "Lionel Messi" but it has only identified "Messi". We can also try different text input and check.

 

We can try for one more text corpus

 

text2 ="On Sunday morning, Sargam Koushal, 32, wore a gorgeous pink slit gown for the final round and won the title of Mrs World, beating contestants from 63 countries at a gala Mrs World event in Las Vegas, USA."

tagged_text2=spacy_nlp(text2)

for ent in tagged_text2.ents:

    print(ent.text, ent.start_char, ent.end_char, ent.label_)

 

Graphical user interface, application

Description automatically generated

 

Transformers for NER

Transformers are a type of neural network architecture that have been successful in a variety of natural language processing tasks, including named entity recognition (NER).

Transformers are the hottest thing in NLP and got huge success since their introduction. They have dominated most of the NLP benchmarks. We can use Transformers Models for NER.

spaCy supports all models that are available via the HuggingFace transformers library.

!python -m spacy download en_core_web_trf

import spacy

from spacy import displacy

nlp_trf = spacy.load('en_core_web_trf')

 

text2 ="On Sunday morning, Sargam Koushal, 32, wore a gorgeous pink slit gown for the final round and won the title of Mrs World, beating contestants from 63 countries at a gala Mrs World event in Las Vegas, USA."

tagged_text2=nlp_trf(text2)

for ent in tagged_text2.ents:

    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Graphical user interface, text, application, email

Description automatically generated

Used Colab for the running the transformer model.

Leave a comment