Named Entity Recognition Specifics
The goal of Named Entity Recognition is to locate and classify named entities in a sequence. The named entities are pre-defined categories chosen according to the use case such as names of people, organizations, places, codes, time notations, monetary values, etc. Essentially, NER aims to assign a class to each token (usually a single word) in a sequence. Because of this, NER is also referred to as token classification.
Usage Steps
The process of performing Named Entity Recognition in Simple Transformers does not deviate from the standard pattern.
- Initialize a
NERModel
- Train the model with
train_model()
- Evaluate the model with
eval_model()
- Make predictions on (unlabelled) data with
predict()
Supported Model Types
New model types are regularly added to the library. Named Entity Recognition tasks currently supports the model types given below.
Model | Model code for NERModel |
---|---|
ALBERT | albert |
BERT | bert |
BERTweet | bertweet |
BigBird | bigbird |
CamemBERT | camembert |
DeBERTa | deberta |
DeBERTa | deberta |
DeBERTaV2 | deberta-v2 |
DistilBERT | distilbert |
ELECTRA | electra |
HerBERT | herbert |
LayoutLM | layoutlm |
LayoutLMv2 | layoutlmv2 |
Longformer | longformer |
MobileBERT | mobilebert |
MPNet | mpnet |
RemBERT | rembert |
RoBERTa | roberta |
SqueezeBert | squeezebert |
XLM | xlm |
XLM-RoBERTa | xlmroberta |
XLNet | xlnet |
Tip: The model code is used to specify the model_type
in a Simple Transformers model.
Custom Labels
The default list of labels used in the NERModel
is from the CoNLL dataset which uses the following tags/labels.
["O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
However, named entity recognition is a very versatile task and has many different applications. It is highly likely that you will wish to define and use your own token tags/labels.
This can be done by passing in your list of labels when creating the NERModel
to the labels
parameter.
1
2
3
4
5
custom_labels = ["O", "B-SPELL", "I-SPELL", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-PLACE", "I-PLACE"]
model = NERModel(
"bert", "bert-cased-base", labels=custom_labels
)
Prediction Caveats
By default, NERModel
will split input sequences to the predict()
method on spaces and assign a NER tag to each “word” of the split sequence. This might not be desirable in some languages (e.g. Chinese). To avoid this, you can specify split_on_space=False
when calling the NERModel.predict()
method. In this case, you must provide a list of lists as the to_predict
input to the predict()
method. The inner list will be the list of split “words” belonging to a single sequence and the outer list is the list of all sequences.
Lazy Loading Data
The system memory required to keep a large dataset in memory can be prohibitively large. In such cases, the data can be lazy loaded from disk to minimize memory consumption.
To enable lazy loading, you must set the lazy_loading
flag to True
in NERArgs
.
1
2
model_args = NERArgs()
model_args.lazy_loading = True
Note: The data must be input as a path to a file in the CoNLL format to use lazy loading. See here for the correct format.
Note: This will typically be slower as the feature conversion is done on the fly. However, the tradeoff between speed and memory consumption should be reasonable.
Tip: See Configuring a NER model for information on configuring the model to read the lazy loading data file correctly.