This section describes how Text Classification tasks are organized and conducted with Simple Transformers.
Sub-Tasks Falling Under Text Classification
|Binary and multi-class text classification||
|Multi-label text classification||
The process of performing text classification in Simple Transformers does not deviate from the standard pattern.
- Initialize a
- Train the model with
- Evaluate the model with
- Make predictions on (unlabelled) data with
Supported Model Types
New model types are regularly added to the library. Text classification tasks currently supports the model types given below.
|Model||Model code for
* Not available with Multi-label classification
Tip: The model code is used to specify the
model_type in a Simple Transformers model.
Dealing With Long Text
Transformer models typically have a restriction on the maximum length allowed for a sequence. This is defined in terms of the number of tokens, where a token is any of the “words” that appear in the model vocabulary.
Note: Each Transformer model has a vocabulary which consists of tokens mapped to a numeric ID. The input sequence to a Transformer consists of a tensor of numeric values found in the vocabulary.
max_seq_length is the maximum number of such tokens (technically token IDs) that a sequence can contain. Any tokens that appear after the
max_seq_length will be truncated when working with Transformer models. Unfortunately, each model type also has an upper bound for the
max_seq_length itself, with it most commonly being 512.
While there is currently no standard method of circumventing this issue, a plausible strategy is to use the sliding window approach. Here, any sequence exceeding the
max_seq_length will be split into several windows (sub-sequences), each of length
The windows will typically overlap each other to a certain degree to minimize any information loss that may be caused by hard cutoffs. The amount of overlap between the windows is determined by the
stride. The stride is the distance (in terms of number of tokens) that the window will be, well, slid to obtain the next sub-sequence.
stride can be specified in terms of either a fraction of the
max_seq_length, or as an absolute number of tokens. The default
stride is set to
0.8 * max_seq_length, which results in about 20% overlap between the sub-sequences.
1 2 3 4 5 6 7 model_args = ClassificationArgs(sliding_window=True) model = ClassificationModel( "roberta", "roberta-base", args=model_args, )
Training with sliding window
When training a model with
sliding_window enabled, each sub-sequence will be assigned the label from the original sequence. The model will then be trained on the full set of sub-sequences. Depending on the number of sequences and how much each sequence exceeds the
max_seq_length, the total number of training samples will be higher than the number of sequences originally in the train data.
Evaluation and prediction with sliding window
During evaluation and prediction, the model will predict a label for each window or sub-sequence of an example. The final prediction for an example will be the mode of the predictions for all its sub-sequences.
In the case of a tie, the predicted label will be assigned the
Note: Sliding window technique is not currently implemented for multi-label classification.
Lazy Loading Data
The system memory required to keep a large dataset in memory can be prohibitively large. In such cases, the data can be lazy loaded from disk to minimize memory consumption.
To enable lazy loading, you must set the
lazy_loading flag to
1 2 model_args = ClassificationArgs() model_args.lazy_loading = True
Note: This will typically be slower as the feature conversion is done on the fly. However, the tradeoff between speed and memory consumption should be reasonable.
Tip: See Lazy Loading Data Formats for information on the data formats.
Tip: See Configuring a Classification model for information on configuring the model to read the lazy loading data file correctly.
Tip: You can find minimal example scripts in the
ClassificationModel expects the labels to be ints from
0 up to
If your dataset contains labels in another format (e.g. string labels like
negative), you can provide the list of all labels to the model args. Simple Transformers will handle the label mappings internally. Note that this will also automatically set
num_labels to the length of the labels list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 from simpletransformers.classification import ClassificationModel, ClassificationArgs import pandas as pd import logging logging.basicConfig(level=logging.INFO) transformers_logger = logging.getLogger("transformers") transformers_logger.setLevel(logging.WARNING) # Preparing train data train_data = [ ["Aragorn was the heir of Isildur", "true"], ["Frodo was the heir of Isildur", "false"], ] train_df = pd.DataFrame(train_data) train_df.columns = ["text", "labels"] # Preparing eval data eval_data = [ ["Theoden was the king of Rohan", "true"], ["Merry was the king of Rohan", "false"], ] eval_df = pd.DataFrame(eval_data) eval_df.columns = ["text", "labels"] # Optional model configuration model_args = ClassificationArgs() model_args.num_train_epochs=1 model_args.labels_list = ["true", "false"] # Create a ClassificationModel model = ClassificationModel( "roberta", "roberta-base", args=model_args ) # Train the model model.train_model(train_df) # Evaluate the model result, model_outputs, wrong_predictions = model.eval_model(eval_df) # Make predictions with the model predictions, raw_outputs = model.predict(["Sam was a Wizard"])
Note: Custom labels are not currently supported with multi-label classification.