General Usage

This section contains general usage information and tips applicable to most tasks in the library.

Task Specific Models

Simple Transformer models are built with a particular Natural Language Processing (NLP) task in mind. Each such model comes equipped with features and functionality designed to best fit the task that they are intended to perform. The high-level process of using Simple Transformers models follows the same pattern.

Initialize a task-specific model
Train the model with train_model()
Evaluate the model with eval_model()
Make predictions on (unlabelled) data with predict()

However, there are necessary differences between the different models to ensure that they are well suited for their intended task. The key differences will typically be the differences in input/output data formats and any task specific features/configuration options. These can all be found in the documentation section for each task.

The currently implemented task-specific Simple Transformer models, along with their task, are given below.

Task	Model
Binary and multi-class text classification	`ClassificationModel`
Conversational AI (chatbot training)	`ConvAIModel`
Language generation	`LanguageGenerationModel`
Language model training/fine-tuning	`LanguageModelingModel`
Multi-label text classification	`MultiLabelClassificationModel`
Multi-modal classification (text and image data combined)	`MultiModalClassificationModel`
Named entity recognition	`NERModel`
Question answering	`QuestionAnsweringModel`
Regression	`ClassificationModel`
Sentence-pair classification	`ClassificationModel`
Text Representation Generation	`RepresentationModel`
Document Retrieval	`RetrievalModel`

Creating a Task-Specific Model

To create a task-specific Simple Transformers model, you will typically specify a model_type and a model_name. Any deviation from this will be noted in the appropriate model documentation.

model_type should be one of the model types from the supported models (e.g. bert, electra, xlnet)
model_name specifies the exact architecture and trained weights to use. This may be a Hugging Face Transformers compatible pre-trained model, a community model, or the path to a directory containing model files.

Note: For a list of standard pre-trained models, see here.

Note: For a list of community models, see here.

You may use any of these models provided the model_type is supported.

The code snippets below demonstrate the typical process of creating a Simple Transformers model, using the ClassificationModel as an example.

Importing the task-specific model

from simpletransformers.classification import ClassificationModel

Loading a pre-trained model

model = ClassificationModel(
    "roberta", "roberta-base"
)

Loading a community model

model = ClassificationModel(
    "bert", "KB/bert-base-swedish-cased"
)

Loading a local save

When loading a saved model, the path to the directory containing the model file should be used.

model = ClassificationModel(
    "bert", "outputs/best_model"
)

To CUDA or not to CUDA

Deep Learning (DL) models are typically run on CUDA-enabled GPUs as the performance is far, far superior compared to running on a CPU. This is especially true for Transformer models considering that they are quite large even in relation to other DL models.

CUDA is enabled by default on all Simple Transformers models.

Enabling/Disabling CUDA

All Simple Transformers models have a use_cuda parameter to easily flip the switch on CUDA. Attempting to use CUDA when a CUDA device is not available will result in an error.

model = ClassificationModel(
    "roberta", "roberta-base", use_cuda=False
)

Pro tip: You can use the following code snippet to ensure that your script will use CUDA if it is available, but won’t error out if it is not.

import torch


cuda_available = torch.cuda.is_available()

model = ClassificationModel(
    "roberta", "roberta-base", use_cuda=cuda_available
)

Selecting a CUDA device

If your environment has multiple CUDA devices, but you wish to use a particular device, you can specify the device ID (int starting from 0) as shown below.

model = ClassificationModel(
    "roberta", "roberta-base", cuda_device=1

Configuring a Simple Transformers Model

Every task-specific Simple Transformers model comes with tons of configuration options to enable the user to easily tailor the model for their use case. These options can be categorized into two types, options common to all tasks and task-specific options. This section focuses on the common (or global) options. The task-specific options are detailed in the relevant documentation for the task.

Configuration options in Simple Transformers are defined as either dataclasses or as Python dicts. The ModelArgs dataclass contains all the global options set to their default values, as shown below.

Argument	Type	Default	Description
adafactor_beta1	float	0	Coefficient used for computing running averages of gradient.
adafactor_clip_threshold	float	1.0	Threshold of root mean square of final gradient update.
adafactor_decay_rate	float	-0.8	Coefficient used to compute running averages of square.
adafactor_eps	tuple	(1e-30, 1e-3)	Regularization constants for square gradient and parameter scale respectively.
adafactor_relative_step	bool	True	If True, time-dependent learning rate is computed instead of external learning rate.
adafactor_scale_parameter	bool	True	If True, learning rate is scaled by root mean square.
adafactor_warmup_init	bool	True	Time-dependent learning rate computation depends on whether warm-up initialization is being used.
adam_betas	tuple	(0.9, 0.999)	coefficients used for computing running averages of gradient and its square with AdamW.
adam_epsilon	float	1e-8	Epsilon hyperparameter used in AdamOptimizer.
best_model_dir	str	outputs/best_model	The directory where the best model (model checkpoints) will be saved (based on eval_during_training)
cache_dir	str	cache_dir	The directory where cached files will be saved.
config	dict	{}	A dictionary containing configuration options that should be overriden in a model’s config.
cosine_schedule_num_cycles	float	0.5	The number of waves in the cosine schedule if `cosine_schedule_With_warmup` is used. The number of hard restarts if `cosine_with_hard_restarts_schedule_with_warmup` is used.
dataloader_num_workers	int	cpu_count () - 2 if cpu_count () > 2 else 1	Number of worker processed to use with the Pytorch dataloader.
do_lower_case	bool	False	Set to True when using uncased models.
dynamic_quantize	bool	False	Set to True to use dynamic quantization.
early_stopping_consider_epochs	bool	False	If True, end of epoch evaluation score will be considered for early stopping.
early_stopping_delta	float	0	The improvement over best_eval_loss necessary to count as a better checkpoint.
early_stopping_metric	str	eval_loss	The metric that should be used with early stopping. (Should be computed during eval_during_training).
early_stopping_metric_minimize	bool	True	Whether early_stopping_metric should be minimized (or maximized).
early_stopping_patience	int	3	Terminate training after this many evaluations without an improvement in the evaluation metric greater then early_stopping_delta.
encoding	str	None	Specify an encoding to be used when reading text files.
eval_batch_size	int	8	The evaluation batch size.
evaluate_during_training	bool	False	Set to True to perform evaluation while training models. Make sure eval data is passed to the training method if enabled.
evaluate_during_training_steps	int	2000	Perform evaluation at every specified number of steps. A checkpoint model and the evaluation results will be saved.
evaluate_during_training_verbose	bool	False	Print results from evaluation during training.
fp16	bool	True	Whether or not fp16 mode should be used. Requires NVidia Apex library.
gradient_accumulation_steps	int	1	The number of training steps to execute before performing a optimizer.step(). Effectively increases the training batch size while sacrificing training time to lower memory consumption.
learning_rate	float	4e-5	The learning rate for training.
logging_steps	int	50	Log training loss and learning at every specified number of steps.
manual_seed	int	None	Set a manual seed if necessary for reproducible results.
max_grad_norm	float	1.0	Maximum gradient clipping.
max_seq_length	int	128	Maximum sequence length the model will support.
multiprocessing_chunksize	int	-1	Number of examples sent to a CPU core at a time when using multiprocessing. If this is set to `-1`, the chunksize will be calculated dynamically as `max(len(data) // (args.process_count * 2), 500)`
n_gpu	int	1	Number of GPUs to use.
no_cache	bool	False	Cache features to disk.
no_save	bool	False	If `True`, models will not be saved to disk.
not_saved_args	list	()	The `model_args` which should not be saved when the model is saved. If any `model_args` are not JSON serializable, those argument names should be specified here.
num_train_epochs	int	1	The number of epochs the model will be trained for.
optimizer	str	“AdamW”	Should be one of (AdamW, Adafactor)
output_dir	str	“outputs/”	The directory where all outputs will be stored. This includes model checkpoints and evaluation results.
overwrite_output_dir	bool	False	If True, the trained model will be saved to the ouput_dir and will overwrite existing saved models in the same directory.
polynomial_decay_schedule_lr_end	float	1e-7	The end learning rate.
polynomial_decay_schedule_power	float	1.0	Power factor.
process_count	int	cpu_count () - 2 if cpu_count () > 2 else 1	Number of cpu cores (processes) to use when converting examples to features. Default is (number of cores - 2) or 1 if (number of cores <= 2)
quantized_model	bool	False	Set to True if loading a quantized model. Note that this will automatically be set to True if `dynamic_quantize` is enabled.
reprocess_input_data	bool	True	If True, the input data will be reprocessed even if a cached file of the input data exists in the cache_dir.
save_eval_checkpoints	bool	True	Save a model checkpoint for every evaluation performed.
save_model_every_epoch	bool	True	Save a model checkpoint at the end of every epoch.
save_optimizer_and_scheduler	bool	True	Save optimizer and scheduler whenever they are available.
save_steps	int	2000	Save a model checkpoint at every specified number of steps. Set to -1 to disable.
scheduler	str	“linear_schedule_with_warmup”	The scheduler to use when training. Should be one of (constant_schedule, constant_schedule_with_warmup, linear_schedule_with_warmup, cosine_schedule_with_warmup, cosine_with_hard_restarts_schedule_with_warmup, polynomial_decay_schedule_with_warmup)
silent	bool	False	Disables progress bars.
tensorboard_dir	str	None	The directory where Tensorboard events will be stored during training. By default, Tensorboard events will be saved in a subfolder inside runs/ like runs/Dec02_09-32-58_36d9e58955b0/.
train_batch_size	int	8	The training batch size.
use_cached_eval_features	bool	False	Evaluation during training uses cached features. Setting this to False will cause features to be recomputed at every evaluation step.
use_early_stopping	bool	False	Use early stopping to stop training when early_stopping_metric doesn’t improve (based on early_stopping_patience, and early_stopping_delta)
use_multiprocessing	bool	True	If True, multiprocessing will be used when converting data into features. Enabling can speed up processing, but may be unstable in certain cases. Defaults to True.
use_multiprocessing_for_evaluation	bool	False	If True, multiprocessing will be used when converting evaluation data into features. Enabling this can sometimes cause issues when evaluating during training. Defaults to False.
wandb_kwargs	dict	{}	Dictionary of keyword arguments to be passed to the W&B project.
wandb_project	str	None	Name of W&B project. This will log all hyperparameter values, training losses, and evaluation metrics to the given project.
warmup_ratio	float	0.06	Ratio of total training steps where learning rate will “warm up”. Overridden if `warmup_steps` is specified.
warmup_steps	int	0	Number of training steps where learning rate will “warm up”. Overrides `warmup_ratio`.
weight_decay	int	0	Adds L2 penalty.

You can override any of these default values by either editing the dataclass attributes or by passing in a Python dict containing the appropriate key-value pairs when initializing a Simple Transformers model.

Using the dataclass

from simpletransformers.classification import ClassificationModel, ClassificationArgs

model_args = ClassificationArgs()
model_args.num_train_epochs = 5
model_args.learning_rate = 1e-4

model = ClassificationModel("bert", "bert-base-cased", args=model_args)

Using a python dictionary

from simpletransformers.classification import ClassificationModel


model_args = {
    "num_train_epochs": 5,
    "learning_rate": 1e-4,
}

model = ClassificationModel("bert", "bert-base-cased", args=model_args)

Tip: Using the dataclass approach has the benefits of IDE auto-completion as well as ensuring that there are no typos in arguments that could lead to unexpected behaviour.

Tip: Both the dataclass and the dictionary approaches are interchangeable.