Tips and Tricks

This section contains various tips and tricks applicable to most tasks in the library.

Visualization support

The Weights & Biases framework is supported for visualizing model training.

To use this, simply set a project name for W&B in the wandb_project attribute of the args dictionary. This will log all hyperparameter values, training losses, and evaluation metrics to the given project.

1
model = ClassificationModel('roberta', 'roberta-base', args={'wandb_project': 'project-name'})

For a complete example, see here.

Using early stopping

Early stopping is a technique used to prevent model overfitting. In a nutshell, the idea is to periodically evaluate the performance of a model against a test dataset and terminate the training once the model stops improving on the test data.

The exact conditions for early stopping can be adjusted as needed using a model’s configuration options.

Note: Refer the configuration options table for more details. (early_stopping_consider_epochs, early_stopping_delta, early_stopping_metric, early_stopping_metric_minimize, early_stopping_patience)

You must set use_early_stopping to True in order to use early stopping.

1
2
3
4
5
6
7
8
9
10
11
12
from simpletransformers.classification import ClassificationModel, ClassificationArgs


model_args = ClassificationArgs()
model_args.use_early_stopping = True
model_args.early_stopping_delta = 0.01
model_args.early_stopping_metric = "mcc"
model_args.early_stopping_metric_minimize = False
model_args.early_stopping_patience = 5
model_args.evaluate_during_training_steps = 1000

model = ClassficationModel("bert", "bert-base-cased", args=model_args)

With this configuration, the training will terminate if the mcc score of the model on the test data does not improve upon the best mcc score by at least 0.01 for 5 consecutive evaluations. An evaluation will occur once for every 1000 training steps.

Pro tip: You can use the evaluation during training functionality without invoking early stopping by setting evaluate_during_training to True while keeping use_early_stopping as False.

Additional Evaluation Metrics

Task-specific Simple Transformers models each have their own default metrics that will be calculated when a model is evaluated on a dataset. The default metrics have been chosen according to the task, usually by looking at the metrics used in standard benchmarks for that task.

However, it is likely that you will wish to calculate your own metrics depending on your particular use case. To facilitate this, all eval_model() and train_model() methods in Simple Transformers accepts keyword-arguments consisting of the name of the metric (str), and the metric function itself. The metric function should accept two inputs, the true labels and the model predictions (sklearn format).

1
2
3
4
5
6
7
8
9
from simpletransformers.classification import ClassificationModel
import sklearn


model = ClassficationModel("bert", "bert-base-cased")

model.train_model(train_df, acc=sklearn.metrics.accuracy_score)

model.eval_model(eval_df, acc=sklearn.metrics.accuracy_score)

Pro tip: You can combine the additional evaluation metrics functionality with early stopping by setting the name of your metrics function as the early_stopping_metric.

Simple-Viewer (Visualizing Model Predictions with Streamlit)

Simple Viewer is a web-app built with the Streamlit framework which can be used to quickly try out trained models.

To start Simple Viewer, run the command simple-viewer.

When Simple Viewer is started, it will look for Simple Transfomers models in the current directory and any subdirectories. All detected models can be found in the Choose Model dropdown. Alternatively, you can load a model by specifying the Simple Transformers task, model type, and model name (model type and model name follows the usual Simple Transformers conventions). The model name may be the path to a local model, or it may be the model name for a model from the Hugging Face model hub.

The following Simple Transformers tasks are currently supported:

  • Classification
  • Multi-Label Classification
  • Named Entity Recognition
  • Question Answering

Hyperparameter Optimization

Machine learning models can be very sensitive to the hyperparameters used to train them. While large models like Transformers can perform well across a relatively wider hyperparameter range, they can also break completely under certain conditions (like training with large learning rates for many iterations).

Hint: We can define two kinds of parameters used to train Transformer models. The first is the learned parameters (like the model weights) and the second is hyperparameters. To give a high-level description of the two kinds of parameters, the hyperparameters (learning rate, batch sizes, etc.) are used to control the process of learning learned parameters.

Choosing a good set of hyperparameter values plays a huge role in developing a state-of-the-art model. Because of this, Simple Transformers has native support for the excellent W&B Sweeps feature for automated hyperparameter optimization.

How to perform hyperparameter optimization with Simple Transformers and W&B Sweeps (Adapted from W&B docs):

1. Setup the sweep

The sweep can be configured through a Python dictionary (sweep_config). The dictionary contains at least 3 keys;

  1. method – Specifies the search strategy

    method Meaning
    grid Grid search iterates over all possible combinations of parameter values.
    random Random search chooses random sets of values.
    bayes Bayesian Optimization uses a gaussian process to model the function and then chooses parameters to optimize probability of improvement. This strategy requires a metric key to be specified.
  2. metric – Specifies the metric to be optimized

    This should be a metric that is logged to W&B by the training script

    The metric key of the sweep_config points to another Python dictionary containing the name, goal, and (optionally) target.

    sub-key Meaning
    name Name of the metric to optimize
    goal "minimize" or "maximize" (Default is "minimize")
    target Value that you’d like to achieve for the metric you’re optimizing. When any run in the sweep achieves that target value, the sweep’s state will be set to “Finished.” This means all agents with active runs will finish those jobs, but no new runs will be launched in the sweep.
  3. parameters – Specifies the hyperparameters and their values to explore

    The parameters key of the sweep_config points to another Python dictionary which contains all the hyperparameters to be optimized and their possible values. Generally, these will be any combination of the model_args for the particular Simple Transformers model.

    W&B offers a variety of ways to define the possible values for each parameter, all of which can be found in the W&B docs. The possible values are also represented using a Python dictionary. Two common methods are given below.

    1. Discrete values

      A dictionary with the key values pointing to a Python list of discrete values.

    2. Range of values

      A dictionary with the two keys min and max which specifies the minimum and maximum values of the range. The range is continuous if min and max are floats and discrete if min and max are ints.

Example sweep_config:

1
2
3
4
5
6
7
8
sweep_config = {
    "method": "bayes",  # grid, random
    "metric": {"name": "train_loss", "goal": "minimize"},
    "parameters": {
        "num_train_epochs": {"values": [2, 3, 5]},
        "learning_rate": {"min": 5e-5, "max": 4e-4},
    },
}

2. Initialize the sweep

Initialize a W&B sweep with the config defined earlier.

1
sweep_id = wandb.sweep(sweep_config, project="Simple Sweep")

3. Prepare the data and default model configuration

In order to run our sweep, we must get our data ready. This is identical to how you would normally set up datasets for training a Simple Transformers model.

For example;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Preparing train data
train_data = [
    ["Aragorn was the heir of Isildur", "true"],
    ["Frodo was the heir of Isildur", "false"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["text", "labels"]

# Preparing eval data
eval_data = [
    ["Theoden was the king of Rohan", "true"],
    ["Merry was the king of Rohan", "false"],
]
eval_df = pd.DataFrame(eval_data)
eval_df.columns = ["text", "labels"]

Next, we can set up the default configuration for the Simple Transformers model. This would include any args that are not being optimized through the sweep.

Hint: As a rule of thumb, it might be a good idea to set all of reprocess_input_data, overwrite_output_dir, and no_save to True when running sweeps.

1
2
3
4
5
6
7
8
9
10
model_args = ClassificationArgs()
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.evaluate_during_training = True
model_args.manual_seed = 4
model_args.use_multiprocessing = True
model_args.train_batch_size = 16
model_args.eval_batch_size = 8
model_args.labels_list = ["true", "false"]
model_args.wandb_project = "Simple Sweep"

4. Set up the training function

W&B will call this function to run the training for a particular sweep run. This function must perform 3 critical tasks.

  1. Initialize the wandb run
  2. Initialize a Simple Transformers model and pass in sweep_config=wandb.config as a kwarg.
  3. Run the training for the Simple Transformers model.

wandb.config contains the hyperparameter values for the current sweeps run. Simple Transformers will update the model args accordingly.

An example training function is shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def train():
    # Initialize a new wandb run
    wandb.init()

    # Create a TransformerModel
    model = ClassificationModel(
        "roberta",
        "roberta-base",
        use_cuda=True,
        args=model_args,
        sweep_config=wandb.config,
    )

    # Train the model
    model.train_model(train_df, eval_df=eval_df)

    # Evaluate the model
    model.eval_model(eval_df)

    # Sync wandb
    wandb.join()

In addition to the 3 tasks outlined earlier, the function also performs an evaluation and manually syncs the W&B run.

Hint: This function can be reused across any Simple Transformers task by simply replacing ClassificationModel with the appropriate model class.

5. Run the sweeps

The following line will execute the sweeps.

1
wandb.agent(sweep_id, train)

6. Putting it all together

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import logging

import pandas as pd
import sklearn

import wandb
from simpletransformers.classification import (
    ClassificationArgs,
    ClassificationModel,
)

sweep_config = {
    "method": "bayes",  # grid, random
    "metric": {"name": "train_loss", "goal": "minimize"},
    "parameters": {
        "num_train_epochs": {"values": [2, 3, 5]},
        "learning_rate": {"min": 5e-5, "max": 4e-4},
    },
}

sweep_id = wandb.sweep(sweep_config, project="Simple Sweep")

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Preparing train data
train_data = [
    ["Aragorn was the heir of Isildur", "true"],
    ["Frodo was the heir of Isildur", "false"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["text", "labels"]

# Preparing eval data
eval_data = [
    ["Theoden was the king of Rohan", "true"],
    ["Merry was the king of Rohan", "false"],
]
eval_df = pd.DataFrame(eval_data)
eval_df.columns = ["text", "labels"]

model_args = ClassificationArgs()
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.evaluate_during_training = True
model_args.manual_seed = 4
model_args.use_multiprocessing = True
model_args.train_batch_size = 16
model_args.eval_batch_size = 8
model_args.labels_list = ["true", "false"]
model_args.wandb_project = "Simple Sweep"

def train():
    # Initialize a new wandb run
    wandb.init()

    # Create a TransformerModel
    model = ClassificationModel(
        "roberta",
        "roberta-base",
        use_cuda=True,
        args=model_args,
        sweep_config=wandb.config,
    )

    # Train the model
    model.train_model(train_df, eval_df=eval_df)

    # Evaluate the model
    model.eval_model(eval_df)

    # Sync wandb
    wandb.join()


wandb.agent(sweep_id, train)

Hint: This script can also be found in the examples directory of the Github repo.

To visualize your sweep results, open the project on W&B. Please refer to W&B docs for more details on understanding the results.

Guide: Guide for hyperparameter optimization here.

Custom Parameter Groups (Freezing Layers)

Simple Transformers supports custom parameter groups which can be used to set different learning rates for different layers in a model, freeze layers, train only the final layer, etc.

All Simple Transformers models supports the following three configuration options for setting up custom parameter groups.

Custom parameter groups

custom_parameter_groups offers the most granular configuration option. This should be a list of Python dicts where each dict contains a params key and any other optional keys matching the keyword arguments accepted by the optimizer (e.g. lr, weight_decay). The value for the params key should be a list of named parameters (e.g. ["classifier.weight", "bert.encoder.layer.10.output.dense.weight"])

Hint: All Simple Transformers models have a get_named_parameters() method that returns a list of all parameter names in the model.

1
2
3
4
5
6
7
model_args = ClassificationArgs()
model_args.custom_parameter_groups = [
    {
        "params": ["classifier.weight", "bert.encoder.layer.10.output.dense.weight"],
        "lr": 1e-2,
    }
]

Custom layer parameters

custom_layer_parameters makes it more convenient to set the optimizer options for a given layer or set of layers. This should be a list of Python dicts where each dict contains a layer key and any other optional keys matching the keyword arguments accepted by the optimizer (e.g. lr, weight_decay). The value for the layer key should be an int (must be numeric) which specifies the layer (e.g. 0, 1, 11).

1
2
3
4
5
6
7
8
9
10
11
model_args = ClassificationArgs()
model_args.custom_layer_parameters = [
    {
        "layer": 10,
        "lr": 1e-3,
    },
    {
        "layer": 0,
        "lr": 1e-5,
    },
]

Note: Any named parameters specified through custom_layer_parameters with bias or LayerNorm.weight in the name will have their weight_decay set to 0.0. This also happens for any parameters not specified in either custom_parameter_groups or in custom_layer_parameters but does not happen for parameters specified through custom_parameter_groups.

Order of precedence:

Note that custom_parameter_groups has higher priority than custom_layer_parameters as custom_parameter_groups is more specific. If a parameter specificed in custom_parameter_groups also happens to be in a layer specified in custom_layer_parameters, that particular parameter will be assigned to the parameter group specified in custom_parameter_groups.

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
model_args = ClassificationArgs()
model_args.custom_layer_parameters = [
    {
        "layer": 10,
        "lr": 1e-3,
    },
    {
        "layer": 0,
        "lr": 1e-5,
    },
]
model_args.custom_parameter_groups = [
    {
        "params": ["classifier.weight", "bert.encoder.layer.10.output.dense.weight"],
        "lr": 1e-2,
    }
]

Here, "bert.encoder.layer.10.output.dense.weight" is specified in both the custom_parameter_groups and the custom_layer_parameters. However, "bert.encoder.layer.10.output.dense.weight" will have a lr of 1e-2 due to the higher precedence of custom_parameter_groups.

Hint: Any parameters not specified in either custom_parameter_groups or in custom_layer_parameters will be assigned the general values from the model args.

Train custom parameters only

The train_custom_parameters_only option is used to facilitate the training of specific parameters only. If train_custom_parameters_only is set to True, only the parameters specified in either custom_parameter_groups or in custom_layer_parameters will be trained.

For example, to train only the Classification layers of a ClassificationModel:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import pandas as pd
import logging


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Preparing train data
train_data = [
    ["Aragorn was the heir of Isildur", 1],
    ["Frodo was the heir of Isildur", 0],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["text", "labels"]

# Preparing eval data
eval_data = [
    ["Theoden was the king of Rohan", 1],
    ["Merry was the king of Rohan", 0],
]
eval_df = pd.DataFrame(eval_data)
eval_df.columns = ["text", "labels"]

# Train only the classifier layers
model_args = ClassificationArgs()
model_args.train_custom_parameters_only = True
model_args.custom_parameter_groups = [
    {
        "params": ["classifier.weight"],
        "lr": 1e-3,
    },
    {
        "params": ["classifier.bias"],
        "lr": 1e-3,
        "weight_decay": 0.0,
    },
]
# Create a ClassificationModel
model = ClassificationModel(
    "bert", "bert-base-cased", args=model_args
)

# Train the model
model.train_model(train_df)

Options For Downloading Pre-Trained Models

Most Simple Transformers models will use the from_pretrained() method from the Hugging Face Transformers library to download pre-trained models. You can pass kwargs to this method to configure things like proxies and force downloading (refer to method link above).

You can pass these kwargs when initializing a Simple Transformers task-specific model to access the same functionality. For example, if you are behind a firewall and need to set the proxy settings;

1
2
3
4
5
model = ClassficationModel(
    "bert",
    "bert-base-cased",
    proxies={"http": "foo.bar:3128", "http://hostname": "foo.bar:4012"}
)

ONNX Support (Beta)

Simple Transformers has ONNX support for Classification and NER tasks. These models can be converted to an ONNX model and run through the ONNX-runtime.

Heads up: ONNX support should be considered experimental at this time. If you encounter any problems, please open an issue in the repo. Please provide a detailed explanation and the minimal code necessary to replicate the issue.

ONNX setup

Please refer to the following pages for instructions on installing ONNX.

Converting a Simple Transformers model to the ONNX format.

The following models are currently compatible:

  • ClassificationModel
  • NERModel

These models can be converted by calling the convert_to_onnx() method. You can change the output directory by specifying output_dir when calling this method.

1
2
3
4
5
6
7
8
9
10
11
12
13
from simpletransformers.classification import (
    ClassificationModel,
    ClassificationArgs,
)


model = ClassificationModel(
    "roberta",
    "roberta-base",
)

model.convert_to_onnx("onnx_outputs")

Loading a converted ONNX model

You can load the ONNX model just as you would load any other model in Simple Transformers.

1
2
3
4
5
6
7
8
9
10
11
12
13
from simpletransformers.classification import (
    ClassificationModel,
    ClassificationArgs,
)


model = ClassificationModel(
    "roberta",
    "onnx_outputs",
)

model.convert_to_onnx("onnx_outputs")

After the model is loaded, you can use the predict() method to make predictions.

Code example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from time import time

from simpletransformers.classification import (
    ClassificationModel,
    ClassificationArgs,
)


model_args = ClassificationArgs()
model_args.overwrite_output_dir = True


# Create a TransformerModel
model = ClassificationModel(
    "roberta",
    "roberta-base",
    use_cuda=False,
    args=model_args,
)

start = time()
print(model.predict(["test " * 450]))
end = time()
print(f"Pytorch CPU: {end - start}")

model.convert_to_onnx("onnx_outputs")

model_args.dynamic_quantize = True

model = ClassificationModel(
    "roberta",
    "onnx_outputs",
    args=model_args,
)

start = time()
print(model.predict(["test " * 450]))
end = time()
print(f"ONNX CPU (Cold): {end - start}")

start = time()
print(model.predict(["test " * 450]))
end = time()
print(f"ONNX CPU (Warm): {end - start}")

Execution Providers

ONNX-Runtime supports many different Execution Providers.

If use_cuda is True, CUDAExecutionProvider will be used. If it is False, the CPUExecutionProvider will be used.

You can manually specify the provider using the onnx_execution_provider argument when loading a model.

1
2
3
4
5
6
7
model = ClassificationModel(
    "roberta",
    "onnx_outputs",
    args=model_args,
    onnx_execution_provider="CPUExecutionProvider",
)

Note that the library is only tested with CPU and CUDA Execution Providers

Saving checkpoints

Don’t save model checkpoints

When training takes little time we may want to save no intermediary checkpoints to reduce disk space usage and training time.
Note that the model artifacts will still be saved to output_dir when the training process finishes.
We can prevent the model from saving intermediary checkpoints by setting the following arguments: set save_steps to -1 and save_model_every_epoch to False

1
2
3
4
5
6
7
from simpletransformers.classification import ClassificationModel, ClassificationArgs


model_args = ClassificationArgs()
model_args["save_steps"] = -1
model_args["save_model_every_epoch"] = False
model = ClassficationModel("bert", "bert-base-cased", args=model_args)

Save model checkpoint every 3 epochs

Every model checkpoint takes the same disk space as a final model, When training transformer models for a high number of epochs we may not want to save checkpoints for every single epoch since this would take a lot of disk space. In the following example You will see how to save a checkpoint every 3 epochs.

The procedure just requires two steps:

  • Turn off automatic save after every epoch by setting save_model_every_epoch arg to False
  • save_steps must be set to N(save every N epochs) times the number of steps the model will perform for every epoch
1
2
3
4
5
6
7
8
9
10
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import math
SAVE_EVERY_N_EPOCHS = 3
model_args = ClassificationArgs()
steps_per_epoch = math.floor(len(train_df) / SAVE_EVERY_N_EPOCHS)
if(len(train_df) % SAVE_EVERY_N_EPOCHS > 0):
    steps_per_epoch +=1
model_args["save_steps"] = steps_per_epoch * SAVE_EVERY_N_EPOCHS
model_args["save_model_every_epoch"] = False
model = ClassficationModel("bert", "bert-base-cased", args=model_args)

Updated: