Language Modeling Model

LanguageModelingModel

The LanguageModelingModel class is used for Language Modeling. This can be used for both Language Model fine-tuning and for training a Language Model from scratch.

To create a LanguageModelingModel, you must specify a model_type and a model_name.

Note: model_name is set to None to train a Language Model from scratch.

  • model_type should be one of the model types from the supported models (e.g. bert, electra, gpt2)
  • model_name specifies the exact architecture and trained weights to use. This may be a Hugging Face Transformers compatible pre-trained model, a community model, the path to a directory containing model files, or None to train a Language Model from scratch.

    Note: For a list of standard pre-trained models, see here.

    Note: For a list of community models, see here.

    You may use any of these models provided the model_type is supported.

Language Model fine-tuning

1
2
3
4
5
6
7
from simpletransformers.language_modeling import (
    LanguageModelingModel,
)


model = LanguageModelingModel("bert", "bert-base-cased")

Language Model training from scratch

1
2
3
4
5
6
7
from simpletransformers.language_modeling import (
    LanguageModelingModel,
)

model = LanguageModelingModel(
    "bert", None
)

Note: For more information on working with Simple Transformers models, please refer to the General Usage section.

Configuring a LanguageModelingModel

LanguageModelingModel has several task-specific configuration options.

Argument Type Default Description
block_size int -1 Optional input sequence length after tokenization. The training dataset will be truncated in block of this size for training. Default to the model max input length for single sentence inputs (take into account special tokens).
clean_text bool True Performs invalid character removal and whitespace cleanup on text.
config_name str None Name of pretrained config or path to a directory containing a config.json file.
dataset_class Subclass of Pytorch Dataset None A custom dataset class to use instead of dataset_type.
dataset_type str "simple" Choose between simple, line_by_line, and text dataset types. (See Dataset types below)
discriminator_config dict {} Key-values given here will override the default values used in an Electra discriminator model Config. (See ELECTRA models)
generator_config dict {} Key-values given here will override the default values used in an Electra generator model Config. (See ELECTRA models)
handle_chinese_chars bool True Whether to tokenize Chinese characters. If False, Chinese text will not be tokenized properly.
max_steps int -1 If max_steps > 0: set total number of training steps to perform. Supersedes num_train_epochs.
min_frequency int 2 Minimum frequency required for a word to be added to the vocabulary.
mlm bool True Train with masked-language modeling loss instead of language modeling. Set to False for models which don’t use Masked Language Modeling.
mlm_probability float 0.15 Ratio of tokens to mask for masked language modeling loss.
sliding_window bool False Whether sliding window technique should be used when preparing data. Only works with SimpleDataset.
special_tokens list Defaults to the special_tokens of the model used List of special tokens to be used when training a new tokenizer.
stride float 0.8 A fraction of the max_seq_length to use as the stride when using a sliding window
strip_accents bool True Strips accents from a piece of text.
tokenizer_name str None Name of pretrained tokenizer or path to a directory containing tokenizer files.
vocab_size int None The maximum size of the vocabulary of the tokenizer. Required when training a tokenizer.

Note: For configuration options common to all Simple Transformers models, please refer to the Configuring a Simple Transformers Model section.

Dataset types
  • simple (or None) - Each line in the train file is considered to be a single, separate sample. sliding_window can be set to True to automatically split longer sequences into samples of length max_seq_length. Uses multiprocessing for significantly improved performance on multi-core systems.

  • line_by_line - Treats each line in the train file as a separate sample. Uses tokenizers from the Hugging Face tokenizers library.

  • text - Treats each line in the train file as a separate sample. Uses default tokenizers.

Using simple is recommended.

Configuring the architecture of a Language Model

When training a Language Model from scratch, you are free to define your own architecture. For all model types except ELECTRA, this is controlled through the config entry in the model args dict. For ELECTRA, the generator and the discriminator architectures can be specified through the generator_config, and discriminator_config entries respectively.

If not specified, the default configurations (the base architecture) for the given model will be used. For all available parameters and their default values, please refer to the Hugging Face docs for the relevant config class (E.g. BERT config).

A custom BERT architecture:

1
2
3
4
5
6
7
8
9
10
11
12
13
from simpletransformers.language_modeling import LanguageModelingModel, LanguageModelingArgs


model_args = LanguageModelingArgs()
model_args.config = {
    "num_hidden_layers": 2
}
 model_args.vocab_size = 5000

model = LanguageModelingModel(
    "bert", None, args=model_args, train_files=train_file
)

A custom ELECTRA architecture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from simpletransformers.language_modeling import LanguageModelingModel, LanguageModelingArgs


model_args = LanguageModelingArgs()
model_args.vocab_size = 30000
model_args.generator_config = {
    "embedding_size": 512,
    "hidden_size": 256,
    "num_hidden_layers": 4,
}
model_args.discriminator_config = {
    "embedding_size": 512,
    "hidden_size": 256,
    "num_hidden_layers": 16,
}

model = LanguageModelingModel(
    "electra",
    None,
    args=model_args,
    train_files=train_file
)

Class LanguageModelingModel

simpletransformers.language_modeling.LanguageModelingModel(self, model_type, model_name, args=None, use_cuda=True, cuda_device=-1, **kwargs,)

Initializes a LanguageModelingModel model.

Parameters

  • model_type (str) - The type of model to use (model types)

  • model_name (str) - The exact architecture and trained weights to use. This may be a Hugging Face Transformers compatible pre-trained model, a community model, the path to a directory containing model files, or None to train a Language Model from scratch.

  • generator_name (str, optional) - A pretrained model name or path to a directory containing an ELECTRA generator model. (See ELECTRA models)

  • discriminator_name (str, optional) - A pretrained model name or path to a directory containing an ELECTRA discriminator model. (See ELECTRA models)

  • train_files (str or List, optional) - A file or a List of files to be used when training the tokenizer. Required if the tokenizer is being trained from scratch.

  • args (dict, optional) - Default args will be used if this parameter is not provided. If provided, it should be a dict containing the args that should be changed in the default args.

  • use_cuda (bool, optional) - Use GPU if available. Setting to False will force model to use CPU only. (See here)

  • cuda_device (int, optional) - Specific GPU that should be used. Will use the first available GPU by default. (See here)

  • kwargs (optional) - For providing proxies, force_download, resume_download, cache_dir and other options specific to the ‘from_pretrained’ implementation where this will be supplied. (See here)

Returns

  • None

Note: For configuration options common to all Simple Transformers models, please refer to the Configuring a Simple Transformers Model section.

Training a LanguageModelingModel

The train_model() method is used to train the model.

1
model.train_model(train_file)

simpletransformers.language_modeling.LanguageModelingModel(self, train_file, output_dir=None, show_running_loss=True, args=None, eval_file=None, verbose=True, **kwargs)

Trains the model using ‘train_file’

Parameters

  • train_file (str) - Path to text file containing the text to train the language model on. The model will be trained on this data. Refer to the Language Modeling Data Formats section for the correct formats.

  • output_dir (str, optional) - The directory where model files will be saved. If not given, self.args['output_dir'] will be used.

  • show_running_loss (bool, optional) - If True, the running loss (training loss at current step) will be logged to the console.

  • args (dict, optional) - A dict of configuration options for the LanguageModelingModel. Any changes made will persist for the model.

  • eval_file (str, optional) - Evaluation data (same format as train_file) against which evaluation will be performed when evaluate_during_training is enabled. Is required if evaluate_during_training is enabled.

  • kwargs (optional) - Additional metrics are not currently supported for Language Modeling.

Returns

  • None

Note: For more details on training models with Simple Transformers, please refer to the Tips and Tricks section.

Evaluating a LanguageModelingModel

The eval_model() method is used to evaluate the model.

The following metrics will be calculated by default:

  • perplexity - Perplexity is a score used to evaluate language models.
  • eval_loss - Cross Entropy Loss for eval_file
1
result = model.eval_model(eval_file)

simpletransformers.language_modeling.LanguageModelingModel.eval_model(self, eval_file, output_dir=None, verbose=True, silent=False)

Evaluates the model using ‘eval_file’

Parameters

  • eval_file (str) - Path to text file containing the text to evaluate the language model on. The model will be evaluated on this data. Refer to the Language Modeling Data Formats section for the correct formats.

  • output_dir (str, optional) - The directory where model files will be saved. If not given, self.args['output_dir'] will be used.

  • verbose (bool, optional) - If verbose, results will be printed to the console on completion of evaluation.

  • silent (bool, optional) - If silent, tqdm progress bars will be hidden.

Returns

  • result (dict) - Dictionary containing evaluation results.

  • texts (list) - A dictionary containing the 3 dictionaries correct_text, similar_text, and incorrect_text.

Note: For more details on evaluating models with Simple Transformers, please refer to the Tips and Tricks section.

Updated: