Language Modeling Model
LanguageModelingModel
The LanguageModelingModel
class is used for Language Modeling. This can be used for both Language Model fine-tuning and for training a Language Model from scratch.
To create a LanguageModelingModel
, you must specify a model_type
and a model_name
.
Note: model_name
is set to None
to train a Language Model from scratch.
model_type
should be one of the model types from the supported models (e.g. bert, electra, gpt2)-
model_name
specifies the exact architecture and trained weights to use. This may be a Hugging Face Transformers compatible pre-trained model, a community model, the path to a directory containing model files, orNone
to train a Language Model from scratch.Note: For a list of standard pre-trained models, see here.
Note: For a list of community models, see here.
You may use any of these models provided the
model_type
is supported.
Language Model fine-tuning
1
2
3
4
5
6
7
from simpletransformers.language_modeling import (
LanguageModelingModel,
)
model = LanguageModelingModel("bert", "bert-base-cased")
Language Model training from scratch
1
2
3
4
5
6
7
from simpletransformers.language_modeling import (
LanguageModelingModel,
)
model = LanguageModelingModel(
"bert", None
)
Note: For more information on working with Simple Transformers models, please refer to the General Usage section.
Configuring a LanguageModelingModel
LanguageModelingModel
has several task-specific configuration options.
Argument | Type | Default | Description |
---|---|---|---|
block_size | int | -1 |
Optional input sequence length after tokenization. The training dataset will be truncated in block of this size for training. Default to the model max input length for single sentence inputs (take into account special tokens). |
clean_text | bool | True |
Performs invalid character removal and whitespace cleanup on text. |
config_name | str | None |
Name of pretrained config or path to a directory containing a config.json file. |
dataset_class | Subclass of Pytorch Dataset | None |
A custom dataset class to use instead of dataset_type . |
dataset_type | str | "simple" |
Choose between simple , line_by_line , and text dataset types. (See Dataset types below) |
discriminator_config | dict | {} |
Key-values given here will override the default values used in an Electra discriminator model Config. (See ELECTRA models) |
generator_config | dict | {} |
Key-values given here will override the default values used in an Electra generator model Config. (See ELECTRA models) |
handle_chinese_chars | bool | True |
Whether to tokenize Chinese characters. If False , Chinese text will not be tokenized properly. |
max_steps | int | -1 |
If max_steps > 0: set total number of training steps to perform. Supersedes num_train_epochs. |
min_frequency | int | 2 |
Minimum frequency required for a word to be added to the vocabulary. |
mlm | bool | True |
Train with masked-language modeling loss instead of language modeling. Set to False for models which don’t use Masked Language Modeling. |
mlm_probability | float | 0.15 |
Ratio of tokens to mask for masked language modeling loss. |
sliding_window | bool | False |
Whether sliding window technique should be used when preparing data. Only works with SimpleDataset. |
special_tokens | list | Defaults to the special_tokens of the model used | List of special tokens to be used when training a new tokenizer. |
stride | float | 0.8 |
A fraction of the max_seq_length to use as the stride when using a sliding window |
strip_accents | bool | True |
Strips accents from a piece of text. |
tokenizer_name | str | None |
Name of pretrained tokenizer or path to a directory containing tokenizer files. |
vocab_size | int | None |
The maximum size of the vocabulary of the tokenizer. Required when training a tokenizer. |
Note: For configuration options common to all Simple Transformers models, please refer to the Configuring a Simple Transformers Model section.
-
simple
(or None) - Each line in the train file is considered to be a single, separate sample.sliding_window
can be set to True to automatically split longer sequences into samples of lengthmax_seq_length
. Uses multiprocessing for significantly improved performance on multi-core systems. -
line_by_line
- Treats each line in the train file as a separate sample. Uses tokenizers from the Hugging Face tokenizers library. -
text
- Treats each line in the train file as a separate sample. Uses default tokenizers.
Using simple
is recommended.
Configuring the architecture of a Language Model
When training a Language Model from scratch, you are free to define your own architecture. For all model types except ELECTRA, this is controlled through the config
entry in the model args
dict. For ELECTRA, the generator and the discriminator architectures can be specified through the generator_config
, and discriminator_config
entries respectively.
If not specified, the default configurations (the base architecture) for the given model will be used. For all available parameters and their default values, please refer to the Hugging Face docs for the relevant config class (E.g. BERT config).
A custom BERT architecture:
1
2
3
4
5
6
7
8
9
10
11
12
13
from simpletransformers.language_modeling import LanguageModelingModel, LanguageModelingArgs
model_args = LanguageModelingArgs()
model_args.config = {
"num_hidden_layers": 2
}
model_args.vocab_size = 5000
model = LanguageModelingModel(
"bert", None, args=model_args, train_files=train_file
)
A custom ELECTRA architecture:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from simpletransformers.language_modeling import LanguageModelingModel, LanguageModelingArgs
model_args = LanguageModelingArgs()
model_args.vocab_size = 30000
model_args.generator_config = {
"embedding_size": 512,
"hidden_size": 256,
"num_hidden_layers": 4,
}
model_args.discriminator_config = {
"embedding_size": 512,
"hidden_size": 256,
"num_hidden_layers": 16,
}
model = LanguageModelingModel(
"electra",
None,
args=model_args,
train_files=train_file
)
Class LanguageModelingModel
simpletransformers.language_modeling.LanguageModelingModel(self, model_type, model_name, args=None, use_cuda=True, cuda_device=-1, **kwargs,)
Initializes a LanguageModelingModel model.
Parameters
-
model_type (
str
) - The type of model to use (model types) -
model_name (
str
) - The exact architecture and trained weights to use. This may be a Hugging Face Transformers compatible pre-trained model, a community model, the path to a directory containing model files, or None to train a Language Model from scratch. -
generator_name (
str
, optional) - A pretrained model name or path to a directory containing an ELECTRA generator model. (See ELECTRA models) -
discriminator_name (
str
, optional) - A pretrained model name or path to a directory containing an ELECTRA discriminator model. (See ELECTRA models) -
train_files (
str
orList
, optional) - A file or a List of files to be used when training the tokenizer. Required if the tokenizer is being trained from scratch. -
args (
dict
, optional) - Default args will be used if this parameter is not provided. If provided, it should be a dict containing the args that should be changed in the default args. -
use_cuda (
bool
, optional) - Use GPU if available. Setting to False will force model to use CPU only. (See here) -
cuda_device (
int
, optional) - Specific GPU that should be used. Will use the first available GPU by default. (See here) -
kwargs (optional) - For providing proxies, force_download, resume_download, cache_dir and other options specific to the ‘from_pretrained’ implementation where this will be supplied. (See here)
Returns
None
Note: For configuration options common to all Simple Transformers models, please refer to the Configuring a Simple Transformers Model section.
Training a LanguageModelingModel
The train_model()
method is used to train the model.
1
model.train_model(train_file)
simpletransformers.language_modeling.LanguageModelingModel(self, train_file, output_dir=None, show_running_loss=True, args=None, eval_file=None, verbose=True, **kwargs)
Trains the model using ‘train_file’
Parameters
-
train_file (
str
) - Path to text file containing the text to train the language model on. The model will be trained on this data. Refer to the Language Modeling Data Formats section for the correct formats. -
output_dir (
str
, optional) - The directory where model files will be saved. If not given,self.args['output_dir']
will be used. -
show_running_loss (
bool
, optional) - If True, the running loss (training loss at current step) will be logged to the console. -
args (
dict
, optional) - A dict of configuration options for theLanguageModelingModel
. Any changes made will persist for the model. -
eval_file (
str
, optional) - Evaluation data (same format as train_file) against which evaluation will be performed when evaluate_during_training is enabled. Is required if evaluate_during_training is enabled. -
kwargs (optional) - Additional metrics are not currently supported for Language Modeling.
Returns
None
Note: For more details on training models with Simple Transformers, please refer to the Tips and Tricks section.
Evaluating a LanguageModelingModel
The eval_model()
method is used to evaluate the model.
The following metrics will be calculated by default:
perplexity
- Perplexity is a score used to evaluate language models.eval_loss
- Cross Entropy Loss for eval_file
1
result = model.eval_model(eval_file)
simpletransformers.language_modeling.LanguageModelingModel.eval_model(self, eval_file, output_dir=None, verbose=True, silent=False)
Evaluates the model using ‘eval_file’
Parameters
-
eval_file (
str
) - Path to text file containing the text to evaluate the language model on. The model will be evaluated on this data. Refer to the Language Modeling Data Formats section for the correct formats. -
output_dir (
str
, optional) - The directory where model files will be saved. If not given,self.args['output_dir']
will be used. -
verbose (
bool
, optional) - If verbose, results will be printed to the console on completion of evaluation. -
silent (
bool
, optional) - If silent, tqdm progress bars will be hidden.
Returns
-
result (
dict
) - Dictionary containing evaluation results. -
texts (
list
) - A dictionary containing the 3 dictionariescorrect_text
,similar_text
, andincorrect_text
.
Note: For more details on evaluating models with Simple Transformers, please refer to the Tips and Tricks section.