Multi-Modal Classification Specifics

Multi-Modal Classification fuses text and image data. This is performed using multi-modal bitransformer models introduced in the paper Supervised Multimodal Bitransformers for Classifying Images and Text.

Usage Steps

The process of performing Multi-Modal Classification in Simple Transformers does not deviate from the standard pattern.

  1. Initialize a Model
  2. Train the model with train_model()
  3. Evaluate the model with eval_model()
  4. Make predictions on (unlabelled) data with predict()

Supported Model Types

Model Model code for Model
BERT bert

Tip: The model code is used to specify the model_type in a Simple Transformers model.

Label formats

With Multi-Modal Classification, labels are always given as strings. You may specify a list of labels by passing in the list to label_list argument when creating the model. If label_list is given, num_labels is not required.

If label_list is not given, num_labels is required and the labels should be strings starting from "0" up to "<num_labels>".