Multi-Modal Classification fuses text and image data. This is performed using multi-modal bitransformer models introduced in the paper Supervised Multimodal Bitransformers for Classifying Images and Text.
The process of performing Multi-Modal Classification in Simple Transformers does not deviate from the standard pattern.
- Initialize a
- Train the model with
- Evaluate the model with
- Make predictions on (unlabelled) data with
Supported Model Types
|Model||Model code for
Tip: The model code is used to specify the
model_type in a Simple Transformers model.
With Multi-Modal Classification, labels are always given as strings. You may specify a list of labels by passing in the list to
label_list argument when creating the model. If
label_list is given,
num_labels is not required.
label_list is not given,
num_labels is required and the labels should be strings starting from
"0" up to