Classification Data Formats

The required input data formats for each classification sub-task is described in this section.

Train Data Format

Used with train_model()

Binary classification

The train data should be contained in a Pandas Dataframe with at least two columns. One column should contain the text and the other should contain the labels. The text column should be of datatype str, while the labels column should be of datatype int (0 or 1).

If the dataframe has a header row, the text column should have the heading text and the labels column should have the heading labels.

text labels
Aragorn was the heir of Isildur 1
Frodo was the heir of Isildur 0
1
2
3
4
5
6
train_data = [
    ["Aragorn was the heir of Isildur", 1],
    ["Frodo was the heir of Isildur", 0],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["text", "labels"]

Multi-class classification

Identical to binary classification, except the labels start from 0 and go up to n, where n is the number of labels.

text labels
Aragorn was the heir of Isildur 1
Frodo was the heir of Isildur 0
Pippin is stronger than Merry 2
1
2
3
4
5
6
7
train_data = [
    ["Aragorn was the heir of Isildur", 1],
    ["Frodo was the heir of Isildur", 0],
    ["Pippin is stronger than Merry", 2],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["text", "labels"]

Data format for LayoutLM models

LayoutLM model (LayoutLM: Pre-training of Text and Layout for Document Image Understanding) is pre-trained to consider both the text and layout information for document image understanding and information extraction tasks.

Although the paper discusses using combinations of text, layout, and image features, Simple Transformers currently only supports text + layout as inputs.

The data format for LayoutLM is similar to the default format described above but it also includes the bounding box information (x0, y0, x1, y1) in addition to the text. Here, x0 and y0 is the list of coordinates of the top-left vertices of the bounding boxes and x1 and y1 is the list of coordinates of the bottom-right vertices of the bounding boxes. Each list contains the list of coordinates for each word in text.

Note: The bounding box coordinates must be normalized to between 0-1000 where (0,0) is the top-left corner of the image.

text labels x0 y0 x1 y1
Aragorn was the heir of Isildur 1 [10, 20, 30, 40, 50, 60] [10, 10, 10, 10, 20, 20] [20, 30, 40, 50, 60, 70] [20, 20, 20, 20, 30, 40]
Frodo was the heir of Isildur 0 [15, 20, 30, 40, 50, 60] [10, 10, 10, 10, 20, 20] [20, 30, 45, 50, 60, 70] [20, 20, 20, 20, 30, 40]

Warning: Pandas can cause issues when saving and loading lists stored in a column. Check whether your list has been converted to a String!

Regression

Identical to binary classification, except the labels are continuous values and the labels column is of type float.

text labels
Aragorn was the heir of Isildur 1.0
Frodo was the heir of Isildur 0.0
Pippin is stronger than Merry 0.3
1
2
3
4
5
6
7
train_data = [
    ["Aragorn was the heir of Isildur", 1.0],
    ["Frodo was the heir of Isildur", 0.0],
    ["Pippin is stronger than Merry", 0.3],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["text", "labels"]

Multi-label classification

Identical to binary classification, except the labels are lists of ints and the labels column is of type list.

text labels
Aragorn was the heir of Isildur [0, 1]
Frodo was the heir of Isildur [0, 0]
Pippin is stronger than Merry [1, 1]
1
2
3
4
5
6
7
train_data = [
    ["Aragorn was the heir of Isildur", [0, 1]],
    ["Frodo was the heir of Isildur", [0, 0]],
    ["Pippin is stronger than Merry", [1, 1]],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["text", "labels"]

Note: Each distinct label can only take the values 0 or 1. I.e., multi-class-multi-label classification is not currently supported.

Warning: Pandas can cause issues when saving and loading lists stored in a column. Check whether your list has been converted to a String!

Evaluation Data Format

Used with eval_model()

The evaluation data format is identical to the train data format.

Binary classification

text labels
Aragorn was the heir of Isildur 1
Frodo was the heir of Isildur 0

Multi-class classification

text labels
Aragorn was the heir of Isildur 1
Frodo was the heir of Isildur 0
Pippin is stronger than Merry 2

Regression

text labels
Aragorn was the heir of Isildur 1.0
Frodo was the heir of Isildur 0.0
Pippin is stronger than Merry 0.3

Multi-label classification

text labels
Aragorn was the heir of Isildur [0, 1]
Frodo was the heir of Isildur [0, 0]
Pippin is stronger than Merry [1, 1]

Data format for LayoutLM models

LayoutLM model (LayoutLM: Pre-training of Text and Layout for Document Image Understanding) is pre-trained to consider both the text and layout information for document image understanding and information extraction tasks.

Although the paper discusses using combinations of text, layout, and image features, Simple Transformers currently only supports text + layout as inputs.

The data format for LayoutLM is similar to the default format described above but it also includes the bounding box information (x0, y0, x1, y1) in addition to the text. Here, x0 and y0 is the list of coordinates of the top-left vertices of the bounding boxes and x1 and y1 is the list of coordinates of the bottom-right vertices of the bounding boxes. Each list contains the list of coordinates for each word in text.

Note: The bounding box coordinates must be normalized to between 0-1000 where (0,0) is the top-left corner of the image.

text labels x0 y0 x1 y1
Aragorn was the heir of Isildur 1 [10, 20, 30, 40, 50, 60] [10, 10, 10, 10, 20, 20] [20, 30, 40, 50, 60, 70] [20, 20, 20, 20, 30, 40]
Frodo was the heir of Isildur 0 [15, 20, 30, 40, 50, 60] [10, 10, 10, 10, 20, 20] [20, 30, 45, 50, 60, 70] [20, 20, 20, 20, 30, 40]

Prediction Data Format

Used with predict()

The prediction data must be a list of strings.

1
2
3
4
to_predict = [
    "Gandalf was a Wizard",
    "Sam was a Wizard",
]

Identical for binary classification, multi-class classification, regression, and multi-label classification.

Data format for LayoutLM models

The prediction data must be a list of lists. For example,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
to_predict = [
    [
        "OCR text from long page one",
        [1, 2, 3, 4, 5, 6], # x0 values for each word
        [11, 12, 13, 14, 15, 16], # y0 values for each word
        [21, 22, 23, 24, 25, 26], # x1 values for each word
        [31, 32, 33, 34, 35, 36], # y1 values for each word
    ],
    [
        "OCR text from long page two",
        [1, 2, 3, 4, 5, 6], # x0 values for each word
        [11, 12, 13, 14, 15, 16], # y0 values for each word
        [21, 22, 23, 24, 25, 26], # x1 values for each word
        [31, 32, 33, 34, 35, 36], # y1 values for each word
    ],
]

Identical for binary classification, multi-class classification, regression, and multi-label classification. Note: The bounding box coordinates must be normalized to between 0-1000 where (0,0) is the top-left corner of the image.

Sentence-Pair Data Format

When performing sentence-pair tasks (e.g. sentence similarity), both the training and evaluation dataframes must contain a header row. The dataframes must also have at least 3 columns, text_a, text_b, and labels.

text_a text_b labels
Gimli fought with a battle axe Gimli’s preferred weapon was a battle axe 1
Legolas was an expert archer Legolas was taller than Gimli 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
train_data = [
    [
        "Gimli fought with a battle axe",
        "Gimli's preferred weapon was a battle axe",
        1,
    ],
    [
        "Legolas was an expert archer",
        "Legolas was taller than Giml",
        0,
    ],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["text_a", "text_b", "labels"]

The input to the predict() method in sentence-pair tasks must be a list of lists, where the inner list contains two sentences (text_a and text_b for a single sample) while the outer list is the list of all samples.

1
2
3
4
to_predict = [
    ["Gimli fought with a battle axe", "Gimli's preferred weapon was a battle axe"],
    ["Legolas was an expert archer", "Legolas was taller than Gimli"],
]

Everything else is identical to the single sentence data formats.

Lazy Loading Data Format

The data must be input as a path to a file to use Lazy Loading.

Warning: Not currently implemented for Multi-label tasks.

The format is similar to the structure of corresponding dataframes in the normal input formats. (One sample per row, with \t as the separator)

Binary Classification

1
2
Aragorn was the heir of Isildur    1
Frodo was the heir of Isildur    0

Multi-class classification

1
2
3
Aragorn was the heir of Isildur    1
Frodo was the heir of Isildur    0
Pippin is stronger than Merry    2

Regression

1
2
3
Aragorn was the heir of Isildur    1.0
Frodo was the heir of Isildur    0.0
Pippin is stronger than Merry    0.3

Sentence-Pair Classification

1
2
Gimli fought with a battle axe    Gimli's preferred weapon was a battle axe    1
Legolas was an expert archer    Legolas was taller than Gimli    0

Updated: