T5 Data Formats

A single input to a T5 model has the following pattern;

1
"<prefix>: <input_text> </s>"

The label sequence has the following pattern;

1
"<target_sequence> </s>"

Train Data Format

Used with train_model()

The train data should be a Pandas DataFrame containing the 3 columns - prefix, input_text, target_text.

  • prefix: A string indicating the task to perform. (E.g. "binary classification", "generate question")
  • input_text: The input text sequence. prefix is automatically prepended to form the full input. (<prefix>: <input_text>)
  • target_text: The target sequence

If preprocess_inputs is set to True in the model args, then the < /s> tokens (including preceeding space) and the : (prefix separator including trailing separator) between prefix and input_text are automatically added. Otherwise, the input DataFrames must contain the < /s> tokens (including preceeding space) and the : (prefix separator including trailing separator).

prefix input_text target_text
binary classification Anakin was Luke’s father 1
binary classification Luke was a Sith Lord 0
generate question Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon Who created the Star Wars franchise?
generate question Anakin was Luke’s father Who was Luke’s father?
1
2
3
4
5
6
7
8
train_data = [
    ["binary classification", "Anakin was Luke's father" , 1],
    ["binary classification", "Luke was a Sith Lord" , 0],
    ["generate question", "Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon", "Who created the Star Wars franchise?"],
    ["generate question", "Anakin was Luke's father" , "Who was Luke's father?"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["prefix", "input_text", "target_text"]

Evaluation Data Format

Used with eval_model()

The evaluation data format is identical to the train data format.

prefix input_text target_text
binary classification Leia was Luke’s sister 1
binary classification Han was a Sith Lord 0
generate question In 2020, the Star Wars franchise’s total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time. What is the total value of the Star Wars franchise?
generate question Leia was Luke’s sister Who was Luke’s sister?
1
2
3
4
5
6
7
8
train_data = [
    ["binary classification", "Leia was Luke's sister" , 1],
    ["binary classification", "Han was a Sith Lord" , 0],
    ["generate question", "In 2020, the Star Wars franchise's total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time.", "What is the total value of the Star Wars franchise?"],
    ["generate question", "Leia was Luke's sister" , "Who was Luke's sister?"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["prefix", "input_text", "target_text"]

Prediction Data Format

Used with predict()

The prediction data should be a list of strings with the prefix and the prefix separator (: ) included.

If preprocess_inputs is set to True in the model args, then the ` < /s> token (including preceeding space) is automatically added to each string in the list. Otherwise, the strings must have the < /s>` (including preceeding space) must be included.

Note: Unlike with training and evaluation, the prefix separator is NOT added in prediction even when preprocess_inputs is set to True.

1
2
3
4
to_predict = [
    "binary classification: Luke blew up the first Death Star",
    "generate question: In 1971, George Lucas wanted to film an adaptation of the Flash Gordon serial, but could not obtain the rights, so he began developing his own space opera.",
]

Updated: