T5 Data Formats

A single input to a T5 model has the following pattern;

"<prefix>: <input_text> </s>"

The label sequence has the following pattern;

"<target_sequence> </s>"

Train Data Format

Used with train_model()

The train data should be a Pandas DataFrame containing the 3 columns - prefix, input_text, target_text.

prefix: A string indicating the task to perform. (E.g. "binary classification", "generate question")
input_text: The input text sequence. prefix is automatically prepended to form the full input. (<prefix>: <input_text>)
target_text: The target sequence

If preprocess_inputs is set to True in the model args, then the < /s> tokens (including preceeding space) and the : (prefix separator including trailing separator) between prefix and input_text are automatically added. Otherwise, the input DataFrames must contain the < /s> tokens (including preceeding space) and the : (prefix separator including trailing separator).

prefix	input_text	target_text
binary classification	Anakin was Luke’s father	1
binary classification	Luke was a Sith Lord	0
generate question	Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon	Who created the Star Wars franchise?
generate question	Anakin was Luke’s father	Who was Luke’s father?

train_data = [
    ["binary classification", "Anakin was Luke's father" , 1],
    ["binary classification", "Luke was a Sith Lord" , 0],
    ["generate question", "Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon", "Who created the Star Wars franchise?"],
    ["generate question", "Anakin was Luke's father" , "Who was Luke's father?"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["prefix", "input_text", "target_text"]

Evaluation Data Format

Used with eval_model()

The evaluation data format is identical to the train data format.

prefix	input_text	target_text
binary classification	Leia was Luke’s sister	1
binary classification	Han was a Sith Lord	0
generate question	In 2020, the Star Wars franchise’s total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time.	What is the total value of the Star Wars franchise?
generate question	Leia was Luke’s sister	Who was Luke’s sister?

train_data = [
    ["binary classification", "Leia was Luke's sister" , 1],
    ["binary classification", "Han was a Sith Lord" , 0],
    ["generate question", "In 2020, the Star Wars franchise's total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time.", "What is the total value of the Star Wars franchise?"],
    ["generate question", "Leia was Luke's sister" , "Who was Luke's sister?"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["prefix", "input_text", "target_text"]

Prediction Data Format

Used with predict()

The prediction data should be a list of strings with the prefix and the prefix separator (: ) included.

If preprocess_inputs is set to True in the model args, then the ` < /s> token (including preceeding space) is automatically added to each string in the list. Otherwise, the strings must have the < /s>` (including preceeding space) must be included.

Note: Unlike with training and evaluation, the prefix separator is NOT added in prediction even when preprocess_inputs is set to True.

to_predict = [
    "binary classification: Luke blew up the first Death Star",
    "generate question: In 1971, George Lucas wanted to film an adaptation of the Flash Gordon serial, but could not obtain the rights, so he began developing his own space opera.",
]