T5 Data Formats
A single input to a T5 model has the following pattern;
1
"<prefix>: <input_text> </s>"
The label sequence has the following pattern;
1
"<target_sequence> </s>"
Train Data Format
Used with train_model()
The train data should be a Pandas DataFrame containing the 3 columns - prefix
, input_text
, target_text
.
prefix
: A string indicating the task to perform. (E.g."binary classification"
,"generate question"
)input_text
: The input text sequence.prefix
is automatically prepended to form the full input. (<prefix>: <input_text>
)target_text
: The target sequence
If preprocess_inputs
is set to True
in the model args
, then the < /s>
tokens (including preceeding space) and the :
(prefix separator including trailing separator) between prefix
and input_text
are automatically added. Otherwise, the input DataFrames must contain the < /s>
tokens (including preceeding space) and the :
(prefix separator including trailing separator).
prefix | input_text | target_text |
---|---|---|
binary classification | Anakin was Luke’s father | 1 |
binary classification | Luke was a Sith Lord | 0 |
generate question | Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon | Who created the Star Wars franchise? |
generate question | Anakin was Luke’s father | Who was Luke’s father? |
1
2
3
4
5
6
7
8
train_data = [
["binary classification", "Anakin was Luke's father" , 1],
["binary classification", "Luke was a Sith Lord" , 0],
["generate question", "Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon", "Who created the Star Wars franchise?"],
["generate question", "Anakin was Luke's father" , "Who was Luke's father?"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["prefix", "input_text", "target_text"]
Evaluation Data Format
Used with eval_model()
The evaluation data format is identical to the train data format.
prefix | input_text | target_text |
---|---|---|
binary classification | Leia was Luke’s sister | 1 |
binary classification | Han was a Sith Lord | 0 |
generate question | In 2020, the Star Wars franchise’s total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time. | What is the total value of the Star Wars franchise? |
generate question | Leia was Luke’s sister | Who was Luke’s sister? |
1
2
3
4
5
6
7
8
train_data = [
["binary classification", "Leia was Luke's sister" , 1],
["binary classification", "Han was a Sith Lord" , 0],
["generate question", "In 2020, the Star Wars franchise's total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time.", "What is the total value of the Star Wars franchise?"],
["generate question", "Leia was Luke's sister" , "Who was Luke's sister?"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["prefix", "input_text", "target_text"]
Prediction Data Format
Used with predict()
The prediction data should be a list of strings with the prefix and the prefix separator (:
) included.
If preprocess_inputs
is set to True
in the model args
, then the ` < /s> token (including preceeding space) is automatically added to each string in the list. Otherwise, the strings must have the
< /s>` (including preceeding space) must be included.
Note: Unlike with training and evaluation, the prefix separator is NOT added in prediction even when preprocess_inputs
is set to True
.
1
2
3
4
to_predict = [
"binary classification: Luke blew up the first Death Star",
"generate question: In 1971, George Lucas wanted to film an adaptation of the Flash Gordon serial, but could not obtain the rights, so he began developing his own space opera.",
]