Question Answering Data Formats
For question answering tasks, the input data can be in JSON files or in a Python list of dictionaries in the correct format. The structure of both formats is identical, i.e. the input may be a string pointing to a JSON file containing a list of dictionaries, or it the input may be a list of dictionaries itself.
Input Structure
The input data should be a single list of dictionaries (or path to a JSON file containing the same). A dictionary represents a single context and its associated questions.
Each such dictionary contains two attributes, the "context"
and "qas"
.
context
: The paragraph or text from which the question is asked.qas
: A list of questions and answers (format below).
Questions and answers are represented as dictionaries. Each dictionary in qas
has the following format.
id
: (string) A unique ID for the question. Should be unique across the entire dataset.question
: (string) A question.is_impossible
: (bool) Indicates whether the question can be answered correctly from the context.answers
: (list) The list of correct answers to the question.
A single answer is represented by a dictionary with the following attributes.
text
: (string) The answer to the question. Must be a substring of the context.answer_start
: (int) Starting index of the answer in the context.
Train Data Format
Used with train_model()
Train data can be in the form of a path to a JSON file or a list of dictionaries in the structure specified.
Note: There cannot be multiple correct answers to a single question during training. Each question must have a single answer (or an empty string as the answer with is_impossible=True
).
List of dictionaries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
train_data = [
{
"context": "Mistborn is a series of epic fantasy novels written by American author Brandon Sanderson.",
"qas": [
{
"id": "00001",
"is_impossible": False,
"question": "Who is the author of the Mistborn series?",
"answers": [
{
"text": "Brandon Sanderson",
"answer_start": 71,
}
],
}
],
},
{
"context": "The first series, published between 2006 and 2008, consists of The Final Empire,"
"The Well of Ascension, and The Hero of Ages.",
"qas": [
{
"id": "00002",
"is_impossible": False,
"question": "When was the series published?",
"answers": [
{
"text": "between 2006 and 2008",
"answer_start": 28,
}
],
},
{
"id": "00003",
"is_impossible": False,
"question": "What are the three books in the series?",
"answers": [
{
"text": "The Final Empire, The Well of Ascension, and The Hero of Ages",
"answer_start": 63,
}
],
},
{
"id": "00004",
"is_impossible": True,
"question": "Who is the main character in the series?",
"answers": [],
},
],
},
]
JSON file
1
train_data = "data/train.json"
Evaluation Data Format
Used with eval_model()
Evaluation data can be in the form of a path to a JSON file or a list of dictionaries in the structure specified.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
eval_data = [
{
"context": "The series primarily takes place in a region called the Final Empire "
"on a world called Scadrial, where the sun and sky are red, vegetation is brown, "
"and the ground is constantly being covered under black volcanic ashfalls.",
"qas": [
{
"id": "00001",
"is_impossible": False,
"question": "Where does the series take place?",
"answers": [
{
"text": "region called the Final Empire",
"answer_start": 38,
},
{
"text": "world called Scadrial",
"answer_start": 74,
},
],
}
],
},
{
"context": "\"Mistings\" have only one of the many Allomantic powers, while \"Mistborns\" have all the powers.",
"qas": [
{
"id": "00002",
"is_impossible": False,
"question": "How many powers does a Misting possess?",
"answers": [
{
"text": "one",
"answer_start": 21,
}
],
},
{
"id": "00003",
"is_impossible": True,
"question": "What are Allomantic powers?",
"answers": [],
},
],
},
]
Prediction Data Format
Used with predict()
The predict()
method of a Simple Transformers model is typically used to get a prediction from the model when the true label/answer is not known. Reflecting this, the predict()
method of the QuestionAnsweringModel
class expects a list of dictionaries which contains only contexts, questions, and an unique ID for each question.
The prediction data should be in the following format.
1
2
3
4
5
6
7
8
9
10
11
12
to_predict = [
{
"context": "Vin is a Mistborn of great power and skill.",
"qas": [
{
"question": "What is Vin's speciality?",
"id": "0",
}
],
}
]
Lazy Loading Data Format
The training data (train_data
) must be input as a path (str) to a JSONL file to use Lazy Loading.
The structure of the JSON object is identical to the normal Question Answering train data format.
Note: Currently, lazy loading is only supported for training. The full eval_data
will be loaded to memory.