Retrieval Data Formats

Train Data Format

Used with train_model()

The train data should be a Pandas DataFrame containing the 3 columns - query_text, gold_passage, and title (Title is optional). If use_hf_datasets is True, then this may also be the path to a TSV file with the same columns.

  • query_text: The query text sequence
  • gold_passage: The gold passage text sequence
  • title: The title of the gold passage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
train_data = [
    {
        "query_text": "Who is the protaganist of Dune?",
        "title": "Dune (novel)",
        "gold_passage": 'Dune is set in the distant future amidst a feudal interstellar society in which various noble houses control planetary fiefs. It tells the story of young Paul Atreides, whose family accepts the stewardship of the planet Arrakis. While the planet is an inhospitable and sparsely populated desert wasteland, it is the only source of melange, or "spice", a drug that extends life and enhances mental abilities. Melange is also necessary for space navigation, which requires a kind of multidimensional awareness and foresight that only the drug provides. As melange can only be produced on Arrakis, control of the planet is a coveted and dangerous undertaking. The story explores the multilayered interactions of politics, religion, ecology, technology, and human emotion, as the factions of the empire confront each other in a struggle for the control of Arrakis and its spice.',
    },
    {
        "query_text": "Who is the author of Dune?"
        "title": "Dune (novel)",
        "gold_passage": "Dune is a 1965 science fiction novel by American author Frank Herbert, originally published as two separate serials in Analog magazine. It tied with Roger Zelazny's This Immortal for the Hugo Award in 1966 and it won the inaugural Nebula Award for Best Novel. It is the first installment of the Dune saga; in 2003, it was described as the world's best-selling science fiction novel.",
    }
]

train_df = pd.DataFrame(
    train_data
)

Evaluation Data Format

Used with eval_model()

The evaluation data format is identical to the train data format.

The evaluation data should be a Pandas DataFrame containing the 3 columns - query_text, gold_passage, and title (Title is optional). If use_hf_datasets is True, then this may also be the path to a TSV file with the same columns.

  • query_text: The query text sequence
  • gold_passage: The gold passage text sequence
  • title: The title of the gold passage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
eval_data = [
    {
        "query_text": "How many Dune sequels did Herbet write?",
        "title": "Dune (novel)",
        "gold_passage": "Herbert wrote five sequels: Dune Messiah, Children of Dune, God Emperor of Dune, Heretics of Dune, and Chapterhouse: Dune. Following Herbert's death in 1986, his son Brian Herbert and author Kevin J. Anderson continued the series in over a dozen additional novels since 1999.",
    },
    {
        "query_text": "What is Arrakis?"
        "title": "Dune (novel)",
        "gold_passage": "Duke Leto Atreides of House Atreides, ruler of the ocean planet Caladan, is assigned by the Padishah Emperor Shaddam IV to serve as fief ruler of the planet Arrakis. Although Arrakis is a harsh and inhospitable desert planet, it is of enormous importance because it is the only planetary source of melange, or the \"spice\", a unique and incredibly valuable substance that extends human youth, vitality and lifespan — the official reason for its high demand in the Empire. It is also through the consumption of spice that the Guild navigators are able to navigate around the stars to find paths to planetary or spatial targets. Shaddam sees House Atreides as a potential future rival and threat, and conspires with House Harkonnen, currently in charge of spice harvesting on Arrakis and longstanding enemies of House Atreides, to destroy Leto and his family after their arrival. Leto is aware his assignment is a trap of some kind, but he must obey the Emperor’s orders.",
    }
]

eval_df = pd.DataFrame(
    eval_data
)

Prediction Data Format

Used with predict()

The prediction data should be a list of strings.

1
2
3
4
to_predict = [
    "What is spice in Dune?",
    "What are the spiceworms in Dune?"
]

Updated: