Reading Comprehension Instances¶
These Instances
are designed for the set of tasks known today as “reading comprehension”, where
the input is a natural language question, a passage, and (optionally) some number of answer
options, and the output is either a (span begin index, span end index) decision over the passage,
or a classification decision over the answer options (if provided).
QuestionPassageInstances¶
-
class
deep_qa.data.instances.reading_comprehension.question_passage_instance.
IndexedQuestionPassageInstance
(question_indices: typing.List[int], passage_indices: typing.List[int], label: typing.List[int], index: int = None)[source]¶ Bases:
deep_qa.data.instances.instance.IndexedInstance
This is an indexed instance that is used for (question, passage) pairs.
-
as_training_data
()[source]¶ Convert this
IndexedInstance
to NumPy arrays suitable for use as training data to Keras models.Returns: train_data : (inputs, label)
The
IndexedInstance
as NumPy arrays to be uesd in Keras. Note thatinputs
might itself be a complex tuple, depending on theInstance
type.
-
classmethod
empty_instance
()[source]¶ Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.
-
-
class
deep_qa.data.instances.reading_comprehension.question_passage_instance.
QuestionPassageInstance
(question_text: str, passage_text: str, label: typing.Any, index: int = None)[source]¶ Bases:
deep_qa.data.instances.instance.TextInstance
A QuestionPassageInstance is a base class for datasets that consist primarily of a question text and a passage, where the passage contains the answer to the question. This class should not be used directly due to the missing
_index_label
function, use a subclass instead.-
_index_label
(label: typing.Any) → typing.List[int][source]¶ Index the labels. Since we don’t know what form the label takes, we leave it to subclasses to implement this method.
-
to_indexed_instance
(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]¶ Converts the words in this
Instance
into indices using theDataIndexer
.Parameters: data_indexer : DataIndexer
DataIndexer
to use in converting theInstance
to anIndexedInstance
.Returns: indexed_instance : IndexedInstance
A
TextInstance
that has had all of its strings converted into indices.
-
words
() → typing.Dict[str, typing.List[str]][source]¶ Returns a list of all of the words in this instance, contained in a namespace dictionary.
This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the
DataIndexer
to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.Returns: namespace : Dictionary of {str: List[str]}
The
str
key refers to vocabularies, and theList[str]
should contain the tokens in that vocabulary. For example, you should use the keywords
to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.
-
McQuestionPassageInstances¶
-
class
deep_qa.data.instances.reading_comprehension.mc_question_passage_instance.
IndexedMcQuestionPassageInstance
(question_indices: typing.List[int], passage_indices: typing.List[int], option_indices: typing.List[typing.List[int]], label: typing.List[int], index: int = None)[source]¶ -
-
as_training_data
()[source]¶ Convert this
IndexedInstance
to NumPy arrays suitable for use as training data to Keras models.Returns: train_data : (inputs, label)
The
IndexedInstance
as NumPy arrays to be uesd in Keras. Note thatinputs
might itself be a complex tuple, depending on theInstance
type.
-
classmethod
empty_instance
()[source]¶ Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.
-
get_padding_lengths
() → typing.Dict[str, int][source]¶ We need to pad the answer option length (in words), the number of answer options, the question length (in words), the passage length (in words), and the word length (in characters) among all the questions, passages, and answer options.
-
pad
(padding_lengths: typing.Dict[str, int])[source]¶ In this function, we pad the questions and passages (in terms of number of words in each), as well as the individual words in the questions and passages themselves. We also pad the number of answer options, the answer options (in terms of numbers or words in each), as well as the individual words in the answer options.
-
-
class
deep_qa.data.instances.reading_comprehension.mc_question_passage_instance.
McQuestionPassageInstance
(question: str, passage: str, answer_options: typing.List[str], label: int, index: int = None)[source]¶ Bases:
deep_qa.data.instances.reading_comprehension.question_passage_instance.QuestionPassageInstance
A McQuestionPassageInstance is a QuestionPassageInstance that represents a (question, passage, answer_options) tuple from the McQuestionPassageInstance dataset, with an associated label indicating the index of the correct answer choice.
-
_index_label
(label: typing.Tuple[int, int]) → typing.List[int][source]¶ Specify how to index self.label, which is needed to convert the McQuestionPassageInstance into an IndexedInstance (conversion handled in superclass).
-
classmethod
read_from_line
(line: str)[source]¶ Reads a McQuestionPassageInstance object from a line. The format has one of two options:
- [example index][tab][passage][tab][question][tab][options][tab][label]
- [passage][tab][question][tab][options][tab][label]
The
answer_options
column is assumed formatted as:[option]###[option]###[option]...
That is, we split on three hashes ("###"
).
-
to_indexed_instance
(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]¶ Converts the words in this
Instance
into indices using theDataIndexer
.Parameters: data_indexer : DataIndexer
DataIndexer
to use in converting theInstance
to anIndexedInstance
.Returns: indexed_instance : IndexedInstance
A
TextInstance
that has had all of its strings converted into indices.
-
words
() → typing.Dict[str, typing.List[str]][source]¶ Returns a list of all of the words in this instance, contained in a namespace dictionary.
This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the
DataIndexer
to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.Returns: namespace : Dictionary of {str: List[str]}
The
str
key refers to vocabularies, and theList[str]
should contain the tokens in that vocabulary. For example, you should use the keywords
to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.
-
CharacterSpanInstances¶
-
class
deep_qa.data.instances.reading_comprehension.character_span_instance.
CharacterSpanInstance
(question: str, passage: str, label: typing.Tuple[int, int], index: int = None)[source]¶ Bases:
deep_qa.data.instances.reading_comprehension.question_passage_instance.QuestionPassageInstance
A CharacterSpanInstance is a QuestionPassageInstance that represents a (question, passage) pair with an associated label, which is the data given for the span prediction task. The label is a span of characters in the passage that indicates where the answer to the question begins and where the answer to the question ends.
The main thing this class handles over QuestionPassageInstance is in specifying the form of and how to index the label, which is given as a span of _characters_ in the passage. The label we are going to use in the rest of the code is a span of _tokens_ in the passage, so the mapping from character labels to token labels depends on the tokenization we did, and the logic to handle this is, unfortunately, a little complicated. The label conversion happens when converting a CharacterSpanInstance to in IndexedInstance (where character indices are generally lost, anyway).
This class should be used to represent training instances for the SQuAD (Stanford Question Answering) and NewsQA datasets, to name a few.
-
_index_label
(label: typing.Tuple[int, int]) → typing.List[int][source]¶ Specify how to index self.label, which is needed to convert the CharacterSpanInstance into an IndexedInstance (handled in superclass).
-
classmethod
read_from_line
(line: str)[source]¶ Reads a CharacterSpanInstance object from a line. The format has one of two options:
- [example index][tab][question][tab][passage][tab][label]
- [question][tab][passage][tab][label]
[label] is assumed to be a comma-separated pair of integers.
-
stop_token
= '@@STOP@@'¶
-
to_indexed_instance
(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]¶ Converts the words in this
Instance
into indices using theDataIndexer
.Parameters: data_indexer : DataIndexer
DataIndexer
to use in converting theInstance
to anIndexedInstance
.Returns: indexed_instance : IndexedInstance
A
TextInstance
that has had all of its strings converted into indices.
-