Entailment Instances¶
These Instances
are designed for an entailment task, where the input is a pair of sentences
(or larger text sequences) and the output is a classification decision.
SentencePairInstances¶
-
class
deep_qa.data.instances.entailment.sentence_pair_instance.
IndexedSentencePairInstance
(first_sentence_indices: typing.List[int], second_sentence_indices: typing.List[int], label: typing.List[int], index: int = None)[source]¶ Bases:
deep_qa.data.instances.instance.IndexedInstance
This is an indexed instance that is commonly used for labeled sentence pairs. Examples of this are SnliInstances where we have a labeled pair of text and hypothesis, and a sentence2vec instance where the objective is to train an encoder to predict whether the sentences are in context or not.
-
as_training_data
()[source]¶ Convert this
IndexedInstance
to NumPy arrays suitable for use as training data to Keras models.Returns: train_data : (inputs, label)
The
IndexedInstance
as NumPy arrays to be uesd in Keras. Note thatinputs
might itself be a complex tuple, depending on theInstance
type.
-
classmethod
empty_instance
()[source]¶ Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.
-
get_padding_lengths
() → typing.Dict[str, int][source]¶ Returns the length of this instance in all dimensions that require padding.
Different kinds of instances have different fields that are padded, such as sentence length, number of background sentences, number of options, etc.
Returns: padding_lengths: Dict[str, int]
A dictionary mapping padding keys (like “num_sentence_words”) to lengths.
-
pad
(padding_lengths: typing.Dict[str, int])[source]¶ Add zero-padding to make each data example of equal length for use in the neural network.
This modifies the current object.
Parameters: padding_lengths: Dict[str, int]
In this dictionary, each
str
refers to a type of token (e.g.num_sentence_words
), and the correspondingint
is the value. This dictionary must have the same keys as was returned byget_padding_lengths()
. We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.
-
-
class
deep_qa.data.instances.entailment.sentence_pair_instance.
SentencePairInstance
(first_sentence: str, second_sentence: str, label: typing.List[int], index: int = None)[source]¶ Bases:
deep_qa.data.instances.instance.TextInstance
SentencePairInstance contains a labeled pair of instances accompanied by a binary label. You could have the label represent whatever you want, such as entailment, or occuring in the same context, or whatever.
-
classmethod
read_from_line
(line: str)[source]¶ Expected format: [sentence1][tab][sentence2][tab][label]
-
to_indexed_instance
(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]¶ Converts the words in this
Instance
into indices using theDataIndexer
.Parameters: data_indexer : DataIndexer
DataIndexer
to use in converting theInstance
to anIndexedInstance
.Returns: indexed_instance : IndexedInstance
A
TextInstance
that has had all of its strings converted into indices.
-
words
() → typing.Dict[str, typing.List[str]][source]¶ Returns a list of all of the words in this instance, contained in a namespace dictionary.
This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the
DataIndexer
to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.Returns: namespace : Dictionary of {str: List[str]}
The
str
key refers to vocabularies, and theList[str]
should contain the tokens in that vocabulary. For example, you should use the keywords
to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.
-
classmethod
SnliInstances¶
-
class
deep_qa.data.instances.entailment.snli_instance.
SnliInstance
(text: str, hypothesis: str, label: str, index: int = None)[source]¶ Bases:
deep_qa.data.instances.entailment.sentence_pair_instance.SentencePairInstance
An SnliInstance is a SentencePairInstance that represents a pair of (text, hypothesis) from the Stanford Natural Language Inference (SNLI) dataset, with an associated label. The main thing we need to add here is handling of the label, because there are a few different ways we can use this Instance.
The label can either be a three-way decision (one of either “entails”, “contradicts”, or “neutral”), or a binary decision (grouping either “entails” and “contradicts”, for relevance decisions, or “contradicts” and “neutral”, for entails/not entails decisions.
The input label must be one of the strings in the label_mapping field below. The difference between the
*_softmax
and*_sigmoid
labels are just for implementation reasons. A softmax over two dimensions is exactly equivalent to a sigmoid, but to make our lives easier in building models, sometimes we use a sigmoid and sometimes we use a softmax over two dimensions. Having separate labels for these cases makes it easier to use this data in whatever kind of model you want.It might make sense to push this difference more generally into some common place, so that we can separate the label itself from how it’s encoded for training. But that might also be complicated to implement, and it’s not needed right now. TODO(matt): if we find ourselves doing this kind of thing in several places, we should think about making that change.
-
label_mapping
= {'contradicts': [0, 1, 0], 'neutral': [0, 0, 1], 'entails_softmax': [0, 1], 'attention_false': [0], 'not_entails_sigmoid': [0], 'attention_true': [1], 'entails': [1, 0, 0], 'not_entails_softmax': [1, 0], 'entails_sigmoid': [1]}¶
-
classmethod
read_from_line
(line: str)[source]¶ Reads an SnliInstance object from a line. The format has one of two options:
- [example index][tab][text][tab][hypothesis][tab][label]
- [text][tab][hypothesis][tab][label]
[label] is assumed to be one of “entails”, “contradicts”, or “neutral”.
-
to_entails_instance
(activation: str)[source]¶ This returns a new SnliInstance with a different label. The new label will be binary (entails / not entails), but we need to distinguish between two different label types. Sometimes we need the label to be encoded in a single dimension (i.e., either 0 or 1), and sometimes we need it to be encoded in two dimensions (i.e., either [0, 1] or [1, 0]). This depends on the activation function of the final layer in our network - a sigmoid activation will need the former, while a softmax activation will need the later. So, we encode these differently, as strings, which will be converted to the right array later, in IndexedSnliInstance.
-