Sequence Tagging Instances¶
These Instances
are designed for a sequence tagging task, where the input is a passage of
natural language (e.g., a sentence), and the output is some classification decision for each token
in that passage (e.g., part-of-speech tags, any kind of BIO tagging like NER or chunking, etc.).
TaggingInstances¶
-
class
deep_qa.data.instances.sequence_tagging.tagging_instance.
IndexedTaggingInstance
(text_indices: typing.List[int], label: typing.List[int], index: int = None)[source]¶ Bases:
deep_qa.data.instances.instance.IndexedInstance
-
as_training_data
()[source]¶ Convert this
IndexedInstance
to NumPy arrays suitable for use as training data to Keras models.Returns: train_data : (inputs, label)
The
IndexedInstance
as NumPy arrays to be uesd in Keras. Note thatinputs
might itself be a complex tuple, depending on theInstance
type.
-
classmethod
empty_instance
()[source]¶ Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.
-
get_padding_lengths
() → typing.Dict[str, int][source]¶ Returns the length of this instance in all dimensions that require padding.
Different kinds of instances have different fields that are padded, such as sentence length, number of background sentences, number of options, etc.
Returns: padding_lengths: Dict[str, int]
A dictionary mapping padding keys (like “num_sentence_words”) to lengths.
-
pad
(padding_lengths: typing.Dict[str, int])[source]¶ Add zero-padding to make each data example of equal length for use in the neural network.
This modifies the current object.
Parameters: padding_lengths: Dict[str, int]
In this dictionary, each
str
refers to a type of token (e.g.num_sentence_words
), and the correspondingint
is the value. This dictionary must have the same keys as was returned byget_padding_lengths()
. We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.
-
-
class
deep_qa.data.instances.sequence_tagging.tagging_instance.
TaggingInstance
(text: str, label: typing.Any, index: int = None)[source]¶ Bases:
deep_qa.data.instances.instance.TextInstance
A
TaggingInstance
represents a passage of text and a tag sequence over that text.There are some sticky issues with tokenization and how exactly the label is specified. For example, if your label is a sequence of tags, that assumes a particular tokenization, which interacts in a funny way with our tokenization code. This is a general superclass containing common functionality for most simple sequence tagging tasks. The specifics of reading in data from a file and converting that data into properly-indexed tag sequences is left to subclasses.
-
_index_label
(label: typing.Any, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[int][source]¶ Index the labels. Since we don’t know what form the label takes, we leave it to subclasses to implement this method. If you need to convert tag names into indices, use the namespace ‘tags’ in the
DataIndexer
.
Returns all of the tag words in this instance, so that we can convert them into indices. This is called in
self.words()
. Not necessary if you have some pre-indexed labeling scheme.
-
words
() → typing.Dict[str, typing.List[str]][source]¶ Returns a list of all of the words in this instance, contained in a namespace dictionary.
This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the
DataIndexer
to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.Returns: namespace : Dictionary of {str: List[str]}
The
str
key refers to vocabularies, and theList[str]
should contain the tokens in that vocabulary. For example, you should use the keywords
to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.
-
PretokenizedTaggingInstances¶
-
class
deep_qa.data.instances.sequence_tagging.pretokenized_tagging_instance.
PreTokenizedTaggingInstance
(text: typing.List[str], label: typing.List[str], index: int = None)[source]¶ Bases:
deep_qa.data.instances.sequence_tagging.tagging_instance.TaggingInstance
This is a
TaggingInstance
where the text has been pre-tokenized. Thus thetext
member variable here is actually aList[str]
, instead of astr
.When using this
Instance
, you must use theNoOpWordSplitter
as well, or things will break. You probably also do not want any kind of filtering (though stemming is ok), because only the words will get filtered, not the labels.-
_index_label
(label: typing.List[str], data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[int][source]¶ Index the labels. Since we don’t know what form the label takes, we leave it to subclasses to implement this method. If you need to convert tag names into indices, use the namespace ‘tags’ in the
DataIndexer
.
-
classmethod
read_from_line
(line: str)[source]¶ Reads a
PreTokenizedTaggingInstance
from a line. The format has one of two options:- [example index][token1]###[tag1][tab][token2]###[tag2][tab]...
- [token1]###[tag1][tab][token2]###[tag2][tab]...
Returns all of the tag words in this instance, so that we can convert them into indices. This is called in
self.words()
. Not necessary if you have some pre-indexed labeling scheme.
-