Tokenizers¶
character_tokenizer¶
-
class
deep_qa.data.tokenizers.character_tokenizer.
CharacterTokenizer
(params: deep_qa.common.params.Params)[source]¶ Bases:
deep_qa.data.tokenizers.tokenizer.Tokenizer
A CharacterTokenizer splits strings into character tokens.
Notes
Note that in the code, we’re still using the “words” namespace, and the “num_sentence_words” padding key, instead of using a different “characters” namespace. This is so that the rest of the code doesn’t have to change as much to just use this different tokenizer. For example, this is an issue when adding start and stop tokens - how is an
Instance
class supposed to know if it should use the “words” or the “characters” namespace when getting a start token id? If we just always use the “words” namespace for the top-level token namespace, it’s not an issue.But confusingly, we’ll still use the “characters” embedding key... At least the user-facing parts all use
characters
; it’s only in writing tokenizer code that you need to be careful about namespaces. TODO(matt): it probably makes sense to change the default namespace to “tokens”, and use that for both the words inWordTokenizer
and the characters inCharacterTokenizer
, so the naming isn’t so confusing.-
embed_input
(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]¶ Applies embedding layers to the input_layer. See
TextTrainer._embed_input
for a more detailed comment on what this method does.Parameters: input_layer: Keras ``Input()`` layer
The layer to embed.
embed_function: Callable[[‘Layer’, str, str], ‘Tensor’]
This should be the __get_embedded_input method from your instantiated
TextTrainer
. This function actually applies anEmbedding
layer (and maybe also a projection and dropout) to the input layer.text_trainer: TextTrainer
Simple
Tokenizers
will just need to use theembed_function
that gets passed as a parameter here, but complexTokenizers
might need more than just an embedding function. So that you can get an encoder or other things from theTextTrainer
here if you need them, we take this object as a parameter.embedding_suffix: str, optional (default=””)
A suffix to add to embedding keys that we use, so that, e.g., you could specify several different word embedding matrices, for whatever reason.
-
get_padding_lengths
(sentence_length: int, word_length: int) → typing.Dict[str, int][source]¶ When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.
-
get_sentence_shape
(sentence_length: int, word_length: int) → typing.Tuple[int][source]¶ If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).
-
get_words_for_indexer
(text: str) → typing.Dict[str, typing.List[str]][source]¶ The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.
-
index_text
(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]¶ This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.
-
tokenizer¶
-
class
deep_qa.data.tokenizers.tokenizer.
Tokenizer
(params: deep_qa.common.params.Params)[source]¶ Bases:
object
A Tokenizer splits strings into sequences of tokens that can be used in a model. The “tokens” here could be words, characters, or words and characters. The Tokenizer object handles various things involved with this conversion, including getting a list of tokens for pre-computing a vocabulary, getting the shape of a word sequence in a model, etc. The Tokenizer needs to handle these things because the tokenization you do could affect the shape of word sequence tensors in the model (e.g., a sentence could have shape (num_words,), (num_characters,), or (num_words, num_characters)).
-
static
_spans_match
(sentence_tokens: typing.List[str], span_tokens: typing.List[str], index: int) → bool[source]¶
-
char_span_to_token_span
(sentence: str, span: typing.Tuple[int, int], slack: int = 3) → typing.Tuple[int, int][source]¶ Converts a character span from a sentence into the corresponding token span in the tokenized version of the sentence. If you pass in a character span that does not correspond to complete tokens in the tokenized version, we’ll do our best, but the behavior is officially undefined.
The basic outline of this method is to find the token that starts the same number of characters into the sentence as the given character span. We try to handle a bit of error in the tokenization by checking slack tokens in either direction from that initial estimate.
The returned
(begin, end)
indices are inclusive forbegin
, and exclusive forend
. So, for example,(2, 2)
is an empty span,(2, 3)
is the one-word span beginning at token index 2, and so on.
-
embed_input
(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]¶ Applies embedding layers to the input_layer. See
TextTrainer._embed_input
for a more detailed comment on what this method does.Parameters: input_layer: Keras ``Input()`` layer
The layer to embed.
embed_function: Callable[[‘Layer’, str, str], ‘Tensor’]
This should be the __get_embedded_input method from your instantiated
TextTrainer
. This function actually applies anEmbedding
layer (and maybe also a projection and dropout) to the input layer.text_trainer: TextTrainer
Simple
Tokenizers
will just need to use theembed_function
that gets passed as a parameter here, but complexTokenizers
might need more than just an embedding function. So that you can get an encoder or other things from theTextTrainer
here if you need them, we take this object as a parameter.embedding_suffix: str, optional (default=””)
A suffix to add to embedding keys that we use, so that, e.g., you could specify several different word embedding matrices, for whatever reason.
-
get_custom_objects
() → typing.Dict[str, typing.Layer][source]¶ If you use any custom
Layers
in yourembed_input
method, you need to return them here, so that theTextTrainer
can correctly load models.
-
get_padding_lengths
(sentence_length: int, word_length: int) → typing.Dict[str, int][source]¶ When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.
-
get_sentence_shape
(sentence_length: int, word_length: int) → typing.Tuple[int][source]¶ If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).
-
get_words_for_indexer
(text: str) → typing.Dict[str, typing.List[str]][source]¶ The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.
-
index_text
(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]¶ This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.
-
static
word_and_character_tokenizer¶
-
class
deep_qa.data.tokenizers.word_and_character_tokenizer.
WordAndCharacterTokenizer
(params: deep_qa.common.params.Params)[source]¶ Bases:
deep_qa.data.tokenizers.tokenizer.Tokenizer
A
WordAndCharacterTokenizer
first splits strings into words, then splits those words into characters, and returns a representation that contains both a word index and a sequence of character indices for each word. See the documention forWordTokenizer
for a note about naming, and the typical notion of “tokenization” in NLP.Notes
In
embed_input
, thisTokenizer
uses an encoder to get a character-level word embedding, which then gets concatenated with a standard word embedding from an embedding matrix. To specify the encoder to use for this character-level word embedding, use the"word"
key in theencoder
parameter to your model (which should be aTextTrainer
subclass - see the documentation there for some more info). If you do not give a"word"
key in theencoder
dict, we’ll create a new encoder using the"default"
parameters.-
embed_input
(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]¶ A combined word-and-characters representation requires some fancy footwork to do the embedding properly.
This method assumes the input shape is (..., sentence_length, word_length + 1), where the first integer for each word in the tensor is the word index, and the remaining word_length entries is the character sequence. We’ll first split this into two tensors, one of shape (..., sentence_length), and one of shape (..., sentence_length, word_length), where the first is the word sequence, and the second is the character sequence for each word. We’ll pass the word sequence through an embedding layer, as normal, and pass the character sequence through a _separate_ embedding layer, then an encoder, to get a word vector out. We’ll then concatenate the two word vectors, returning a tensor of shape (..., sentence_length, embedding_dim * 2).
-
get_custom_objects
() → typing.Dict[str, typing.Any][source]¶ If you use any custom
Layers
in yourembed_input
method, you need to return them here, so that theTextTrainer
can correctly load models.
-
get_padding_lengths
(sentence_length: int, word_length: int) → typing.Dict[str, int][source]¶ When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.
-
get_sentence_shape
(sentence_length: int, word_length: int = None) → typing.Tuple[int][source]¶ If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).
-
get_words_for_indexer
(text: str) → typing.Dict[str, typing.List[str]][source]¶ The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.
-
index_text
(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]¶ This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.
-
word_splitter¶
-
class
deep_qa.data.tokenizers.word_splitter.
NltkWordSplitter
[source]¶ Bases:
deep_qa.data.tokenizers.word_splitter.WordSplitter
A tokenizer that uses nltk’s word_tokenize method.
I found that nltk is very slow, so I switched to using my own simple one, which is a good deal faster. But I’m adding this one back so that there’s consistency with older versions of the code, if you really want it.
-
class
deep_qa.data.tokenizers.word_splitter.
NoOpWordSplitter
[source]¶ Bases:
deep_qa.data.tokenizers.word_splitter.WordSplitter
This is a word splitter that does nothing. We’re playing a little loose with python’s dynamic typing, breaking the typical WordSplitter API a bit and assuming that you’ve already split
sentence
into a list somehow, so you don’t need to do anything else here. For example, thePreTokenizedTaggingInstance
requires this word splitter, because it reads in pre-tokenized data from a file.
-
class
deep_qa.data.tokenizers.word_splitter.
SimpleWordSplitter
[source]¶ Bases:
deep_qa.data.tokenizers.word_splitter.WordSplitter
Does really simple tokenization. NLTK was too slow, so we wrote our own simple tokenizer instead. This just does an initial split(), followed by some heuristic filtering of each whitespace-delimited token, separating contractions and punctuation. We assume lower-cased, reasonably well-formed English sentences as input.
-
split_words
(sentence: str) → typing.List[str][source]¶ Splits a sentence into word tokens. We handle four kinds of things: words with punctuation that should be ignored as a special case (Mr. Mrs., etc.), contractions/genitives (isn’t, don’t, Matt’s), and beginning and ending punctuation (“antennagate”, (parentheticals), and such.).
The basic outline is to split on whitespace, then check each of these cases. First, we strip off beginning punctuation, then strip off ending punctuation, then strip off contractions. When we strip something off the beginning of a word, we can add it to the list of tokens immediately. When we strip it off the end, we have to save it to be added to after the word itself has been added. Before stripping off any part of a token, we first check to be sure the token isn’t in our list of special cases.
-
-
class
deep_qa.data.tokenizers.word_splitter.
SpacyWordSplitter
[source]¶ Bases:
deep_qa.data.tokenizers.word_splitter.WordSplitter
A tokenizer that uses spaCy’s Tokenizer, which is much faster than the others.
tokenizers.word_tokenizer¶
-
class
deep_qa.data.tokenizers.word_tokenizer.
WordTokenizer
(params: deep_qa.common.params.Params)[source]¶ Bases:
deep_qa.data.tokenizers.tokenizer.Tokenizer
A
WordTokenizer
splits strings into word tokens.There are several ways that you can split a string into words, so we rely on a
WordProcessor
to do that work for us. Note that we’re using the word “tokenizer” here for something different than is typical in NLP - we’re referring here to how strings are represented as numpy arrays, not the linguistic notion of splitting sentences into tokens. Those things are handled in theWordProcessor
, which is a common dependency in severalTokenizers
.Parameters: processor: Dict[str, Any], default={}
Contains parameters for processing text strings into word tokens, including, e.g., splitting, stemming, and filtering words. See
WordProcessor
for a complete description of available parameters.-
embed_input
(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]¶ Applies embedding layers to the input_layer. See
TextTrainer._embed_input
for a more detailed comment on what this method does.Parameters: input_layer: Keras ``Input()`` layer
The layer to embed.
embed_function: Callable[[‘Layer’, str, str], ‘Tensor’]
This should be the __get_embedded_input method from your instantiated
TextTrainer
. This function actually applies anEmbedding
layer (and maybe also a projection and dropout) to the input layer.text_trainer: TextTrainer
Simple
Tokenizers
will just need to use theembed_function
that gets passed as a parameter here, but complexTokenizers
might need more than just an embedding function. So that you can get an encoder or other things from theTextTrainer
here if you need them, we take this object as a parameter.embedding_suffix: str, optional (default=””)
A suffix to add to embedding keys that we use, so that, e.g., you could specify several different word embedding matrices, for whatever reason.
-
get_padding_lengths
(sentence_length: int, word_length: int) → typing.Dict[str, int][source]¶ When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.
-
get_sentence_shape
(sentence_length: int, word_length: int) → typing.Tuple[int][source]¶ If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).
-
get_words_for_indexer
(text: str) → typing.Dict[str, typing.List[str]][source]¶ The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.
-
index_text
(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]¶ This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.
-