Datasets¶
deep_qa.data.dataset¶
-
class
deep_qa.data.datasets.dataset.
Dataset
(instances: typing.List[deep_qa.data.instances.instance.Instance])[source]¶ Bases:
object
A collection of Instances.
This base class has general methods that apply to all collections of Instances. That basically is just methods that operate on sets, like merging and truncating.
-
merge
(other: deep_qa.data.datasets.dataset.Dataset) → deep_qa.data.datasets.dataset.Dataset[source]¶ Combine two datasets. If you call try to merge two Datasets of the same subtype, you will end up with a Dataset of the same type (i.e., calling IndexedDataset.merge() with another IndexedDataset will return an IndexedDataset). If the types differ, this method currently raises an error, because the underlying Instance objects are not currently type compatible.
-
-
class
deep_qa.data.datasets.dataset.
IndexedDataset
(instances: typing.List[deep_qa.data.instances.instance.IndexedInstance])[source]¶ Bases:
deep_qa.data.datasets.dataset.Dataset
A Dataset of IndexedInstances, with some helper methods.
IndexedInstances have text sequences replaced with lists of word indices, and are thus able to be padded to consistent lengths and converted to training inputs.
-
as_training_data
()[source]¶ Takes each
IndexedInstance
and converts it into (inputs, labels), according to the Instance’s as_training_data() method. Both the inputs and the labels are numpy arrays. Note that if theInstances
return tuples for their inputs, we convert the list of tuples into a tuple of lists, before converting everything to numpy arrays.
-
pad_instances
(padding_lengths: typing.Dict[str, int] = None, verbose: bool = True)[source]¶ Makes all of the
IndexedInstances
in the dataset have the same length by padding them. ThisDataset
object doesn’t know what things there are in theInstance
to pad, but theInstances
do, and so does the model that called us, passing in apadding_lengths
dictionary. The keys in that dictionary must match the lengths that theInstance
knows about.Given that, this method does two things: (1) it asks each of the
Instances
what their padding lengths are, and takes a max (usingpadding_lengths()
). It then reconciles those values with thepadding_lengths
we were passed as an argument to this method, and pads the instances withIndexedInstance.pad()
. Ifpadding_lengths
has a particular key specified with a value, that value takes precedence over whatever we computed in our data. TODO(matt): with dynamic padding, we should probably have this be a max padding length, not a hard setting, but that requires some API changes.This method modifies the current object, it does not return a new
IndexedDataset
.Parameters: padding_lengths: Dict[str, int]
If a key is present in this dictionary with a non-None value, we will pad to that length instead of the length calculated from the data. This lets you, e.g., set a maximum value for sentence length, or word length, if you want to throw out long sequences.
verbose: bool, optional (default=True)
Should we output logging information when we’re doing this padding? If the dataset is large, this is nice to have, because padding a large dataset could take a long time. But if you’re doing this inside of a data generator, having all of this output per batch is a bit obnoxious.
-
-
class
deep_qa.data.datasets.dataset.
TextDataset
(instances: typing.List[deep_qa.data.instances.instance.TextInstance], params: deep_qa.common.params.Params = None)[source]¶ Bases:
deep_qa.data.datasets.dataset.Dataset
A Dataset of TextInstances, with a few helper methods.
TextInstances aren’t useful for much with Keras until they’ve been indexed. So this class just has methods to read in data from a file and convert it into other kinds of Datasets.
-
static
read_from_file
(filename: str, instance_class, params: deep_qa.common.params.Params = None)[source]¶
-
static