Datasets
The armory.data.datasets
module implements functionality to return datasets of
various data modalities. By default, this is a NumPy ArmoryDataGenerator
which
implements the methods needed by the ART framework. Specifically get_batch
will
return a tuple of (data, labels)
for a specified batch size in numpy format.
We have experimental support for returning tf.data.Dataset
and
torch.utils.data.Dataset
. These can be specified with the framework
argument to
the dataset function. Options are <numpy|tf|pytorch>
.
Currently, datasets are loaded using TensorFlow Datasets from cached tfrecord files.
These tfrecord files will be pulled from S3 if not available on your
dataset_dir
directory.
Image Datasets
Dataset | Description | x_shape | x_dtype | y_shape | y_dtype | splits |
---|---|---|---|---|---|---|
cifar10 | CIFAR 10 classes image dataset | (N, 32, 32, 3) | float32 | (N,) | int64 | train, test |
german_traffic_sign | German traffic sign dataset | (N, variable_height, variable_width, 3) | float32 | (N,) | int64 | train, test |
imagenette | Smaller subset of 10 classes from Imagenet | (N, variable_height, variable_width, 3) | uint8 | (N,) | int64 | train, validation |
mnist | MNIST hand written digit image dataset | (N, 28, 28, 1) | float32 | (N,) | int64 | train, test |
resisc45 | REmote Sensing Image Scene Classification | (N, 256, 256, 3) | float32 | (N,) | int64 | train, validation, test |
Coco2017 | Common Objects in Context | (N, variable_height, variable_width, 3) | float32 | n/a | List[dict] | train, validation, test |
xView | Objects in Context in Overhead Imagery | (N, variable_height, variable_width, 3) | float32 | n/a | List[dict] | train, test |
minicoco | A 3-class subset of Common Objects in Context | (N, variable_height, variable_width, 3) | float32 | n/a | List[dict] | train, validation |
NOTE: the Coco2017 dataset's class labels are 0-indexed (start from 0).
Multimodal Image Datasets
Dataset | Description | x_shape | x_dtype | y_shape | y_dtype | splits |
---|---|---|---|---|---|---|
so2sat | Co-registered synthetic aperture radar and multispectral optical images | (N, 32, 32, 14) | float32 | (N,) | int64 | train, validation |
carla_obj_det_train | CARLA Simulator Object Detection | (N, 960, 1280, 3 or 6) | float32 | n/a | List[dict] | train, val |
carla_over_obj_det_train | CARLA Simulator Object Detection | (N, 960, 1280, 3 or 6) | float32 | n/a | List[dict] | train, val |
CARLA Object Detection
The carla_obj_det_train dataset contains rgb and depth modalities. The modality defaults to rgb and must be one of ["rgb", "depth", "both"]
.
When using the dataset function imported from armory.data.datasets, this value is passed via the modality
kwarg. When running an Armory scenario, the value
is specified in the dataset_config as such:
"dataset": {
"batch_size": 1,
"modality": "rgb",
}
When modality
is set to "both"
, the input will be of shape (nb=1, 960, 1280, 6)
where x[..., :3]
are
the rgb channels and x[..., 3:]
the depth channels.
The carla_over_obj_det_train dataset has the same properties as the above mentioned dataset but is collected utilizing overhead perspectives.
Audio Datasets
Dataset | Description | x_shape | x_dtype | y_shape | y_dtype | sampling_rate | splits |
---|---|---|---|---|---|---|---|
digit | Audio dataset of spoken digits | (N, variable_length) | int64 | (N,) | int64 | 8 kHz | train, test |
librispeech | Librispeech dataset for automatic speech recognition | (N, variable_length) | float32 | (N,) | bytes | 16 kHz | dev_clean, dev_other, test_clean, train_clean100 |
librispeech-full | Full Librispeech dataset for automatic speech recognition | (N, variable_length) | float32 | (N,) | bytes | 16 kHz | dev_clean, dev_other, test_clean, train_clean100, train_clean360, train_other500 |
librispeech_dev_clean | Librispeech dev dataset for speaker identification | (N, variable_length) | float32 | (N,) | int64 | 16 kHz | train, validation, test |
librispeech_dev_clean_asr | Librispeech dev dataset for automatic speech recognition | (N, variable_length) | float32 | (N,) | bytes | 16 kHz | train, validation, test |
speech_commands | Speech commands dataset for audio poisoning | (N, variable_length) | float32 | (N,) | int64 | 16 kHz | train, validation, test |
NOTE: because the Librispeech dataset is over 300 GB with all splits, the librispeech_full
dataset has
all splits, whereas the librispeech
dataset does not have the train_clean360 or train_other500 splits.
Video Datasets
Dataset | Description | x_shape | x_dtype | y_shape | y_dtype | splits |
---|---|---|---|---|---|---|
ucf101 | UCF 101 Action Recognition | (N, variable_frames, None, None, 3) | float32 | (N,) | int64 | train, test |
ucf101_clean | UCF 101 Action Recognition | (N, variable_frames, None, None, 3) | float32 | (N,) | int64 | train, test |
NOTE: The dimension of UCF101 videos is (N, variable_frames, 240, 320, 3)
for the entire training set and all of the test set except for 4 examples.
For those, the dimensions are (N, variable_frames, 226, 400, 3)
. If not shuffled, these correspond to (0-indexed) examples 333, 694, 1343, and 3218.
NOTE: The only difference between ucf101
and ucf101_clean
is that the latter uses the ffmpeg flag -q:v 2
, which results in fewer video compression errors.These are stored as separate datasets, however.
Preprocessing
Armory applies preprocessing to convert each dataset to canonical form (e.g. normalize the range of values, set the data type). The poisoning scenario loads its own custom preprocessing, however the GTSRB data is also available in its canonical form. Any additional preprocessing that is desired should occur as part of the model under evaluation.
Canonical preprocessing is not yet supported when framework
is tf
or pytorch
.
Splits
Datasets that are imported directly from TFDS have splits that are defined according to the
Tensorflow Datasets library. The
german-traffic-sign
dataset split follows the description of the original source of the
dataset. The digits
dataset split follows the description of the original source of the
dataset. The following
table describes datasets with custom splits in Armory.
| Dataset | Split | Description | Split logic details |
|:---------------------:|:----------:|:--------------------------------------:|:------------------------------------------------------:|
| resisc_45 | train | First 5/7 of dataset | See armory/data/resisc45/resisc45_dataset_partition.py |
| | validation | Next 1/7 of dataset | |
| | test | Final 1/7 of dataset | |
| librispeech_dev_clean | train | 1371 recordings from dev_clean dataset | Assign discrete clips so at least 50% of audio time |
| | validation | 692 recordings from dev_clean dataset | is in train, at least 25% is in validation, |
| | test | 640 recordings from dev_clean dataset | and the remainder are in test |
Adversarial Datasets
See adversarial_datasets.md for descriptions of Armory's adversarial datasets.
Dataset Licensing
See dataset_licensing.md for details related to the licensing of datasets.