Datasets

The armory.data.datasets module implements functionality to return datasets of various data modalities. By default, this is a NumPy ArmoryDataGenerator which implements the methods needed by the ART framework. Specifically get_batch will return a tuple of (data, labels) for a specified batch size in numpy format.

We have experimental support for returning tf.data.Dataset and torch.utils.data.Dataset. These can be specified with the framework argument to the dataset function. Options are <numpy|tf|pytorch>.

Currently, datasets are loaded using TensorFlow Datasets from cached tfrecord files. These tfrecord files will be pulled from S3 if not available on your dataset_dir directory.

Image Datasets

Dataset Description x_shape x_dtype y_shape y_dtype splits
cifar10 CIFAR 10 classes image dataset (N, 32, 32, 3) float32 (N,) int64 train, test
german_traffic_sign German traffic sign dataset (N, variable_height, variable_width, 3) float32 (N,) int64 train, test
imagenette Smaller subset of 10 classes from Imagenet (N, variable_height, variable_width, 3) uint8 (N,) int64 train, validation
mnist MNIST hand written digit image dataset (N, 28, 28, 1) float32 (N,) int64 train, test
resisc45 REmote Sensing Image Scene Classification (N, 256, 256, 3) float32 (N,) int64 train, validation, test
Coco2017 Common Objects in Context (N, variable_height, variable_width, 3) float32 n/a List[dict] train, validation, test
xView Objects in Context in Overhead Imagery (N, variable_height, variable_width, 3) float32 n/a List[dict] train, test
minicoco A 3-class subset of Common Objects in Context (N, variable_height, variable_width, 3) float32 n/a List[dict] train, validation

NOTE: the Coco2017 dataset's class labels are 0-indexed (start from 0).

Multimodal Image Datasets

Dataset Description x_shape x_dtype y_shape y_dtype splits
so2sat Co-registered synthetic aperture radar and multispectral optical images (N, 32, 32, 14) float32 (N,) int64 train, validation
carla_obj_det_train CARLA Simulator Object Detection (N, 960, 1280, 3 or 6) float32 n/a List[dict] train, val
carla_over_obj_det_train CARLA Simulator Object Detection (N, 960, 1280, 3 or 6) float32 n/a List[dict] train, val

CARLA Object Detection

The carla_obj_det_train dataset contains rgb and depth modalities. The modality defaults to rgb and must be one of ["rgb", "depth", "both"]. When using the dataset function imported from armory.data.datasets, this value is passed via the modality kwarg. When running an Armory scenario, the value is specified in the dataset_config as such:

 "dataset": {
    "batch_size": 1,
    "modality": "rgb",
}

When modality is set to "both", the input will be of shape (nb=1, 960, 1280, 6) where x[..., :3] are the rgb channels and x[..., 3:] the depth channels.

The carla_over_obj_det_train dataset has the same properties as the above mentioned dataset but is collected utilizing overhead perspectives.

Audio Datasets

Dataset Description x_shape x_dtype y_shape y_dtype sampling_rate splits
digit Audio dataset of spoken digits (N, variable_length) int64 (N,) int64 8 kHz train, test
librispeech Librispeech dataset for automatic speech recognition (N, variable_length) float32 (N,) bytes 16 kHz dev_clean, dev_other, test_clean, train_clean100
librispeech-full Full Librispeech dataset for automatic speech recognition (N, variable_length) float32 (N,) bytes 16 kHz dev_clean, dev_other, test_clean, train_clean100, train_clean360, train_other500
librispeech_dev_clean Librispeech dev dataset for speaker identification (N, variable_length) float32 (N,) int64 16 kHz train, validation, test
librispeech_dev_clean_asr Librispeech dev dataset for automatic speech recognition (N, variable_length) float32 (N,) bytes 16 kHz train, validation, test
speech_commands Speech commands dataset for audio poisoning (N, variable_length) float32 (N,) int64 16 kHz train, validation, test

NOTE: because the Librispeech dataset is over 300 GB with all splits, the librispeech_full dataset has all splits, whereas the librispeech dataset does not have the train_clean360 or train_other500 splits.

Video Datasets

Dataset Description x_shape x_dtype y_shape y_dtype splits
ucf101 UCF 101 Action Recognition (N, variable_frames, None, None, 3) float32 (N,) int64 train, test
ucf101_clean UCF 101 Action Recognition (N, variable_frames, None, None, 3) float32 (N,) int64 train, test

NOTE: The dimension of UCF101 videos is (N, variable_frames, 240, 320, 3) for the entire training set and all of the test set except for 4 examples. For those, the dimensions are (N, variable_frames, 226, 400, 3). If not shuffled, these correspond to (0-indexed) examples 333, 694, 1343, and 3218. NOTE: The only difference between ucf101 and ucf101_clean is that the latter uses the ffmpeg flag -q:v 2, which results in fewer video compression errors.These are stored as separate datasets, however.


Preprocessing

Armory applies preprocessing to convert each dataset to canonical form (e.g. normalize the range of values, set the data type). The poisoning scenario loads its own custom preprocessing, however the GTSRB data is also available in its canonical form. Any additional preprocessing that is desired should occur as part of the model under evaluation.

Canonical preprocessing is not yet supported when framework is tf or pytorch.

Splits

Datasets that are imported directly from TFDS have splits that are defined according to the Tensorflow Datasets library. The german-traffic-sign dataset split follows the description of the original source of the dataset. The digits dataset split follows the description of the original source of the dataset. The following table describes datasets with custom splits in Armory. | Dataset | Split | Description | Split logic details | |:---------------------:|:----------:|:--------------------------------------:|:------------------------------------------------------:| | resisc_45 | train | First 5/7 of dataset | See armory/data/resisc45/resisc45_dataset_partition.py | | | validation | Next 1/7 of dataset | | | | test | Final 1/7 of dataset | | | librispeech_dev_clean | train | 1371 recordings from dev_clean dataset | Assign discrete clips so at least 50% of audio time | | | validation | 692 recordings from dev_clean dataset | is in train, at least 25% is in validation, | | | test | 640 recordings from dev_clean dataset | and the remainder are in test |


Adversarial Datasets

See adversarial_datasets.md for descriptions of Armory's adversarial datasets.

Dataset Licensing

See dataset_licensing.md for details related to the licensing of datasets.