Instructions to Integrate TFDS Datasets
- Get the name, version number of the Tensorflow Dataset, and optionally the config: "name[/config]:version_number", where the brackets denote optional text.
- Set the environmental variables ARMORY_PRIVATE_S3_ID and ARMORY_PRIVATE_S3_KEY to the appropriate keys with write access to the Armory S3 bucket.
- From a locally cloned version of armory, on a new branch, run:
python -m armory exec pytorch -- python -m armory.data.integrate_tfds name[/config]:version
where the brackets denote optional text.
The script will download and process the TFDS dataset, generate TF Records files, create a tarball, and upload the tarball to S3. It also will create a S3 checksum file in armory/data/cached_s3_checksums/{name}.txt
- Run
git status
to confirm the S3 checksum file was generated and to see the path of the template file. - Manually put the template code from
TEMPLATE_{name}.txt
inarmory/data/datasets.py
. Create a context object that contains metadata and a preprocessing function that does appropriate integrity checks/input normalizing. See for example the canonical fixed-size image preprocessing function which checks the shapes of an image, and renormalizes it to be in the appropriate range defined by the context object (typically 0.0-1.0) with a standard type. See the documentation on dataset preprocessing for more details. - [Optional] Add the dataset to the SUPPORTED_DATASETS dictionary by adding a key with the dataset's name and value of the dataset function from the template code.
- [Optional] Create a continuous integration test for the dataset in
tests/test_docker/test_dataset.py
, possibly usingpytest.skip
. - Commit the changes to the branch on your fork of the Armory repo.
- Open a PR to integrate the dataset.