Data

Overview

Dataset stores the samples and their corresponding labels.
DataLoader wraps an iterable around the Dataset to enable easy access to the samples and easy batching, shuffling and multiprocess data loading.

Dataset

3 magic methods in the Dataset class should be implemented:

from torch.utils.data import Dataset


class MyDataset__(Dataset):
    def __init(self, *args, **kargs):
        pass

    def __len__(self):
        """Return the size of dataset when called by len(dataset)."""
        pass

    def __getitem__(self, idx):
        """Return the idx-th sample when called by dataset[idx]."""
        pass

Dataloader

torch.utils.data.DataLoader supports for:

map-style and iterable-style datasets
customizing data loading order
automatic batching
single- and multi-process data loading
automatic memory pinning

Dataset types

Map-style datasets: __getitem__ and __len__
Iterable-style datasets: __iter__ and __len__

Data loading order

For iterable-style datasets, the data loading order is entirely controlled by the user-defined iterable.

Loading batched and non-batched data

DataLoader supports automatically collating individual fetched data samples into batches via arguments batch_size: how many samples per batch to load

drop_last
batch_sampler: iterable, a sampler that returns mini-batches of indices. batch_sampler is mutually exclusive with batch_size, shuffle, sampler, and drop_last.
collate_fn: callable, merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

Single- and Multi-process Data Loading

num_workers: how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Generally speaking, num_workers are set to the number of available CPUs.

Pinning memory

Pinning memory makes data transfer between CPU and GPU faster, but it may also lead to increased memory usage. Therefore, you should only pin memory if you know it will improve the performance of your application, and be aware that it may indeed decrease performance in some cases.

API

The parameters of DataLoader are as follows:

dataset: Dataset
batch_size: the size of each batch
shuffle: whether to shuffle the data
sampler: defines the strategy to draw samples from the dataset.
batch_sampler: like the sampler, but returns a batch of indices at a time.
num_workers: how many subprocesses to use for data loading.
collate_fn: merges a list of samples to form a mini-batch of Tensor(s).
pin_memory: whether to copy tensors into CUDA pinned memory.
drop_last: whether to drop the last incomplete batch.

Other useful APIs

concat: Concatenates a list of tensors along a given dimension.

subset: Subset of a dataset at specified indices.

Tips

Subset is the subclass of Dataset.

def __init__(self, dataset: Dataset[T_co], indices: Sequence[int]) -> None:
    self.dataset = dataset
    self.indices = indices

def __getitem__(self, idx):
    if isinstance(idx, list):
        return self.dataset[[self.indices[i] for i in idx]]
    return self.dataset[self.indices[idx]]

def __len__(self):
    return len(self.indices)

random_split: Randomly split a dataset into non-overlapping new datasets of given lengths.
sampler: Base class for all Samplers. RandomSampler, SequentialSampler, SubsetRandomSampler, WeightedRandomSampler and BatchSampler are subclasses of Sampler.
The difference between `Sampler` and `BatchSampler`
When using auto_collation, specified batch_sampler according to,
```
@property
def _index_sampler(self):
    if self._auto_collation:
        return self.batch_sampler
    else:
        return self.sampler
```

Data

# Overview

# Dataset

# Dataloader

# Dataset types

# Data loading order

# Loading batched and non-batched data

# Single- and Multi-process Data Loading

# Pinning memory

# API

# Other useful APIs

# References