class documentation

A class that accepts a FO dataset and creates a corresponding torch.utils.data.Dataset

Notes

General: - This only works with persistent datasets.

In order to make a dataset persistent, do dataset.persistent = True
  • Process start methods

    It is recommended to use the 'spawn' and 'forkserver' start methods over 'fork' - Spawn and forkserver are safer when dealing with code that may be threaded (which a lot of code is, for example NumPy). - MongoDB, which backs FiftyOne's database, is not fork safe. In theory nothing here should be breaking, but you will see a lot of warnings. - When using persistent_workers=True, the overhead of 'spawn' and 'forkserver' is low. - When using Jupyter notebooks, if you want all of your code to be in the notebook, rather than calling it from a python file, you'll have

    to use fork. Do so at your own risk. You can easily set the start method for all of your torch code with the following command: torch.multiprocessing.set_start_method('forkserver')

  • Make sure to not touch self.samples or subscript this object until after

    all workers are initialized. This will help you avoid unnecessary memory usage. If you're using DDP, this will help your code not crash.

DDP: - A helper function ::method:`distributed_init` is provided to be called by each

trainer process in the beginning of DDP training. This function: - Safely creates a database connection for each trainer - Shares an authkey between all local processes to ensure they can communicate

torch.utils.data.Dataloader use: - DO NOT use torch.Tensor.to in this function, do so in the train loop, or use

the pin_memory argument in your torch.utils.data.DataLoader
  • If using a dataloader with many workers, remember to pass
    :method:`fo.utils.torch.FiftyOneTorchDataset.worker_init` to the argument worker_init_fn. This class will not work otherwise.
  • Using persistent_workers=True is a good idea.

On reading and writing to the FO object during training: - Reading

  • Try to have as many of the reads as possible during get_item or
when caching fields rather than in the main training script loop. Reads into the FO object during training may slow your code down significantly.
  • Writing
    • Writing currently happens on the process from which it is called. This

    shows moderate slowdown, and will be adressed.

Parameters
samplesa fo.core.collections.SampleCollection
get_itema Callable[`fo.core.sample.SampleView, Any]` Must be a serializable function.
cache_field_names

a list of strings. Fields to cache in memory. If this argument is passed, get_item should be from a dict with keys and values corresponding to the sample's fields and values to the model input. This argument is highly recommended, as it offers a significant performance boost. Please note : the field values must be pickle serializable i.e.

pickle.dumps(field_value) should not raise an error. pickle.loads(pickle.dumps(field_value)) should have all of the functionality of the original field value that you would need in your get_item function.
local_process_groupThe process group with each of the processes running the main train script on the machine this object is on.
Static Method distributed_init Function to be called by each trainer process in DDP training. Facilitates communication between processes. Safely creates database connection for each trainer.
Static Method worker_init Undocumented
Method __getitem__ Undocumented
Method __init__ Undocumented
Method __len__ Undocumented
Instance Variable cache_field_names Undocumented
Instance Variable cached_fields Undocumented
Instance Variable get_item Undocumented
Instance Variable ids Undocumented
Instance Variable name Undocumented
Instance Variable stages Undocumented
Property samples Undocumented
Method _load_cached_fields Undocumented
Method _load_samples Undocumented
Instance Variable _dataset Undocumented
Instance Variable _samples Undocumented
@staticmethod
def distributed_init(dataset_name, local_process_group, view_name=None): (source)

Function to be called by each trainer process in DDP training. Facilitates communication between processes. Safely creates database connection for each trainer.

This function should be called at the beginning of the training script.

Parameters
dataset_namethe name of the dataset to load
local_process_groupthe process group with all the processes running the main training script
view_name:Nonethe name of the view to load. If None, the whole dataset is loaded.
Returns
The loaded fiftyone.core.dataset.Dataset or fiftyone.core.view.DatasetView
@staticmethod
def worker_init(worker_id): (source)

Undocumented

def __getitem__(self, index): (source)

Undocumented

def __init__(self, samples: focol.SampleCollection, get_item: Callable[fos.SampleView, Any], cache_field_names: list[str] = None, local_process_group=None): (source)

Undocumented

def __len__(self): (source)

Undocumented

cache_field_names: None = (source)

Undocumented

cached_fields: dict = (source)

Undocumented

get_item = (source)

Undocumented

Undocumented

Undocumented

Undocumented

Undocumented

def _load_cached_fields(self, samples, cache_field_names, local_process_group): (source)

Undocumented

def _load_samples(self): (source)

Undocumented

_dataset = (source)

Undocumented

_samples = (source)

Undocumented