darwin.dataset package

Submodules

darwin.dataset.download_manager module

Holds helper functions that deal with downloading videos and images.

darwin.dataset.download_manager.download_all_images_from_annotations(client: Client, annotations_path: Path, images_path: Path, force_replace: bool = False, remove_extra: bool = False, annotation_format: str = 'json', use_folders: bool = True, video_frames: bool = False, force_slots: bool = False, ignore_slots: bool = False) → Tuple[Callable[[], Iterable[Any]], int][source]

Downloads all the images corresponding to a project.

Parameters:

api_key (str) – API Key of the current team
annotations_path (Path) – Path where the annotations are located
images_path (Path) – Path where to download the images
force_replace (bool, default: False) – Forces the re-download of an existing image
remove_extra (bool, default: False) – Removes local files that would not be overwritten by the release being pulled.
annotation_format (str, default: "json") – Format of the annotations. Currently only JSON and xml are expected
use_folders (bool, default: False) – Recreate folders
video_frames (bool, default: False) – Pulls video frames images instead of video files
force_slots (bool, default: False) – Pulls all slots of items into deeper file structure ({prefix}/{item_name}/{slot_name}/{file_name}) If False, all multi-slotted items and items with slots containing multiple source files will be downloaded as the deeper file structure

Returns:

generator (function) – Generator for doing the actual downloads
count (int) – The files count

Raises:

ValueError – If the given annotation file is not in darwin (json) or pascalvoc (xml) format.

darwin.dataset.download_manager.lazy_download_image_from_annotation(client: Client, annotation: AnnotationFile, images_path: Path, annotation_format: str, use_folders: bool, video_frames: bool, force_slots: bool, ignore_slots: bool = False) → Iterable[Callable[[], None]][source]

Returns functions to download an image given an annotation. Same as download_image_from_annotation but returns Callables that trigger the download instead fetching files interally.

Parameters:

client (Client) – Client of the current team
annotation (AnnotationFile) – Annotation file corresponding to the dataset file
images_path (Path) – Path where to download the image
annotation_format (str) – Format of the annotations. Currently only JSON is supported
use_folders (bool) – Recreate folder structure
video_frames (bool) – Pulls video frames images instead of video files
force_slots (bool) – Pulls all slots of items into deeper file structure ({prefix}/{item_name}/{slot_name}/{file_name})

Raises:

NotImplementedError – If the format of the annotation is not supported.

darwin.dataset.download_manager.download_manifest_txts(urls: List[str], client: Client, folder: Path) → List[Path][source]

darwin.dataset.download_manager.get_segment_manifests(slot: Slot, parent_path: Path, client: Client) → List[SegmentManifest][source]

darwin.dataset.identifier module

class darwin.dataset.identifier.DatasetIdentifier(dataset_slug: str, team_slug: str | None = None, version: str | None = None)[source]

Bases: object

Formal representation of a dataset identifier for the SDK.

A dataset identifier is a string that uniquely identifies a dataset on Darwin. A dataset identifier is made of the following substrings: <team-slug>/<dataset-slug>:<version>.

If version is missing, it defaults to latest.

Parameters:

dataset_slug (str) – The slugified name of the dataset.
team_slug (Optional[str], default: None) – The slugified name of the team.
version (Optional[str], default: None) – The version of the identifier.

dataset_slug

The slugified name of the dataset.

Type:: str

team_slug

The slugified name of the team.

Type:: Optional[str], default: None

version

The version of the identifier.

Type:: Optional[str], default: None

classmethod parse(identifier: str | DatasetIdentifier) → DatasetIdentifier[source]

Parses the given identifier and returns the corresponding DatasetIdentifier.

Parameters:: identifier (Union[str, DatasetIdentifier]) – The identifier to be parsed.
Returns:: The SDK representation of a DatasetIdentifier.
Return type:: DatasetIdentifier
Raises:: ValueError – If the identifier given is invalid.

darwin.dataset.local_dataset module

class darwin.dataset.local_dataset.LocalDataset(dataset_path: Path, annotation_type: str, partition: str | None = None, split: str = 'default', split_type: str = 'random', release_name: str | None = None, keep_empty_annotations: bool = False)[source]

Bases: object

Base class representing a V7 Darwin dataset that has been pulled locally already. It can be used with PyTorch dataloaders. See darwin.torch module for more specialized dataset classes, extending this one.

Parameters:

dataset_path (Path) – Path to the location of the dataset on the file system.
annotation_type (str) – The type of annotation classes ["tag", "bounding_box", "polygon"].
partition (Optional[str], default: None) – Selects one of the partitions ["train", "val", "test"].
split (str, default: "default") – Selects the split that defines the percentages used (use ‘default’ to select the default split).
split_type (str, default: "random") – Heuristic used to do the split ["random", "stratified"].
release_name (Optional[str], default: None) – Version of the dataset.

dataset_path

Path to the location of the dataset on the file system.

Type:: Path

annotation_type

The type of annotation classes ["tag", "bounding_box", "polygon"].

Type:: str

partition

Selects one of the partitions ["train", "val", "test"].

Type:: Optional[str], default: None

split

Selects the split that defines the percentages used (use ‘default’ to select the default split).

Type:: str, default: “default”

split_type

Heuristic used to do the split ["random", "stratified"].

Type:: str, default: “random”

release_name

Version of the dataset.

Type:: Optional[str], default: None

Raises:

ValueError –

If partition, split_type or annotation_type have an invalid value. - If an annotation has no corresponding image - If an image has multiple extensions (meaning it is present in multiple formats) - If no images are found

get_img_info(index: int) → Dict[str, Any][source]

Returns the annotation information for a given image.

Parameters:: index (int) – The index of the image.
Returns:: A dictionary with the image’s class and annotaiton information.
Return type:: Dict[str, Any]
Raises:: ValueError – If there are no annotations downloaded in this machine. You can pull them by using the command darwin dataset pull $DATASET_NAME --only-annotations in the CLI.

get_height_and_width(index: int) → Tuple[float, float][source]

Returns the width and height of the image with the given index.

Parameters:: index (int) – The index of the image.
Returns:: A tuple where the first element is the height of the image and the second is the width.
Return type:: Tuple[float, float]

extend(dataset: LocalDataset, extend_classes: bool = False) → LocalDataset[source]

Extends the current dataset with another one.

Parameters:

dataset (Dataset) – Dataset to merge
extend_classes (bool, default: False) – Extend the current set of classes by merging it with the set of classes belonging to the given dataset.

Returns:

This LocalDataset extended with the classes of the give one.

Return type:

LocalDataset

Raises:

ValueError –

If the annotation_type of this LocalDataset differs from the annotation_type of the given one. - If the set of classes from this LocalDataset differs from the set of classes from the given one AND extend_classes is False.

get_image(index: int) → Image[source]

Returns the correspoding PILImage.Image.

Parameters:: index (int) – The index of the image in this LocalDataset.
Returns:: The image.
Return type:: PILImage.Image

get_image_path(index: int) → Path[source]

Returns the path of the image with the given index.

Parameters:: index (int) – The index of the image in this LocalDataset.
Returns:: The Path of the image.
Return type:: Path

parse_json(index: int) → Dict[str, Any][source]

Load an annotation and filter out the extra classes according to what is specified in self.classes and the annotation_type.

Parameters:: index (int) – Index of the annotation to read.
Returns:: A dictionary containing the index and the filtered annotation.
Return type:: Dict[str, Any]

annotation_type_supported(annotation) → bool[source]

measure_mean_std(multi_processed: bool = True) → Tuple[ndarray, ndarray][source]

Computes mean and std of trained images, given the train loader.

Parameters:

multi_processed (bool, default: True) – Uses multiprocessing to download the dataset in parallel.

Returns:

mean (ndarray[double]) – Mean value (for each channel) of all pixels of the images in the input folder.
std (ndarray[double]) – Standard deviation (for each channel) of all pixels of the images in the input folder.

darwin.dataset.local_dataset.get_annotation_filepaths(release_path: Path, annotations_dir: Path, annotation_type: str, split: str, partition: str | None = None, split_type: str = 'random') → Iterator[str][source]

Returns a list of annotation filepaths for the given release & partition.

Parameters:

release_path (Path) – The path of the Release saved locally.
annotations_dir (Path) – The path for a directory where annotations.
annotation_type (str) – The type of the annotations.
split (str) – The split name.
partition (Optional[str], default: None) –
How to partition files. If no partition is specified, then it takes all the json files in the annotations directory. The resulting generator prepends parent directories relative to the main annotation directory.

E.g.: ["annotations/test/1.json", "annotations/2.json", "annotations/test/2/3.json"]:
- annotations/test/1
- annotations/2
- annotations/test/2/3
str (split_type) – The type of split. Can be "random" or "stratified".
default ("random") – The type of split. Can be "random" or "stratified".

Returns:

An iterator with the path for the stem files.

Return type:

Iterator[str]

Raises:

ValueError – If the provided split_type is invalid.
FileNotFoundError – If no dataset partitions are found.

darwin.dataset.release module

class darwin.dataset.release.ReleaseStatus(value)[source]

Bases: Enum

An enumeration.

PENDING = 'pending'

COMPLETE = 'complete'

FAILED = 'failed'

class darwin.dataset.release.Release(dataset_slug: str, team_slug: str, version: str, name: str, status: ReleaseStatus, url: str | None, export_date: datetime, image_count: int | None, class_count: int | None, available: bool, latest: bool, format: str)[source]

Bases: object

Represents a release/export. Releases created this way can only contain items with ‘completed’ status.

Parameters:

dataset_slug (str) – The slug of the dataset.
team_slug (str) – the slug of the team.
version (str) – The version of the Release.
name (str) – The name of the Release.
status (ReleaseStatus) – The status of the Release.
url (Optional[str]) – The full url used to download the Release.
export_date (datetime.datetime) – The datetime of when this release was created.
image_count (Optional[int]) – Number of images in this Release.
class_count (Optional[int]) – Number of distinct classes in this Release.
available (bool) – If this Release is downloadable or not.
latest (bool) – If this Release is the latest one or not.
format (str) – Format for the file of this Release should it be downloaded.

dataset_slug

The slug of the dataset.

Type:: str

team_slug

the slug of the team.

Type:: str

version

The version of the Release.

Type:: str

name

The name of the Release.

Type:: str

status

The status of the Release.

Type:: ReleaseStatus

url

The full url used to download the Release.

Type:: Optional[str]

export_date

The datetime of when this release was created.

Type:: datetime.datetime

image_count

Number of images in this Release.

Type:: Optional[int]

class_count

Number of distinct classes in this Release.

Type:: Optional[int]

available

If this Release is downloadable or not.

Type:: bool

latest

If this Release is the latest one or not.

Type:: bool

format

Format for the file of this Release should it be downloaded.

Type:: str

classmethod parse_json(dataset_slug: str, team_slug: str, payload: Dict[str, Any]) → Release[source]

Given a json, parses it into a Release object instance.

Parameters:

dataset_slug (str) – The slug of the dataset this Release belongs to.
team_slug (str) – The slug of the team this Release’s dataset belongs to.
payload (Dict[str, Any]) –
A Dictionary with the Release information. It must have a minimal format similar to:
```
{
    "version": "a_version",
    "name": "a_name"
}
```
If no format key is found in payload, the default will be json.

Optional payload has no download_url key, then url, available, image_count, class_count and latest will default to either None or False depending on the type.

A more complete format for this parameter would be similar to:
```
{
    "version": "a_version",
    "name": "a_name",
    "metadata": {
        "num_images": 1,
        "annotation_classes": []
    },
    "download_url": "http://www.some_url_here.com",
    "latest": false,
    "format": "a_format"
}
```

Returns:

A Release created from the given payload.

Return type:

Release

download_zip(path: Path) → Path[source]

Downloads the release content into a zip file located by the given path.

Parameters:: path (Path) – The path where the zip file will be located.
Returns:: Same Path as provided in the parameters.
Return type:: Path
Raises:: ValueError – If this Release object does not have a specified url.

property identifier: DatasetIdentifier

The DatasetIdentifier for this Release.

Type:: DatasetIdentifier

darwin.dataset.remote_dataset module

class darwin.dataset.remote_dataset.RemoteDataset(*, client: Client, team: str, name: str, slug: str, dataset_id: int, item_count: int = 0, progress: float = 0, version: int = 1, release: str | None = None)[source]

Bases: ABC

Manages the remote and local versions of a dataset hosted on Darwin. It allows several dataset management operations such as syncing between remote and local, pulling a remote dataset, removing the local files, …

Parameters:

client (Client) – Client to use for interaction with the server.
team (str) – Team the dataset belongs to.
name (str) – Name of the datasets as originally displayed on Darwin. It may contain white spaces, capital letters and special characters, e.g. Bird Species!.
slug (str) – This is the dataset name with everything lower-case, removed specials characters and spaces are replaced by dashes, e.g., bird-species. This string is unique within a team.
dataset_id (int) – Unique internal reference from the Darwin backend.
item_count (int, default: 0) – Dataset size (number of items).
progress (float, default: 0) – How much of the dataset has been annotated 0.0 to 1.0 (1.0 == 100%).

client

Client to use for interaction with the server.

Type:: Client

team

Team the dataset belongs to.

Type:: str

name

Name of the datasets as originally displayed on Darwin. It may contain white spaces, capital letters and special characters, e.g. Bird Species!.

Type:: str

slug

This is the dataset name with everything lower-case, removed specials characters and spaces are replaced by dashes, e.g., bird-species. This string is unique within a team.

Type:: str

dataset_id

Unique internal reference from the Darwin backend.

Type:: int

item_count

Dataset size (number of items).

Type:: int, default: 0

progress

How much of the dataset has been annotated 0.0 to 1.0 (1.0 == 100%).

Type:: float, default: 0

abstract push(files_to_upload: Sequence[str | Path | LocalFile] | None, *, blocking: bool = True, multi_threaded: bool = True, max_workers: int | None = None, fps: int = 0, as_frames: bool = False, extract_views: bool = False, files_to_exclude: List[str | Path] | None = None, path: str | None = None, preserve_folders: bool = False, progress_callback: Callable[[int, float], None] | None = None, file_upload_callback: Callable[[str, int, int], None] | None = None, item_merge_mode: str | None = None) → UploadHandler[source]

split_video_annotations(release_name: str = 'latest') → None[source]

Splits the video annotations from this RemoteDataset using the given release.

Parameters:: release_name (str, default: "latest") – The name of the release to use.

pull(*, release: Release | None = None, blocking: bool = True, multi_processed: bool = True, only_annotations: bool = False, force_replace: bool = False, remove_extra: bool = False, subset_filter_annotations_function: Callable | None = None, subset_folder_name: str | None = None, use_folders: bool = True, video_frames: bool = False, force_slots: bool = False, ignore_slots: bool = False, retry: bool = False, retry_timeout: int = 600, retry_interval: int = 10) → Tuple[Callable[[], Iterator[Any]] | None, int][source]

Downloads a remote dataset (images and annotations) to the datasets directory.

Parameters:

release (Optional[Release], default: None) – The release to pull.
blocking (bool, default: True) – If False, the dataset is not downloaded and a generator function is returned instead.
multi_processed (bool, default: True) – Uses multiprocessing to download the dataset in parallel. If blocking is False this has no effect.
only_annotations (bool, default: False) – Download only the annotations and no corresponding images.
force_replace (bool, default: False) – Forces the re-download of an existing image.
remove_extra (bool, default: False) – Removes local files that would not be overwritten by the release being pulled.
subset_filter_annotations_function (Optional[Callable], default: None) – This function receives the directory where the annotations are downloaded and can perform any operation on them i.e. filtering them with custom rules or else. If it needs to receive other parameters is advised to use functools.partial() for it.
subset_folder_name (Optional[str], default: None) – Name of the folder with the subset of the dataset. If not provided a timestamp is used.
use_folders (bool, default: True) – Recreates folders from the dataset.
video_frames (bool, default: False) – Pulls video frames images instead of video files.
force_slots (bool) – Pulls all slots of items into deeper file structure ({prefix}/{item_name}/{slot_name}/{file_name})
retry (bool) – If True, will repeatedly try to download the release if it is still processing up to a maximum of 5 minutes.

Returns:

generator (function) – Generator for doing the actual downloads. This is None if blocking is True.
count (int) – The number of files.

Raises:

UnsupportedExportFormat – If the given release has an invalid format.
ValueError – If darwin in unable to get Team configuration.
ValueError – If the release is still processing after the maximum retry duration.

remove_remote() → None[source]: Archives (soft-deletion) this RemoteDataset.

abstract fetch_remote_files(filters: Dict[str, str | List[str]] | None = None, sort: str | ItemSorter | None = None) → Iterator[DatasetItem][source]

Fetch and lists all files on the remote dataset.

Parameters:

filters (Optional[Dict[str, Union[str, List[str]]]], default: None) – The filters to use. Files excluded by the filter won’t be fetched.
sort (Optional[Union[str, ItemSorter]], default: None) – A sorting direction. It can be a string with the values ‘asc’, ‘ascending’, ‘desc’, ‘descending’ or an ItemSorter instance.

Yields:

Iterator[DatasetItem] – An iterator of DatasetItem.

abstract archive(items: Iterable[DatasetItem]) → None[source]

Archives (soft-deletion) the given DatasetItems belonging to this RemoteDataset.

Parameters:: items (Iterable[DatasetItem]) – The DatasetItems to be archived.

abstract restore_archived(items: Iterable[DatasetItem]) → None[source]

Restores the archived DatasetItems that belong to this RemoteDataset.

Parameters:: items (Iterable[DatasetItem]) – The DatasetItems to be restored.

abstract move_to_new(items: Iterable[DatasetItem]) → None[source]

Changes the given DatasetItems status to new.

Parameters:: items (Iterable[DatasetItem]) – The DatasetItems whose status will change.

abstract complete(items: Iterable[DatasetItem]) → None[source]

Completes the given DatasetItems.

Parameters:: items (Iterable[DatasetItem]) – The DatasetItems to be completed.

abstract delete_items(items: Iterable[DatasetItem]) → None[source]

Deletes the given DatasetItems.

Parameters:: items (Iterable[DatasetItem]) – The DatasetItems to be deleted.

fetch_annotation_type_id_for_name(name: str) → int | None[source]

Fetches annotation type id for a annotation type name, such as bounding_box.

Parameters:: name (str) – The name of the annotation we want the id for.
Returns:: The id of the annotation type or None if it doesn’t exist.
Return type:: Optional[int]

create_annotation_class(name: str, type: str, subtypes: List[str] = []) → Dict[str, Any][source]

Creates an annotation class for this RemoteDataset.

Parameters:

name (str) – The name of the annotation class.
type (str) – The type of the annotation class.
subtypes (List[str], default: []) – Annotation class subtypes.

Returns:

Dictionary with the server response.

Return type:

Dict[str, Any]

Raises:

ValueError – If a given annotation type or subtype is unknown.

add_annotation_class(annotation_class: AnnotationClass | int) → Dict[str, Any] | None[source]

Adds an annotation class to this RemoteDataset.

Parameters:: annotation_class (Union[AnnotationClass, int]) – The annotation class to add or its id.
Returns:: Dictionary with the server response or None if the annotations class already exists.
Return type:: Optional[Dict[str, Any]]
Raises:: ValueError – If the given annotation_class does not exist in this RemoteDataset’s team.

fetch_remote_classes(team_wide=False) → List[Dict[str, Any]][source]

Fetches all the Annotation Classes from this RemoteDataset.

Parameters:: team_wide (bool, default: False) – If True will return all Annotation Classes that belong to the team. If False will only return Annotation Classes which have been added to the dataset.
Returns:: List of Annotation Classes (can be empty).
Return type:: List[Dict[str, Any]]

fetch_remote_attributes() → List[Dict[str, Any]][source]

Fetches all remote attributes on the remote dataset.

Returns:: A List with the attributes, where each attribute is a dictionary.
Return type:: List[Dict[str, Any]]

abstract export(name: str, annotation_class_ids: List[str] | None = None, include_url_token: bool = False, include_authorship: bool = False, version: str | None = None) → None[source]

Create a new release for this RemoteDataset.

Parameters:

name (str) – Name of the release.
annotation_class_ids (Optional[List[str]], default: None) – List of the classes to filter.
include_url_token (bool, default: False) – Should the image url in the export include a token enabling access without team membership or not?
include_authorship (bool, default: False) – If set, include annotator and reviewer metadata for each annotation.
version (Optional[str], default: None, enum: ["1.0", "2.0"]) – When used for V2 dataset, allows to force generation of either Darwin JSON 1.0 (Legacy) or newer 2.0. Omit this option to get your team’s default.

abstract get_releases(include_unavailable: bool = False) → List[Release][source]

Get a sorted list of releases with the most recent first.

Parameters:: include_unavailable (bool, default: False) – If True, return all releases, including those that are not available.
Returns:: Returns a sorted list of available Releases with the most recent first.
Return type:: List[“Release”]

get_release(name: str = 'latest', include_unavailable: bool = True) → Release[source]

Get a specific Release for this RemoteDataset.

Parameters:

name (str, default: "latest") – Name of the export.
include_unavailable (bool, default: True) – If True, return all releases, including those that are not available.

Returns:

The selected release.

Return type:

Release

Raises:

NotFound – The selected Release does not exist.

split(val_percentage: float = 0.1, test_percentage: float = 0, split_seed: int = 0, make_default_split: bool = True, release_name: str | None = None) → None[source]

Creates lists of file names for each split for train, validation, and test. Note: This functions needs a local copy of the dataset.

Parameters:

val_percentage (float, default: 0.1) – Percentage of images used in the validation set.
test_percentage (float, default: 0) – Percentage of images used in the test set.
split_seed (int, default: 0) – Fix seed for random split creation.
make_default_split (bool, default: True) – Makes this split the default split.
release_name (Optional[str], default: None) – Version of the dataset.

Raises:

NotFound – If this RemoteDataset is not found locally.

classes(annotation_type: str, release_name: str | None = None) → List[str][source]

Returns the list of class_type classes.

Parameters:

annotation_type (str) – The type of annotation classes, e.g. ‘tag’ or ‘polygon’.
release_name (Optional[str], default: None) – Version of the dataset.

Returns:

classes – List of classes in the dataset of type class_type.

Return type:

List[str]

annotations(partition: str, split: str = 'split', split_type: str = 'stratified', annotation_type: str = 'polygon', release_name: str | None = None, annotation_format: str | None = 'darwin') → Iterable[Dict[str, Any]][source]

Returns all the annotations of a given split and partition in a single dictionary.

Parameters:

partition (str) – Selects one of the partitions [train, val, test].
split (str, default: "split") – Selects the split that defines the percentages used (use ‘split’ to select the default split.
split_type (str, default: "stratified") – Heuristic used to do the split [random, stratified].
annotation_type (str, default: "polygon") – The type of annotation classes [tag, polygon].
release_name (Optional[str], default: None) – Version of the dataset.
annotation_format (Optional[str], default: "darwin") – Re-formatting of the annotation when loaded [coco, darwin].

Yields:

Dict[str, Any] – Dictionary representing an annotation from this RemoteDataset.

abstract workview_url_for_item(item: DatasetItem) → str[source]

Returns the darwin URL for the given DatasetItem.

Parameters:: item (DatasetItem) – The DatasetItem for which we want the url.
Returns:: The url.
Return type:: str

abstract post_comment(item: DatasetItem, text: str, x: float, y: float, w: float, h: float) → None[source]

Adds a comment to an item in this dataset. The comment will be added with a bounding box. Creates the workflow for said item if necessary.

Parameters:

item (DatasetItem) – The DatasetItem which will receive the comment.
text (str) – The text of the comment.
x (float) – The x coordinate of the bounding box containing the comment.
y (float) – The y coordinate of the bounding box containing the comment.
w (float) – The width of the bounding box containing the comment.
h (float) – The height of the bounding box containing the comment.

abstract import_annotation(item_id: str | int, payload: Dict[str, Any]) → None[source]

Imports the annotation for the item with the given id.

Parameters:

item_id (ItemId) – Identifier of the Item that we are import the annotation to.
payload (Dict[str, Any]) – A dictionary with the annotation to import. The default format is: {“annotations”: serialized_annotations, “overwrite”: “false”}

property remote_path: Path: Returns an URL specifying the location of the remote dataset.

property local_path: Path: Returns a Path to the local dataset.

property local_releases_path: Path: Returns a Path to the local dataset releases.

property local_images_path: Path: Returns a local Path to the images folder.

property identifier: DatasetIdentifier: The DatasetIdentifier of this RemoteDataset.

darwin.dataset.remote_dataset_v2 module

class darwin.dataset.remote_dataset_v2.RemoteDatasetV2(*, client: Client, team: str, name: str, slug: str, dataset_id: int, item_count: int = 0, progress: float = 0)[source]

Bases: RemoteDataset

Manages the remote and local versions of a dataset hosted on Darwin. It allows several dataset management operations such as syncing between remote and local, pulling a remote dataset, removing the local files, …

Parameters:

client (Client) – Client to use for interaction with the server.
team (str) – Team the dataset belongs to.
name (str) – Name of the datasets as originally displayed on Darwin. It may contain white spaces, capital letters and special characters, e.g. Bird Species!.
slug (str) – This is the dataset name with everything lower-case, removed specials characters and spaces are replaced by dashes, e.g., bird-species. This string is unique within a team.
dataset_id (int) – Unique internal reference from the Darwin backend.
item_count (int, default: 0) – Dataset size (number of items).
progress (float, default: 0) – How much of the dataset has been annotated 0.0 to 1.0 (1.0 == 100%).

client

Client to use for interaction with the server.

Type:: Client

team

Team the dataset belongs to.

Type:: str

name

Name of the datasets as originally displayed on Darwin. It may contain white spaces, capital letters and special characters, e.g. Bird Species!.

Type:: str

slug

This is the dataset name with everything lower-case, removed specials characters and spaces are replaced by dashes, e.g., bird-species. This string is unique within a team.

Type:: str

dataset_id

Unique internal reference from the Darwin backend.

Type:: int

item_count

Dataset size (number of items).

Type:: int, default: 0

progress

How much of the dataset has been annotated 0.0 to 1.0 (1.0 == 100%).

Type:: float, default: 0

get_releases(include_unavailable: bool = False) → List[Release][source]

Get a sorted list of releases with the most recent first.

Parameters:: include_unavailable (bool, default: False) – If True, return all releases, including those that are not available.
Returns:: Returns a sorted list of available Releases with the most recent first.
Return type:: List[“Release”]

push(files_to_upload: Sequence[str | Path | LocalFile] | None, *, blocking: bool = True, multi_threaded: bool = True, max_workers: int | None = None, fps: int = 0, as_frames: bool = False, extract_views: bool = False, handle_as_slices: bool | None = False, files_to_exclude: List[str | Path] | None = None, path: str | None = None, preserve_folders: bool = False, progress_callback: Callable[[int, float], None] | None = None, file_upload_callback: Callable[[str, int, int], None] | None = None, item_merge_mode: str | None = None) → UploadHandler[source]

Uploads a local dataset (images ONLY) in the datasets directory.

Parameters:

files_to_upload (Optional[List[Union[PathLike, LocalFile]]]) – List of files to upload. These can be folders. If item_merge_mode is set, these paths must be folders.
blocking (bool, default: True) – If False, the dataset is not uploaded and a generator function is returned instead.
multi_threaded (bool, default: True) – Uses multiprocessing to upload the dataset in parallel. If blocking is False this has no effect.
max_workers (int, default: None) – Maximum number of workers to use for parallel upload.
fps (int, default: 0) – When the uploading file is a video, specify its framerate.
as_frames (bool, default: False) – When the uploading file is a video, specify whether it’s going to be uploaded as a list of frames.
extract_views (bool, default: False) – When the uploading file is a volume, specify whether it’s going to be split into orthogonal views.
handle_as_slices (Optioonal[bool], default: False) – Whether to upload DICOM files as slices
files_to_exclude (Optional[PathLike]], default: None) – Optional list of files to exclude from the file scan. These can be folders.
path (Optional[str], default: None) – Optional path to store the files in.
preserve_folders (bool, default: False) – Specify whether or not to preserve folder paths when uploading
progress_callback (Optional[ProgressCallback], default: None) – Optional callback, called every time the progress of an uploading files is reported.
file_upload_callback (Optional[FileUploadCallback], default: None) – Optional callback, called every time a file chunk is uploaded.
item_merge_mode (Optional[str]) –
If set, each file path passed to files_to_upload behaves as follows: - Paths pointing directly to individual files are ignored - Paths pointing to folders of files will be uploaded according to the following mode rules. Note that folders will not be recursively searched, so only files in the first level of the folder will be uploaded:
- ”slots”: Each file in the folder will be uploaded to a different slot of the same item.
- ”series”: All .dcm files in the folder will be concatenated into a single slot. All other files are ignored.
- ”channels”: Each file in the folder will be uploaded to a different channel of the same item.

Returns:

handler – Class for handling uploads, progress and error messages.

Return type:

UploadHandler

Raises:

ValueError –

If files_to_upload is None. - If a path is specified when uploading a LocalFile object. - If there are no files to upload (because path is wrong or the exclude filter excludes everything).

fetch_remote_files(filters: Dict[str, str | List[str]] | None = None, sort: str | ItemSorter | None = None) → Iterator[DatasetItem][source]

Fetch and lists all files on the remote dataset.

Parameters:

filters (Optional[Dict[str, Union[str, List[str]]]], default: None) – The filters to use. Files excluded by the filter won’t be fetched.
sort (Optional[Union[str, ItemSorter]], default: None) – A sorting direction. It can be a string with the values ‘asc’, ‘ascending’, ‘desc’, ‘descending’ or an ItemSorter instance.

Yields:

Iterator[DatasetItem] – An iterator of DatasetItem.

archive(items: Iterable[DatasetItem]) → None[source]

Archives (soft-deletion) the given DatasetItems belonging to this RemoteDataset.

Parameters:: items (Iterable[DatasetItem]) – The DatasetItems to be archived.

restore_archived(items: Iterable[DatasetItem]) → None[source]

Restores the archived DatasetItems that belong to this RemoteDataset.

Parameters:: items (Iterable[DatasetItem]) – The DatasetItems to be restored.

move_to_new(items: Iterable[DatasetItem]) → None[source]

Changes the given DatasetItems status to new.

Parameters:: items (Iterable[DatasetItem]) – The DatasetItems whose status will change.

complete(items: Iterable[DatasetItem]) → None[source]

Completes the given DatasetItems.

Parameters:: items (Iterable[DatasetItem]) – The DatasetItems to be completed.

delete_items(items: Iterable[DatasetItem]) → None[source]

Deletes the given DatasetItems.

Parameters:: items (Iterable[DatasetItem]) – The DatasetItems to be deleted.

export(name: str, annotation_class_ids: List[str] | None = None, include_url_token: bool = False, include_authorship: bool = False, version: str | None = None) → None[source]

Create a new release for this RemoteDataset.

Parameters:

name (str) – Name of the release.
annotation_class_ids (Optional[List[str]], default: None) – List of the classes to filter.
include_url_token (bool, default: False) – Should the image url in the export include a token enabling access without team membership or not?
include_authorship (bool, default: False) – If set, include annotator and reviewer metadata for each annotation.
version (Optional[str], default: None, enum: ["1.0", "2.0"]) – When used for V2 dataset, allows to force generation of either Darwin JSON 1.0 (Legacy) or newer 2.0. Omit this option to get your team’s default.

workview_url_for_item(item: DatasetItem) → str[source]

Returns the darwin URL for the given DatasetItem.

Parameters:: item (DatasetItem) – The DatasetItem for which we want the url.
Returns:: The url.
Return type:: str

post_comment(item: DatasetItem, text: str, x: float, y: float, w: float, h: float, slot_name: str | None = None)[source]: Adds a comment to an item in this dataset, Tries to infer slot_name if left out.

import_annotation(item_id: str | int, payload: Dict[str, Any]) → None[source]

Imports the annotation for the item with the given id.

Parameters:

item_id (ItemId) – Identifier of the Item that we are import the annotation to.
payload (Dict[str, Any]) – A dictionary with the annotation to import. The default format is: {“annotations”: serialized_annotations, “overwrite”: “false”}

register(object_store: ObjectStore, storage_keys: List[str] | Dict[str, List[str]], fps: str | float | None = None, multi_planar_view: bool = False, preserve_folders: bool = False, multi_slotted: bool = False) → Dict[str, List[str]][source]

Register files from external storage in a Darwin dataset.

Parameters:

object_store (ObjectStore) – Object store to use for the registration.
storage_keys (List[str] | Dict[str, List[str]]) – Either: - Single-slotted items: A list of storage keys - Multi-slotted items: A dictionary with keys as item names and values as lists of storage keys
fps (Optional[str], default: None) – When the uploading file is a video, specify its framerate.
multi_planar_view (bool, default: False) – Uses multiplanar view when uploading files.
preserve_folders (bool, default: False) – Specify whether or not to preserve folder paths when uploading.
multi_slotted (bool, default: False) – Specify whether the items are multi-slotted or not.

Returns:

A dictionary with the list of registered files.

Return type:

Dict[str, List[str]]

Raises:

ValueError – If the type of storage_keys: - Isn’t List[str] when multi_slotted is False. - Isn’t Dict[str, List[str]] when multi_slotted is True.

register_single_slotted(object_store: ObjectStore, storage_keys: List[str], fps: str | float | None = None, multi_planar_view: bool = False, preserve_folders: bool = False) → Dict[str, List[str]][source]

Register files in the dataset in a single slot.

Parameters:

object_store (ObjectStore) – Object store to use for the registration.
storage_keys (List[str]) – List of storage keys to register.
fps (Optional[str], default: None) – When the uploading file is a video, specify its framerate.
multi_planar_view (bool, default: False) – Uses multiplanar view when uploading files.
preserve_folders (bool, default: False) – Specify whether or not to preserve folder paths when uploading

Returns:

A dictionary with the list of registered files.

Return type:

Dict[str, List[str]]

Raises:

TypeError – If the file type of any storage keyis not supported.

register_multi_slotted(object_store: ObjectStore, storage_keys: Dict[str, List[str]], fps: str | float | None = None, multi_planar_view: bool = False, preserve_folders: bool = False) → Dict[str, List[str]][source]

Register files in the dataset in multiple slots.

Parameters:

object_store (ObjectStore) – Object store to use for the registration.
storage_keys (Dict[str, List[str]) – Storage keys to register. The keys are the item names and the values are lists of storage keys.
fps (Optional[str], default: None) – When the uploading file is a video, specify its framerate.
multi_planar_view (bool, default: False) – Uses multiplanar view when uploading files.
preserve_folders (bool, default: False) – Specify whether or not to preserve folder paths when uploading

Returns:

A dictionary with the list of registered files.

Return type:

Dict[str, List[str]]

Raises:

TypeError – If the file type of any storage key is not supported.

darwin.dataset.split_manager module

class darwin.dataset.split_manager.Split(random: Dict[str, Path] | None = None, stratified: Dict[str, Dict[str, Path]] | None = None)[source]

Bases: object

A Split object holds the state of a split as a set of attributes. For each split type (namely, random and stratified), the Split object will keep a record of paths were the splits are going to be stored as files.

If a dataset can be split randomly, then the random attribute will be set as a dictionary between a particular partition (e.g.: train, val, test) and the Path of the file where that partition split file is going to be stored.

{
    "train": Path("/path/to/split/random_train.txt"),
    "val": Path("/path/to/split/random_val.txt"),
    "test": Path("/path/to/split/random_test.txt")
}

If a dataset can be split with a stratified strategy based on a given annotation type, then the stratified attribute will be set as a dictionary between a particular annotation type and a dictionary between a particular partition (e.g.: train, val, test) and the Path of the file where that partition split file is going to be stored.

{
    "polygon": {
        "train": Path("/path/to/split/stratified_polygon_train.txt"),
        "val": Path("/path/to/split/stratified_polygon_val.txt"),
        "test": Path("/path/to/split/stratified_polygon_test.txt")
    },
    "tag": {
        "train": Path("/path/to/split/stratified_tag_train.txt"),
        "val": Path("/path/to/split/stratified_tag_val.txt"),
        "test": Path("/path/to/split/stratified_tag_test.txt")
    }
}

random: Dict[str, Path] | None = None: Stores the type of split (e.g. train, val, test) and the file path where the split is stored if the split is of type random.

stratified: Dict[str, Dict[str, Path]] | None = None: Stores the relation between an annotation type and the partition-filepath key value of the split if its type is stratified.

is_valid() → bool[source]

Returns whether or not this split instance is valid.

Returns:: True if this instance is valid, False otherwise.
Return type:: bool

darwin.dataset.split_manager.split_dataset(dataset_path: str | Path, release_name: str | None = None, val_percentage: float = 0.1, test_percentage: float = 0.2, split_seed: int = 0, make_default_split: bool = True, stratified_types: List[str] = ['bounding_box', 'polygon', 'tag']) → Path[source]

Given a local a dataset (pulled from Darwin), split it by creating lists of filenames. The partitions to split the dataset into are called train, val and test.

The dataset is always split randomly, and can be additionally split according to the stratified strategy by providing a list of stratified types.

Requires scikit-learn to split a dataset.

Parameters:

dataset_path (PathLike) – Local path to the dataset.
release_name (Optional[str], default: None) – Version of the dataset.
val_percentage (float, default: 0.1) – Percentage of images used in the validation set.
test_percentage (float, default: 0.2) – Percentage of images used in the test set.
split_seed (int, default: 0) – Fix seed for random split creation.
make_default_split (bool, default: True) – Makes this split the default split.
stratified_types (List[str], default: ["bounding_box", "polygon", "tag"]) – List of annotation types to split with the stratified strategy.

Returns:

Keys are the different splits (random, tags, …) and values are the relative file names.

Return type:

Path

Raises:

ImportError – If sklearn is not installed.

darwin.dataset.upload_manager module

class darwin.dataset.upload_manager.ItemMergeMode(value)[source]

Bases: Enum

An enumeration.

SLOTS = 'slots'

SERIES = 'series'

CHANNELS = 'channels'

class darwin.dataset.upload_manager.ItemPayload(*, dataset_item_id: int, filename: str, path: str, reasons: List[str] | None = None, slots: List[Dict[str, str]])[source]

Bases: object

Represents an item’s payload.

Parameters:

dataset_item_id (int) – The id of the dataset this item belongs to.
filename (str) – The filename of where this ItemPayload’s data is.
path (str) – The path to filename.
reasons (Optional[List[str]], default: None) – A per-slot reason to upload this ItemPayload.

dataset_item_id

The id of the dataset this item belongs to.

Type:: int

filename

The filename of where this ItemPayload’s data is.

Type:: str

path

The path to filename.

Type:: str

static parse_v2(payload)[source]

property full_path: str: The full Path (with filename inclduded) to the file.

class darwin.dataset.upload_manager.UploadStage(value)[source]

Bases: DocEnum

The different stages of uploading a file.

REQUEST_SIGNATURE = 0

UPLOAD_TO_S3 = 1

CONFIRM_UPLOAD_COMPLETE = 2

OTHER = 3

exception darwin.dataset.upload_manager.UploadRequestError(file_path: Path, stage: UploadStage, error: Exception | None = None)[source]

Bases: Exception

Error throw when uploading a file fails with an unrecoverable error.

file_path: Path: The Path of the file being uploaded.

stage: UploadStage: The UploadStage when the failure happened.

error: Exception | None = None: The Exception that triggered this unrecoverable error.

class darwin.dataset.upload_manager.LocalFile(local_path: str | Path, **kwargs)[source]

Bases: object

Represents a file locally stored.

Parameters:

local_path (PathLike) – The Path of the file.
kwargs (Any) – Data relative to this file. Can be anything.

local_path

The Path of the file.

Type:: PathLike

data

Dictionary with metadata relative to this file. It has the following format:

{
    "filename": "a_filename",
    "path": "a path"
}

data["filename"] will hold the value passed as filename from kwargs or default to self.local_path.name
data["path"] will hold the value passed as path from kwargs or default to "/"

Type:: Dict[str, str]

serialize()[source]

serialize_darwin_json_v2()[source]

property full_path: str: The full Path (with filename inclduded) to the item.

class darwin.dataset.upload_manager.MultiFileItem(directory: Path, files: List[Path], merge_mode: ItemMergeMode, fps: int)[source]

Bases: object

serialize_darwin_json_v2()[source]

property full_path: str: The full Path (with filename included) to the item

class darwin.dataset.upload_manager.FileMonitor(io: BinaryIO, file_size: int, callback: Callable[[FileMonitor], None])[source]

Bases: object

Monitors the progress of a :class:BufferedReader.

To use this monitor, you construct your :class:BufferedReader as you normally would, then construct this object with it as argument.

Parameters:

io (BinaryIO) – IO object used by this class. Depency injection.
file_size (int) – The fie of the file in bytes.
callback (Callable[["FileMonitor"], None]) – Callable function used by this class. Depency injection via constructor.

io

IO object used by this class. Depency injection.

Type:: BinaryIO

callback

Callable function used by this class. Depency injection.

Type:: Callable[[“FileMonitor”], None]

bytes_read

Amount of bytes read from the IO.

Type:: int

len

Total size of the IO.

Type:: int

read(size: int = -1) → Any[source]

Reads given amount of bytes from configured IO and calls the configured callback for each block read. The callback is passed a reference this object that can be used to get current self.bytes_read.

Parameters:: size (int, default: -1) – The number of bytes to read. Defaults to -1, so all bytes until EOF are read.
Returns:: data – Data read from the IO.
Return type:: Any

class darwin.dataset.upload_manager.UploadHandler(dataset: RemoteDataset, local_files: List[LocalFile], multi_file_items: List[MultiFileItem] | None = None, handle_as_slices: bool | None = False)[source]

Bases: ABC

Holds responsibilities for file upload management and failure into RemoteDatasets.

Parameters:

dataset (RemoteDataset) – Target RemoteDataset where we want to upload our files to.
uploading_files (Union[List[LocalFile], List[MultiFileItems]]) – List of LocalFiles or MultiFileItems to be uploaded.

dataset

Target RemoteDataset where we want to upload our files to.

Type:: RemoteDataset

errors

List of errors that happened during the upload process

Type:: List[UploadRequestError]

local_files

List of LocalFiles to be uploaded.

Type:: List[LocalFile]

multi_file_items

List of MultiFileItems to be uploaded.

Type:: List[MultiFileItem]

blocked_items

List of items that were not able to be uploaded.

Type:: List[ItemPayload]

pending_items

List of items waiting to be uploaded.

Type:: List[ItemPayload]

static build(dataset: RemoteDataset, local_files: List[LocalFile], handle_as_slices: bool | None = False)[source]

property client: Client: The Client used by this UploadHander's RemoteDataset.

property dataset_identifier: DatasetIdentifier: The DatasetIdentifier of this UploadHander's RemoteDataset.

property blocked_count: int: Number of items that could not be uploaded successfully.

property error_count: int: Number of errors that prevented items from being uploaded.

property pending_count: int: Number of items waiting to be uploaded.

property total_count: int: Total number of blocked and pending items.

property progress: Current level of upload progress.

prepare_upload() → Iterator[Callable[[Callable[[str | None, float, float], None] | None], None]] | None[source]

upload(multi_threaded: bool = True, progress_callback: Callable[[int, float], None] | None = None, file_upload_callback: Callable[[str, int, int], None] | None = None, max_workers: int | None = None) → None[source]

class darwin.dataset.upload_manager.UploadHandlerV2(dataset: RemoteDataset, local_files: List[LocalFile], multi_file_items: List[MultiFileItem] | None = None, handle_as_slices: bool | None = False)[source]: Bases: UploadHandler

darwin.dataset.utils module

darwin.dataset.utils.get_release_path(dataset_path: Path, release_name: str | None = None) → Path[source]

Given a dataset path and a release name, returns the path to the release.

Parameters:

dataset_path (Path) – Path to the location of the dataset on the file system.
release_name (Optional[str], default: None) – Version of the dataset.

Returns:

Path to the location of the dataset release on the file system.

Return type:

Path

Raises:

NotFound – If no dataset is found in the location provided by dataset_path.

darwin.dataset.utils.extract_classes(annotations_path: Path, annotation_type: str | List[str]) → Tuple[Dict[str, Set[int]], Dict[int, Set[str]]][source]

Given the GT as json files extracts all classes and maps images index to classes.

Parameters:

annotations_files (Path) – Path to the json files with the GT information of each image.
annotation_type (Union[str, List[str]]) – Type(s) of annotation to use to extract the GT information.

Returns:

A Tuple where the first element is a Dictionary where keys are the classes found in the GT and values are a list of file numbers which contain it; and the second element is Dictionary where keys are image indices and values are all classes contained in that image.

Return type:

Tuple[Dict[str, Set[int]], Dict[int, Set[str]]]

darwin.dataset.utils.make_class_lists(release_path: Path) → None[source]

Support function to extract classes and save the output to file.

Parameters:: release_path (Path) – Path to the location of the dataset on the file system.

darwin.dataset.utils.get_classes_from_file(path: Path) → List[str][source]: Helper function to read class names from a file.

darwin.dataset.utils.available_annotation_types(release_path: Path) → List[str][source]: Returns a list of available annotation types based on the existing files.

darwin.dataset.utils.get_classes(dataset_path: str | Path, release_name: str | None = None, annotation_type: str | List[str] = 'polygon', remove_background: bool = True) → List[str][source]

Given a dataset and an annotation_type returns the list of classes.

Parameters:

dataset_path (PathLike) – Path to the location of the dataset on the file system.
release_name (Optional[str], default: None) – Version of the dataset.
annotation_type (str, default: "polygon") – The type of annotation classes [tag, polygon, bounding_box].
remove_background (bool, default: True) – Removes the background class (if exists) from the list of classes.

Returns:

List of classes in the dataset of type classes_type.

Return type:

List[str]

darwin.dataset.utils.exhaust_generator(progress: Generator, count: int, multi_processed: bool, worker_count: int | None = None) → Tuple[List[Dict[str, Any]], List[Exception]][source]

Exhausts the generator passed as parameter. Can be done multi processed if desired. Creates and returns a coco record from the given annotation.

Uses BoxMode.XYXY_ABS from detectron2.structures if available, defaults to box_mode = 0 otherwise. :param annotation_path: Path to the annotation file. :type annotation_path: Path :param annotation_type: Type of the annotation we want to retrieve. :type annotation_type: str = “polygon” :param image_path: Path to the image the annotation refers to. :type image_path: Optional[Path], default: None :param image_id: Id of the image the annotation refers to. :type image_id: Optional[Union[str, int]], default: None :param classes: Classes of the annotation. :type classes: Optional[List[str]], default: None

Returns:

A coco record with the following keys: .. code-block:: python

{
“height”: 100, “width”: 100, “file_name”: “a file name”, “image_id”: 1, “annotations”: [ … ]

}

Return type:

Dict[str, Any]

darwin.dataset.utils.get_coco_format_record(annotation_path: Path, annotation_type: str = 'polygon', image_path: Path | None = None, image_id: int | str | None = None, classes: List[str] | None = None) → Dict[str, Any][source]

darwin.dataset.utils.create_polygon_object(obj, box_mode, classes=None)[source]

darwin.dataset.utils.create_bbox_object(obj, box_mode, classes=None)[source]

darwin.dataset.utils.get_annotations(dataset_path: str | Path, partition: str | None = None, split_type: str | None = 'random', annotation_format: str = 'coco', split: str | None = 'default', annotation_type: str = 'polygon', release_name: str | None = None, ignore_inconsistent_examples: bool = False) → Iterator[Dict[str, Any]][source]

Returns all the annotations of a given dataset and split in a single dictionary.

Parameters:

dataset_path (PathLike) – Path to the location of the dataset on the file system.
partition (Optional[str], default: None) – Selects one of the partitions [train, val, test, None]. If not specified, all annotations are returned.
split_type (Optional[str], default: "random") – Heuristic used to do the split [random, stratified]. If not specified, random is used.
annotation_format (str) – Re-formatting of the annotation when loaded [coco, darwin]..
split (Optional[str], default: "default") – Selects the split that defines the percentages used (use ‘default’ to select the default split).
annotation_type (str, default: "polygon") – The type of annotation classes [tag, bounding_box, polygon].
release_name (Optional[str], default: None) – Version of the dataset.
ignore_inconsistent_examples (bool, default: False) – Ignore examples for which we have annotations, but either images are missing, or more than one images exist for the same annotation. If set to True, then filter those examples out of the dataset. If set to False, then raise an error as soon as such an example is found.

Returns:

Dictionary containing all the annotations of the dataset.

Return type:

Iterator[Dict[str, Any]]

Raises:

ValueError –
- If the partition given is not valid. - If the split_type given is not valid. - If the annotation_type given is not valid. - If an annotation has no corresponding image. - If an image is present with multiple extensions.
FileNotFoundError – If no dataset in dataset_path is found.

darwin.dataset.utils.load_pil_image(path: Path, to_rgb: bool | None = True) → Image[source]

Loads a PIL image and converts it into RGB (optional).

Parameters:

path (Path) – Path to the image file.
to_rgb (Optional[bool], default: True) – Converts the image to RGB.

Returns:

The loaded image.

Return type:

PILImage.Image

darwin.dataset.utils.convert_to_rgb(pic: Image) → Image[source]

Converts a PIL image to RGB.

Parameters:: pic (PILImage.Image) – The image to convert.
Returns:: Values between 0 and 255.
Return type:: PIL Image
Raises:: TypeError – If the image given via pic has an unsupported type.

darwin.dataset.utils.compute_max_density(annotations_dir: Path) → int[source]

Calculates the maximum density of all of the annotations in the given folder. Density is calculated as the number of polygons present in an annotation file.

Parameters:: annotations_dir (Path) – Directory where the annotations are present.
Returns:: The maximum density.
Return type:: int

darwin.dataset.utils.compute_distributions(annotations_dir: Path, split_path: Path, partitions: List[str] = ['train', 'val', 'test'], annotation_types: List[str] = ['polygon']) → Dict[str, Dict[str, Counter]][source]

Builds and returns the following dictionaries:

class_distribution: count of all files where at least one instance of a given class exists for each partition
instance_distribution: count of all instances of a given class exist for each partition

Note that this function can only be used after a dataset has been split with “stratified” strategy.

Parameters:

annotations_dir (Path) – Directory where the annotations are.
split_path (Path) – Path to the split.
partitions (List[str], default: ["train", "val", "test"]) – Partitions to use.
annotation_types (List[str], default: ["polygon"]) – Annotation types to consider.

Returns:

class_distribution: count of all files where at least one instance of a given class exists for each partition
instance_distribution: count of all instances of a given class exist for each partition

Return type:

Dict[str, AnnotationDistribution]

darwin.dataset.utils.is_relative_to(path: Path, *other) → bool[source]

Returns True if the path is relative to another path or False otherwise. It also returns False in the event of an exception, making False the default value.

Parameters:

path (Path) – The path to evaluate.
other (Path) – The other path to compare against.

Returns:

bool
True if the path is relative to other or False otherwise.

darwin.dataset.utils.sanitize_filename(filename: str) → str[source]

Sanitizes the given filename, removing/replacing forbiden characters.

Parameters:: filename (str) – The filename to sanitize.
Returns:: The sanitized filename.
Return type:: str

darwin.dataset.utils.get_external_file_type(storage_key: str) → str | None[source]

Returns the type of file given a storage key.

Parameters:: storage_key (str) – The storage key to get the type of file from.
Returns:: The type of file, or None if the file type is not supported.
Return type:: Optional[str]

darwin.dataset.utils.parse_external_file_path(storage_key: str, preserve_folders: bool) → str[source]

Returns the Darwin dataset path given a storage key.

Parameters:

storage_key (str) – The storage key to parse.
preserve_folders (bool) – Whether to preserve folders or place the file in the Dataset root.

Returns:

The parsed external file path.

Return type:

str

darwin.dataset.utils.get_external_file_name(storage_key: str) → str[source]

Returns the name of the file given a storage key.

Parameters:: storage_key (str) – The storage key to get the file name from.
Returns:: The name of the file.
Return type:: str

darwin.dataset.utils.chunk_items(items: List[Any], chunk_size: int = 500) → Iterator[List[Any]][source]

Splits the list of items into chunks of specified size.

Parameters:

items (List[Any]) – The list of items to split.
chunk_size (int, default: 500) – The size of each chunk.

Returns:

An iterator that yields lists of items, each of length chunk_size.

Return type:

Iterator[List[Any]]

darwin.dataset package

Submodules

darwin.dataset.download_manager module

darwin.dataset.identifier module

darwin.dataset.local_dataset module

darwin.dataset.release module

darwin.dataset.remote_dataset module

darwin.dataset.remote_dataset_v2 module

darwin.dataset.split_manager module

darwin.dataset.upload_manager module

darwin.dataset.utils module

Module contents