dareblopy package

Module contents

class dareblopy.Archive

Bases: pybind11_builtins.pybind11_object

__init__(*args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

exists(self: _dareblopy.Archive, arg0: str) → bool

Exists

list_directory(self: _dareblopy.Archive, arg0: str) → bool

ListDirectory

open(self: _dareblopy.Archive, arg0: str) → object

Opens file

open_as_bytes(self: _dareblopy.Archive, arg0: str) → object
open_as_numpy_ubyte(self: _dareblopy.Archive, arg0: str, arg1: object) → numpy.ndarray[uint8]
read_jpg_as_numpy(self: _dareblopy.Archive, filename: str, use_turbo: bool = False) → numpy.ndarray[uint8]
class dareblopy.Compression

Bases: pybind11_builtins.pybind11_object

Enumeration for compression type used for tfrecords.

Possible values:

  • NONE - default

  • GZIP

  • ZLIB

Example:

record_reader = db.RecordReader('zlib_compressed.tfrecords', db.Compression.ZLIB)
record_reader = db.RecordReader('gzip_compressed.tfrecords', db.Compression.GZIP)

record_yielder = db.RecordYielderBasic(['test_utils/test-small-gzip-r00.tfrecords',
                                        'test_utils/test-small-gzip-r01.tfrecords',
                                        'test_utils/test-small-gzip-r02.tfrecords',
                                        'test_utils/test-small-gzip-r03.tfrecords'], db.Compression.GZIP)

record_yielder_random = db.RecordYielderRandomized(['test_utils/test-small-gzip-r00.tfrecords',
                                                    'test_utils/test-small-gzip-r01.tfrecords',
                                                    'test_utils/test-small-gzip-r02.tfrecords',
                                                    'test_utils/test-small-gzip-r03.tfrecords'],
                                                    buffer_size=16,
                                                    seed=0,
                                                    epoch=0,
                                                    db.Compression.GZIP)

Members:

NONE

GZIP

ZLIB

GZIP = Compression.GZIP
NONE = Compression.NONE
ZLIB = Compression.ZLIB
__init__(self: _dareblopy.Compression, arg0: int) → None
property name

handle) -> str

Type

(self

class dareblopy.DataType

Bases: pybind11_builtins.pybind11_object

Enumeration for FixedLenFeature dtype.

Equivalent to tf.string, tf.float32, tf.int64 Note:

uint8 - is an alias for string, that enables reading directly to a preallocated numpy ndarray of a uint8 dtype and a given shape. This eliminates any additional copying/casting.

To use it, shape of the encoded numpy array most be known

Example:

features = {
    'shape': db.FixedLenFeature([3], db.int64),
    'data': db.FixedLenFeature([], db.string)
}

Members:

string

float32

int64

uint8

__init__(self: _dareblopy.DataType, arg0: int) → None
float32 = DataType.float32
int64 = DataType.int64
property name

handle) -> str

Type

(self

string = DataType.string
uint8 = DataType.uint8
class dareblopy.File

Bases: pybind11_builtins.pybind11_object

__init__(self: _dareblopy.File) → None
get_last_write_time(self: _dareblopy.File) → int
path(self: _dareblopy.File) → str
read(self: _dareblopy.File, size: int = - 1) → object
seek(self: _dareblopy.File, offset: int, origin: int = 0) → int
size(self: _dareblopy.File) → int
tell(self: _dareblopy.File) → int
class dareblopy.FileSystem

Bases: pybind11_builtins.pybind11_object

__init__(self: _dareblopy.FileSystem) → None
clear_search_paths(self: _dareblopy.FileSystem) → None

ClearSearchPaths

create_directory(self: _dareblopy.FileSystem, arg0: _dareblopy.Location) → fsal::Status

CreateDirectory

exists(self: _dareblopy.FileSystem, arg0: _dareblopy.Location) → bool

Exists

mount_archive(self: _dareblopy.FileSystem, arg0: _dareblopy.Archive) → fsal::Status

AddArchive

open(self: _dareblopy.FileSystem, location: _dareblopy.Location, mode: _dareblopy.Mode = Mode.read, lockable: bool = False) → object

Opens file

pop_search_path(self: _dareblopy.FileSystem) → None

PopSearchPath

push_search_path(self: _dareblopy.FileSystem, arg0: _dareblopy.Location) → None

PushSearchPath

remove(self: _dareblopy.FileSystem, arg0: _dareblopy.Location) → fsal::Status

Remove

rename(self: _dareblopy.FileSystem, arg0: _dareblopy.Location, arg1: _dareblopy.Location) → fsal::Status

Rename

class dareblopy.FixedLenFeature

Bases: pybind11_builtins.pybind11_object

An iterator that reads a list of tfrecord files and returns single or a batch of records). Does not support compressed tfrecords. Performs crc32 check of read data.

Variables
  • shape (TensorShape) – a .TensorShape object that defines input data shape.

  • dtype (DataType) – a .DataType object that defines input data type.

  • default_value (object, optional) – default value.

Note

Contructor is overloaded and excepts either:

  • shape (List[int]), datatype (DataType)

  • shape (List[int]), datatype (DataType), default_value (object)

Example:

features = {
    'shape': db.FixedLenFeature([3], db.int64),
    'data': db.FixedLenFeature([], db.string)
}

# or

features = {
    'shape': db.FixedLenFeature([3], db.int64),
    'data': db.FixedLenFeature([3, 32, 32], db.uint8)
}
__init__(*args, **kwargs)

Overloaded function.

  1. __init__(self: _dareblopy.FixedLenFeature) -> None

  2. __init__(self: _dareblopy.FixedLenFeature, arg0: List[int], arg1: _dareblopy.DataType) -> None

  3. __init__(self: _dareblopy.FixedLenFeature, arg0: List[int], arg1: _dareblopy.DataType, arg2: object) -> None

property default_value
property dtype
property shape
class dareblopy.Location

Bases: pybind11_builtins.pybind11_object

__init__(*args, **kwargs)

Overloaded function.

  1. __init__(self: _dareblopy.Location, arg0: str) -> None

  2. __init__(self: _dareblopy.Location, arg0: str, arg1: fsal::Location::Options, arg2: fsal::PathType, arg3: fsal::LinkType) -> None

class dareblopy.Mode

Bases: pybind11_builtins.pybind11_object

Members:

read

write

append

read_update

write_update

append_update

__init__(self: _dareblopy.Mode, arg0: int) → None
append = Mode.append
append_update = Mode.append_update
property name

handle) -> str

Type

(self

read = Mode.read
read_update = Mode.read_update
write = Mode.write
write_update = Mode.write_update
class dareblopy.ParsedRecordYielderRandomized

Bases: pybind11_builtins.pybind11_object

Generator that yields parsed records from a list of tfrecord files in a randomized way.

ParsedRecordYielderRandomized gives slightly better performance over RecordYielderRandomized since it reduces data coping.

Args:
parser (RecordParser): parser to be used to decode records.

filenames (List[str]): a list of filenames of the tfrecord files. buffer_size (Int): Size of the buffer is in number of samples. Reading of data from tfrecords to this buffer is sequential, but order of tfrecords is picked at random.

Samples from this buffer are sampled at random. The more is the size of the buffer, the smaller are tf records, the more random is sample yielding. Similar to https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle

seed (Int): seed for random number generator compression (Compression, optional): compression type. Default is Compression.None.

__init__(self: _dareblopy.ParsedRecordYielderRandomized, parser: object, filenames: List[str], buffer_size: int, seed: int, epoch: int, compression: _dareblopy.Compression = Compression.NONE) → None
__iter__(self: object) → object
__next__(self: _dareblopy.ParsedRecordYielderRandomized) → object
next_n(self: _dareblopy.ParsedRecordYielderRandomized, arg0: int) → list
class dareblopy.RecordParser

Bases: pybind11_builtins.pybind11_object

__init__(*args, **kwargs)

Overloaded function.

  1. __init__(self: _dareblopy.RecordParser, arg0: dict) -> None

  2. __init__(self: _dareblopy.RecordParser, arg0: dict, arg1: bool) -> None

  3. __init__(self: _dareblopy.RecordParser, arg0: dict, arg1: bool, arg2: int) -> None

parse_example(self: _dareblopy.RecordParser, arg0: List[str]) → list
parse_single_example(self: _dareblopy.RecordParser, arg0: str) → list
parse_single_example_inplace(self: _dareblopy.RecordParser, arg0: str, arg1: List[object], arg2: int) → None
class dareblopy.RecordReader

Bases: pybind11_builtins.pybind11_object

An iterator that reads tfrecord file and returns raw records (protobuffer messages). Does not support compressed tfrecords. Performs crc32 check of read data.

Parameters
  • file (File) – a .File fileobject.

  • filename (str) – a filename of the file.

  • compression (Compression, optional) – compression type. Default is Compression.None.

Note

Contructor is overloaded and excepts either file (File) either filename (str)

Example:

rr = db.RecordReader('test_utils/test-small-r00.tfrecords')
file_size, data_size, entries = rr.get_metadata()
records = list(rr)

# Or for the compressed records:
rr = db.RecordReader('test_utils/test-small-gzip-r00.tfrecords', db.Compression.GZIP)
file_size, data_size, entries = rr.get_metadata()
records = list(rr)
__init__(*args, **kwargs)

Overloaded function.

  1. __init__(self: _dareblopy.RecordReader, file: fsal::File, compression: _dareblopy.Compression = Compression.NONE) -> None

  2. __init__(self: _dareblopy.RecordReader, filename: str, compression: _dareblopy.Compression = Compression.NONE) -> None

__iter__(self: object) → object
__next__(self: _dareblopy.RecordReader) → object
get_metadata(self: _dareblopy.RecordReader) → Tuple[int, int, int]

Returns metadata of the tfrecord and checks all crc32 checksums.

Note

It has to scan the whole file to

Returns

Tuple[int, int, int] - file_size, data_size, entries. Where file_size - size of the file, data_size - size of the data stored in the tfrecord, entries - number of entries.

read_record(self: _dareblopy.RecordReader, arg0: int) → object

Reads a record at specific offset. In majority of cases, you won’t need this method, instead use

RecordReader as iterator.

class dareblopy.RecordYielderBasic

Bases: pybind11_builtins.pybind11_object

Generator that yields records from a list of tfrecord files.

Parameters
  • filenames (List[str]) – a list of filenames of the tfrecord files.

  • compression (Compression, optional) – compression type. Default is Compression.None.

__init__(self: _dareblopy.RecordYielderBasic, filenames: List[str], compression: _dareblopy.Compression = Compression.NONE) → None
__iter__(self: object) → object
__next__(self: _dareblopy.RecordYielderBasic) → object
next_n(self: _dareblopy.RecordYielderBasic, arg0: int) → list
class dareblopy.RecordYielderRandomized

Bases: pybind11_builtins.pybind11_object

Generator that yields records from a list of tfrecord files in a randomized way.

Parameters
  • filenames (List[str]) – a list of filenames of the tfrecord files.

  • buffer_size (Int) – Size of the buffer is in number of samples. Reading of data from tfrecords to this buffer is sequential, but order of tfrecords is picked at random. Samples from this buffer are sampled at random. The more is the size of the buffer, the smaller are tf records, the more random is sample yielding. Similar to https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle

  • seed (Int) – seed for random number generator

  • compression (Compression, optional) – compression type. Default is Compression.None.

__init__(self: _dareblopy.RecordYielderRandomized, filenames: List[str], buffer_size: int, seed: int, epoch: int, compression: _dareblopy.Compression = Compression.NONE) → None
__iter__(self: object) → object
__next__(self: _dareblopy.RecordYielderRandomized) → object
next_n(self: _dareblopy.RecordYielderRandomized, arg0: int) → list
class dareblopy.Status

Bases: pybind11_builtins.pybind11_object

__init__(self: _dareblopy.Status) → None
is_eof(self: _dareblopy.Status) → bool
dareblopy.open_as_bytes(arg0: str) → object

Opens file as bytes object

Parameters

filename (str) – filename

dareblopy.open_as_numpy_ubyte(filename: str, shape: object = None) → object

Opens file as numby array of type np.ubyte

Parameters
  • filename (str) – filename

  • shape (List[Int]) – shape

dareblopy.open_zip_archive(*args, **kwargs)

Overloaded function.

  1. open_zip_archive(arg0: str) -> fsal::Archive

    Opens zip archive

    Args:

    filename (str): filename

  2. open_zip_archive(arg0: fsal::File) -> fsal::Archive

dareblopy.read_jpg_as_numpy(filename: str, use_turbo: bool = False) → object

Opens jpeg file as numby array of type np.ubyte

Parameters
  • filename (str) – filename

  • bool) (use_turbo) – Uses libjpeg turbo if True

Submodules

dareblopy.TFRecordsDatasetIterator module

class dareblopy.TFRecordsDatasetIterator.ParsedTFRecordsDatasetIterator(filenames, features, batch_size, buffer_size=1000, seed=None, epoch=0, compression=None)[source]

Bases: object

__init__(filenames, features, batch_size, buffer_size=1000, seed=None, epoch=0, compression=None)[source]

Initialize self. See help(type(self)) for accurate signature.

__iter__()[source]
__next__()[source]
class dareblopy.TFRecordsDatasetIterator.TFRecordsDatasetIterator(filenames, batch_size, buffer_size=1000, seed=None, epoch=0, compression=None)[source]

Bases: object

__init__(filenames, batch_size, buffer_size=1000, seed=None, epoch=0, compression=None)[source]

Initialize self. See help(type(self)) for accurate signature.

__iter__()[source]
__next__()[source]

dareblopy.data_loader module

dareblopy.data_loader.data_loader(yielder, collator=None, iteration_count=None, worker_count=1, queue_size=16)[source]

Return an iterator that retrieves objects from yielder and passes them through collator.

Maintains a queue of given size and can run several worker threads. Intended to be used for asynchronous, buffered data loading. Uses threads instead of multiprocessing, so tensors can be uploaded to GPU in collator.

There are many purposes that function collator can be used for, depending on your use case.

  • Reading data from disk or db

  • Data decoding, e.g. from JPEG.

  • Augmenting data, flipping, rotating adding nose, etc.

  • Concatenation of data, stacking to single ndarray, conversion to a tensor, uploading to GPU.

  • Data generation.

Note

Sequential order of batches is guaranteed only if number of workers is 1 (Default), otherwise batches might be supplied out of order.

Parameters
  • yielder (iterator) – Input data, returns batches.

  • collator (Callable, optional) – Function for processing batches. Receives batch from yielder.

  • return object of any type. Defaults to None. (Can) –

  • worker_count (int, optional) – Number of workers, should be greater or equal to one. To process data in parallel and fully load CPU worker_count should be close to the number of CPU cores. Defaults to one.

  • queue_size (int, optional) – Maximum size of the queue, which is number of batches to buffer. Should be larger than worker_count. Typically, one would want this to be as large as possible to amortize all disk IO and computational costs. Downside of large value is increased RAM consumption. Defaults to 16.

Returns

An object that produces a sequence of batches. next() method of the iterator will return object that was produced by collator function

Return type

Iterator

Raises

StopIteration – When all data was iterated through. Stops the for loop.

dareblopy.utils module

dareblopy.utils.display_grid(tensor, nrow=8, padding=2)[source]
dareblopy.utils.make_grid(tensor, nrow=8, padding=2)[source]