dlutils package

Submodules

dlutils.async module

dlutils.async_func(fnc=None, callback=None)[source]

dlutils.batch_provider module

dlutils.batch_provider(data, batch_size, processor=None, worker_count=1, queue_size=16, report_progress=True)[source]

Return an object that produces a sequence of batches from input data

Input data is split into batches of size batch_size which are processed with function processor Data is split and processed by separate threads and dumped into a queue allowing continuous provision of data. The main purpose of this primitive is to provide easy to use tool for parallel batch processing/generation in background while main thread runs the main algorithm. Batches are processed in parallel, allowing better utilization of CPU cores and disk that may improve GPU utilization for DL tasks with Storage/IO bottleneck.

This primitive can be used in various ways. For small datasets, the input data list may contain actual dataset, while processor function does from small to no data processing. For larger datasets, data list may contain just filenames or keys while processor function reads data from disk or db.

There are many purposes that function processor can be used for, depending on your use case.

  • Reading data from disk or db

  • Data decoding, e.g. from JPEG.

  • Augmenting data, flipping, rotating adding nose, etc.

  • Concatenation of data, stacking to single ndarray, conversion to a tensor, uploading to GPU.

  • Data generation.

Note

Sequential order of batches is guaranteed only if number of workers is 1 (Default), otherwise batches might be supplied out of order.

Parameters
  • data (list) – Input data, each entry in the list should be a separate data point.

  • batch_size (int) – Size of a batch. If size of data is not divisible by batch_size, then the last batch will have smaller size.

  • processor (Callable[[list], Any], optional) – Function for processing batches. Receives slice of the data list as input. Can return object of any type. Defaults to None.

  • worker_count (int, optional) – Number of workers, should be greater or equal to one. To process data in parallel and fully load CPU worker_count should be close to the number of CPU cores. Defaults to one.

  • queue_size (int, optional) – Maximum size of the queue, which is number of batches to buffer. Should be larger than worker_count. Typically, one would want this to be as large as possible to amortize all disk IO and computational costs. Downside of large value is increased RAM consumption. Defaults to 16.

  • report_progress (bool, optional) –

    Print a progress bar similar to tqdm. You still may use tqdm if you set report_progress to False. To use tqdm just do

    for x in tqdm(batch_provider(...)):
        ...
    

    Defaults to True.

Returns

An object that produces a sequence of batches. next() method of the iterator will return object that was produced by processor function

Return type

Iterator

Raises

StopIteration – When all data was iterated through. Stops the for loop.

Example

def process(batch):
    images = [misc.imread(x[0]) for x in batch]
    images = np.asarray(images, dtype=np.float32)
    images = images.transpose((0, 3, 1, 2))
    labeles = [x[1] for x in batch]
    labeles = np.asarray(labeles, np.int)
    return torch.from_numpy(images) / 255.0, torch.from_numpy(labeles)

data = [('some_list.jpg', 1), ('of_filenames.jpg', 2), ('etc.jpg', 4), ...] # filenames and labels
batches = dlutils.batch_provider(data, 32, process)

for images, labeles in batches:
    result = model(images)
    loss = F.nll_loss(result, labeles)
    loss.backward()
    optimizer.step()

dlutils.cache module

class dlutils.cache(function)[source]

Bases: object

Caches return value of a functions.

Given a function with no side effects, it will compute sha256 hash of passed arguments and use that hash to retrieve saved pickle.

Note

Passed arguments must be picklable.

If you change function, or do any other change that invalidates previously saved caches you will need to delete them manually

Results are saved to ‘.cache’ folder in current directory.

Parameters

function (function) – fucntions to be called.

Example

@dlutils.cache
def expensive_function(x):
    for i in range(12):
        x = x + x * x
    return x

dlutils.default_cfg module

dlutils.default_cfg.get_default_cfg()[source]

dlutils.download module

Module for downloading files, downloading files from google drive, uncompressing targz

dlutils.download.cifar10(directory='cifar10')[source]

Downloads CIFAR10 Dataset.

Parameters

directory (str) – Directory where to save the files

dlutils.download.cifar100(directory='cifar100')[source]

Downloads CIFAR100 Dataset.

Parameters

directory (str) – Directory where to save the files

dlutils.download.fashion_mnist(directory='fashion-mnist')[source]

Downloads Fashion-MNIST Dataset.

Parameters

directory (str) – Directory where to save the files

dlutils.download.from_google_drive(google_drive_fileid, directory='.', file_name=None, extract_targz=False, extract_gz=False, extract_zip=False)[source]

Downloads file from Google Drive.

Given the file ID, file is downloaded from Google Drive and optionally it can be unpacked after downloading completes.

Note

You need to share the file as Anyone who has the link can access. No sign-in required.. You can find the file ID in the link:

https://drive.google.com/file/d/ 0B3kP5zWXwFm_OUpQbDFqY2dXNGs /view?usp=sharing

Parameters
  • google_drive_fileid (str) – file ID.

  • directory (str) – Directory where to save the file

  • file_name (str, optional) – If not None, this will overwrite the file name, otherwise it will use the filename returned from http request. Defaults to None.

  • extract_targz (bool) – Extract tar.gz archive. Defaults to False.

  • extract_gz (bool) – Decompress gz compressed file. Defaults to False.

  • extract_zip (bool) – Extract zip archive. Defaults to False.

Example

dlutils.download.from_google_drive(directory="data/", google_drive_fileid="0B3kP5zWXwFm_OUpQbDFqY2dXNGs")
dlutils.download.from_url(url, directory='.', file_name=None, extract_targz=False, extract_gz=False, extract_zip=False)[source]

Downloads file from specified URL.

Optionally it can be unpacked after downloading completes.

Parameters
  • url (str) – file URL.

  • directory (str) – Directory where to save the file

  • file_name (str, optional) – If not None, this will overwrite the file name, otherwise it will use the filename returned from http request. Defaults to None.

  • extract_targz (bool) – Extract tar.gz archive. Defaults to False.

  • extract_gz (bool) – Decompress gz compressed file. Defaults to False.

  • extract_zip (bool) – Extract zip archive. Defaults to False.

Example

dlutils.download.from_url("http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz", directory, extract_gz=True)
dlutils.download.mnist(directory='mnist')[source]

Downloads MNIST Dataset.

Parameters

directory (str) – Directory where to save the files

dlutils.epoch module

class dlutils.epoch.EpochRange(epoch_count, log_func=None)[source]

Bases: object

Range for iterating epochs

class dlutils.epoch.LossTracker[source]

Bases: object

Tracker for easy recording and computing mean values of some quanities such as losses. Summary of average values is printed at the end of each epoch.

add(name, format_str='%s: %.3f')[source]
reset()[source]
class dlutils.epoch.RunningMean[source]

Bases: object

dlutils.measures module

dlutils.measures.auc(label, prediction)[source]
dlutils.measures.f1(label, prediction, threshold)[source]
dlutils.measures.f1_from_pr(precision, recall)[source]
dlutils.measures.f1_from_tp_fp_fn(true_positive, false_positive, false_negative)[source]
dlutils.measures.openset_f1(label_inlier, prediction_inlier, threshold, correctly_classified)[source]

dlutils.numpy_dataset module

class dlutils.numpy_dataset.NumpyDataset(data)[source]

Bases: object

static list_of_pairs_to_numpy(l)[source]
shuffle()[source]

dlutils.progress_bar module

class dlutils.progress_bar.ProgressBar(total_iterations, prefix='Progress:', suffix='', decimals=1, length=None, fill='#')[source]

Bases: object

increment(val=1)[source]

dlutils.random_rotation module

Random rotation matrix

dlutils.random_rotation.random_rotation(size)[source]

dlutils.reader module

Util for reading MNIST dataset

class dlutils.reader.Cifar10(path, train=True, test=False)[source]

Bases: object

Read CIFAR out of binary batches

get_images()[source]
get_labels()[source]
class dlutils.reader.Cifar100(path, train=True, test=False)[source]

Bases: object

Read CIFAR out of binary batches

get_images()[source]
get_labels()[source]
class dlutils.reader.Mnist(path, items=None, train=True, test=False, resize_to_32x32=False)[source]

Bases: object

Read MNIST out of binary batches

get_images()[source]
get_labels()[source]

dlutils.registry module

class dlutils.registry.Registry(*args, **kwargs)[source]

Bases: dict

register(module_name)[source]

dlutils.save_image module

dlutils.make_grid(images, nrow=8, padding=2, NCWH=False)[source]
dlutils.save_image(images, filename, nrow=8, padding=2, NCWH=False, format=None)[source]

dlutils.shuffle module

dlutils.shuffle.shuffle_ndarray(x, axis=0)[source]

Shuffle slices of ndarray along specific axis.

For example, given a 4-dimentional ndarray, which represents a batch of images in BCHW format, one could shuffle samples in that batch by applying shuffle_ndarray() with axis = 0.

Note

Function does not return anything. It shuffles ndarray inplace.

Parameters
  • x (array_like) – ndarray to shuffle.

  • axis (int, optional) – The axis over which to shuffle. Defaults to 0.

Example

>>> a = np.asarray([[1, 5], [0, 2], [0, 1]])
>>> a
array([[1, 5],
       [0, 2],
       [0, 1]])
>>> dlutils.shuffle.shuffle_ndarray(a, axis=0)
>>> a
array([[0, 2],
       [0, 1],
       [1, 5]])
>>> dlutils.shuffle.shuffle_ndarray(a, axis=1)
>>> a
array([[2, 0],
       [1, 0],
       [5, 1]])
dlutils.shuffle.shuffle_ndarrays_in_unison(arrays, axis=0)[source]

Shuffle slices of a list of ndarrays along specific axis with the same permutation for each of the arrays in the list.

Works similar to shuffle_ndarray(), but applys the same permutation to all arrays in the list

Note

Function does not return anything. It shuffles ndarray inplace. All arrays in the list should have the same shape.

Parameters
  • arrays (list[array_like]) – list of ndarrays to shuffle.

  • axis (int, optional) – The axis over which to shuffle. Defaults to 0.

dlutils.timer module

Profiling utils

dlutils.timer.timer(f)[source]

Decorator for timeing function (method) execution time.

After return from function will print string: func: <function name> took: <time in seconds> sec.

Parameters

f (Callable[Any]) – function to decorate.

Returns

Decorated function.

Return type

Callable[Any]

Example

>>> from dlutils import timer
>>> @timer.timer
... def foo(x):
...     for i in range(x):
...             pass
...
>>> foo(100000)
func:'foo'  took: 0.0019 sec

dlutils.tracker module

class dlutils.tracker.LossTracker(output_dir='.')[source]

Bases: object

add(name, pytorch=True)[source]
load_state_dict(state_dict)[source]
plot()[source]
register_means(epoch)[source]
state_dict()[source]
update(d)[source]

Module contents