dlutils package¶
Submodules¶
dlutils.batch_provider module¶
-
dlutils.
batch_provider
(data, batch_size, processor=None, worker_count=1, queue_size=16, report_progress=True)[source] Return an object that produces a sequence of batches from input data
Input data is split into batches of size
batch_size
which are processed with functionprocessor
Data is split and processed by separate threads and dumped into a queue allowing continuous provision of data. The main purpose of this primitive is to provide easy to use tool for parallel batch processing/generation in background while main thread runs the main algorithm. Batches are processed in parallel, allowing better utilization of CPU cores and disk that may improve GPU utilization for DL tasks with Storage/IO bottleneck.This primitive can be used in various ways. For small datasets, the input
data
list may contain actual dataset, whileprocessor
function does from small to no data processing. For larger datasets,data
list may contain just filenames or keys whileprocessor
function reads data from disk or db.There are many purposes that function
processor
can be used for, depending on your use case.Reading data from disk or db
Data decoding, e.g. from JPEG.
Augmenting data, flipping, rotating adding nose, etc.
Concatenation of data, stacking to single ndarray, conversion to a tensor, uploading to GPU.
Data generation.
Note
Sequential order of batches is guaranteed only if number of workers is 1 (Default), otherwise batches might be supplied out of order.
- Parameters
data (list) – Input data, each entry in the list should be a separate data point.
batch_size (int) – Size of a batch. If size of data is not divisible by
batch_size
, then the last batch will have smaller size.processor (Callable[[list], Any], optional) – Function for processing batches. Receives slice of the
data
list as input. Can return object of any type. Defaults to None.worker_count (int, optional) – Number of workers, should be greater or equal to one. To process data in parallel and fully load CPU
worker_count
should be close to the number of CPU cores. Defaults to one.queue_size (int, optional) – Maximum size of the queue, which is number of batches to buffer. Should be larger than
worker_count
. Typically, one would want this to be as large as possible to amortize all disk IO and computational costs. Downside of large value is increased RAM consumption. Defaults to 16.report_progress (bool, optional) –
Print a progress bar similar to tqdm. You still may use tqdm if you set
report_progress
to False. To use tqdm just dofor x in tqdm(batch_provider(...)): ...
Defaults to True.
- Returns
An object that produces a sequence of batches.
next()
method of the iterator will return object that was produced byprocessor
function- Return type
Iterator
- Raises
StopIteration – When all data was iterated through. Stops the for loop.
Example
def process(batch): images = [misc.imread(x[0]) for x in batch] images = np.asarray(images, dtype=np.float32) images = images.transpose((0, 3, 1, 2)) labeles = [x[1] for x in batch] labeles = np.asarray(labeles, np.int) return torch.from_numpy(images) / 255.0, torch.from_numpy(labeles) data = [('some_list.jpg', 1), ('of_filenames.jpg', 2), ('etc.jpg', 4), ...] # filenames and labels batches = dlutils.batch_provider(data, 32, process) for images, labeles in batches: result = model(images) loss = F.nll_loss(result, labeles) loss.backward() optimizer.step()
dlutils.cache module¶
-
class
dlutils.
cache
(function)[source]¶ Bases:
object
Caches return value of a functions.
Given a function with no side effects, it will compute sha256 hash of passed arguments and use that hash to retrieve saved pickle.
Note
Passed arguments must be picklable.
If you change function, or do any other change that invalidates previously saved caches you will need to delete them manually
Results are saved to ‘.cache’ folder in current directory.
- Parameters
function (function) – fucntions to be called.
Example
@dlutils.cache def expensive_function(x): for i in range(12): x = x + x * x return x
dlutils.download module¶
Module for downloading files, downloading files from google drive, uncompressing targz
-
dlutils.download.
cifar10
(directory='cifar10')[source] Downloads CIFAR10 Dataset.
- Parameters
directory (str) – Directory where to save the files
-
dlutils.download.
cifar100
(directory='cifar100')[source] Downloads CIFAR100 Dataset.
- Parameters
directory (str) – Directory where to save the files
-
dlutils.download.
fashion_mnist
(directory='fashion-mnist')[source] Downloads Fashion-MNIST Dataset.
- Parameters
directory (str) – Directory where to save the files
-
dlutils.download.
from_google_drive
(google_drive_fileid, directory='.', file_name=None, extract_targz=False, extract_gz=False, extract_zip=False)[source] Downloads file from Google Drive.
Given the file ID, file is downloaded from Google Drive and optionally it can be unpacked after downloading completes.
Note
You need to share the file as
Anyone who has the link can access. No sign-in required.
. You can find the file ID in the link:https://drive.google.com/file/d/
0B3kP5zWXwFm_OUpQbDFqY2dXNGs
/view?usp=sharing- Parameters
google_drive_fileid (str) – file ID.
directory (str) – Directory where to save the file
file_name (str, optional) – If not None, this will overwrite the file name, otherwise it will use the filename returned from http request. Defaults to None.
extract_targz (bool) – Extract tar.gz archive. Defaults to False.
extract_gz (bool) – Decompress gz compressed file. Defaults to False.
extract_zip (bool) – Extract zip archive. Defaults to False.
Example
dlutils.download.from_google_drive(directory="data/", google_drive_fileid="0B3kP5zWXwFm_OUpQbDFqY2dXNGs")
-
dlutils.download.
from_url
(url, directory='.', file_name=None, extract_targz=False, extract_gz=False, extract_zip=False)[source] Downloads file from specified URL.
Optionally it can be unpacked after downloading completes.
- Parameters
url (str) – file URL.
directory (str) – Directory where to save the file
file_name (str, optional) – If not None, this will overwrite the file name, otherwise it will use the filename returned from http request. Defaults to None.
extract_targz (bool) – Extract tar.gz archive. Defaults to False.
extract_gz (bool) – Decompress gz compressed file. Defaults to False.
extract_zip (bool) – Extract zip archive. Defaults to False.
Example
dlutils.download.from_url("http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz", directory, extract_gz=True)
dlutils.epoch module¶
-
class
dlutils.epoch.
EpochRange
(epoch_count, log_func=None)[source] Bases:
object
Range for iterating epochs
dlutils.measures module¶
-
dlutils.measures.
auc
(label, prediction)[source]
-
dlutils.measures.
f1
(label, prediction, threshold)[source]
-
dlutils.measures.
f1_from_pr
(precision, recall)[source]
-
dlutils.measures.
f1_from_tp_fp_fn
(true_positive, false_positive, false_negative)[source]
-
dlutils.measures.
openset_f1
(label_inlier, prediction_inlier, threshold, correctly_classified)[source]
dlutils.numpy_dataset module¶
dlutils.progress_bar module¶
dlutils.random_rotation module¶
Random rotation matrix
-
dlutils.random_rotation.
random_rotation
(size)[source]
dlutils.reader module¶
Util for reading MNIST dataset
-
class
dlutils.reader.
Cifar10
(path, train=True, test=False)[source] Bases:
object
Read CIFAR out of binary batches
-
get_images
()[source]
-
get_labels
()[source]
-
dlutils.registry module¶
dlutils.save_image module¶
-
dlutils.
make_grid
(images, nrow=8, padding=2, NCWH=False)[source]
-
dlutils.
save_image
(images, filename, nrow=8, padding=2, NCWH=False, format=None)[source]
dlutils.shuffle module¶
-
dlutils.shuffle.
shuffle_ndarray
(x, axis=0)[source] Shuffle slices of ndarray along specific axis.
For example, given a 4-dimentional ndarray, which represents a batch of images in BCHW format, one could shuffle samples in that batch by applying
shuffle_ndarray()
withaxis
= 0.Note
Function does not return anything. It shuffles ndarray inplace.
- Parameters
x (array_like) – ndarray to shuffle.
axis (int, optional) – The axis over which to shuffle. Defaults to 0.
Example
>>> a = np.asarray([[1, 5], [0, 2], [0, 1]]) >>> a array([[1, 5], [0, 2], [0, 1]]) >>> dlutils.shuffle.shuffle_ndarray(a, axis=0) >>> a array([[0, 2], [0, 1], [1, 5]]) >>> dlutils.shuffle.shuffle_ndarray(a, axis=1) >>> a array([[2, 0], [1, 0], [5, 1]])
-
dlutils.shuffle.
shuffle_ndarrays_in_unison
(arrays, axis=0)[source] Shuffle slices of a list of ndarrays along specific axis with the same permutation for each of the arrays in the list.
Works similar to
shuffle_ndarray()
, but applys the same permutation to all arrays in the listNote
Function does not return anything. It shuffles ndarray inplace. All arrays in the list should have the same shape.
dlutils.timer module¶
Profiling utils
-
dlutils.timer.
timer
(f)[source] Decorator for timeing function (method) execution time.
After return from function will print string:
func: <function name> took: <time in seconds> sec
.- Parameters
f (Callable[Any]) – function to decorate.
- Returns
Decorated function.
- Return type
Callable[Any]
Example
>>> from dlutils import timer >>> @timer.timer ... def foo(x): ... for i in range(x): ... pass ... >>> foo(100000) func:'foo' took: 0.0019 sec