Dask array rounding
Dask array rounding float32
This notebook is an extract from the Dask array Tutorial notebook, see also the Youtube SciPy 2020 class at https://www.youtube.com/watch?v=mqdglv9GnM8.
We notice that dask
is automatically rounding float32
numbers to machine precision, which I think is the most sensible choice, but surprising difference compared to numpy
.
import os
import h5py
import numpy as np
%%time
blocksize = 1000000
nblocks = 1000
shape = nblocks * blocksize
if not os.path.exists('random.hdf5'):
with h5py.File('random.hdf5', mode='w') as f:
dset = f.create_dataset('/x', shape=(shape,), dtype='f4')
for i in range(0, shape, blocksize):
dset[i: i + blocksize] = np.random.exponential(size=blocksize)
from dask.distributed import Client
client = Client(n_workers=24, processes=False)
# this creates a pointer to the data, but does not actually load
import h5py
import os
f = h5py.File('random.hdf5', mode='r')
dset = f['/x']
dset.dtype
!ls -lah data/random.hdf5
Compute sum using blocked algorithm
Before using dask, lets consider the concept of blocked algorithms. We can compute the sum of a large number of elements by loading them chunk-by-chunk, and keeping a running total.
Here we compute the sum of this large array on disk by
- Computing the sum of each 1,000,000 sized chunk of the array
- Computing the sum of the 1,000 intermediate sums
Note that this is a sequential process in the notebook kernel, both the loading and summing.
len(dset)
sums = []
for i in range(0, 1000000000, 1000000):
chunk = dset[i: i + 1000000] # pull out numpy array
sums.append(chunk.sum())
total = sum(sums)
print(total)
Create dask.array
object
You can create a dask.array
Array
object with the da.from_array
function. This function accepts
data
: Any object that supports NumPy slicing, likedset
chunks
: A chunk size to tell us how to block up our array, like(1000000,)
import dask.array as da
import numpy as np
x = da.from_array(dset, chunks=(10000000,))
x
x_float64 = x.astype(np.float64)
x.sum().compute()
The machine resolution of float32
is 1e-6
, therefore everything after the 7th digit is garbage, so it is reasonable to remove it, otherwise it gives you the impression that the computation is more precise than it actually it.
Still I am surprised dask
does it, numpy
above instead doesn't care about that and prints all the digits.
If we need more precision, we need to increase the precision of the calculation, see below, but we are going to use a lot more memory, also, the input data were float32
, so it is not very useful anyway, we should generate again the input with higher precision.
np.finfo(np.float32)
x_float64.sum()
x_float64.sum().compute()
client.shutdown()