Dataset Handling

Deep.Net provides a generic type for handling datasets used in machine learning. It can handle samples that are of a user-defined record type containing fields of type ArrayNDT. The following features are provided:

data storage on host and CUDA GPU
indexed sample access
sample range access
mini-batch sequencing (with optional padding of last batch)
partitioning into training, validation and test set
loading from and saving to disk

We are going to introduce it using a simple, synthetic dataset.

Creating a dataset

In most cases you are going to load a dataset by parsing some text or binary files. However, since this is quite application-specific we do not want to concern ourselves with it here and will create a synthetic dataset using trigonometric functions on the fly.

Defining the sample type

Our sample type consists of two fields: a scalar \(x\) and a vector \(\mathbf{v}\). This corresponds to the following record type

1: 
2: 
3: 
4: 
5: 
6:

open ArrayNDNS

type MySampleType = {
    X:      ArrayNDT<single>
    V:      ArrayNDT<single>
}

We use the data type single for fast arithmetic operations on the GPU.

Generating some samples

Next, let us generate some samples. The scalar \(x\) shall be sampled randomly from a uniform distribution on the interval \([-2, 2]\). The values of vector \(v\) shall be given by the relation

\[\mathbf{v}(x) = \left( \begin{matrix} \mathrm{sinh} \, x \\ \mathrm{cosh} \, x \end{matrix} \right)\]

We can implement that using the following code.

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9:

let generateSamples cnt = seq {
    let rng = System.Random (100)
    for n = 0 to cnt - 1 do
        let x = 2. * (rng.NextDouble () - 0.5) * 2. |> single
        yield {
            X = ArrayNDHost.scalar x
            V = ArrayNDHost.ofList [sinh x; cosh x]
        }    
}

The generateSamples function produces the specified number of samples. We can test it as follows.

1: 
2: 
3: 
4:

let smpls = generateSamples 100 |> List.ofSeq

for idx, smpl in List.indexed smpls do
    printfn "Sample %3d: X=%A    V=%A" idx smpl.X smpl.V

This prints

1: 
2: 
3: 
4: 
5: 
6:

Sample   0: X=   1.8751    V=[   3.1841    3.3374]
Sample   1: X=  -1.3633    V=[  -1.8265    2.0824]
Sample   2: X=   0.6673    V=[   0.7179    1.2310]
Sample   3: X=   1.6098    V=[   2.4010    2.6009]
...
Sample  99: X=  -0.1610    V=[  -0.1617    1.0130]

Now that we have some data, we can create a dataset.

Instantiating the dataset type

There are two ways to construct a dataset.

The Dataset<'S>.FromSamples takes a sequence of samples (of type 'S) and constructs a dataset from them.
The Dataset<'S> constructor takes a list of ArrayNDTs corresponding to the fields of the record type 'S. The first dimension of each passed array must correspond to the sample index.

Since we already have a sequence of sample, we use the first method.

1: 
2: 
3:

open Datasets

let ds = smpls |> Dataset.FromSamples

Accessing single and multiple elements

The dataset type supports the indexing and slicing operations to access samples.

When accessing a single sample using the indexing operator we obtain a record from the sequence of samples we passed into the Dataset.FromSamples methods. For example to print the third sample we write

1: 
2:

let smpl2 = ds.[2]
printfn "Sample 3: X=%A    V=%A" smpl2.X smpl2.V

and get the output

1:	`Sample 3: X= 0.6673 V=[ 0.7179 1.2310]`

When accessing multiple elements using the slicing operator, the returned value is of the same sample record type but the contained tensors have one additional dimension on the left corresponding to the sample index. For example we can get a record containing the first three sample using the following code.

1: 
2:

let smpl0to2 = ds.[0..2]
printfn "Samples 0,1,2:\nX=%A\nV=\n%A" smpl0to2.X smpl0to2.V

This prints

1: 
2: 
3: 
4: 
5: 
6:

Samples 0,1,2:
X=[   1.8751   -1.3633    0.6673]
V=
[[   3.1841    3.3374]
 [  -1.8265    2.0824]
 [   0.7179    1.2310]]

Hence all tensors in the sample record raise in rank by one dimension, i.e. the scalar X became a vector and the vector V became a matrix with each row corresponding to a sample.

Iterating over the dataset

You can also iterate over the samples of the dataset directly.

1: 
2:

for smpl in ds do
    printfn "Sample: %A" smpl

This prints

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11:

Sample: {X =    1.8751;
 V = [   3.1841    3.3374];}
Sample: {X =   -1.3633;
 V = [  -1.8265    2.0824];}
Sample: {X =    0.6673;
 V = [   0.7179    1.2310];}
Sample: {X =    1.6098;
 V = [   2.4010    2.6009];}
...
Sample: {X =   -0.1610;
 V = [  -0.1617    1.0130];}

Mini-batches

The ds.Batches function returns a sequence of mini-batches from the dataset. It takes one argument specifying the number of samples in each batch. If the total number of samples in the dataset is not a multiple of the batch size, the last batch will have less samples.

The following code prints the sizes of the obtained mini-batches.

1: 
2: 
3:

for idx, batch in Seq.indexed (ds.Batches 30) do
    printfn "Batch %d: shape of X: %A    shape of V: %A" 
        idx batch.X.Shape batch.V.Shape

This outputs

1: 
2: 
3: 
4:

Batch 0: shape of X: [30]    shape of V: [30; 2]
Batch 1: shape of X: [30]    shape of V: [30; 2]
Batch 2: shape of X: [30]    shape of V: [30; 2]
Batch 3: shape of X: [10]    shape of V: [10; 2]

If you need the last batch to be padded to the specified batch size, use the ds.PaddedBatches method instead.

Partitioning

It is often necessary to split a dataset into partitions.

The ds.Partition methods takes a list of ratios and returns a list of new datasets obtained by splitting the dataset according to the specified ratios. Partitioning is done by sequentially taking samples from the beginning, until the first partition has the requested number of samples. Then the samples for the second partition are taken and so on.

The following example splits our dataset into three partitions of ratios \(1/2\), \(1/4\) and \(1/4\).

1: 
2: 
3: 
4:

let partitions = ds.Partition [0.5; 0.25; 0.25]

for idx, p in List.indexed partitions do
    printfn "Partition %d has %d samples." idx p.NSamples

This prints

1: 
2: 
3:

Partition 0 has 50 samples.
Partition 1 has 25 samples.
Partition 2 has 25 samples.

Training, validation and test splits

In machine learning it is common practice to split the dataset into a training, validation and test dataset. Deep.Net provides the TrnValTst<'S> type for that purpose. It is a record type with the fields Trn, Val and Tst of type Dataset<'S>. It can be constructed from an existing dataset using the TrnValTst.Of function.

The following code demonstrates its use using the ratios \(0.7\), \(0.15\) and \(0.15\) for the train, validation and test set respectively. The ratio specification is optional; if it is omitted ratios of \(0.8\), \(0.1\) and \(0.1\) are used.

1: 
2: 
3: 
4: 
5:

let dsp = TrnValTst.Of (ds, 0.7, 0.15, 0.15)

printfn "Training set size:    %d" dsp.Trn.NSamples
printfn "Validation set size:  %d" dsp.Val.NSamples
printfn "Test set size:        %d" dsp.Tst.NSamples

This prints

1: 
2: 
3:

Training set size:    70
Validation set size:  15
Test set size:        15

Data transfer

The ds.ToCuda and ds.ToHost methods copy the dataset to the CUDA GPU or to the host respectively. The TrnValTst type provides the same methods.

Disk storage

Use the ds.Save method to save a dataset to disk using the HDF5 format. The Dataset<'S>.Load function loads a saved dataset. The TrnValTst type provides the same methods.

Dataset loaders

Currently Deep.Net provides the following loaders for common datasets.

MNIST. Use the Mnist.load function. It takes two parameters; the first is the path to the MNIST dataset (containing the files t10k-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz, train-images-idx3-ubyte.gz, train-labels-idx1-ubyte.gz) and the second is the desired ratio of the validation set to the training set (for example 0.166 if you want 50 000 training samples and 10 000 validation samples). The sample type MnistT contains two fields: Img for the flattened images and Lbl for the images in one-hot encoding.

Summary

The Dataset<'S> type provides a convenient way to work with datasets. Type-safety is provided by preserving the user-specified sample type 'S when accessing individual or multiple samples. The dataset handler is used by the generic training function.

type MySampleType =
{X: obj;
V: obj;}

Full name: Dataset.MySampleType

MySampleType.X: obj

Multiple items
val single : value:'T -> single (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.single

--------------------
type single = System.Single

Full name: Microsoft.FSharp.Core.single

MySampleType.V: obj

val generateSamples : cnt:int -> seq<MySampleType>

Full name: Dataset.generateSamples

val cnt : int

Multiple items
val seq : sequence:seq<'T> -> seq<'T>

Full name: Microsoft.FSharp.Core.Operators.seq

--------------------
type seq<'T> = System.Collections.Generic.IEnumerable<'T>

Full name: Microsoft.FSharp.Collections.seq<_>

val rng : System.Random

namespace System

Multiple items
type Random =
  new : unit -> Random + 1 overload
  member Next : unit -> int + 2 overloads
  member NextBytes : buffer:byte[] -> unit
  member NextDouble : unit -> float

Full name: System.Random

--------------------
System.Random() : unit
System.Random(Seed: int) : unit

val n : int

val x : single

System.Random.NextDouble() : float

val sinh : value:'T -> 'T (requires member Sinh)

Full name: Microsoft.FSharp.Core.Operators.sinh

val cosh : value:'T -> 'T (requires member Cosh)

Full name: Microsoft.FSharp.Core.Operators.cosh

val smpls : MySampleType list

Full name: Dataset.smpls

Multiple items
module List

from Microsoft.FSharp.Collections

--------------------
type List<'T> =
  | ( [] )
  | ( :: ) of Head: 'T * Tail: 'T list
  interface IEnumerable
  interface IEnumerable<'T>
  member GetSlice : startIndex:int option * endIndex:int option -> 'T list
  member Head : 'T
  member IsEmpty : bool
  member Item : index:int -> 'T with get
  member Length : int
  member Tail : 'T list
  static member Cons : head:'T * tail:'T list -> 'T list
  static member Empty : 'T list

Full name: Microsoft.FSharp.Collections.List<_>

val ofSeq : source:seq<'T> -> 'T list

Full name: Microsoft.FSharp.Collections.List.ofSeq

val idx : int

val smpl : MySampleType

val indexed : list:'T list -> (int * 'T) list

Full name: Microsoft.FSharp.Collections.List.indexed

val printfn : format:Printf.TextWriterFormat<'T> -> 'T

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.printfn

val ds : seq<obj>

Full name: Dataset.ds

val smpl2 : MySampleType

Full name: Dataset.smpl2

val smpl0to2 : MySampleType

Full name: Dataset.smpl0to2

val smpl : obj

val batch : MySampleType

module Seq

from Microsoft.FSharp.Collections

val indexed : source:seq<'T> -> seq<int * 'T>

Full name: Microsoft.FSharp.Collections.Seq.indexed

val partitions : obj list

Full name: Dataset.partitions

val p : obj

val dsp : obj

Full name: Dataset.dsp

val set : elements:seq<'T> -> Set<'T> (requires comparison)

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.set