Reference

Note

Note that for all keyword arguments of type NTuple{2, Int}, the order of dimensions is (y/height-dimension, x/width-dimension).

Reference

GradValley

DataLoader

GradValley.DataLoader — Type

DataLoader(get_function::Function, dataset_size::Integer; batch_size::Integer=1, shuffle::Bool=false, drop_last::Bool=false)

The DataLoader was designed to easily iterate over batches. Each time a new batch is requested, the data loader loads this batch "just in time" (instead of loading all the batches to memory at once).

The get_function is expected to load one item from a dataset at a given index. The specified get_function is expected to accept exactly one positional argument, which is the index of the item the get_function will return. A tuple of arbitrary length is expected as the return value of the get_function. Each element in this tuple must be an array. The length/size and type of the tuple and array is expected to be the same at each index. When a batch is requested, the data loader returns the tuple containing the with batch dimensions extended arrays.

Note

The DataLoader is iteratabel and indexable. size(dataloader) returns the given size of the dataset, length(dataloader) returns the total number of batches (equal if batch_size=1). When a range is given as the index argument, a vector containing multiple batches (arrays) is returned.

Tip

If you really want to load the whole dataset to memory (e.g. useful when training over multiple epochs, with this way, you don't have to reload the dataset each epoch over and over again), you can do so of course: all_batches = dataloader[start:end] where typeof(dataloader) == DataLoader

Arguments

get_function::Function: the function which takes the index of an item from a dataset and returns that item (an arbitrary sized tuple containing arrays)
dataset_size::Integer: the maximum index the get_function accepts (the number of items in the dataset, the dataset size)
batch_size::Integer=1: the batch size (the last dimension, the extended batch dimension, of each array in the returned tuple has this size)
shuffle::Bool=false: reshuffle the data (doesn't reshuffle automatically after each epoch, use reshuffle! instead)
drop_last::Bool=false: set to true to drop the last incomplete batch, if the dataset size is not divisible by the batch size, if false and the size of dataset is not divisible by the batch size, then the last batch will be smaller

Examples

# EXAMPLE FROM https://jonas208.github.io/GradValley.jl/tutorials_and_examples/#Tutorials-and-Examples
julia> using MLDatasets # a package for downloading datasets
# initialize train- and test-dataset
julia> mnist_train = MNIST(:train) 
julia> mnist_test = MNIST(:test)
# define the get_element function:
# function for getting an image and the corresponding target vector from the train or test partition
julia> function get_element(index, partition)
            # load one image and the corresponding label
            if partition == "train"
                image, label = mnist_train[index]
            else # test partition
                image, label = mnist_test[index]
            end
            # add channel dimension and rescaling the values to their original 8 bit gray scale values
            image = reshape(image, 28, 28, 1) .* 255
            # generate the target vector from the label, one for the correct digit, zeros for the wrong digits
            target = zeros(10)
            target[label + 1] = 1.00

            return image, target
       end
# initialize the data loaders (with anonymous function which helps to easily distinguish between test- and train-partition)
julia> train_data_loader = DataLoader(index -> get_element(index, "train"), length(mnist_train), batch_size=32, shuffle=true)
julia> test_data_loader = DataLoader(index -> get_element(index, "test"), length(mnist_test), batch_size=32)
# in most cases NOT recommended: you can force the data loaders to load all the data at once into memory, depending on the dataset's size, this may take a while
julia> # train_data = train_data_loader[begin:end] # turned off to save time
julia> # test_data = test_data_loader[begin:end] # turned off to save time
# now you can write your train- or test-loop like so 
julia> for (batch, (image_batch, target_batch)) in enumerate(test_data_loader) #=do anything useful here=# end
julia> for (batch, (image_batch, target_batch)) in enumerate(train_data_loader) #=do anything useful here=# end

GradValley.reshuffle! — Function

reshuffle!(data_loader::DataLoader)

Manually shuffle the data loader (even if shuffle is disabled in the given data loader). It is recommended to reshuffle after each epoch during training.

GradValley.Layers

Containers

GradValley.Layers.SequentialContainer — Type

SequentialContainer(layer_stack::Vector{<: Any})

A sequential container (recommended method for building models). A SequtialContainer can take a vector of layers or other containers (submodules). While forward-pass, the given inputs are sequentially propagated through every layer (or submodule) and the output will be returned. The execution order during forward pass is of course the same as the order in the vector containing the layers or submodules.

Note

You can use a SequentialContainer in a GraphContainer (and vice versa). You can also use a SequentialContainer in a SequentialContainer (nesting allowed).

Arguments

layer_stack::Vector{<: Any}: the vector containing the layers (or submodules, so other containers), the order of the modules in the vector corresponds to the execution order

Indexing and Iteration

The sequential container is indexable and iterable. Indexing one element/iterating behaves like indexing one element of/iterating over the sequential_container.layer_stack passed to the container at initialization. However, if the index is a range (UnitRange{<: Integer}), a new SequentialContainer containing all the requested submodules/layers is initialized and returned. length(sequential_container) and size(sequential_container) both just return the number of modules in the layers vector (equivalent to length(sequential_container.layer_stack)).

Examples

# a simple chain of fully connected layers
julia> m = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# computing the output of the module (with random inputs)
julia> input = rand(Float32, 1000, 32)
julia> output = forward(m, input)

# a more complicated example with with nested submodules
julia> feature_extractor_part_1 = SequentialContainer([Conv(1, 6, (5, 5), activation_function="relu"), AvgPool((2, 2))])
julia> feature_extractor_part_2 = SequentialContainer([Conv(6, 16, (5, 5), activation_function="relu"), AvgPool((2, 2))])
julia> feature_extractor = SequentialContainer([feature_extractor_part_1, feature_extractor_part_2])
julia> classifier = SequentialContainer([Fc(256, 120, activation_function="relu"), Fc(120, 84, activation_function="relu"), Fc(84, 10)])
julia> m = SequentialContainer([feature_extractor, Reshape((256, )), classifier, Softmax(dims=1)])
# computing the output of the module (with random inputs)
julia> input = rand(Float32, 28, 28, 1, 32)
julia> output = forward(m, input)

# indexing 
julia> m[begin] # returns the feature_extractor submodule (SequentialContainer)
julia> m[end] # returns the softmax layer (Softmax)
julia> m[begin:end-1] # returns the entire model except the softmax layer (a new SequentialContainer is initialized and returned) 

# if a SequentialContainer contains BatchNorm layers (regardless of whether they are nested somewhere in a submodule or not), 
# the mode of all these layers at once can be switched as follows
julia> trainmode!(m)
julia> testmode!(m)

# if a SequentialContainer contains layers with trainable parameters/weights (what is hopefully in nearly all situations the case),
# regardless of whether they are nested somewhere in a submodule or not, the gradients of all these layers at once can be reset as follows
julia> zero_gradients(m)

GradValley.Layers.GraphContainer — Type

GraphContainer(forward_pass::Function, layer_stack::Vector{<: Any})

A computational graph container (recommended method for building models). A GraphContainer can take a function representing the forward pass of a model and a vector of layers or other containers (submodules). While forward-pass, a tracked version of the given inputs are passed through the given forward pass function and the output will be returned. During forward pass, the computational graph is build by a function overload based automatic differentiation system (AD). During backward pass, this computational graph is used to compute the gradients.

Note

You can use a GraphContainer in a SequentialContainer (and vice versa). You can also use a GraphContainer in a GraphContainer (nesting allowed).

Warning

Note that the GraphContainer is an experimental feature. The behavior of this module could change dramatically in the future. Using this module can may cause problems.

Arguments

forward_pass::Function: the function representing the forward pass of a model
layer_stack::Vector{<: Any}: the vector containing the layers (or submodules, so other Containers), the order doesn't matter

Guidelines

GradValley has its own little, rudimentary function overload based automatic differentiation system based on ChainRules.jl. It was designed to allow simple modifications of a normal sequential signal flow, which is the basis of most neural networks. For example, to be able to implement ResNet's residual connections. So it represents an alternative to data flow layers known from other Deep Learning packages. In a way, it is similar to the forward function known from every PyTorch model. Since the AD does not offer that much functionality at this point in time, the following guidelines must be observed:

The forward pass function must take at least two arguments. The first is the vector containing the layers (which was passed to GraphContainer at initialization). The following arguments (the last could also be a Vararg argument) are the data inputs.
The forward pass function must be written generically enough to accept arrays of type T<:AbstractArray or numbers of type T<:Real as input (starting with the second argument).
Array inputs that are being differentiated cannot be mutated.
The initialization of new arrays (for example with zeros or rand) and their use in mix with the inputs passed to the forward function is not allowed.
Avoid dot syntax in most cases, there only exist a few differentiation rules for the most basic vectorized operators (.+, .-, .*, ./, .^).

Examples

# a simple chain of fully connected layers (equivalent to the first example of SequentialContainer)
julia> layers = [Fc(1000, 500), Fc(500, 250), Fc(250, 125)]
julia> function forward_pass(layers::Vector, input::AbstractArray)
           fc_1, fc_2, fc_3 = layers
           output = forward(fc_1, input)
           output = forward(fc_2, output)
           output = forward(fc_3, output)
           return output
       end
julia> m = GraphContainer(forward_pass, layers)
# computing the output of the module (with random inputs)
julia> input = rand(Float32, 1000, 32)
julia> output = forward(m, input)

# a more complicated example: implementation of an inverted residual block
julia> layers = [Conv(16, 64, (1, 1), activation_function="relu"), 
                 Conv(64, 64, (3, 3), padding=(1, 1), groups=64, activation_function="relu"), # depthwise-conv layer because groups==in_channels
                 Conv(64, 16, (1, 1), activation_function="relu")]
julia> function forward_pass(layers::Vector, input::AbstractArray)
           conv_1, depthwise_conv, conv_2 = layers
           output = forward(conv_1, input)
           output = forward(depthwise_conv, output)
           output = forward(conv_2, output)
           output = output + input # residual/skipped connection
           return output
       end
julia> m = GraphContainer(forward_pass, layers)
# computing the output of the module (with random inputs)
julia> input = rand(Float32, 50, 50, 16, 32)
julia> output = forward(m, input)

# a simple example with a polynomial, just to show that it is possible to use the GraphContainer like an automatic differentiation (AD) tool 
julia> f(layers, x) = 0.5x^3 - 2x^2 + 10
julia> df(x) = 1.5x^2 - 4x # checking the result of the AD with this manually written derivation 
julia> m = GraphContainer(f, [])
julia> y = forward(m, 3)
julia> dydx = backward(m, 1) # in this case, no loss function was used, so we have no gradient information, therefore we use 1 as the so-called seed
1-element Vector{Float64}:
 1.5
julia> manual_dydx = df(3)
1.5
julia> isapprox(dydx[1], manual_dydx)
true

# if a GraphContainer contains BatchNorm layers (regardless of whether they are nested somewhere in a submodule or not), 
# the mode of all these layers at once can be switched as follows
julia> trainmode!(m)
julia> testmode!(m)

# if a GraphContainer contains layers with trainable parameters/weights (what is hopefully in nearly all situations the case),
# regardless of whether they are nested somewhere in a submodule or not, the gradients of all these layers at once can be reset as follows
julia> zero_gradients(m)

GradValley.Layers.summarize_model — Function

summarize_model(container::AbstractContainer)

Return a string (and the total number of parameters) intended for printing with an overview of the model (currently doesn't show an visualization of the computational graph) and its number of parameters.

GradValley.Layers.module_to_eltype_device! — Function

module_to_eltype_device!(layer::AbstractLayer; element_type::DataType=Float32, device::AbstractString="cpu")

Convert the parameters of a container or layer to a different element type and move the parameters to the specified device.

Arguments

layer::AbstractLayer: the layer or container (often just the entire model) holding the parameters to be changed
element_type::DataType=Float32: the element type into which the parameters will be converted to
device::AbstractString="cpu": the device to which the parameters will be moved, can be "cpu" or "gpu" (only CUDA is supported)

GradValley.Layers.clean_model_from_backward_information! — Function

clean_model_from_backward_information!(model::AbstractContainer)

Clean a container from backward pass information (e.g. computational graph). It is recommended to run this function on a model which should be saved to a file.

Forward- and Backward-Pass

GradValley.Layers.forward — Function

forward(layer, input::AbstractArray{T, N}) where {T, N}

The forward function for computing the output of a module. For every layer/container, an individual method exists. However, all these methods work exactly the same. They all take the layer/container as the first argument and the input data as the second argument. The output is returned.

Examples

# define some layers and containers
julia> layer = Conv(3, 6, (5, 5))
julia> container = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# create some random input data
julia> layer_input = rand(Float32, 50, 50, 3, 32)
julia> container_input = rand(Float32, 1000, 32)
# compute the output of the modules
julia> layer_output = forward(layer, layer_input)
julia> container_output = forward(container, container_input)

GradValley.Layers.backward — Function

backward(sc::SequentialContainer, derivative_loss::Union{AbstractArray{T, N}, Real}) where {T, N}

The backward function for computing the gradients for a SequentialContainer (highly recommend for model building). The function takes the container (so mostly the whole model) as the first argument and the derivative of the loss as the second argument. No gradients are returned, they are just saved in the layers the container contains.

Warning

Calling backward multiple times can have serious consequences. Gradients are added (accumulated) by convention, so calling backward multiple times after the corresponding forward call, the gradients of the weights AND the returned gradient w.r.t. the input are added up (accumulated)! So even the gradient returned by the backward call doesn't stay the same when calling backward multiple times after the forward call. Note that the gradients of the weights can be reset by zero_gradients but the gradient w.r.t. to the input of a container cannot be reset (except of course by another forward call).

Examples

# define a model
julia> m = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# compute the output of the model (with random inputs)
julia> output = forward(m, rand(Float32, 1000, 32))
# use a loss function (with random data as target values) and save the derivative of the loss
julia> loss, derivative_loss = mse_loss(output, rand(Float32, 125, 32)) # note that GradValley.Optimization.mse_loss must be imported
# before the gradients are recalculated, the old gradients should always be reset first
julia> zero_gradients(m)
# backpropagation 
julia> backward(m, derivative_loss)

backward(grc::GraphContainer, derivative_loss::Union{AbstractArray{T, N}, Real}) where {T, N}

The backward function for computing the gradients for a GraphContainer (recommend for model building). The function takes the container (so mostly the whole model) as the first argument and the derivative of the loss as the second argument. The gradients to the input arguments are returned (in a vector, in the same order as the inputs were passed to the forward function).

Warning

Calling backward multiple times can have serious consequences. Gradients are added (accumulated) by convention, so calling backward multiple times after the corresponding forward call, the gradients of the weights AND the returned gradients w.r.t. the inputs are added up (accumulated)! So even the gradients returned by the backward call doesn't stay the same when calling backward multiple times after the forward call. Note that the gradients of the weights can be reset by zero_gradients but the gradients w.r.t. to the inputs of a container cannot be reset (except of course by another forward call).

Examples

# define a model
julia> layers = [Fc(1000, 500), Fc(500, 250), Fc(250, 125)]
julia> function forward_pass(layers::Vector, input::AbstractArray)
           fc_1, fc_2, fc_3 = layers
           output = forward(fc_1, input)
           output = forward(fc_2, output)
           output = forward(fc_3, output)
           return output
       end
julia> m = GraphContainer(forward_pass, layers)
# compute the output of the model (with random inputs)
julia> input = rand(Float32, 1000, 32)
julia> output = forward(m, input)
# use a loss function (with random data as target values) and save the derivative of the loss
julia> loss, derivative_loss = mse_loss(output, rand(Float32, 125, 32)) # note that GradValley.Optimization.mse_loss must be imported
# before the gradients are (re)calculated, the old gradients should always be reset first
julia> zero_gradients(m)
# backpropagation 
julia> input_gradients = backward(m, derivative_loss) # input_gradients is a vector of length 1 because we only passed one input to the forward function
julia> input_gradient = input_gradients[1] # input_gradient is the gradient w.r.t to the single input

Reset/zero gradients

GradValley.Layers.zero_gradients — Function

zero_gradients(layer_or_container)

Resets the gradients of a layer or a container (any kind of module with trainable parameters).

There only exists methods for layers with parameters, however, if a container without layers with trainable parameters is given, NO error will be thrown. So if the given container contains layers with trainable parameters/weights, regardless of whether they are nested somewhere in a submodule or not, the gradients of all these layers at once will be reset.

Training mode/test mode

GradValley.Layers.trainmode! — Function

trainmode!(batch_norm_layer_or_container)

Switches the mode of the given batch normalization layer or container to training mode. See Normalization

If the given container contains batch normalization layers (regardless of whether they are nested somewhere in a submodule or not), the mode of all these layers at once will be switched to training mode.

GradValley.Layers.testmode! — Function

testmode!(batch_norm_layer_or_container)

Switches the mode of the given batch normalization layer or container to test mode. See Normalization

If the given container contains batch normalization layers (regardless of whether they are nested somewhere in a submodule or not), the mode of all these layers at once will be switched to test mode.

Convolution

GradValley.Layers.Conv — Type

Conv(in_channels::Int, out_channels::Int, kernel_size::NTuple{2, Int}; stride::NTuple{2, Int}=(1, 1), padding::NTuple{2, Int}=(0, 0), dilation::NTuple{2, Int}=(1, 1), groups::Int=1, activation_function::Union{Nothing, AbstractString}=nothing, init_mode::AbstractString="default_uniform", use_bias::Bool=true)

A convolution layer. Apply a 2D convolution over an input signal with additional batch and channel dimensions.

Arguments

in_channels::Int: the number of channels in the input image
out_channels::Int: the number of channels produced by the convolution
kernel_size::NTuple{2, Int}: the size of the convolving kernel
stride::NTuple{2, Int}=(1, 1): the stride of the convolution
padding::NTuple{2, Int}=(0, 0): the zero padding added to all four sides of the input
dilation::NTuple{2, Int}=(1, 1): the spacing between kernel elements
groups::Int=1: the number of blocked connections from input channels to output channels (in-channels and out-channels must both be divisible by groups)
activation_function::Union{Nothing, AbstractString}=nothing: the element-wise activation function which will be applied to the output after the convolution
init_mode::AbstractString="default_uniform": the initialization mode of the weights (can be "default_uniform", "default", "kaiming_uniform", "kaiming", "xavier_uniform" or "xavier")
use_bias::Bool=true: if true, adds a learnable bias to the output

Shapes

Input: $(W_{in}, H_{in}, C_{in}, N)$
Weight: $(W_{w}, H_{w}, \frac{C_{in}}{groups}, C_{out})$
Bias: $(C_{out}, )$
Output: $(W_{out}, H_{out}, C_{out}, N)$
- $H_{out} = {\frac{H_{in} + 2 \cdot padding[1] - dilation[1] \cdot (H_w - 1) - 1}{stride[1]}} + 1$
- $W_{out} = {\frac{W_{in} + 2 \cdot padding[2] - dilation[2] \cdot (W_w - 1) - 1}{stride[2]}} + 1$

Useful Fields/Variables

weight::AbstractArray{<: Real, 4}: the learnable weight of the layer
bias::AbstractVector{<: Real}: the learnable bias of the layer (used when use_bias=true)
weight_gradient::AbstractArray{<: Real, 4}: the current gradient of the weight/kernel
bias_gradient::AbstractVector{<: Real}: the current gradient of the bias

Definition

For one group, a multichannel 2D convolution (disregarding batch dimension and activation function) can be described as:

$o_{c_{out}, y_{out}, x_{out}} = \big(\sum_{c_{in=1}}^{C_{in}}\sum_{y_w=1}^{H_{w}}\sum_{x_w=1}^{W_{w}} i_{c_{in}, y_{in}, x_{in}} \cdot w_{c_{out}, c_{in}, y_w, x_w}\big) + b_{c_{out}}$, where
- $y_{in} = y_{out} + (stride[1] - 1) \cdot (y_{out} - 1) + (y_w - 1) \cdot dilation[1]$
- $x_{in} = x_{out} + (stride[2] - 1) \cdot (x_{out} - 1) + (x_w - 1) \cdot dilation[2]$

O is the output array, I the input array, W the weight array and B the bias array.

Examples

# square kernels and fully default values of keyword arguments
julia> m = Conv(3, 6, (5, 5))
# non-square kernels and unequal stride and with padding as well as specified weight initialization mode
# (init_mode="kaiming" stands for kaiming weight initialization with normally distributed values)
julia> m = Conv(3, 6, (3, 5), stride=(2, 1), padding=(2, 1))
# non-square kernels and unequal stride and with padding, dilation and 3 groups
# (because groups=in_channels and out_channles is divisible by groups, it is even a depthwise convolution)
julia> m = Conv(3, 6, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1), groups=3)
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)

GradValley.Layers.ConvTranspose — Type

ConvTranspose(in_channels::Int, out_channels::Int, kernel_size::NTuple{2, Int}; stride::NTuple{2, Int}=(1, 1), padding::NTuple{2, Int}=(0, 0), output_padding::NTuple{2, Int}=(0, 0), dilation::NTuple{2, Int}=(1, 1), groups::Int=1, activation_function::Union{Nothing, AbstractString}=nothing, init_mode::AbstractString="default_uniform", use_bias::Bool=true)

A transpose convolution layer (also known as fractionally-strided convolution or deconvolution). Apply a 2D transposed convolution over an input signal with additional batch and channel dimensions.

Arguments

in_channels::Int: the number of channels in the input image
out_channels::Int: the number of channels produced by the convolution
kernel_size::NTuple{2, Int}: the size of the convolving kernel
stride::NTuple{2, Int}=(1, 1): the stride of the convolution
padding::NTuple{2, Int}=(0, 0): because transposed convolution can be seen as a partly (not true) inverse of convolution, padding means is this case to cut off the desired number of pixels on each side (instead of adding pixels)
output_padding::NTuple{2, Int}=(0, 0): additional size added to one side of each dimension in the output shape (note that output_padding is only used to calculate the output shape, but does not actually add zero-padding to the output)
dilation::NTuple{2, Int}=(1, 1): the spacing between kernel elements
groups::Int=1: the number of blocked connections from input channels to output channels (in-channels and out-channels must both be divisible by groups)
activation_function::Union{Nothing, AbstractString}=nothing: the element-wise activation function which will be applied to the output after the convolution
init_mode::AbstractString="default_uniform": the initialization mode of the weights (can be "default_uniform", "default", "kaiming_uniform", "kaiming", "xavier_uniform" or "xavier")
use_bias::Bool=true: if true, adds a learnable bias to the output

Shapes

Input: $( W_{in}, H_{in}, C_{in}, N)$
Weight: $(W_{w}, H_{w}, \frac{C_{out}}{groups}, C_{in})$
Bias: $(C_{out}, )$
Output: $(W_{out}, H_{out}, C_{out}, N)$, where
- $H_{out} = (H_{in} - 1) \cdot stride[1] - 2 \cdot padding[1] + dilation[1] \cdot (H_w - 1) + output\_padding[1] + 1$
- $W_{out} = (W_{in} - 1) \cdot stride[2] - 2 \cdot padding[2] + dilation[2] \cdot (W_w - 1) + output\_padding[2] + 1$

Useful Fields/Variables

weight::AbstractArray{<: Real, 4}: the learnable weight of the layer
bias::AbstractVector{<: Real}: the learnable bias of the layer (used when use_bias=true)
weight_gradient::AbstractArray{<: Real, 4}: the current gradient of the weight/kernel
bias_gradient::Vector{<: Real}: the current gradient of the bias

Definition

A transposed convolution can be seen as the gradient of a normal convolution with respect to its input. The forward pass of a transposed convolution is the backward pass of a normal convolution, so the forward pass of a normal convolution becomes the backward pass of a transposed convolution (with respect to its input). For more detailed information, you can look at the source code of (transposed) convolution. A nice looking visualization of (transposed) convolution can be found here.

Examples

# square kernels and fully default values of keyword arguments
julia> m = ConvTranspose(6, 3, (5, 5))
# upsampling an output from normal convolution like in GANS, Unet, etc.
julia> input = forward(Conv(3, 6, (5, 5)), rand(Float32, 50, 50, 3, 32))
julia> output = forward(m, input)
# the size of the output of the transposed convolution is equal to the size of the original input of the normal convolution
julia> size(output)
(50, 50, 3, 32)

Pooling

GradValley.Layers.MaxPool — Type

MaxPool(kernel_size::NTuple{2, Int}; stride::NTuple{2, Int}=kernel_size, padding::NTuple{2, Int}=(0, 0), dilation::NTuple{2, Int}=(1, 1), activation_function::Union{Nothing, AbstractString}=nothing)

A maximum pooling layer. Apply a 2D maximum pooling over an input signal with additional batch and channel dimensions.

Arguments

kernel_size::NTuple{2, Int}: the size of the window to take the maximum over
stride::NTuple{2, Int}=kernel_size: the stride of the window
padding::NTuple{2, Int}=(0, 0): the zero padding added to all four sides of the input
dilation::NTuple{2, Int}=(1, 1): the spacing between the window elements
activation_function::Union{Nothing, AbstractString}=nothing: the element-wise activation function which will be applied to the output after the pooling

Shapes

Input: $(W_{in}, H_{in}, C, N)$
Output: $(W_{out}, H_{out}, C, N)$
- $H_{out} = {\frac{H_{in} + 2 \cdot padding[1] - dilation[1] \cdot (H_w - 1) - 1}{stride[1]}} + 1$
- $W_{out} = {\frac{W_{in} + 2 \cdot padding[2] - dilation[2] \cdot (W_w - 1) - 1}{stride[2]}} + 1$

Definition

A multichannel 2D maximum pooling (disregarding batch dimension and activation function) can be described as:

\[\begin{align*} o_{c, y_{out}, x_{out}} = \max _{y_w = 1, ..., kernel\_size[1] \ x_w = 1, ..., kernel\_size[2]} i_{c, y_{in}, x_{in}} \end{align*}\]

Where

$y_{in} = y_{out} + (stride[1] - 1) \cdot (y_{out} - 1) + (y_w - 1) \cdot dilation[1]$
$x_{in} = x_{out} + (stride[2] - 1) \cdot (x_{out} - 1) + (x_w - 1) \cdot dilation[2]$

O is the output array and I the input array.

Examples

# pooling of square window of size=(3, 3) and automatically selected stride
julia> m = MaxPool((3, 3))
# pooling of non-square window with custom stride and padding
julia> m = MaxPool((3, 2), stride=(2, 1), padding=(1, 1))
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)

GradValley.Layers.AvgPool — Type

AvgPool(kernel_size::NTuple{2, Int}; stride::NTuple{2, Int}=kernel_size, padding::NTuple{2, Int}=(0, 0), dilation::NTuple{2, Int}=(1, 1), activation_function::Union{Nothing, AbstractString}=nothing)

An average pooling layer. Apply a 2D average pooling over an input signal with additional batch and channel dimensions.

Arguments

kernel_size::NTuple{2, Int}: the size of the window to take the average over
stride::NTuple{2, Int}=kernel_size: the stride of the window
padding::NTuple{2, Int}=(0, 0): the zero padding added to all four sides of the input
dilation::NTuple{2, Int}=(1, 1): the spacing between the window elements
activation_function::Union{Nothing, AbstractString}=nothing: the element-wise activation function which will be applied to the output after the pooling

Shapes

Input: $(W_{in}, H_{in}, C, N)$
Output: $(W_{out}, H_{out}, C, N)$
- $H_{out} = {\frac{H_{in} + 2 \cdot padding[1] - dilation[1] \cdot (H_w - 1) - 1}{stride[1]}} + 1$
- $W_{out} = {\frac{W_{in} + 2 \cdot padding[2] - dilation[2] \cdot (W_w - 1) - 1}{stride[2]}} + 1$

Definition

A multichannel 2D average pooling (disregarding batch dimension and activation function) can be described as:

$o_{c, y_{out}, x_{out}} = \frac{1}{kernel\_size[1] \cdot kernel\_size[2]} \sum_{i=1}^{kernel\_size[1]}\sum_{j=1}^{kernel\_size[2]} i_{c, y_{in}, x_{in}}$, where
- $y_{in} = y_{out} + (stride[1] - 1) \cdot (y_{out} - 1) + (y_w - 1) \cdot dilation[1]$
- $x_{in} = x_{out} + (stride[2] - 1) \cdot (x_{out} - 1) + (x_w - 1) \cdot dilation[2]$

O is the output array and I the input array.

Examples

# pooling of square window of size=(3, 3) and automatically selected stride
julia> m = AvgPool((3, 3))
# pooling of non-square window with custom stride and padding
julia> m = AvgPool((3, 2), stride=(2, 1), padding=(1, 1))
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)

GradValley.Layers.AdaptiveMaxPool — Type

AdaptiveMaxPool(output_size::NTuple{2, Int}; activation_function::Union{Nothing, AbstractString}=nothing)

An adaptive maximum pooling layer. Apply a 2D adaptive maximum pooling over an input signal with additional batch and channel dimensions. For any input size, the spatial size of the output is always equal to the specified $output\_size$.

Arguments

output_size::NTuple{2, Int}: the target output size of the image (can even be larger than the input size) of the form $(H_{out}, W_{out})$
activation_function::Union{Nothing, AbstractString}=nothing: the element-wise activation function which will be applied to the output after the pooling

Shapes

Input: $(W_{in}, H_{in}, C, N)$
Output: $(W_{out}, H_{out}, C, N)$, where $(H_{out}, W_{out}) = output\_size$

Definition

In some cases, the kernel-size and stride could be calculated in a way that the output would have the target size (using a standard maximum pooling with the calculated kernel-size and stride, padding and dilation would not be used in this case). However, this approach would only work if the input size is an integer multiple of the output size (See this question at stack overflow for further information: stackoverflow.com/questions/53841509/how-does-adaptive-pooling-in-pytorch-work). A more generic approach is to calculate the indices of the input with an additional algorithm only for adaptive pooling. With this approach, it is even possible that the output is larger than the input what is really unusual for pooling simply because that is the opposite of what pooling actually should do, namely reducing the size. The function get_in_indices(in_len, out_len) in gv_functional.jl (line 68 - 85) implements such an algorithm (similar to the one at the stack overflow question), so you could check there on how exactly it is defined. Thus, the mathematical definition would be identical to the one at MaxPool with the difference that the indices $y_{in}$ and $x_{in}$ have already been calculated beforehand.

Examples

# target output size of 5x5
julia> m = AdaptiveMaxPool((5, 5))
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)

GradValley.Layers.AdaptiveAvgPool — Type

AdaptiveAvgPool(output_size::NTuple{2, Int}; activation_function::Union{Nothing, AbstractString}=nothing)

An adaptive average pooling layer. Apply a 2D adaptive average pooling over an input signal with additional batch and channel dimensions. For any input size, the size of the output is always equal to the specified $output\_size$.

Arguments

output_size::NTuple{2, Int}: the target output size of the image (can even be larger than the input size) of the form $(H_{out}, W_{out})$
activation_function::Union{Nothing, AbstractString}=nothing: the element-wise activation function which will be applied to the output after the pooling

Shapes

Input: $(W_{in}, H_{in}, C, N)$
Output: $(W_{out}, H_{out}, C, N)$, where $(H_{out}, W_{out}) = output\_size$

Definition

In some cases, the kernel-size and stride could be calculated in a way that the output would have the target size (using a standard average pooling with the calculated kernel-size and stride, padding and dilation would not be used in this case). However, this approach would only work if the input size is an integer multiple of the output size (See this question at stack overflow for further information: stackoverflow.com/questions/53841509/how-does-adaptive-pooling-in-pytorch-work). A more generic approach is to calculate the indices of the input with an additional algorithm only for adaptive pooling. With this approach, it is even possible that the output is larger than the input what is really unusual for pooling simply because that is the opposite of what pooling actually should do, namely reducing the size. The function get_in_indices(in_len, out_len) in gv_functional.jl (line 68 - 85) implements such an algorithm (similar to the one at the stack overflow question), so you could check there on how exactly it is defined. Thus, the mathematical definition would be identical to the one at AvgPool with the difference that the indices $y_{in}$ and $x_{in}$ have already been calculated beforehand.

Examples

# target output size of 5x5
julia> m = AdaptiveAvgPool((5, 5))
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)

Fully connected

GradValley.Layers.Fc — Type

Fc(in_features::Int, out_features::Int; activation_function::Union{Nothing, AbstractString}=nothing, init_mode::AbstractString="default_uniform", use_bias::Bool=true)

A fully connected layer (sometimes also known as dense or linear). Apply a linear transformation (matrix multiplication) to the input signal with additional batch dimension.

Arguments

in_features::Int: the size of each input sample ("number of input neurons")
out_features::Int: the size of each output sample ("number of output neurons")
activation_function::Union{Nothing, AbstractString}=nothing: the element-wise activation function which will be applied to the output
init_mode::AbstractString="default_uniform": the initialization mode of the weights (can be "default_uniform", "default", "kaiming_uniform", "kaiming", "xavier_uniform" or "xavier")

use_bias::Bool=true: if true, adds a learnable bias to the output

Shapes

Input: $(in\_features, N)$
Weight: $(out\_features, in\_features)$
Bias: $(out\_features, )$
Output: $(out\_features, N)$

Useful Fields/Variables

weight::AbstractArray{<: Real, 2}: the learnable weights of the layer
bias::AbstractVector{<: Real}: the learnable bias of the layer (used when use_bias=true)
weight_gradient::AbstractArray{<: Real, 2}: the current gradients of the weights
bias_gradient::AbstractVector{<: Real}: the current gradients of the bias

Definition

The forward pass of a fully connected layer is given by the matrix multiplication between the weight matrix and the input vector (disregarding batch dimension and activation function):

$O = WI + B$

This operation can also be described by:

$o_{j} = \big(\sum_{k=1}^{in\_features} w_{j,k} \cdot i_{k}\big) + b_{j}$

O is the output vector, I the input vector, W the weight matrix and B the bias vector. Visually interpreted, it means that each input neuron i is weighted with the corresponding weight w connecting the input neuron to the output neuron o where all the incoming signals are summed up.

Examples

# a fully connected layer with 784 input features and 120 output features
julia> m = Fc(784, 120)
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 784, 32)
julia> output = forward(m, input)

Identity

GradValley.Layers.Identity — Type

Identity(; activation_function::Union{Nothing, AbstractString}=nothing)

An identity layer (can be used as an activation function layer). If no activation function is used, this layer does not change the signal in any way. However, if an activation function is used, the activation function will be applied to the input element-wise.

Tip

This layer is helpful to apply an element-wise activation independent of a "normal" computational layer.

Arguments

activation_function::Union{Nothing, AbstractString}=nothing: the element-wise activation function which will be applied to the inputs

Shapes

Input: $(*)$, where $*$ means any number of dimensions
Output: $(*)$ (same shape as input)

Definition

A placeholder identity operator, except the optional activation function, the input signal is not changed in any way. If an activation function is used, the activation function will be applied to the input element-wise.

Examples

# an independent relu activation
julia> m = Identity(activation_function="relu")
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 10, 32)
julia> output = forward(m, input)

Normalization

GradValley.Layers.BatchNorm — Type

BatchNorm(num_features::Int; epsilon::Real=1e-05, momentum::Real=0.1, affine::Bool=true, track_running_stats::Bool=true, activation_function::Union{Nothing, AbstractString}=nothing)

A batch normalization layer. Apply a batch normalization over a 4D input signal (a mini-batch of 2D inputs with additional channel dimension).

This layer has two modes: training mode and test mode. If track_running_stats::Bool=true, this layer behaves differently in the two modes. During training, this layer always uses the currently calculated batch statistics. If track_running_stats::Bool=true, the running mean and variance are tracked during training and will be used while testing. If track_running_stats::Bool=false, even in test mode, the currently calculated batch statistics are used. The mode can be switched with trainmode! or testmode! respectively. The training mode is active by default.

Arguments

num_features::Int: the number of channels
epsilon::Real=1e-05: a value added to the denominator for numerical stability
momentum::Real=0.1: the value used for the running mean and running variance computation
affine::Bool=true: if true, this layer uses learnable affine parameters/weights ($\gamma$ and $\beta$)
track_running_stats::Bool=true: if true, this layer tracks the running mean and variance during training and will use them for testing/evaluation, if false, such statistics are not tracked and, even in test mode, the batch statistics are always recalculated for each new input
activation_function::Union{Nothing, AbstractString}=nothing: the element-wise activation function which will be applied to the output

Shapes

Input: $(W, H, C, N)$
$\gamma$ Weight, $\beta$ Bias: $(C, )$
Running Mean/Variance: $(C, )$
Output: $(W, H, C, N)$ (same shape as input)

Useful Fields/Variables

Weights (used if affine::Bool=true)

weight::AbstractVector{<: Real}: $\gamma$, a learnabele parameter for each channel, initialized with ones
bias::AbstractVector{<: Real}: $\beta$, a learnabele parameter for each channel, initialized with zeros

Gradients of weights (used if affine::Bool=true)

weight_gradient::AbstractVector{<: Real}: the gradient of $\gamma$
bias_gradient::AbstractVector{<: Real}: the gradient of $\beta$

Running statistics (used if rack_running_stats::Bool=true)

running_mean::AbstractVector{<: Real}: the continuously updated batch statistics of the mean
running_variance::AbstractVector{<: Real}: the continuously updated batch statistics of the variance

Definition

A batch normalization operation can be described as: For input values over a mini-batch: $\mathcal{B} = \{x_1, x_2, ..., x_n\}$

\[\begin{align*} y_i = \frac{x_i - \overline{\mathcal{B}}}{\sqrt{Var(\mathcal{B}) + \epsilon}} \cdot \gamma + \beta \end{align*}\]

Where $y_i$ is an output value and $x_i$ an input value. $\overline{\mathcal{B}}$ is the mean of the input values in $\mathcal{B}$ and $Var(\mathcal{B})$ is the variance of the input values in $\mathcal{B}$. Note that this definition is fairly general and not specified to 4D inputs. In this case, the input values of $\mathcal{B}$ are taken for each channel individually. So the mean and variance are calculated per channel over the mini-batch.

The update rule for the running statistics (running mean/variance) is:

\[\begin{align*} \hat{x}_{new} = (1 - momentum) \cdot \hat{x} + momentum \cdot x \end{align*}\]

Where $\hat{x}$ is the estimated statistic and $x$ is the new observed value. So $\hat{x}_{new}$ is the new, updated estimated statistic.

Examples

# a batch normalization layer (3 channels) with learnabel parameters and continuously updated batch statistics for evaluation
julia> m = BatchNorm(3)
# the mode can be switched with trainmode! or testmode!
julia> trainmode!(m)
julia> testmode!(m)
# compute the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)

Reshape / Flatten

GradValley.Layers.Reshape — Type

Reshape(out_shape; activation_function::Union{Nothing, AbstractString}=nothing)

A reshape layer (probably mostly used as a flatten layer). Reshape the input signal (effects all dimensions except the batch dimension).

Arguments

out_shape: the target output size (the output has the same data as the input and must have the same number of elements)
activation_function::Union{Nothing, AbstractString}=nothing: the element-wise activation function which will be applied to the output

Shapes

Input: $(*, N)$, where * means any number of dimensions
Output: $(out\_shape..., N)$

Definition

This layer uses the standard reshape function inbuilt in Julia.

Examples

# flatten the input of size 28*28*1 to a vector of length 784 (each plus batch dimension of course)
julia> m = Reshape((784, ))
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 28, 28, 1, 32)
julia> output = forward(m, input)
julia> size(output) # specified size plus batch dimension
(784, 32)

Activation functions

Almost every layer constructor has the keyword argument activation_function specifying the element-wise activation function. activation_function can be nothing or a string. nothing means no activation function, a string gives the name of the activation. Because softmax isn’t a simple element-wise activation function like the most activations, Softmax has it’s own layer. The following element-wise activation functions are currently implemented:

"relu": applies the element-wise relu activation (recified linear unit): $f(x) = \max(0, x)$
"sigmoid": applies the element-wise sigmoid acivation (logistic curve): $f(x) = \frac{1}{1 + e^{-x}}$
"tanh": applies the element-wise tanh activation (tangens hyperbolicus): $f(x) = \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$
"leaky_relu": applies a element-wise leaky relu activation with a negative slope of 0.01: $f(x) = \begin{cases}x &\text{if $x \geq 0$}\\\textnormal{0.01} \times x &\text{if $x < 0$}\end{cases}$
"leaky_relu:$(negative_slope)": applies a element-wise leaky relu activation with a negative slope of negative_slope (e.g. "leaky_relu:0.2" means a leaky relu activation with a negative slope of 0.2): $f(x) = \begin{cases}x &\text{if $x \geq 0$}\\\textnormal{negative\_slope} \times x &\text{if $x < 0$}\end{cases}$

Special activation functions

GradValley.Layers.Softmax — Type

Softmax(; dims=1)

A softmax activation function layer (probably mostly used at the "end" of a classifier model). Apply the softmax function to an n-dimensional input array. The softmax will be computed along the given dimensions (dims), so every slice along these dimensions will sum to 1.

Note

Note that this is the only activation function in form of a layer. All other activation functions can be used with the activation_function::AbstractString keyword argument nearly every layer provides. All the activation functions which can be used that way are simple element-wise activation functions. Softmax is currently the only non-element-wise activation function. Besides, it is very important to be able to select a specific dimension along the softmax should be computed. That would also not work well with the use of simple keyword argument taking only a string which is the name of the function.

Arguments

dims=1: the dimensions along the softmax will be computed (so every slice along these dimensions will sum to 1)

Shapes

Input: $(*)$, where $*$ means any number of dimensions
Output: $(*)$ (same shape as input)

Definition

The softmax function converts a vector of real numbers into a probability distribution. The softmax function is defined as:

\[\begin{align*} softmax(x_i) = \frac{e^{x_i}}{\sum_{j}e^{x_j}} = \frac{exp(x_i)}{\sum_{j}exp(x_j)} \end{align*}\]

Where X is the input array (slice). Note that the $x_j$ values are taken from each slice individually along the specified dimension. So each slice along the specified dimension will sum to 1. All values in the output are between 0 and 1.

Examples

# the softmax will be computed along the first dimension
julia> m = Softmax(dims=1)
# computing the output of the layer 
# (with random input data which could represent a batch of unnormalized output values from a classifier)
julia> input = rand(Float32, 10, 32)
julia> output = forward(m, input)
# summing up the values in the output along the first dimension result in a batch of 32 ones
julia> sum(output, dims=1)
1x32 Matrix{Float64}:
1.0 1.0 ... 1.0

GradValley.Optimization

Optimizers

GradValley.Optimization.Adam — Type

Adam(layer_stack::Union{Vector, AbstractContainer}; learning_rate::Real=0.001, beta1::Real=0.9, beta2::Real=0.999, epsilon::Real=1e-08, weight_decay::Real=0, amsgrad::Bool=false, maximize::Bool=false)

Implementation of Adam optimization algorithm (including the optional AMSgrad version of this algorithm and optional weight decay).

Arguments

layer_stack::Union{Vector, SequentialContainer, GraphContainer}: the vector OR the container (SequentialContainer/GraphContainer, often simply the whole model) containing the layers with the parameters to be optimized (can also contain layers without any parameters)
learning_rate::Real=0.001: the learning rate (shouldn't be 0)
beta1::Real=0.9, beta2::Real=0.999: the two coefficients used for computing running averages of gradient and its square
epsilon::Real=1e-08: value for numerical stability
weight_decay::Real=0.00: the weight decay (L2 penalty)
amsgrad::Bool=false: use the AMSgrad version of this algorithm
maximize::Bool=false: maximize the parameters, instead of minimizing

Definition

For example, a definition of this algorithm in pseudocode can be found here.

Examples

# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# initialize a Adam optimizer with default arguments
julia> optimizer = Adam(model)
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values 
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative 
julia> loss, loss_derivative = mse_loss(output, target)
# computet the gradients 
julia> backward(model, loss_derivative)
# perform a single optimization step (parameter update)
julia> step!(optimizer)

GradValley.Optimization.SGD — Type

SGD(layer_stack::Union{Vector, AbstractContainer}, learning_rate::Real; weight_decay::Real=0.00, dampening::Real=0.00, maximize::Bool=false)

Implementation of stochastic gradient descent optimization algorithm (including optional weight decay and dampening).

Arguments

layer_stack::Union{Vector, AbstractContainer}: the vector OR the container (SequentialContainer/GraphContainer, often simply the whole model) containing the layers with the parameters to be optimized (can also contain layers without parameters)
learning_rate::Real: the learning rate (shouldn't be 0)
weight_decay::Real=0.00: the weight decay (L2 penalty)
dampening::Real=0.00: the dampening (normally just for optimizers with momentum, however, can be theoretically used without, in this case acts like: $(1 - dampening) \cdot learning\_rate$)
maximize::Bool=false: maximize the parameters, instead of minimizing

Definition

For example, a definition of this algorithm in pseudocode can be found here. (Note that in this case of a simple SGD with no momentum, the momentum $μ$ is zero in the sense of the mentioned documentation.)

Examples

# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# initialize a SGD optimizer with learning-rate equal 0.1 and weight decay equal to 0.5 (otherwise default values)
julia> optimizer = SGD(model, 0.1, weight_decay=0.5)
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values 
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative 
julia> loss, loss_derivative = mse_loss(output, target)
# computet the gradients 
julia> backward(model, loss_derivative)
# perform a single optimization step (parameter update)
julia> step!(optimizer)

GradValley.Optimization.MSGD — Type

MSGD(layer_stack::Union{Vector, AbstractContainer}, learning_rate::Real; momentum::Real=0.90, weight_decay::Real=0.00, dampening::Real=0.00, maximize::Bool=false)

Implementation of stochastic gradient descent with momentum optimization algorithm (including optional weight decay and dampening).

Arguments

layer_stack::Union{Vector, AbstractContainer}: the vector OR the container (SequentialContainer/GraphContainer, often simply the whole model) containing the layers with the parameters to be optimized (can also contain layers without any parameters)
learning_rate::Real: the learning rate (shouldn't be 0)
momentum::Real=0.90: the momentum factor (shouldn't be 0)
weight_decay::Real=0.00: the weight decay (L2 penalty)
dampening::Real=0.00: the dampening for the momentum
maximize::Bool=false: maximize the parameters, instead of minimizing

Definition

For example, a definition of this algorithm in pseudocode can be found here. (Note that in this case of SGD with default momentum, in the sense of the mentioned documentation, the momentum $\mu$ isn't zero ($\mu \neq 0$) and $nesterov$ is $false$.)

Examples

# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# initialize a MSGD optimizer with learning-rate equal 0.1 and momentum equal to 0.75 (otherwise default values)
julia> optimizer = Nesterov(model, 0.1, momentum=0.75)
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values 
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative 
julia> loss, loss_derivative = mse_loss(output, target)
# computet the gradients 
julia> backward(model, loss_derivative)
# perform a single optimization step (parameter update)
julia> step!(optimizer)

GradValley.Optimization.Nesterov — Type

Nesterov(layer_stack::Union{Vector, AbstractContainer}, learning_rate::Real; momentum::Real=0.90, weight_decay::Real=0.00, dampening::Real=0.00, maximize::Bool=false)

Implementation of stochastic gradient descent with nesterov momentum optimization algorithm (including optional weight decay and dampening).

Arguments

layer_stack::Union{Vector, AbstractContainer}: the vector OR the container (SequentialContainer/GraphContainer, often simply the whole model) containing the layers with the parameters to be optimized (can also contain layers without any parameters)
learning_rate::Real: the learning rate (shouldn't be 0)
momentum::Real=0.90: the momentum factor (shouldn't be 0)
weight_decay::Real=0.00: the weight decay (L2 penalty)
dampening::Real=0.00: the dampening for the momentum (for true nesterov momentum, dampening must be 0)
maximize::Bool=false: maximize the parameters, instead of minimizing

Definition

For example, a definition of this algorithm in pseudocode can be found here. (Note that in this case of SGD with nesterov momentum, $nesterov$ is $true$ in the sense of the mentioned documentation.)

Examples

# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# initialize a Nesterov optimizer with learning-rate equal 0.1 and nesterov momentum equal to 0.8 (otherwise default values)
julia> optimizer = Nesterov(model, 0.1, momentum=0.8)
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values 
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative 
julia> loss, loss_derivative = mse_loss(output, target)
# computet the gradients 
julia> backward(model, loss_derivative)
# perform a single optimization step (parameter update)
julia> step!(optimizer)

Optimization step function

GradValley.Optimization.step! — Function

step!(optimizer::Union{SGD, MSGD, Nesterov})

Perform a single optimization step (parameter update) for the given optimizer.

Examples

# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# initialize an optimizer (which optimizer specifically dosen't matter)
julia> optimizer = SGD(model, 0.1)
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values 
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative 
julia> loss, loss_derivative = mse_loss(output, target)
# computet the gradients 
julia> backward(model, loss_derivative)
# perform a single optimization step (parameter update)
julia> step!(optimizer)

Loss functions

GradValley.Optimization.mae_loss — Function

mae_loss(prediction::AbstractArray{<: Real, N}, target::AbstractArray{<: Real, N}; reduction_method::Union{AbstractString, Nothing}="mean", return_derivative::Bool=true) where N

Calculate the (mean) absolute error (L1 norm, with optional reduction to a single loss value (mean or sum)) and it's derivative with respect to the prediction input.

Arguments

prediction::AbstractArray{<: Real, N}: the prediction of the model of shape (*), where * means any number of dimensions
target::AbstractArray{<: Real, N}: the corresponding target values of shape (*), must have the same shape as the prediction input
reduction_method::Union{AbstractString, Nothing}="mean": can be "mean", "sum" or nothing, specifies the reduction method which reduces the element-wise computed loss to a single value
return_derivative::Bool=true: it true, the loss and it's derivative with respect to the prediction input is returned, if false, just the loss will be returned

Definition

$L$ is the loss value which will be returned. If return_derivative is true, then an array with the same shape as prediction/target is returned as well, it contains the partial derivatives of $L$ w.r.t. to each prediction value: $\frac{\partial L}{\partial p_i}$, where $p_i$ in one prediction value. If reduction_method is nothing, the element-wise computed losses are returned. Note that for reduction_method=nothing, the derivative is just the same as when reduction_method="sum". The element-wise calculation can be defined as (where $t_i$ is one target value and $l_i$ is one loss value):

\[\begin{align*} l_i = |p_i - t_i| \end{align*}\]

Then, $L$ and $\frac{\partial L}{\partial p_i}$ differ a little bit from case to case ($n$ is the number of values in prediction/target):

\[\begin{align*} L;\frac{\partial L}{\partial p_i} = \begin{cases}\frac{1}{n}\sum_{i=1}^{n} l_i; \frac{p_i - t_i}{l_i \cdot n} &\text{for reduction\_method="mean"}\\\sum_{i=1}^{n} l_i; \frac{p_i - t_i}{l_i} &\text{for reduction\_method="sum"}\end{cases} \end{align*}\]

Examples

# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# create some random input data
julia> input = rand(1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values 
julia> target = rand(size(output)...)
# compute the loss and it's derivative (with default reduction method "mean")
julia> loss, loss_derivative = mae_loss(output, target)
# computet the gradients 
julia> backward(model, loss_derivative)

GradValley.Optimization.mse_loss — Function

mse_loss(prediction::AbstractArray{<: Real, N}, target::AbstractArray{<: Real, N}; reduction_method::Union{AbstractString, Nothing}="mean", return_derivative::Bool=true) where N

Calculate the (mean) squared error (squared L2 norm, with optional reduction to a single loss value (mean or sum)) and it's derivative with respect to the prediction input.

Arguments

prediction::AbstractArray{<: Real, N}: the prediction of the model of shape (*), where * means any number of dimensions
target::AbstractArray{<: Real, N}: the corresponding target values of shape (*), must have the same shape as the prediction input
reduction_method::Union{AbstractString, Nothing}="mean": can be "mean", "sum" or nothing, specifies the reduction method which reduces the element-wise computed loss to a single value
return_derivative::Bool=true: it true, the loss and it's derivative with respect to the prediction input is returned, if false, just the loss will be returned

Definition

\[\begin{align*} l_i = (p_i - t_i)^2 \end{align*}\]

Then, $L$ and $\frac{\partial L}{\partial p_i}$ differ a little bit from case to case ($n$ is the number of values in prediction/target):

\[\begin{align*} L;\frac{\partial L}{\partial p_i} = \begin{cases}\frac{1}{n}\sum_{i=1}^{n} l_i; \frac{2}{n}(p_i - t_i) &\text{for reduction\_method="mean"}\\\sum_{i=1}^{n} l_i; 2(p_i - t_i) &\text{for reduction\_method="sum"}\end{cases} \end{align*}\]

Examples

# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# create some random input data
julia> input = rand(1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values 
julia> target = rand(size(output)...)
# compute the loss and it's derivative (with default reduction method "mean")
julia> loss, loss_derivative = mse_loss(output, target)
# compute the gradients 
julia> backward(model, loss_derivative)

GradValley.Optimization.bce_loss — Function

bce_loss(prediction::AbstractArray{<: Real, N}, target::AbstractArray{<: Real, N}; epsilon::Real=eps(eltype(prediction)), reduction_method::Union{AbstractString, Nothing}="mean", return_derivative::Bool=true) where N

Calculate the binary cross entropy loss (with optional reduction to a single loss value (mean or sum)) and it's derivative with respect to the prediction input.

Arguments

prediction::AbstractArray{<: Real, N}: the prediction of the model of shape (*), where * means any number of dimensions (the prediction is typically given by the output of a sigmoid activation function)
target::AbstractArray{<: Real, N}: the corresponding target values (should be between 0 and 1) of shape (*), must have the same shape as the prediction input
epsilon::Real=eps(eltype(prediction)): term to avoid infinity
reduction_method::Union{AbstractString, Nothing}="mean": can be "mean", "sum" or nothing, specifies the reduction method which reduces the element-wise computed loss to a single value
return_derivative::Bool=true: it true, the loss and it's derivative with respect to the prediction input is returned, if false, just the loss will be returned

Definition

\[\begin{align*} l_i = -t_i \cdot \log(p_i + \epsilon) - (1 - t_i) \cdot \log(1 - p_i + \epsilon) \end{align*}\]

Then, $L$ and $\frac{\partial L}{\partial p_i}$ differ a little bit from case to case ($n$ is the number of values in prediction/target):

\[\begin{align*} L;\frac{\partial L}{\partial p_i} = \begin{cases}\frac{1}{n}\sum_{i=1}^{n} l_i; \frac{1}{n}(\frac{-t_i}{p_i + \epsilon} - \frac{t_i - 1}{1 - p_i + \epsilon}) &\text{for reduction\_method="mean"}\\\sum_{i=1}^{n} l_i; \frac{-t_i}{p_i + \epsilon} - \frac{t_i - 1}{1 - p_i + \epsilon} &\text{for reduction\_method="sum"}\end{cases} \end{align*}\]

Examples

# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125, activation_function="softmax")])
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values 
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative (with default reduction method "mean")
julia> loss, loss_derivative = bce_loss(output, target)
# computet the gradients
julia> backward(model, loss_derivative)

GradValley.Functional

GradValley.Functional contains many primitives common for various neuronal networks. Not all of these functions (better said the fewest) are documented because they are mostly used only internally (not by the user).

GradValley.Functional.zero_pad_nd — Function

zero_pad_nd(input::AbstractArray{T, N}, padding::NTuple{N, Int}) where {T, N}

Perform a padding-operation (nd => number of dimensions doesn't matter) as is usual for neural networks: equal padding at each "end" of each axis/dimension.

Arguments

input::AbstractArray{T, N}: of shape(d1, d2, ..., dn)
padding::NTuple{2, Int}: must be always a tuple of length of the number of dimensions of input: (pad-d1, pad-d2, ..., pad-dn)

Shape of returned output: (d1 + padding[1] * 2, d2 + padding[2] * 2, ..., dn + padding[n] * 2)

zero_pad_nd(input::CuArray{T, N}, padding::NTuple{N, Int}) where {T, N}

Perform a padding-operation (nd => number of dimensions doesn't matter) as is usual for neural networks: equal padding at each "end" of each axis/dimension.

Arguments

input::CuArray{T, N}: of shape(d1, d2, ..., dn)
padding::NTuple{2, Int}: must be always a tuple of length of the number of dimensions of input: (pad-d1, pad-d2, ..., pad-dn)

Shape of returned output: (d1 + padding[1] * 2, d2 + padding[2] * 2, ..., dn + padding[n] * 2)

GradValley.Functional.zero_pad_2d — Function

zero_pad_nd(input::AbstractArray{T, 4}, padding::NTuple{2, Int) where {T}

Perform a padding-operation (2d => 4 dimensions, where the last 2 dimensions will be padded) as is usual for neural networks: equal padding at each "end" of each spatial axis/dimension.

Arguments

input::AbstractArray{T, 4}: of shape(d1, d2, d3, d4), d2 is expected to be the height dimension, d1 is expected to be the width dimension
padding::NTuple{2, Int}: must be always a tuple of length 2: (pad-d2, pad-d1) == (pad-height, pad-width)

Shape of returned output: (d1 + padding[2] * 2, d2 + padding[1] * 2, d3, d4)

zero_pad_nd(input::CuArray{T, 4}, padding::NTuple{2, Int) where {T}

Perform a padding-operation (2d => 4 dimensions, where the last 2 dimensions will be padded) as is usual for neural networks: equal padding at each "end" of each spatial axis/dimension.

Arguments

input::CuArray{T, 4}: of shape(d1, d2, d3, d4), d2 is expected to be the height dimension, d1 is expected to be the width dimension
padding::NTuple{2, Int}: must be always a tuple of length 2: (pad-d2, pad-d1) == (pad-height, pad-width)

Shape of returned output: (d1 + padding[2] * 2, d2 + padding[1] * 2, d3, d4)