Reference
Note that for all keyword arguments of type NTuple{2, Int}
, the order of dimensions is (y/height-dimension, x/width-dimension).
GradValley
DataLoader
GradValley.DataLoader
— TypeDataLoader(get_function::Function, dataset_size::Integer; batch_size::Integer=1, shuffle::Bool=false, drop_last::Bool=false)
The DataLoader was designed to easily iterate over batches. Each time a new batch is requested, the data loader loads this batch "just in time" (instead of loading all the batches to memory at once).
The get_function
is expected to load one item from a dataset at a given index. The specified get_function
is expected to accept exactly one positional argument, which is the index of the item the get_function
will return. A tuple of arbitrary length is expected as the return value of the get_function
. Each element in this tuple must be an array. The length/size and type of the tuple and array is expected to be the same at each index. When a batch is requested, the data loader returns the tuple containing the with batch dimensions extended arrays.
The DataLoader is iteratabel and indexable. size(dataloader) returns the given size of the dataset, length(dataloader) returns the total number of batches (equal if batch_size=1). When a range is given as the index argument, a vector containing multiple batches (arrays) is returned.
If you really want to load the whole dataset to memory (e.g. useful when training over multiple epochs, with this way, you don't have to reload the dataset each epoch over and over again), you can do so of course: all_batches = dataloader[start:end]
where typeof(dataloader) == DataLoader
Arguments
get_function::Function
: the function which takes the index of an item from a dataset and returns that item (an arbitrary sized tuple containing arrays)dataset_size::Integer
: the maximum index theget_function
accepts (the number of items in the dataset, the dataset size)batch_size::Integer=1
: the batch size (the last dimension, the extended batch dimension, of each array in the returned tuple has this size)shuffle::Bool=false
: reshuffle the data (doesn't reshuffle automatically after each epoch, usereshuffle!
instead)drop_last::Bool=false
: set to true to drop the last incomplete batch, if the dataset size is not divisible by the batch size, if false and the size of dataset is not divisible by the batch size, then the last batch will be smaller
Examples
# EXAMPLE FROM https://jonas208.github.io/GradValley.jl/tutorials_and_examples/#Tutorials-and-Examples
julia> using MLDatasets # a package for downloading datasets
# initialize train- and test-dataset
julia> mnist_train = MNIST(:train)
julia> mnist_test = MNIST(:test)
# define the get_element function:
# function for getting an image and the corresponding target vector from the train or test partition
julia> function get_element(index, partition)
# load one image and the corresponding label
if partition == "train"
image, label = mnist_train[index]
else # test partition
image, label = mnist_test[index]
end
# add channel dimension and rescaling the values to their original 8 bit gray scale values
image = reshape(image, 28, 28, 1) .* 255
# generate the target vector from the label, one for the correct digit, zeros for the wrong digits
target = zeros(10)
target[label + 1] = 1.00
return image, target
end
# initialize the data loaders (with anonymous function which helps to easily distinguish between test- and train-partition)
julia> train_data_loader = DataLoader(index -> get_element(index, "train"), length(mnist_train), batch_size=32, shuffle=true)
julia> test_data_loader = DataLoader(index -> get_element(index, "test"), length(mnist_test), batch_size=32)
# in most cases NOT recommended: you can force the data loaders to load all the data at once into memory, depending on the dataset's size, this may take a while
julia> # train_data = train_data_loader[begin:end] # turned off to save time
julia> # test_data = test_data_loader[begin:end] # turned off to save time
# now you can write your train- or test-loop like so
julia> for (batch, (image_batch, target_batch)) in enumerate(test_data_loader) #=do anything useful here=# end
julia> for (batch, (image_batch, target_batch)) in enumerate(train_data_loader) #=do anything useful here=# end
GradValley.reshuffle!
— Functionreshuffle!(data_loader::DataLoader)
Manually shuffle the data loader (even if shuffle is disabled in the given data loader). It is recommended to reshuffle after each epoch during training.
GradValley.Layers
Containers
GradValley.Layers.SequentialContainer
— TypeSequentialContainer(layer_stack::Vector{<: Any})
A sequential container (recommended method for building models). A SequtialContainer can take a vector of layers or other containers (submodules). While forward-pass, the given inputs are sequentially propagated through every layer (or submodule) and the output will be returned. The execution order during forward pass is of course the same as the order in the vector containing the layers or submodules.
You can use a SequentialContainer in a GraphContainer (and vice versa). You can also use a SequentialContainer in a SequentialContainer (nesting allowed).
Arguments
layer_stack::Vector{<: Any}
: the vector containing the layers (or submodules, so other containers), the order of the modules in the vector corresponds to the execution order
Indexing and Iteration
The sequential container is indexable and iterable. Indexing one element/iterating behaves like indexing one element of/iterating over the sequential_container.layer_stack
passed to the container at initialization. However, if the index is a range (UnitRange{<: Integer}), a new SequentialContainer containing all the requested submodules/layers is initialized and returned. length(sequential_container)
and size(sequential_container)
both just return the number of modules in the layers vector (equivalent to length(sequential_container.layer_stack)
).
Examples
# a simple chain of fully connected layers
julia> m = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# computing the output of the module (with random inputs)
julia> input = rand(Float32, 1000, 32)
julia> output = forward(m, input)
# a more complicated example with with nested submodules
julia> feature_extractor_part_1 = SequentialContainer([Conv(1, 6, (5, 5), activation_function="relu"), AvgPool((2, 2))])
julia> feature_extractor_part_2 = SequentialContainer([Conv(6, 16, (5, 5), activation_function="relu"), AvgPool((2, 2))])
julia> feature_extractor = SequentialContainer([feature_extractor_part_1, feature_extractor_part_2])
julia> classifier = SequentialContainer([Fc(256, 120, activation_function="relu"), Fc(120, 84, activation_function="relu"), Fc(84, 10)])
julia> m = SequentialContainer([feature_extractor, Reshape((256, )), classifier, Softmax(dims=1)])
# computing the output of the module (with random inputs)
julia> input = rand(Float32, 28, 28, 1, 32)
julia> output = forward(m, input)
# indexing
julia> m[begin] # returns the feature_extractor submodule (SequentialContainer)
julia> m[end] # returns the softmax layer (Softmax)
julia> m[begin:end-1] # returns the entire model except the softmax layer (a new SequentialContainer is initialized and returned)
# if a SequentialContainer contains BatchNorm layers (regardless of whether they are nested somewhere in a submodule or not),
# the mode of all these layers at once can be switched as follows
julia> trainmode!(m)
julia> testmode!(m)
# if a SequentialContainer contains layers with trainable parameters/weights (what is hopefully in nearly all situations the case),
# regardless of whether they are nested somewhere in a submodule or not, the gradients of all these layers at once can be reset as follows
julia> zero_gradients(m)
GradValley.Layers.GraphContainer
— TypeGraphContainer(forward_pass::Function, layer_stack::Vector{<: Any})
A computational graph container (recommended method for building models). A GraphContainer can take a function representing the forward pass of a model and a vector of layers or other containers (submodules). While forward-pass, a tracked version of the given inputs are passed through the given forward pass function and the output will be returned. During forward pass, the computational graph is build by a function overload based automatic differentiation system (AD). During backward pass, this computational graph is used to compute the gradients.
You can use a GraphContainer in a SequentialContainer (and vice versa). You can also use a GraphContainer in a GraphContainer (nesting allowed).
Note that the GraphContainer is an experimental feature. The behavior of this module could change dramatically in the future. Using this module can may cause problems.
Arguments
forward_pass::Function
: the function representing the forward pass of a modellayer_stack::Vector{<: Any}
: the vector containing the layers (or submodules, so other Containers), the order doesn't matter
Guidelines
GradValley has its own little, rudimentary function overload based automatic differentiation system based on ChainRules.jl. It was designed to allow simple modifications of a normal sequential signal flow, which is the basis of most neural networks. For example, to be able to implement ResNet's residual connections. So it represents an alternative to data flow layers known from other Deep Learning packages. In a way, it is similar to the forward function known from every PyTorch model. Since the AD does not offer that much functionality at this point in time, the following guidelines must be observed:
- The forward pass function must take at least two arguments. The first is the vector containing the layers (which was passed to GraphContainer at initialization). The following arguments (the last could also be a Vararg argument) are the data inputs.
- The forward pass function must be written generically enough to accept arrays of type T<:AbstractArray or numbers of type T<:Real as input (starting with the second argument).
- Array inputs that are being differentiated cannot be mutated.
- The initialization of new arrays (for example with
zeros
orrand
) and their use in mix with the inputs passed to the forward function is not allowed. - Avoid dot syntax in most cases, there only exist a few differentiation rules for the most basic vectorized operators (.+, .-, .*, ./, .^).
Examples
# a simple chain of fully connected layers (equivalent to the first example of SequentialContainer)
julia> layers = [Fc(1000, 500), Fc(500, 250), Fc(250, 125)]
julia> function forward_pass(layers::Vector, input::AbstractArray)
fc_1, fc_2, fc_3 = layers
output = forward(fc_1, input)
output = forward(fc_2, output)
output = forward(fc_3, output)
return output
end
julia> m = GraphContainer(forward_pass, layers)
# computing the output of the module (with random inputs)
julia> input = rand(Float32, 1000, 32)
julia> output = forward(m, input)
# a more complicated example: implementation of an inverted residual block
julia> layers = [Conv(16, 64, (1, 1), activation_function="relu"),
Conv(64, 64, (3, 3), padding=(1, 1), groups=64, activation_function="relu"), # depthwise-conv layer because groups==in_channels
Conv(64, 16, (1, 1), activation_function="relu")]
julia> function forward_pass(layers::Vector, input::AbstractArray)
conv_1, depthwise_conv, conv_2 = layers
output = forward(conv_1, input)
output = forward(depthwise_conv, output)
output = forward(conv_2, output)
output = output + input # residual/skipped connection
return output
end
julia> m = GraphContainer(forward_pass, layers)
# computing the output of the module (with random inputs)
julia> input = rand(Float32, 50, 50, 16, 32)
julia> output = forward(m, input)
# a simple example with a polynomial, just to show that it is possible to use the GraphContainer like an automatic differentiation (AD) tool
julia> f(layers, x) = 0.5x^3 - 2x^2 + 10
julia> df(x) = 1.5x^2 - 4x # checking the result of the AD with this manually written derivation
julia> m = GraphContainer(f, [])
julia> y = forward(m, 3)
julia> dydx = backward(m, 1) # in this case, no loss function was used, so we have no gradient information, therefore we use 1 as the so-called seed
1-element Vector{Float64}:
1.5
julia> manual_dydx = df(3)
1.5
julia> isapprox(dydx[1], manual_dydx)
true
# if a GraphContainer contains BatchNorm layers (regardless of whether they are nested somewhere in a submodule or not),
# the mode of all these layers at once can be switched as follows
julia> trainmode!(m)
julia> testmode!(m)
# if a GraphContainer contains layers with trainable parameters/weights (what is hopefully in nearly all situations the case),
# regardless of whether they are nested somewhere in a submodule or not, the gradients of all these layers at once can be reset as follows
julia> zero_gradients(m)
GradValley.Layers.summarize_model
— Functionsummarize_model(container::AbstractContainer)
Return a string (and the total number of parameters) intended for printing with an overview of the model (currently doesn't show an visualization of the computational graph) and its number of parameters.
GradValley.Layers.module_to_eltype_device!
— Functionmodule_to_eltype_device!(layer::AbstractLayer; element_type::DataType=Float32, device::AbstractString="cpu")
Convert the parameters of a container or layer to a different element type and move the parameters to the specified device.
Arguments
layer::AbstractLayer
: the layer or container (often just the entire model) holding the parameters to be changedelement_type::DataType=Float32
: the element type into which the parameters will be converted todevice::AbstractString="cpu"
: the device to which the parameters will be moved, can be "cpu" or "gpu" (only CUDA is supported)
GradValley.Layers.clean_model_from_backward_information!
— Functionclean_model_from_backward_information!(model::AbstractContainer)
Clean a container from backward pass information (e.g. computational graph). It is recommended to run this function on a model which should be saved to a file.
Forward- and Backward-Pass
GradValley.Layers.forward
— Functionforward(layer, input::AbstractArray{T, N}) where {T, N}
The forward function for computing the output of a module. For every layer/container, an individual method exists. However, all these methods work exactly the same. They all take the layer/container as the first argument and the input data as the second argument. The output is returned.
Examples
# define some layers and containers
julia> layer = Conv(3, 6, (5, 5))
julia> container = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# create some random input data
julia> layer_input = rand(Float32, 50, 50, 3, 32)
julia> container_input = rand(Float32, 1000, 32)
# compute the output of the modules
julia> layer_output = forward(layer, layer_input)
julia> container_output = forward(container, container_input)
GradValley.Layers.backward
— Functionbackward(sc::SequentialContainer, derivative_loss::Union{AbstractArray{T, N}, Real}) where {T, N}
The backward function for computing the gradients for a SequentialContainer (highly recommend for model building). The function takes the container (so mostly the whole model) as the first argument and the derivative of the loss as the second argument. No gradients are returned, they are just saved in the layers the container contains.
Calling backward
multiple times can have serious consequences. Gradients are added (accumulated) by convention, so calling backward
multiple times after the corresponding forward
call, the gradients of the weights AND the returned gradient w.r.t. the input are added up (accumulated)! So even the gradient returned by the backward
call doesn't stay the same when calling backward
multiple times after the forward
call. Note that the gradients of the weights can be reset by zero_gradients
but the gradient w.r.t. to the input of a container cannot be reset (except of course by another forward
call).
Examples
# define a model
julia> m = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# compute the output of the model (with random inputs)
julia> output = forward(m, rand(Float32, 1000, 32))
# use a loss function (with random data as target values) and save the derivative of the loss
julia> loss, derivative_loss = mse_loss(output, rand(Float32, 125, 32)) # note that GradValley.Optimization.mse_loss must be imported
# before the gradients are recalculated, the old gradients should always be reset first
julia> zero_gradients(m)
# backpropagation
julia> backward(m, derivative_loss)
backward(grc::GraphContainer, derivative_loss::Union{AbstractArray{T, N}, Real}) where {T, N}
The backward function for computing the gradients for a GraphContainer (recommend for model building). The function takes the container (so mostly the whole model) as the first argument and the derivative of the loss as the second argument. The gradients to the input arguments are returned (in a vector, in the same order as the inputs were passed to the forward
function).
Calling backward
multiple times can have serious consequences. Gradients are added (accumulated) by convention, so calling backward
multiple times after the corresponding forward
call, the gradients of the weights AND the returned gradients w.r.t. the inputs are added up (accumulated)! So even the gradients returned by the backward
call doesn't stay the same when calling backward
multiple times after the forward
call. Note that the gradients of the weights can be reset by zero_gradients
but the gradients w.r.t. to the inputs of a container cannot be reset (except of course by another forward
call).
Examples
# define a model
julia> layers = [Fc(1000, 500), Fc(500, 250), Fc(250, 125)]
julia> function forward_pass(layers::Vector, input::AbstractArray)
fc_1, fc_2, fc_3 = layers
output = forward(fc_1, input)
output = forward(fc_2, output)
output = forward(fc_3, output)
return output
end
julia> m = GraphContainer(forward_pass, layers)
# compute the output of the model (with random inputs)
julia> input = rand(Float32, 1000, 32)
julia> output = forward(m, input)
# use a loss function (with random data as target values) and save the derivative of the loss
julia> loss, derivative_loss = mse_loss(output, rand(Float32, 125, 32)) # note that GradValley.Optimization.mse_loss must be imported
# before the gradients are (re)calculated, the old gradients should always be reset first
julia> zero_gradients(m)
# backpropagation
julia> input_gradients = backward(m, derivative_loss) # input_gradients is a vector of length 1 because we only passed one input to the forward function
julia> input_gradient = input_gradients[1] # input_gradient is the gradient w.r.t to the single input
Reset/zero gradients
GradValley.Layers.zero_gradients
— Functionzero_gradients(layer_or_container)
Resets the gradients of a layer or a container (any kind of module with trainable parameters).
There only exists methods for layers with parameters, however, if a container without layers with trainable parameters is given, NO error will be thrown. So if the given container contains layers with trainable parameters/weights, regardless of whether they are nested somewhere in a submodule or not, the gradients of all these layers at once will be reset.
Training mode/test mode
GradValley.Layers.trainmode!
— Functiontrainmode!(batch_norm_layer_or_container)
Switches the mode of the given batch normalization layer or container to training mode. See Normalization
If the given container contains batch normalization layers (regardless of whether they are nested somewhere in a submodule or not), the mode of all these layers at once will be switched to training mode.
GradValley.Layers.testmode!
— Functiontestmode!(batch_norm_layer_or_container)
Switches the mode of the given batch normalization layer or container to test mode. See Normalization
If the given container contains batch normalization layers (regardless of whether they are nested somewhere in a submodule or not), the mode of all these layers at once will be switched to test mode.
Convolution
GradValley.Layers.Conv
— TypeConv(in_channels::Int, out_channels::Int, kernel_size::NTuple{2, Int}; stride::NTuple{2, Int}=(1, 1), padding::NTuple{2, Int}=(0, 0), dilation::NTuple{2, Int}=(1, 1), groups::Int=1, activation_function::Union{Nothing, AbstractString}=nothing, init_mode::AbstractString="default_uniform", use_bias::Bool=true)
A convolution layer. Apply a 2D convolution over an input signal with additional batch and channel dimensions.
Arguments
in_channels::Int
: the number of channels in the input imageout_channels::Int
: the number of channels produced by the convolutionkernel_size::NTuple{2, Int}
: the size of the convolving kernelstride::NTuple{2, Int}=(1, 1)
: the stride of the convolutionpadding::NTuple{2, Int}=(0, 0)
: the zero padding added to all four sides of the inputdilation::NTuple{2, Int}=(1, 1)
: the spacing between kernel elementsgroups::Int=1
: the number of blocked connections from input channels to output channels (in-channels and out-channels must both be divisible by groups)activation_function::Union{Nothing, AbstractString}=nothing
: the element-wise activation function which will be applied to the output after the convolutioninit_mode::AbstractString="default_uniform"
: the initialization mode of the weights (can be"default_uniform"
,"default"
,"kaiming_uniform"
,"kaiming"
,"xavier_uniform"
or"xavier"
)use_bias::Bool=true
: if true, adds a learnable bias to the output
Shapes
- Input: $(W_{in}, H_{in}, C_{in}, N)$
- Weight: $(W_{w}, H_{w}, \frac{C_{in}}{groups}, C_{out})$
- Bias: $(C_{out}, )$
- Output: $(W_{out}, H_{out}, C_{out}, N)$
- $H_{out} = {\frac{H_{in} + 2 \cdot padding[1] - dilation[1] \cdot (H_w - 1) - 1}{stride[1]}} + 1$
- $W_{out} = {\frac{W_{in} + 2 \cdot padding[2] - dilation[2] \cdot (W_w - 1) - 1}{stride[2]}} + 1$
Useful Fields/Variables
weight::AbstractArray{<: Real, 4}
: the learnable weight of the layerbias::AbstractVector{<: Real}
: the learnable bias of the layer (used whenuse_bias=true
)weight_gradient::AbstractArray{<: Real, 4}
: the current gradient of the weight/kernelbias_gradient::AbstractVector{<: Real}
: the current gradient of the bias
Definition
For one group, a multichannel 2D convolution (disregarding batch dimension and activation function) can be described as:
- $o_{c_{out}, y_{out}, x_{out}} = \big(\sum_{c_{in=1}}^{C_{in}}\sum_{y_w=1}^{H_{w}}\sum_{x_w=1}^{W_{w}} i_{c_{in}, y_{in}, x_{in}} \cdot w_{c_{out}, c_{in}, y_w, x_w}\big) + b_{c_{out}}$, where
- $y_{in} = y_{out} + (stride[1] - 1) \cdot (y_{out} - 1) + (y_w - 1) \cdot dilation[1]$
- $x_{in} = x_{out} + (stride[2] - 1) \cdot (x_{out} - 1) + (x_w - 1) \cdot dilation[2]$
O is the output array, I the input array, W the weight array and B the bias array.
Examples
# square kernels and fully default values of keyword arguments
julia> m = Conv(3, 6, (5, 5))
# non-square kernels and unequal stride and with padding as well as specified weight initialization mode
# (init_mode="kaiming" stands for kaiming weight initialization with normally distributed values)
julia> m = Conv(3, 6, (3, 5), stride=(2, 1), padding=(2, 1))
# non-square kernels and unequal stride and with padding, dilation and 3 groups
# (because groups=in_channels and out_channles is divisible by groups, it is even a depthwise convolution)
julia> m = Conv(3, 6, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1), groups=3)
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)
GradValley.Layers.ConvTranspose
— TypeConvTranspose(in_channels::Int, out_channels::Int, kernel_size::NTuple{2, Int}; stride::NTuple{2, Int}=(1, 1), padding::NTuple{2, Int}=(0, 0), output_padding::NTuple{2, Int}=(0, 0), dilation::NTuple{2, Int}=(1, 1), groups::Int=1, activation_function::Union{Nothing, AbstractString}=nothing, init_mode::AbstractString="default_uniform", use_bias::Bool=true)
A transpose convolution layer (also known as fractionally-strided convolution or deconvolution). Apply a 2D transposed convolution over an input signal with additional batch and channel dimensions.
Arguments
in_channels::Int
: the number of channels in the input imageout_channels::Int
: the number of channels produced by the convolutionkernel_size::NTuple{2, Int}
: the size of the convolving kernelstride::NTuple{2, Int}=(1, 1)
: the stride of the convolutionpadding::NTuple{2, Int}=(0, 0)
: because transposed convolution can be seen as a partly (not true) inverse of convolution, padding means is this case to cut off the desired number of pixels on each side (instead of adding pixels)output_padding::NTuple{2, Int}=(0, 0)
: additional size added to one side of each dimension in the output shape (note that output_padding is only used to calculate the output shape, but does not actually add zero-padding to the output)dilation::NTuple{2, Int}=(1, 1)
: the spacing between kernel elementsgroups::Int=1
: the number of blocked connections from input channels to output channels (in-channels and out-channels must both be divisible by groups)activation_function::Union{Nothing, AbstractString}=nothing
: the element-wise activation function which will be applied to the output after the convolutioninit_mode::AbstractString="default_uniform"
: the initialization mode of the weights (can be"default_uniform"
,"default"
,"kaiming_uniform"
,"kaiming"
,"xavier_uniform"
or"xavier"
)use_bias::Bool=true
: if true, adds a learnable bias to the output
Shapes
- Input: $( W_{in}, H_{in}, C_{in}, N)$
- Weight: $(W_{w}, H_{w}, \frac{C_{out}}{groups}, C_{in})$
- Bias: $(C_{out}, )$
- Output: $(W_{out}, H_{out}, C_{out}, N)$, where
- $H_{out} = (H_{in} - 1) \cdot stride[1] - 2 \cdot padding[1] + dilation[1] \cdot (H_w - 1) + output\_padding[1] + 1$
- $W_{out} = (W_{in} - 1) \cdot stride[2] - 2 \cdot padding[2] + dilation[2] \cdot (W_w - 1) + output\_padding[2] + 1$
Useful Fields/Variables
weight::AbstractArray{<: Real, 4}
: the learnable weight of the layerbias::AbstractVector{<: Real}
: the learnable bias of the layer (used whenuse_bias=true
)weight_gradient::AbstractArray{<: Real, 4}
: the current gradient of the weight/kernelbias_gradient::Vector{<: Real}
: the current gradient of the bias
Definition
A transposed convolution can be seen as the gradient of a normal convolution with respect to its input. The forward pass of a transposed convolution is the backward pass of a normal convolution, so the forward pass of a normal convolution becomes the backward pass of a transposed convolution (with respect to its input). For more detailed information, you can look at the source code of (transposed) convolution. A nice looking visualization of (transposed) convolution can be found here.
Examples
# square kernels and fully default values of keyword arguments
julia> m = ConvTranspose(6, 3, (5, 5))
# upsampling an output from normal convolution like in GANS, Unet, etc.
julia> input = forward(Conv(3, 6, (5, 5)), rand(Float32, 50, 50, 3, 32))
julia> output = forward(m, input)
# the size of the output of the transposed convolution is equal to the size of the original input of the normal convolution
julia> size(output)
(50, 50, 3, 32)
Pooling
GradValley.Layers.MaxPool
— TypeMaxPool(kernel_size::NTuple{2, Int}; stride::NTuple{2, Int}=kernel_size, padding::NTuple{2, Int}=(0, 0), dilation::NTuple{2, Int}=(1, 1), activation_function::Union{Nothing, AbstractString}=nothing)
A maximum pooling layer. Apply a 2D maximum pooling over an input signal with additional batch and channel dimensions.
Arguments
kernel_size::NTuple{2, Int}
: the size of the window to take the maximum overstride::NTuple{2, Int}=kernel_size
: the stride of the windowpadding::NTuple{2, Int}=(0, 0)
: the zero padding added to all four sides of the inputdilation::NTuple{2, Int}=(1, 1)
: the spacing between the window elementsactivation_function::Union{Nothing, AbstractString}=nothing
: the element-wise activation function which will be applied to the output after the pooling
Shapes
- Input: $(W_{in}, H_{in}, C, N)$
- Output: $(W_{out}, H_{out}, C, N)$
- $H_{out} = {\frac{H_{in} + 2 \cdot padding[1] - dilation[1] \cdot (H_w - 1) - 1}{stride[1]}} + 1$
- $W_{out} = {\frac{W_{in} + 2 \cdot padding[2] - dilation[2] \cdot (W_w - 1) - 1}{stride[2]}} + 1$
Definition
A multichannel 2D maximum pooling (disregarding batch dimension and activation function) can be described as:
\[\begin{align*} o_{c, y_{out}, x_{out}} = \max _{y_w = 1, ..., kernel\_size[1] \ x_w = 1, ..., kernel\_size[2]} i_{c, y_{in}, x_{in}} \end{align*}\]
Where
- $y_{in} = y_{out} + (stride[1] - 1) \cdot (y_{out} - 1) + (y_w - 1) \cdot dilation[1]$
- $x_{in} = x_{out} + (stride[2] - 1) \cdot (x_{out} - 1) + (x_w - 1) \cdot dilation[2]$
O is the output array and I the input array.
Examples
# pooling of square window of size=(3, 3) and automatically selected stride
julia> m = MaxPool((3, 3))
# pooling of non-square window with custom stride and padding
julia> m = MaxPool((3, 2), stride=(2, 1), padding=(1, 1))
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)
GradValley.Layers.AvgPool
— TypeAvgPool(kernel_size::NTuple{2, Int}; stride::NTuple{2, Int}=kernel_size, padding::NTuple{2, Int}=(0, 0), dilation::NTuple{2, Int}=(1, 1), activation_function::Union{Nothing, AbstractString}=nothing)
An average pooling layer. Apply a 2D average pooling over an input signal with additional batch and channel dimensions.
Arguments
kernel_size::NTuple{2, Int}
: the size of the window to take the average overstride::NTuple{2, Int}=kernel_size
: the stride of the windowpadding::NTuple{2, Int}=(0, 0)
: the zero padding added to all four sides of the inputdilation::NTuple{2, Int}=(1, 1)
: the spacing between the window elementsactivation_function::Union{Nothing, AbstractString}=nothing
: the element-wise activation function which will be applied to the output after the pooling
Shapes
- Input: $(W_{in}, H_{in}, C, N)$
- Output: $(W_{out}, H_{out}, C, N)$
- $H_{out} = {\frac{H_{in} + 2 \cdot padding[1] - dilation[1] \cdot (H_w - 1) - 1}{stride[1]}} + 1$
- $W_{out} = {\frac{W_{in} + 2 \cdot padding[2] - dilation[2] \cdot (W_w - 1) - 1}{stride[2]}} + 1$
Definition
A multichannel 2D average pooling (disregarding batch dimension and activation function) can be described as:
- $o_{c, y_{out}, x_{out}} = \frac{1}{kernel\_size[1] \cdot kernel\_size[2]} \sum_{i=1}^{kernel\_size[1]}\sum_{j=1}^{kernel\_size[2]} i_{c, y_{in}, x_{in}}$, where
- $y_{in} = y_{out} + (stride[1] - 1) \cdot (y_{out} - 1) + (y_w - 1) \cdot dilation[1]$
- $x_{in} = x_{out} + (stride[2] - 1) \cdot (x_{out} - 1) + (x_w - 1) \cdot dilation[2]$
O is the output array and I the input array.
Examples
# pooling of square window of size=(3, 3) and automatically selected stride
julia> m = AvgPool((3, 3))
# pooling of non-square window with custom stride and padding
julia> m = AvgPool((3, 2), stride=(2, 1), padding=(1, 1))
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)
GradValley.Layers.AdaptiveMaxPool
— TypeAdaptiveMaxPool(output_size::NTuple{2, Int}; activation_function::Union{Nothing, AbstractString}=nothing)
An adaptive maximum pooling layer. Apply a 2D adaptive maximum pooling over an input signal with additional batch and channel dimensions. For any input size, the spatial size of the output is always equal to the specified $output\_size$.
Arguments
output_size::NTuple{2, Int}
: the target output size of the image (can even be larger than the input size) of the form $(H_{out}, W_{out})$activation_function::Union{Nothing, AbstractString}=nothing
: the element-wise activation function which will be applied to the output after the pooling
Shapes
- Input: $(W_{in}, H_{in}, C, N)$
- Output: $(W_{out}, H_{out}, C, N)$, where $(H_{out}, W_{out}) = output\_size$
Definition
In some cases, the kernel-size and stride could be calculated in a way that the output would have the target size (using a standard maximum pooling with the calculated kernel-size and stride, padding and dilation would not be used in this case). However, this approach would only work if the input size is an integer multiple of the output size (See this question at stack overflow for further information: stackoverflow.com/questions/53841509/how-does-adaptive-pooling-in-pytorch-work). A more generic approach is to calculate the indices of the input with an additional algorithm only for adaptive pooling. With this approach, it is even possible that the output is larger than the input what is really unusual for pooling simply because that is the opposite of what pooling actually should do, namely reducing the size. The function get_in_indices(in_len, out_len)
in gv_functional.jl
(line 68 - 85) implements such an algorithm (similar to the one at the stack overflow question), so you could check there on how exactly it is defined. Thus, the mathematical definition would be identical to the one at MaxPool
with the difference that the indices $y_{in}$ and $x_{in}$ have already been calculated beforehand.
Examples
# target output size of 5x5
julia> m = AdaptiveMaxPool((5, 5))
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)
GradValley.Layers.AdaptiveAvgPool
— TypeAdaptiveAvgPool(output_size::NTuple{2, Int}; activation_function::Union{Nothing, AbstractString}=nothing)
An adaptive average pooling layer. Apply a 2D adaptive average pooling over an input signal with additional batch and channel dimensions. For any input size, the size of the output is always equal to the specified $output\_size$.
Arguments
output_size::NTuple{2, Int}
: the target output size of the image (can even be larger than the input size) of the form $(H_{out}, W_{out})$activation_function::Union{Nothing, AbstractString}=nothing
: the element-wise activation function which will be applied to the output after the pooling
Shapes
- Input: $(W_{in}, H_{in}, C, N)$
- Output: $(W_{out}, H_{out}, C, N)$, where $(H_{out}, W_{out}) = output\_size$
Definition
In some cases, the kernel-size and stride could be calculated in a way that the output would have the target size (using a standard average pooling with the calculated kernel-size and stride, padding and dilation would not be used in this case). However, this approach would only work if the input size is an integer multiple of the output size (See this question at stack overflow for further information: stackoverflow.com/questions/53841509/how-does-adaptive-pooling-in-pytorch-work). A more generic approach is to calculate the indices of the input with an additional algorithm only for adaptive pooling. With this approach, it is even possible that the output is larger than the input what is really unusual for pooling simply because that is the opposite of what pooling actually should do, namely reducing the size. The function get_in_indices(in_len, out_len)
in gv_functional.jl
(line 68 - 85) implements such an algorithm (similar to the one at the stack overflow question), so you could check there on how exactly it is defined. Thus, the mathematical definition would be identical to the one at AvgPool
with the difference that the indices $y_{in}$ and $x_{in}$ have already been calculated beforehand.
Examples
# target output size of 5x5
julia> m = AdaptiveAvgPool((5, 5))
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)
Fully connected
GradValley.Layers.Fc
— TypeFc(in_features::Int, out_features::Int; activation_function::Union{Nothing, AbstractString}=nothing, init_mode::AbstractString="default_uniform", use_bias::Bool=true)
A fully connected layer (sometimes also known as dense or linear). Apply a linear transformation (matrix multiplication) to the input signal with additional batch dimension.
Arguments
in_features::Int
: the size of each input sample ("number of input neurons")out_features::Int
: the size of each output sample ("number of output neurons")activation_function::Union{Nothing, AbstractString}=nothing
: the element-wise activation function which will be applied to the outputinit_mode::AbstractString="default_uniform"
: the initialization mode of the weights (can be"default_uniform"
,"default"
,"kaiming_uniform"
,"kaiming"
,"xavier_uniform"
or"xavier"
)
use_bias::Bool=true
: if true, adds a learnable bias to the output
Shapes
- Input: $(in\_features, N)$
- Weight: $(out\_features, in\_features)$
- Bias: $(out\_features, )$
- Output: $(out\_features, N)$
Useful Fields/Variables
weight::AbstractArray{<: Real, 2}
: the learnable weights of the layerbias::AbstractVector{<: Real}
: the learnable bias of the layer (used whenuse_bias=true
)weight_gradient::AbstractArray{<: Real, 2}
: the current gradients of the weightsbias_gradient::AbstractVector{<: Real}
: the current gradients of the bias
Definition
The forward pass of a fully connected layer is given by the matrix multiplication between the weight matrix and the input vector (disregarding batch dimension and activation function):
- $O = WI + B$
This operation can also be described by:
- $o_{j} = \big(\sum_{k=1}^{in\_features} w_{j,k} \cdot i_{k}\big) + b_{j}$
O is the output vector, I the input vector, W the weight matrix and B the bias vector. Visually interpreted, it means that each input neuron i is weighted with the corresponding weight w connecting the input neuron to the output neuron o where all the incoming signals are summed up.
Examples
# a fully connected layer with 784 input features and 120 output features
julia> m = Fc(784, 120)
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 784, 32)
julia> output = forward(m, input)
Identity
GradValley.Layers.Identity
— TypeIdentity(; activation_function::Union{Nothing, AbstractString}=nothing)
An identity layer (can be used as an activation function layer). If no activation function is used, this layer does not change the signal in any way. However, if an activation function is used, the activation function will be applied to the input element-wise.
This layer is helpful to apply an element-wise activation independent of a "normal" computational layer.
Arguments
activation_function::Union{Nothing, AbstractString}=nothing
: the element-wise activation function which will be applied to the inputs
Shapes
- Input: $(*)$, where $*$ means any number of dimensions
- Output: $(*)$ (same shape as input)
Definition
A placeholder identity operator, except the optional activation function, the input signal is not changed in any way. If an activation function is used, the activation function will be applied to the input element-wise.
Examples
# an independent relu activation
julia> m = Identity(activation_function="relu")
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 10, 32)
julia> output = forward(m, input)
Normalization
GradValley.Layers.BatchNorm
— TypeBatchNorm(num_features::Int; epsilon::Real=1e-05, momentum::Real=0.1, affine::Bool=true, track_running_stats::Bool=true, activation_function::Union{Nothing, AbstractString}=nothing)
A batch normalization layer. Apply a batch normalization over a 4D input signal (a mini-batch of 2D inputs with additional channel dimension).
This layer has two modes: training mode and test mode. If track_running_stats::Bool=true
, this layer behaves differently in the two modes. During training, this layer always uses the currently calculated batch statistics. If track_running_stats::Bool=true
, the running mean and variance are tracked during training and will be used while testing. If track_running_stats::Bool=false
, even in test mode, the currently calculated batch statistics are used. The mode can be switched with trainmode!
or testmode!
respectively. The training mode is active by default.
Arguments
num_features::Int
: the number of channelsepsilon::Real=1e-05
: a value added to the denominator for numerical stabilitymomentum::Real=0.1
: the value used for the running mean and running variance computationaffine::Bool=true
: if true, this layer uses learnable affine parameters/weights ($\gamma$ and $\beta$)track_running_stats::Bool=true
: if true, this layer tracks the running mean and variance during training and will use them for testing/evaluation, if false, such statistics are not tracked and, even in test mode, the batch statistics are always recalculated for each new inputactivation_function::Union{Nothing, AbstractString}=nothing
: the element-wise activation function which will be applied to the output
Shapes
- Input: $(W, H, C, N)$
- $\gamma$ Weight, $\beta$ Bias: $(C, )$
- Running Mean/Variance: $(C, )$
- Output: $(W, H, C, N)$ (same shape as input)
Useful Fields/Variables
Weights (used if affine::Bool=true
)
weight::AbstractVector{<: Real}
: $\gamma$, a learnabele parameter for each channel, initialized with onesbias::AbstractVector{<: Real}
: $\beta$, a learnabele parameter for each channel, initialized with zeros
Gradients of weights (used if affine::Bool=true
)
weight_gradient::AbstractVector{<: Real}
: the gradient of $\gamma$bias_gradient::AbstractVector{<: Real}
: the gradient of $\beta$
Running statistics (used if rack_running_stats::Bool=true
)
running_mean::AbstractVector{<: Real}
: the continuously updated batch statistics of the meanrunning_variance::AbstractVector{<: Real}
: the continuously updated batch statistics of the variance
Definition
A batch normalization operation can be described as: For input values over a mini-batch: $\mathcal{B} = \{x_1, x_2, ..., x_n\}$
\[\begin{align*} y_i = \frac{x_i - \overline{\mathcal{B}}}{\sqrt{Var(\mathcal{B}) + \epsilon}} \cdot \gamma + \beta \end{align*}\]
Where $y_i$ is an output value and $x_i$ an input value. $\overline{\mathcal{B}}$ is the mean of the input values in $\mathcal{B}$ and $Var(\mathcal{B})$ is the variance of the input values in $\mathcal{B}$. Note that this definition is fairly general and not specified to 4D inputs. In this case, the input values of $\mathcal{B}$ are taken for each channel individually. So the mean and variance are calculated per channel over the mini-batch.
The update rule for the running statistics (running mean/variance) is:
\[\begin{align*} \hat{x}_{new} = (1 - momentum) \cdot \hat{x} + momentum \cdot x \end{align*}\]
Where $\hat{x}$ is the estimated statistic and $x$ is the new observed value. So $\hat{x}_{new}$ is the new, updated estimated statistic.
Examples
# a batch normalization layer (3 channels) with learnabel parameters and continuously updated batch statistics for evaluation
julia> m = BatchNorm(3)
# the mode can be switched with trainmode! or testmode!
julia> trainmode!(m)
julia> testmode!(m)
# compute the output of the layer (with random inputs)
julia> input = rand(Float32, 50, 50, 3, 32)
julia> output = forward(m, input)
Reshape / Flatten
GradValley.Layers.Reshape
— TypeReshape(out_shape; activation_function::Union{Nothing, AbstractString}=nothing)
A reshape layer (probably mostly used as a flatten layer). Reshape the input signal (effects all dimensions except the batch dimension).
Arguments
out_shape
: the target output size (the output has the same data as the input and must have the same number of elements)activation_function::Union{Nothing, AbstractString}=nothing
: the element-wise activation function which will be applied to the output
Shapes
- Input: $(*, N)$, where * means any number of dimensions
- Output: $(out\_shape..., N)$
Definition
This layer uses the standard reshape function inbuilt in Julia.
Examples
# flatten the input of size 28*28*1 to a vector of length 784 (each plus batch dimension of course)
julia> m = Reshape((784, ))
# computing the output of the layer (with random inputs)
julia> input = rand(Float32, 28, 28, 1, 32)
julia> output = forward(m, input)
julia> size(output) # specified size plus batch dimension
(784, 32)
Activation functions
Almost every layer constructor has the keyword argument activation_function
specifying the element-wise activation function. activation_function
can be nothing
or a string. nothing
means no activation function, a string gives the name of the activation. Because softmax isn’t a simple element-wise activation function like the most activations, Softmax
has it’s own layer. The following element-wise activation functions are currently implemented:
"relu"
: applies the element-wise relu activation (recified linear unit): $f(x) = \max(0, x)$"sigmoid"
: applies the element-wise sigmoid acivation (logistic curve): $f(x) = \frac{1}{1 + e^{-x}}$"tanh"
: applies the element-wise tanh activation (tangens hyperbolicus): $f(x) = \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$"leaky_relu"
: applies a element-wise leaky relu activation with a negative slope of 0.01: $f(x) = \begin{cases}x &\text{if $x \geq 0$}\\\textnormal{0.01} \times x &\text{if $x < 0$}\end{cases}$"leaky_relu:$(negative_slope)"
: applies a element-wise leaky relu activation with a negative slope ofnegative_slope
(e.g."leaky_relu:0.2"
means a leaky relu activation with a negative slope of 0.2): $f(x) = \begin{cases}x &\text{if $x \geq 0$}\\\textnormal{negative\_slope} \times x &\text{if $x < 0$}\end{cases}$
Special activation functions
GradValley.Layers.Softmax
— TypeSoftmax(; dims=1)
A softmax activation function layer (probably mostly used at the "end" of a classifier model). Apply the softmax function to an n-dimensional input array. The softmax will be computed along the given dimensions (dims
), so every slice along these dimensions will sum to 1.
Note that this is the only activation function in form of a layer. All other activation functions can be used with the activation_function::AbstractString
keyword argument nearly every layer provides. All the activation functions which can be used that way are simple element-wise activation functions. Softmax is currently the only non-element-wise activation function. Besides, it is very important to be able to select a specific dimension along the softmax should be computed. That would also not work well with the use of simple keyword argument taking only a string which is the name of the function.
Arguments
dims=1
: the dimensions along the softmax will be computed (so every slice along these dimensions will sum to 1)
Shapes
- Input: $(*)$, where $*$ means any number of dimensions
- Output: $(*)$ (same shape as input)
Definition
The softmax function converts a vector of real numbers into a probability distribution. The softmax function is defined as:
\[\begin{align*} softmax(x_i) = \frac{e^{x_i}}{\sum_{j}e^{x_j}} = \frac{exp(x_i)}{\sum_{j}exp(x_j)} \end{align*}\]
Where X is the input array (slice). Note that the $x_j$ values are taken from each slice individually along the specified dimension. So each slice along the specified dimension will sum to 1. All values in the output are between 0 and 1.
Examples
# the softmax will be computed along the first dimension
julia> m = Softmax(dims=1)
# computing the output of the layer
# (with random input data which could represent a batch of unnormalized output values from a classifier)
julia> input = rand(Float32, 10, 32)
julia> output = forward(m, input)
# summing up the values in the output along the first dimension result in a batch of 32 ones
julia> sum(output, dims=1)
1x32 Matrix{Float64}:
1.0 1.0 ... 1.0
GradValley.Optimization
Optimizers
GradValley.Optimization.Adam
— TypeAdam(layer_stack::Union{Vector, AbstractContainer}; learning_rate::Real=0.001, beta1::Real=0.9, beta2::Real=0.999, epsilon::Real=1e-08, weight_decay::Real=0, amsgrad::Bool=false, maximize::Bool=false)
Implementation of Adam optimization algorithm (including the optional AMSgrad version of this algorithm and optional weight decay).
Arguments
layer_stack::Union{Vector, SequentialContainer, GraphContainer}
: the vector OR the container (SequentialContainer/GraphContainer, often simply the whole model) containing the layers with the parameters to be optimized (can also contain layers without any parameters)learning_rate::Real=0.001
: the learning rate (shouldn't be 0)beta1::Real=0.9
,beta2::Real=0.999
: the two coefficients used for computing running averages of gradient and its squareepsilon::Real=1e-08
: value for numerical stabilityweight_decay::Real=0.00
: the weight decay (L2 penalty)amsgrad::Bool=false
: use the AMSgrad version of this algorithmmaximize::Bool=false
: maximize the parameters, instead of minimizing
Definition
For example, a definition of this algorithm in pseudocode can be found here.
Examples
# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# initialize a Adam optimizer with default arguments
julia> optimizer = Adam(model)
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative
julia> loss, loss_derivative = mse_loss(output, target)
# computet the gradients
julia> backward(model, loss_derivative)
# perform a single optimization step (parameter update)
julia> step!(optimizer)
GradValley.Optimization.SGD
— TypeSGD(layer_stack::Union{Vector, AbstractContainer}, learning_rate::Real; weight_decay::Real=0.00, dampening::Real=0.00, maximize::Bool=false)
Implementation of stochastic gradient descent optimization algorithm (including optional weight decay and dampening).
Arguments
layer_stack::Union{Vector, AbstractContainer}
: the vector OR the container (SequentialContainer/GraphContainer, often simply the whole model) containing the layers with the parameters to be optimized (can also contain layers without parameters)learning_rate::Real
: the learning rate (shouldn't be 0)weight_decay::Real=0.00
: the weight decay (L2 penalty)dampening::Real=0.00
: the dampening (normally just for optimizers with momentum, however, can be theoretically used without, in this case acts like: $(1 - dampening) \cdot learning\_rate$)maximize::Bool=false
: maximize the parameters, instead of minimizing
Definition
For example, a definition of this algorithm in pseudocode can be found here. (Note that in this case of a simple SGD with no momentum, the momentum $μ$ is zero in the sense of the mentioned documentation.)
Examples
# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# initialize a SGD optimizer with learning-rate equal 0.1 and weight decay equal to 0.5 (otherwise default values)
julia> optimizer = SGD(model, 0.1, weight_decay=0.5)
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative
julia> loss, loss_derivative = mse_loss(output, target)
# computet the gradients
julia> backward(model, loss_derivative)
# perform a single optimization step (parameter update)
julia> step!(optimizer)
GradValley.Optimization.MSGD
— TypeMSGD(layer_stack::Union{Vector, AbstractContainer}, learning_rate::Real; momentum::Real=0.90, weight_decay::Real=0.00, dampening::Real=0.00, maximize::Bool=false)
Implementation of stochastic gradient descent with momentum optimization algorithm (including optional weight decay and dampening).
Arguments
layer_stack::Union{Vector, AbstractContainer}
: the vector OR the container (SequentialContainer/GraphContainer, often simply the whole model) containing the layers with the parameters to be optimized (can also contain layers without any parameters)learning_rate::Real
: the learning rate (shouldn't be 0)momentum::Real=0.90
: the momentum factor (shouldn't be 0)weight_decay::Real=0.00
: the weight decay (L2 penalty)dampening::Real=0.00
: the dampening for the momentummaximize::Bool=false
: maximize the parameters, instead of minimizing
Definition
For example, a definition of this algorithm in pseudocode can be found here. (Note that in this case of SGD with default momentum, in the sense of the mentioned documentation, the momentum $\mu$ isn't zero ($\mu \neq 0$) and $nesterov$ is $false$.)
Examples
# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# initialize a MSGD optimizer with learning-rate equal 0.1 and momentum equal to 0.75 (otherwise default values)
julia> optimizer = Nesterov(model, 0.1, momentum=0.75)
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative
julia> loss, loss_derivative = mse_loss(output, target)
# computet the gradients
julia> backward(model, loss_derivative)
# perform a single optimization step (parameter update)
julia> step!(optimizer)
GradValley.Optimization.Nesterov
— TypeNesterov(layer_stack::Union{Vector, AbstractContainer}, learning_rate::Real; momentum::Real=0.90, weight_decay::Real=0.00, dampening::Real=0.00, maximize::Bool=false)
Implementation of stochastic gradient descent with nesterov momentum optimization algorithm (including optional weight decay and dampening).
Arguments
layer_stack::Union{Vector, AbstractContainer}
: the vector OR the container (SequentialContainer/GraphContainer, often simply the whole model) containing the layers with the parameters to be optimized (can also contain layers without any parameters)learning_rate::Real
: the learning rate (shouldn't be 0)momentum::Real=0.90
: the momentum factor (shouldn't be 0)weight_decay::Real=0.00
: the weight decay (L2 penalty)dampening::Real=0.00
: the dampening for the momentum (for true nesterov momentum, dampening must be 0)maximize::Bool=false
: maximize the parameters, instead of minimizing
Definition
For example, a definition of this algorithm in pseudocode can be found here. (Note that in this case of SGD with nesterov momentum, $nesterov$ is $true$ in the sense of the mentioned documentation.)
Examples
# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# initialize a Nesterov optimizer with learning-rate equal 0.1 and nesterov momentum equal to 0.8 (otherwise default values)
julia> optimizer = Nesterov(model, 0.1, momentum=0.8)
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative
julia> loss, loss_derivative = mse_loss(output, target)
# computet the gradients
julia> backward(model, loss_derivative)
# perform a single optimization step (parameter update)
julia> step!(optimizer)
Optimization step function
GradValley.Optimization.step!
— Functionstep!(optimizer::Union{SGD, MSGD, Nesterov})
Perform a single optimization step (parameter update) for the given optimizer.
Examples
# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# initialize an optimizer (which optimizer specifically dosen't matter)
julia> optimizer = SGD(model, 0.1)
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative
julia> loss, loss_derivative = mse_loss(output, target)
# computet the gradients
julia> backward(model, loss_derivative)
# perform a single optimization step (parameter update)
julia> step!(optimizer)
Loss functions
GradValley.Optimization.mae_loss
— Functionmae_loss(prediction::AbstractArray{<: Real, N}, target::AbstractArray{<: Real, N}; reduction_method::Union{AbstractString, Nothing}="mean", return_derivative::Bool=true) where N
Calculate the (mean) absolute error (L1 norm, with optional reduction to a single loss value (mean or sum)) and it's derivative with respect to the prediction input.
Arguments
prediction::AbstractArray{<: Real, N}
: the prediction of the model of shape (*), where * means any number of dimensionstarget::AbstractArray{<: Real, N}
: the corresponding target values of shape (*), must have the same shape as the prediction inputreduction_method::Union{AbstractString, Nothing}="mean"
: can be"mean"
,"sum"
ornothing
, specifies the reduction method which reduces the element-wise computed loss to a single valuereturn_derivative::Bool=true
: it true, the loss and it's derivative with respect to the prediction input is returned, if false, just the loss will be returned
Definition
$L$ is the loss value which will be returned. If return_derivative
is true, then an array with the same shape as prediction/target is returned as well, it contains the partial derivatives of $L$ w.r.t. to each prediction value: $\frac{\partial L}{\partial p_i}$, where $p_i$ in one prediction value. If reduction_method
is nothing
, the element-wise computed losses are returned. Note that for reduction_method=nothing
, the derivative is just the same as when reduction_method="sum"
. The element-wise calculation can be defined as (where $t_i$ is one target value and $l_i$ is one loss value):
\[\begin{align*} l_i = |p_i - t_i| \end{align*}\]
Then, $L$ and $\frac{\partial L}{\partial p_i}$ differ a little bit from case to case ($n$ is the number of values in prediction
/target
):
\[\begin{align*} L;\frac{\partial L}{\partial p_i} = \begin{cases}\frac{1}{n}\sum_{i=1}^{n} l_i; \frac{p_i - t_i}{l_i \cdot n} &\text{for reduction\_method="mean"}\\\sum_{i=1}^{n} l_i; \frac{p_i - t_i}{l_i} &\text{for reduction\_method="sum"}\end{cases} \end{align*}\]
Examples
# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# create some random input data
julia> input = rand(1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values
julia> target = rand(size(output)...)
# compute the loss and it's derivative (with default reduction method "mean")
julia> loss, loss_derivative = mae_loss(output, target)
# computet the gradients
julia> backward(model, loss_derivative)
GradValley.Optimization.mse_loss
— Functionmse_loss(prediction::AbstractArray{<: Real, N}, target::AbstractArray{<: Real, N}; reduction_method::Union{AbstractString, Nothing}="mean", return_derivative::Bool=true) where N
Calculate the (mean) squared error (squared L2 norm, with optional reduction to a single loss value (mean or sum)) and it's derivative with respect to the prediction input.
Arguments
prediction::AbstractArray{<: Real, N}
: the prediction of the model of shape (*), where * means any number of dimensionstarget::AbstractArray{<: Real, N}
: the corresponding target values of shape (*), must have the same shape as the prediction inputreduction_method::Union{AbstractString, Nothing}="mean"
: can be"mean"
,"sum"
ornothing
, specifies the reduction method which reduces the element-wise computed loss to a single valuereturn_derivative::Bool=true
: it true, the loss and it's derivative with respect to the prediction input is returned, if false, just the loss will be returned
Definition
$L$ is the loss value which will be returned. If return_derivative
is true, then an array with the same shape as prediction/target is returned as well, it contains the partial derivatives of $L$ w.r.t. to each prediction value: $\frac{\partial L}{\partial p_i}$, where $p_i$ in one prediction value. If reduction_method
is nothing
, the element-wise computed losses are returned. Note that for reduction_method=nothing
, the derivative is just the same as when reduction_method="sum"
. The element-wise calculation can be defined as (where $t_i$ is one target value and $l_i$ is one loss value):
\[\begin{align*} l_i = (p_i - t_i)^2 \end{align*}\]
Then, $L$ and $\frac{\partial L}{\partial p_i}$ differ a little bit from case to case ($n$ is the number of values in prediction
/target
):
\[\begin{align*} L;\frac{\partial L}{\partial p_i} = \begin{cases}\frac{1}{n}\sum_{i=1}^{n} l_i; \frac{2}{n}(p_i - t_i) &\text{for reduction\_method="mean"}\\\sum_{i=1}^{n} l_i; 2(p_i - t_i) &\text{for reduction\_method="sum"}\end{cases} \end{align*}\]
Examples
# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125)])
# create some random input data
julia> input = rand(1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values
julia> target = rand(size(output)...)
# compute the loss and it's derivative (with default reduction method "mean")
julia> loss, loss_derivative = mse_loss(output, target)
# compute the gradients
julia> backward(model, loss_derivative)
GradValley.Optimization.bce_loss
— Functionbce_loss(prediction::AbstractArray{<: Real, N}, target::AbstractArray{<: Real, N}; epsilon::Real=eps(eltype(prediction)), reduction_method::Union{AbstractString, Nothing}="mean", return_derivative::Bool=true) where N
Calculate the binary cross entropy loss (with optional reduction to a single loss value (mean or sum)) and it's derivative with respect to the prediction input.
Arguments
prediction::AbstractArray{<: Real, N}
: the prediction of the model of shape (*), where * means any number of dimensions (the prediction is typically given by the output of a sigmoid activation function)target::AbstractArray{<: Real, N}
: the corresponding target values (should be between 0 and 1) of shape (*), must have the same shape as the prediction inputepsilon::Real=eps(eltype(prediction))
: term to avoid infinityreduction_method::Union{AbstractString, Nothing}="mean"
: can be"mean"
,"sum"
ornothing
, specifies the reduction method which reduces the element-wise computed loss to a single valuereturn_derivative::Bool=true
: it true, the loss and it's derivative with respect to the prediction input is returned, if false, just the loss will be returned
Definition
$L$ is the loss value which will be returned. If return_derivative
is true, then an array with the same shape as prediction/target is returned as well, it contains the partial derivatives of $L$ w.r.t. to each prediction value: $\frac{\partial L}{\partial p_i}$, where $p_i$ in one prediction value. If reduction_method
is nothing
, the element-wise computed losses are returned. Note that for reduction_method=nothing
, the derivative is just the same as when reduction_method="sum"
. The element-wise calculation can be defined as (where $t_i$ is one target value and $l_i$ is one loss value):
\[\begin{align*} l_i = -t_i \cdot \log(p_i + \epsilon) - (1 - t_i) \cdot \log(1 - p_i + \epsilon) \end{align*}\]
Then, $L$ and $\frac{\partial L}{\partial p_i}$ differ a little bit from case to case ($n$ is the number of values in prediction
/target
):
\[\begin{align*} L;\frac{\partial L}{\partial p_i} = \begin{cases}\frac{1}{n}\sum_{i=1}^{n} l_i; \frac{1}{n}(\frac{-t_i}{p_i + \epsilon} - \frac{t_i - 1}{1 - p_i + \epsilon}) &\text{for reduction\_method="mean"}\\\sum_{i=1}^{n} l_i; \frac{-t_i}{p_i + \epsilon} - \frac{t_i - 1}{1 - p_i + \epsilon} &\text{for reduction\_method="sum"}\end{cases} \end{align*}\]
Examples
# define a model
julia> model = SequentialContainer([Fc(1000, 500), Fc(500, 250), Fc(250, 125, activation_function="softmax")])
# create some random input data
julia> input = rand(Float32, 1000, 32)
# compute the output of the model
julia> output = forward(model, input)
# generate some random target values
julia> target = rand(Float32, size(output)...)
# compute the loss and it's derivative (with default reduction method "mean")
julia> loss, loss_derivative = bce_loss(output, target)
# computet the gradients
julia> backward(model, loss_derivative)
GradValley.Functional
GradValley.Functional contains many primitives common for various neuronal networks. Not all of these functions (better said the fewest) are documented because they are mostly used only internally (not by the user).
GradValley.Functional.zero_pad_nd
— Functionzero_pad_nd(input::AbstractArray{T, N}, padding::NTuple{N, Int}) where {T, N}
Perform a padding-operation (nd => number of dimensions doesn't matter) as is usual for neural networks: equal padding at each "end" of each axis/dimension.
Arguments
input::AbstractArray{T, N}
: of shape(d1, d2, ..., dn)padding::NTuple{2, Int}
: must be always a tuple of length of the number of dimensions of input: (pad-d1, pad-d2, ..., pad-dn)
Shape of returned output: (d1 + padding[1] * 2, d2 + padding[2] * 2, ..., dn + padding[n] * 2)
zero_pad_nd(input::CuArray{T, N}, padding::NTuple{N, Int}) where {T, N}
Perform a padding-operation (nd => number of dimensions doesn't matter) as is usual for neural networks: equal padding at each "end" of each axis/dimension.
Arguments
input::CuArray{T, N}
: of shape(d1, d2, ..., dn)padding::NTuple{2, Int}
: must be always a tuple of length of the number of dimensions of input: (pad-d1, pad-d2, ..., pad-dn)
Shape of returned output: (d1 + padding[1] * 2, d2 + padding[2] * 2, ..., dn + padding[n] * 2)
GradValley.Functional.zero_pad_2d
— Functionzero_pad_nd(input::AbstractArray{T, 4}, padding::NTuple{2, Int) where {T}
Perform a padding-operation (2d => 4 dimensions, where the last 2 dimensions will be padded) as is usual for neural networks: equal padding at each "end" of each spatial axis/dimension.
Arguments
input::AbstractArray{T, 4}
: of shape(d1, d2, d3, d4), d2 is expected to be the height dimension, d1 is expected to be the width dimensionpadding::NTuple{2, Int}
: must be always a tuple of length 2: (pad-d2, pad-d1) == (pad-height, pad-width)
Shape of returned output: (d1 + padding[2] * 2, d2 + padding[1] * 2, d3, d4)
zero_pad_nd(input::CuArray{T, 4}, padding::NTuple{2, Int) where {T}
Perform a padding-operation (2d => 4 dimensions, where the last 2 dimensions will be padded) as is usual for neural networks: equal padding at each "end" of each spatial axis/dimension.
Arguments
input::CuArray{T, 4}
: of shape(d1, d2, d3, d4), d2 is expected to be the height dimension, d1 is expected to be the width dimensionpadding::NTuple{2, Int}
: must be always a tuple of length 2: (pad-d2, pad-d1) == (pad-height, pad-width)
Shape of returned output: (d1 + padding[2] * 2, d2 + padding[1] * 2, d3, d4)