Continuous Normalizing Flows
Now, we study a single layer neural network that can estimate the density p_x
of a variable of interest x
by re-parameterizing a base variable z
with known density p_z
through the Neural Network model passed to the layer.
Copy-Pasteable Code
Before getting to the explanation, here's some code to start with. We will follow a full explanation of the definition and training process:
using ComponentArrays, DiffEqFlux, OrdinaryDiffEq, Optimization, Distributions, Random,
OptimizationOptimisers, OptimizationOptimJL
nn = Chain(Dense(1, 3, tanh), Dense(3, 1, tanh))
tspan = (0.0f0, 10.0f0)
ffjord_mdl = FFJORD(nn, tspan, (1,), Tsit5(); ad = AutoZygote())
ps, st = Lux.setup(Xoshiro(0), ffjord_mdl)
ps = ComponentArray(ps)
model = StatefulLuxLayer{true}(ffjord_mdl, nothing, st)
# Training
data_dist = Normal(6.0f0, 0.7f0)
train_data = Float32.(rand(data_dist, 1, 100))
function loss(θ)
logpx, λ₁, λ₂ = model(train_data, θ)
return -mean(logpx)
end
function cb(state, l)
@info "FFJORD Training" loss=l
return false
end
adtype = Optimization.AutoForwardDiff()
optf = Optimization.OptimizationFunction((x, p) -> loss(x), adtype)
optprob = Optimization.OptimizationProblem(optf, ps)
res1 = Optimization.solve(
optprob, OptimizationOptimisers.Adam(0.01); maxiters = 20, callback = cb)
optprob2 = Optimization.OptimizationProblem(optf, res1.u)
res2 = Optimization.solve(optprob2, Optim.LBFGS(); allow_f_increases = false, callback = cb)
# Evaluation
using Distances
st_ = (; st..., monte_carlo = false)
actual_pdf = pdf.(data_dist, train_data)
learned_pdf = exp.(ffjord_mdl(train_data, res2.u, st_)[1][1])
train_dis = totalvariation(learned_pdf, actual_pdf) / size(train_data, 2)
# Data Generation
ffjord_dist = FFJORDDistribution(ffjord_mdl, ps, st)
new_data = rand(ffjord_dist, 100)
1×100 Matrix{Float32}:
-5.55262 -4.96205 -5.37136 -5.37277 … -5.07958 -6.13076 -5.76396
Step-by-Step Explanation
We can use DiffEqFlux.jl to define, train and output the densities computed by CNF layers. In the same way as a neural ODE, the layer takes a neural network that defines its derivative function (see [1] for a reference). A possible way to define a CNF layer, would be:
using ComponentArrays, DiffEqFlux, OrdinaryDiffEq, Optimization, OptimizationOptimisers,
OptimizationOptimJL, Distributions, Random
nn = Chain(Dense(1, 3, tanh), Dense(3, 1, tanh))
tspan = (0.0f0, 10.0f0)
ffjord_mdl = FFJORD(nn, tspan, (1,), Tsit5(); ad = AutoZygote())
ps, st = Lux.setup(Xoshiro(0), ffjord_mdl)
ps = ComponentArray(ps)
model = StatefulLuxLayer{true}(ffjord_mdl, ps, st)
ffjord_mdl
FFJORD(
model = Chain(
layer_1 = Dense(1 => 3, tanh), # 6 parameters
layer_2 = Dense(3 => 1, tanh), # 4 parameters
),
) # Total: 10 parameters,
# plus 0 states.
where we also pass as an input the desired timespan for which the differential equation that defines log p_x
and z(t)
will be solved.
Training
First, let's get an array from a normal distribution as the training data. Note that we want the data in Float32 values to match how we have set up the neural network weights and the state space of the ODE.
data_dist = Normal(6.0f0, 0.7f0)
train_data = Float32.(rand(data_dist, 1, 100))
1×100 Matrix{Float32}:
6.56684 5.31679 6.53929 7.49943 … 5.87144 5.54236 5.83153 4.94166
Now we define a loss function that we wish to minimize and a callback function to track loss improvements
function loss(θ)
logpx, λ₁, λ₂ = model(train_data, θ)
return -mean(logpx)
end
function cb(state, l)
@info "FFJORD Training" loss=loss(p)
return false
end
cb (generic function with 1 method)
In this example, we wish to choose the parameters of the network such that the likelihood of the re-parameterized variable is maximized. Other loss functions may be used depending on the application. Furthermore, the CNF layer gives the log of the density of the variable x, as one may guess from the code above.
We then train the neural network to learn the distribution of x
.
Here we showcase starting the optimization with Adam
to more quickly find a minimum, and then honing in on the minimum by using LBFGS
.
adtype = Optimization.AutoForwardDiff()
optf = Optimization.OptimizationFunction((x, p) -> loss(x), adtype)
optprob = Optimization.OptimizationProblem(optf, ps)
res1 = Optimization.solve(
optprob, OptimizationOptimisers.Adam(0.01); maxiters = 20, callback = cb)
retcode: Default
u: ComponentVector{Float32}(layer_1 = (weight = Float32[0.11590508; -0.7445653; -2.7860525;;], bias = Float32[-0.39763144, 0.52249235, -0.4585489]), layer_2 = (weight = Float32[-0.4738097 0.7351604 -0.34725535], bias = Float32[0.2251537]))
We then complete the training using a different optimizer, starting from where Adam
stopped.
optprob2 = Optimization.OptimizationProblem(optf, res1.u)
res2 = Optimization.solve(optprob2, Optim.LBFGS(); allow_f_increases = false, callback = cb)
retcode: Failure
u: ComponentVector{Float32}(layer_1 = (weight = Float32[0.8417258; 0.43129402; -2.791226;;], bias = Float32[0.11248932, 0.12563767, -0.5826275]), layer_2 = (weight = Float32[0.4657826 -0.0684071 -0.04422123], bias = Float32[-1.0858264]))
Evaluation
For evaluating the result, we can use totalvariation
function from Distances.jl
. First, we compute densities using actual distribution and FFJORD model. Then we use a distance function between these distributions.
using Distances
st_ = (; st..., monte_carlo = false)
actual_pdf = pdf.(data_dist, train_data)
learned_pdf = exp.(ffjord_mdl(train_data, res2.u, st_)[1][1])
train_dis = totalvariation(learned_pdf, actual_pdf) / size(train_data, 2)
0.03142902f0
Data Generation
What's more, we can generate new data by using FFJORD as a distribution in rand
.
ffjord_dist = FFJORDDistribution(ffjord_mdl, ps, st)
new_data = rand(ffjord_dist, 100)
1×100 Matrix{Float32}:
-5.24674 -6.36796 -4.85017 -6.03722 … -5.49508 -6.6998 -5.50485
References
[1] Grathwohl, Will, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. "Ffjord: Free-form continuous dynamics for scalable reversible generative models." arXiv preprint arXiv:1810.01367 (2018).