Tips & Tricks, FAQ#
Stochastic gradient descent will not work (well) for INNs. Use e.g. Adam instead.
Gradient clipping can be useful if you are experiencing training instabilities, e.g. use
torch.nn.utils.clip_grad_norm_
Add some slight noise to the inputs (order of 1E-2). This stabilizes training and prevents sparse gradients, if there are some quantized or perfectly correlated input dimenions
For coupling blocks in particular:
Use Xavier initialization for the weights. This prevents unstable training at the start.
If your network is very deep (>30 coupling blocks), initialize the last layer in the subnetworks to zero. This means the INN as a whole is initialized to the identity, and you will not get NaNs at the first iteration.
Do not forget permutations/orthogonal transforms between coupling blocks.
Keep the subnetworks shallow (2-3 layers only), but wide (>= 128 neurons/ >= 64 conv. channels)
Keep in mind that one coupling block contains between 4 and 12 individual convolutions or fully connected layers. So you may not have to use as many as you think, else the number of parameters will be huge.
This being said, as the coupling blocks initialize to roughly the identity transform, it is hard to have too many coupling blocks and break the training completely (as opposed to a standard feed-forward NN).
For convolutional INNs in particular:
Perform some kind of reshaping early, so the INN has >3 channels to work with
Coupling blocks using 1x1 convolutions in the subnets seem important for the quality, they should constitute every other, or every third coupling block