Tips & Tricks, FAQ#

  • Stochastic gradient descent will not work (well) for INNs. Use e.g. Adam instead.

  • Gradient clipping can be useful if you are experiencing training instabilities, e.g. use torch.nn.utils.clip_grad_norm_

  • Add some slight noise to the inputs (order of 1E-2). This stabilizes training and prevents sparse gradients, if there are some quantized or perfectly correlated input dimenions

For coupling blocks in particular:

  • Use Xavier initialization for the weights. This prevents unstable training at the start.

  • If your network is very deep (>30 coupling blocks), initialize the last layer in the subnetworks to zero. This means the INN as a whole is initialized to the identity, and you will not get NaNs at the first iteration.

  • Do not forget permutations/orthogonal transforms between coupling blocks.

  • Keep the subnetworks shallow (2-3 layers only), but wide (>= 128 neurons/ >= 64 conv. channels)

  • Keep in mind that one coupling block contains between 4 and 12 individual convolutions or fully connected layers. So you may not have to use as many as you think, else the number of parameters will be huge.

  • This being said, as the coupling blocks initialize to roughly the identity transform, it is hard to have too many coupling blocks and break the training completely (as opposed to a standard feed-forward NN).

For convolutional INNs in particular:

  • Perform some kind of reshaping early, so the INN has >3 channels to work with

  • Coupling blocks using 1x1 convolutions in the subnets seem important for the quality, they should constitute every other, or every third coupling block