Tips & Tricks, FAQ#
- Stochastic gradient descent will not work (well) for INNs. Use e.g. Adam instead. 
- Gradient clipping can be useful if you are experiencing training instabilities, e.g. use - torch.nn.utils.clip_grad_norm_
- Add some slight noise to the inputs (order of 1E-2). This stabilizes training and prevents sparse gradients, if there are some quantized or perfectly correlated input dimenions 
For coupling blocks in particular:
- Use Xavier initialization for the weights. This prevents unstable training at the start. 
- If your network is very deep (>30 coupling blocks), initialize the last layer in the subnetworks to zero. This means the INN as a whole is initialized to the identity, and you will not get NaNs at the first iteration. 
- Do not forget permutations/orthogonal transforms between coupling blocks. 
- Keep the subnetworks shallow (2-3 layers only), but wide (>= 128 neurons/ >= 64 conv. channels) 
- Keep in mind that one coupling block contains between 4 and 12 individual convolutions or fully connected layers. So you may not have to use as many as you think, else the number of parameters will be huge. 
- This being said, as the coupling blocks initialize to roughly the identity transform, it is hard to have too many coupling blocks and break the training completely (as opposed to a standard feed-forward NN). 
For convolutional INNs in particular:
- Perform some kind of reshaping early, so the INN has >3 channels to work with 
- Coupling blocks using 1x1 convolutions in the subnets seem important for the quality, they should constitute every other, or every third coupling block