Tips & Tricks, FAQ =============================== * Stochastic gradient descent will not work (well) for INNs. Use e.g. Adam instead. * Gradient clipping can be useful if you are experiencing training instabilities, e.g. use ``torch.nn.utils.clip_grad_norm_`` * Add some slight noise to the inputs (order of 1E-2). This stabilizes training and prevents sparse gradients, if there are some quantized or perfectly correlated input dimenions For coupling blocks in particular: * Use Xavier initialization for the weights. This prevents unstable training at the start. * If your network is very deep (>30 coupling blocks), initialize the last layer in the subnetworks to zero. This means the INN as a whole is initialized to the identity, and you will not get NaNs at the first iteration. * Do not forget permutations/orthogonal transforms between coupling blocks. * Keep the subnetworks shallow (2-3 layers only), but wide (>= 128 neurons/ >= 64 conv. channels) * Keep in mind that one coupling block contains between 4 and 12 individual convolutions or fully connected layers. So you may not have to use as many as you think, else the number of parameters will be huge. * This being said, as the coupling blocks initialize to roughly the identity transform, it is hard to have too many coupling blocks and break the training completely (as opposed to a standard feed-forward NN). For convolutional INNs in particular: * Perform some kind of reshaping early, so the INN has >3 channels to work with * Coupling blocks using 1x1 convolutions in the subnets seem important for the quality, they should constitute every other, or every third coupling block