Adding another entry for the debate about whether batch normalization should be called before or after the non-linear activation:
In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:
It is natural to wonder whether we should apply batch normalization to the input X, or to the transformed value XW+b. Io?e and Szegedy (2015) recommend the latter. More speci?cally, XW+b should be replaced by a normalized version of XW. The bias term should be omitted because it becomes redundant with the ß parameter applied by the batch normalization reparameterization. The input to a layer is usually the output of a nonlinear activation function such as the recti?ed linear function in a previous layer. The statistics of the input are thus more non-Gaussian and less amenable to standardization by linear operations.
In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.
Keep in mind that there are reports of models getting better results when using batch normalization after the activation, while others get best results when the batch normalization is placed before the activation. It is probably best to test your model using both configurations, and if batch normalization after activation gives a significant decrease in validation loss, use that configuration instead.