When doing mathematical modeling, it is desirable that your function are continuos and have defined gradients. This is because there are numerous numerical methods that use a function’s gradient information in order to converge faster. However not mathematical functions have gradients. We shall see how to deal with these functions.
For example, max(a,b) doesn’t have a defined gradient. However, we can use log(exp(a) + exp(b)) to get a nice approximation for the max function. There is one problem though with this definition – it is numerically unstable in cases of a large a or b. We can resolve this instability using the following fact:
log( exp(a – k) + exp(b – k) ) = log( exp(a) + exp(b) ) – k.
log( exp(x) + exp(y) ) = log( exp(x –k) + exp(y–k) ) + k
Which is guaranteed to be numerically stable. If you are wondering how can we model the min function, as in min(a,b), we just use the above soft maximum implementation with the arguments: -max(-a,-b).
There are are numerical stability issues as well, please see Eli Bendersky’s blog for softmax in the context of neural network, or more specifically, when diving an exponent by a some of exponents.
For more numerical stability issues see What Every Computer Scientist should know about Floating Point Arithmetic .