Mish: A Self Regularized Non-Monotonic Neural Activation Function

Review of paper by Diganta Misra, 2019

This paper presents a new neural network activation function and shows with a number of examples that it often increases the accuracy of deep networks.

Note

This paper was suggested to me by the author. I decided to review it because it seems interesting and potentially useful to many people, assuming that its assertions are accurate.

What can we learn from this paper?

That a properly chosen activation function can be important for achieving the best neural network accuracy.

A specific activation function is introduced that is claimed to improve upon the current state of the art in deep learning.

Prerequisites (to understand the paper, what does one need to be familiar with?)

  • General understanding of neural networks
  • Activation functions

Motivation

To increase the accuracy of neural networks by making a small easy change in their code.

Results

Activation functions are used in neural networks to transform the weighted sum of inputs for each neuron to its output. Activation functions are generally nonlinear, which makes it possible for the network to model complex systems.

There is a great number of activation functions that have been suggested in the literature. A recent review of the history of activation functions in deep learning can be found here. The most common functions that are used in inner layers of deep networks are

  • ReLU: \(f(x) = max(0, x)\)
  • tanh: \( f(x) = \)\(\Large{e^x\text{ – }e^{-x}} \over \Large{e^x\text{ + }e^{-x}}\)
  • sigmoid: \(f(x) =\) \(\Large{1} \over \Large{1\text{ + }e^{-x}}\)

and their variations.

In a 2017 Google Brain paper, a new compound activation function named Swish was proposed, which is defined as \(f(x) = x\text{ * }sigmoid(x) = x\text{ * }(1\text{ – }\)\(\Large{1} \over \Large{1\text{ + }e^x}\)\()\). This function largely mimics the ReLU function but is smooth, which facilitates gradient calculations.

In this paper, a new function called Mish is suggested, which is \(f(x) = x\text{ * }tanh(softplus(x)) = x\text{ * }(1\text{ – }\)\(\Large{1} \over \Large{1\text{ + }e^x\text{ + }e^{2x}/2}\)\()\). The difference from Swish is the second-order term in the denominator. The author admits that he doesn’t know himself the reason why this small change should improve the training, but apparently it does. In the paper, he describes a number of experiments where using Mish results in a small, but noticeable and consistent improvement in the accuracy. The disadvantage is that it is a bit slower, but tolerably so (perhaps about 10% slower than Swish).

If you are interested in getting maximum performance out of your deep neural network, why not give this function a try and see if it improves the results for your task. if you do try it, please feel free to leave a comment below and let me know how it worked out for you!

Original paper link

Github repository

Further reading

Leave a Reply