Why softmax/sigmoid use base e instead of 2?

Question

Performing -ln(ε) in NumPy returns relatively small values like this:

print(-np.log(np.finfo(np.float32).eps))
print(-np.log(np.finfo(np.float64).eps))

Output:

15.942385
36.04365338911715

Compared to -log2(ε) which has a greater range compared to base e:

print(-np.log2(np.finfo(np.float16).eps))
print(-np.log2(np.finfo(np.float32).eps))
print(-np.log2(np.finfo(np.float64).eps))

Output:

10.0
23.0
52.0

So, why don't softmax/sigmoid functions use 2 instead of e as the base for the exponential function? Using base 2 for the logarithm when converting back to logits would have more precision compared to base e in IEEE 754 representation.

score 2 · Answer 1 · edited Dec 19 '24 at 12:49

The base of the logarithm change the range, but it not necessarily affect the precision.

There exists the following formula:

$\log_b(x) = \log(x) \cdot \frac{1}{\log(b)}$

In which b is the base.

Then, the selection of the base affect the range. Larger bases generate smaller ranges.

Also, computing gradients with log with base e is more easy that when using other bases. Gradients are needed for backpropagation.

cinch · Accepted Answer · 2024-12-19T21:04:48.907

IEEE 754 provides sufficient precision for most ML tasks even with base $e$, and the choice in softmax and sigmoid functions is rooted in mathematical and practical considerations.

Notice the function $e^x$ has a unique property that its derivative is itself, therefore this simplifies gradient calculations during backpropagation of the common CE loss. Changing the base $2$ doesn't gain much precision but complicates existing math formulations in machine learning field and optimization computations potentially across many places. Therefore both softmax and sigmoid functions often used in probability and information theory typically use natural logarithm and exponential function based on $e$ in practice, such as the CE loss implemented in PyTorch. Often in information-theoretic discussions the base $2$ for the unit of bits is preferred such as Shanon's original paper about entropy.

score 1 · Answer 3 · answered Dec 20 '24 at 08:56

1

$\log_2$ results in larger numbers than $\ln$, but that doesn't make them more precise. Floating point arithmetic has the same relative precision for numbers of any size.

answered Dec 20 '24 at 08:56

Tomek Czajka

150
4

Why softmax/sigmoid use base e instead of 2?

3 Answers3