1

In the paper Hopfield networks is all you need, the authors mention that their modern Hopfield network layers are a good replacement for pooling, GRU, LSTM, and attention layers, and tend to outperform them in various tasks.

I understand that they show that the layers can store an exponential amount of vectors, but that should still be worse than attention layers that can focus parts of an arbitrary length input sequence.

Also, in their paper, they briefly allude to Neural Turing Machine and related memory augmentation architectures, but do not comment on the comparison between them.

Has someone studied how these layers help improve the performance over pooling and attention layers, and is there any comparison between replacing layers with Hopfield layers vs augmenting networks with external memory like Neural Turing Machines?

Edit 29 Jan 2020 I believe my intuition that attention mechanism should outperform hopfield layers was wrong, as I was comparing the hopfield layer that uses an input vector for query $R (\approx Q)$ and stored patterns $Y$ for both Key $K$ and Values $V$. In this case my assumption was that hopfield layer would be limited by its storage capacity while attention mechanism does not have such constraints.

However the authors do mention that the input $Y$ may be modified to ingest two extra input vectors for Key and Value. I believe in this case it would perform hopfield network mapping instead of attention and I do not know how the 2 compare. Hopfield Layer - Attention analogy

Rijul Gupta
  • 119
  • 2

1 Answers1

1

Will try to formulate my understanding of the ideas in this paper, mention my own concerns that I see are relevant to your question, and see if we can identify any confusions along the way that might clarify the issue

On eq(6) of the relevant blog post, they identify the weight matrix of a discrete, binary Hopfield network as

$$ \boldsymbol{W} = \sum_i^N x_i x_i^T $$

with N raw stored entries, that are retrieved by iterating the initial guess $\xi$ with the following update rule

$$ \xi_{t+1} = \text{sgn}( \boldsymbol{W} \xi_{t} - b ) $$

Now to the paper in question, the update rule for the generalization they propose for continuous states that would be used is (eq 22 in OP):

$$ \xi_{t+1} = \boldsymbol{X} \text{softmax}( \beta \boldsymbol{X}^T \xi_{t} ) $$

Where $\boldsymbol{X} = (x_0, x_1, \dots , x_N ) $

The first substantial difference I see is that, while on the case of binary entries, all the weights of the network are encoded in the matrix $\boldsymbol{W}$, hence the storage is constant regardless of how many actual patterns are stored in it. In contrast, in this continuous case generalized rule, the $\boldsymbol{X}$ matrix seems to grow linearly with the size of entries, in fact it keeps all the stored entries directly. By their update rule, it doesn't seem to be a way around storing and keeping around the entire entries, and the update rule seems to only find a "best fit" among the entries, using the scalar dot product of the attention mechanism. I still think I might be missing something important here

lurscher
  • 111
  • 4