1

Based on this feature dependencies problem.

$G_{x,y,z}$ is rank-3 tensor where:

  • $x$: Num of samples.
  • $y$: Num of features.
  • $z$: Num of embedding dimensions.

If I train a simple attention layer architecture where the input and output to feed are identic, which is the $G_{x,y,z}$. Then, my hypothesis says the attention layer's weight is the adjacency matrix where it's what I expecting in the reference question.

This is supported by attention layer shape argument which is expecting the shape of (batch_size, Tq, Tv) atleast in Keras3. Considering the analogies:

  • batch_size is $x$
  • Tq is $y$
  • Tv is $z$.

The adjacency matrix must be have the shape of $y×y$.

So, is my hypothesis correct?

Which weight that the shape is $y×y$?

1 Answers1

1

An adjacency matrix is typically a square matrix where the entry at position $(i,j)$ indicates the relationship (or weight) between node $i$ and node $j$. Your hypothesis makes sense that the attention weight (score) matrix in this case could act as the adjacency matrix of an undirected graph consisting of the corresponding input nodes, provided that it's a symmetric non-causal attention (otherwise it's a lower triangular matrix). The shape of the learned attention weight matrix will indeed be $y×y$ for each sample in the mini-batch, as it represents the dependencies between the $y$ nodes. These dependencies are learned based on the input node data, and the output will give you quantitative relation strengths between each pair of nodes in a symmetric way.

If you want to further model non-symmetric adjacency matrix of directed graphs, you need to apply additional mask to above symmetric attentions score matrix during training or to use causal attention score matrix from the causally masked self-attention mechanism if your graph is a pure DAG without hybrid edges.

cinch
  • 11,000
  • 3
  • 8
  • 17