1

I am currently reading a paper about Negative Sampling in Knowledge Graphs, and ran into the objective function of KG Representation Learning. There are 2 nodes $u, v$ which composes a triple. And there are negative samples of $u$ given $v$, denoted as $\bar{u}$. We optimize the objective function by maximizing the similarity of positive samples(left side of +) & minimizing the similarity of negative samples(right side of +).

What I don't understand is how the 2nd line of the equation below becomes the 3rd line of the equation.

$$ \begin{align} J^{(v)}&=\mathbb{E}_{u \sim P_{d}(u|v)}~log~\sigma({\bf{u}}^{\top}{\bf{v}})+ \mathbb{E}_{\bar{u} \sim P_{d}(\bar{u}|v)}~log~\sigma({-\bf{\bar{u}}}^{\top}{\bf{v}})\\ &=\sum\limits_{u}P_{d}(u|v)~log~\sigma({\bf{u}}^{\top}{\bf{v}})+k\sum\limits_{\bar{u}}P_{n}(\bar{u}|v)~log~\sigma{({-\bf{\bar{u}}}^{\top}{\bf{v}})}\\ &=\sum\limits_{u}[P_{d}(u|v)~log~\sigma({\bf{u}}^{\top}{\bf{v}})+kP_{n}(u|v)~log~\sigma{({-\bf{u}}^{\top}{\bf{v}})}]\\ &=\sum\limits_{u}[P_{d}(u|v)~log~\sigma({\bf{u}}^{\top}{\bf{v}})+kP_{n}(u|v)~log~(1 - \sigma{({\bf{u}}^{\top}{\bf{v}})})]\\ \end{align} $$

From what I understand, the negative distribution $P_{n}(\bar{u}|v)$ is defined among the negative sample set, which won't contain the positive sample $u$. And even if it does, the value of $P_{n}(\bar{u}|v)$ and $P_{n}(u|v)$ won't be the same.

1 Answers1

0

In many negative sampling approaches, the expectation over negatives is approximated by drawing $k$ negative samples per positive sample from a negative, or more accurately, noise distribution $P_n(u∣v)$ instead of the data distribution $P_d(\bar u∣v)$, and then rewriting the expectation as a weighted sum over all possible samples. It's true that the negative distribution $P_n(u∣v)$ seems defined over the negative sample set only, however, it is common to define $P_n(u∣v)$ over the entire set of samples. See Mikolov et al's 2013 paper "Distributed Representations of Words and Phrases and their Compositionality" which introduced negative sampling for NLP first.

NCE posits that a good model should be able to differentiate data from noise by means of logistic regression... While NCE can be shown to approximately maximize the log probability of the softmax, the Skip-gram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. We define Negative sampling (NEG) by the objective... Thus the task is to distinguish the target word $w_O$ from draws from the noise distribution $P_n(w)$ using logistic regression, where there are $k$ negative samples for each data sample. Our experiments indicate that values of $k$ in the range $5–20$ are useful for small training datasets, while for large datasets the $k$ can be as small as $2–5$. Both NCE and NEG have the noise distribution $P_n(w)$ as a free parameter.

The noise distribution for negative sampling is typically derived directly from the training data’s frequency statistics, such as unigram frequency for NLP and weighted nodes based on properties like degree or importance for KG. Therefore the 3rd line can be derived from the 2nd line mainly because under a noise distribution the sum over all negative samples can be rewritten as a sum over all samples and also the probability mass for the positive samples as noise here is very small to be negligible in practice compared to negative samples due to the empirically fitted range of the hyperparameter $k$ in above reference.

cinch
  • 11,000
  • 3
  • 8
  • 17