Most Popular
1500 questions
22
votes
1 answer
What is the intuition behind the dot product attention?
I am watching the video Attention Is All You Need by Yannic Kilcher.
My question is: what is the intuition behind the dot product attention?
$$A(q,K, V) = \sum_i\frac{e^{q.k_i}}{\sum_j e^{q.k_j}} v_i$$
becomes:
$$A(Q,K, V) = \text{softmax}(QK^T)V$$
DRV
- 1,843
- 3
- 15
- 20
22
votes
3 answers
Is a dystopian surveillance state computationally possible?
This isn't really a conspiracy theory question. More of an inquire on the global computational power and data storage logistics question.
Most recording instruments such as cameras and microphones are typically voluntary opt in devices, in that,…
Harrison Tran
- 329
- 2
- 6
22
votes
2 answers
Why would you implement the position-wise feed-forward network of the transformer with convolution layers?
The Transformer model introduced in "Attention is all you need" by Vaswani et al. incorporates a so-called position-wise feed-forward network (FFN):
In addition to attention sub-layers, each of the layers in our encoder
and decoder contains a…
Eli Korvigo
- 321
- 1
- 2
- 6
22
votes
1 answer
Has the Lovelace Test 2.0 been successfully used in an academic setting?
In October 2014, Dr. Mark Riedl published an approach to testing AI intelligence, called the "Lovelace Test 2.0", after being inspired by the original Lovelace Test (published in 2001). Mark believed that the original Lovelace Test would be…
Left SE On 10_6_19
- 1,670
- 10
- 23
22
votes
3 answers
Why doesn't Q-learning converge when using function approximation?
The tabular Q-learning algorithm is guaranteed to find the optimal $Q$ function, $Q^*$, provided the following conditions (the Robbins-Monro conditions) regarding the learning rate are satisfied
$\sum_{t} \alpha_t(s, a) = \infty$
$\sum_{t}…
nbro
- 42,615
- 12
- 119
- 217
21
votes
3 answers
How can we process the data from both the true distribution and the generator?
I'm struggling to understand the GAN loss function as provided in Understanding Generative Adversarial Networks (a blog post written by Daniel Seita).
In the standard cross-entropy loss, we have an output that has been run through a sigmoid function…
tryingtolearn
- 395
- 1
- 2
- 10
21
votes
3 answers
What kind of word embedding is used in the original transformer?
I am currently trying to understand transformers.
To start, I read Attention Is All You Need and also this tutorial.
What makes me wonder is the word embedding used in the model. Is word2vec or GloVe being used? Are the word embeddings trained from…
Bert Gayus
- 645
- 1
- 5
- 12
21
votes
1 answer
What is the number of neurons required to approximate a polynomial of degree n?
I learned about the universal approximation theorem from this guide. It states that a network even with a single hidden layer can approximate any function within some bound, given a sufficient number of neurons. Or mathematically, ${|g(x)−f(x)|<…
mark mark
- 813
- 6
- 25
21
votes
2 answers
What is the "Hello World" problem of Reinforcement Learning?
As we all know, "Hello World" is usually the first program that any programmer learns/implements in any language/framework.
As Aurélien Géron mentioned in his book that MNIST is often called the Hello World of Machine Learning, is there any "Hello…
Arpit-Gole
- 404
- 3
- 9
21
votes
1 answer
What is a fully convolution network?
I was surveying some literature related to Fully Convolutional Networks and came across the following phrase,
A fully convolutional network is achieved by replacing the parameter-rich fully connected layers in standard CNN architectures by…
r4bb1t
- 345
- 1
- 3
- 8
21
votes
2 answers
What are the main differences between skip-gram and continuous bag of words?
The skip-gram and continuous bag of words (CBOW) are two different types of word2vec models.
What are the main differences between them? What are the pros and cons of both methods?
DRV
- 1,843
- 3
- 15
- 20
21
votes
2 answers
Problems that only humans will ever be able to solve
With the increasing complexity of reCAPTCHA, I wondered about the existence of some problem, that only a human will ever be able to solve (or that AI won't be able to solve as long as it doesn't reproduce exactly the human brain).
For instance, the…
Marc Perlade
- 313
- 1
- 6
20
votes
1 answer
How does LSTM in deep reinforcement learning differ from experience replay?
In the paper Deep Recurrent Q-Learning for Partially Observable MDPs, the author processed the Atari game frames with an LSTM layer at the end. My questions are:
How does this method differ from the experience replay, as they both use past…
Kevin. Fang
- 373
- 1
- 2
- 7
20
votes
4 answers
What is the difference between actor-critic and advantage actor-critic?
I'm struggling to understand the difference between actor-critic and advantage actor-critic.
At least, I know they are different from asynchronous advantage actor-critic (A3C), as A3C adds an asynchronous mechanism that uses multiple worker agents…
Blaszard
- 1,097
- 4
- 11
- 25
20
votes
3 answers
How do I choose the best algorithm for a board game like checkers?
How do I choose the best algorithm for a board game like checkers?
So far, I have considered only three algorithms, namely, minimax, alpha-beta pruning, and Monte Carlo tree search (MCTS). Apparently, both the alpha-beta pruning and MCTS are…
JoeyB
- 487
- 1
- 6
- 15