Most Popular

1500 questions
22
votes
1 answer

What is the intuition behind the dot product attention?

I am watching the video Attention Is All You Need by Yannic Kilcher. My question is: what is the intuition behind the dot product attention? $$A(q,K, V) = \sum_i\frac{e^{q.k_i}}{\sum_j e^{q.k_j}} v_i$$ becomes: $$A(Q,K, V) = \text{softmax}(QK^T)V$$
DRV
  • 1,843
  • 3
  • 15
  • 20
22
votes
3 answers

Is a dystopian surveillance state computationally possible?

This isn't really a conspiracy theory question. More of an inquire on the global computational power and data storage logistics question. Most recording instruments such as cameras and microphones are typically voluntary opt in devices, in that,…
Harrison Tran
  • 329
  • 2
  • 6
22
votes
2 answers

Why would you implement the position-wise feed-forward network of the transformer with convolution layers?

The Transformer model introduced in "Attention is all you need" by Vaswani et al. incorporates a so-called position-wise feed-forward network (FFN): In addition to attention sub-layers, each of the layers in our encoder and decoder contains a…
22
votes
1 answer

Has the Lovelace Test 2.0 been successfully used in an academic setting?

In October 2014, Dr. Mark Riedl published an approach to testing AI intelligence, called the "Lovelace Test 2.0", after being inspired by the original Lovelace Test (published in 2001). Mark believed that the original Lovelace Test would be…
Left SE On 10_6_19
  • 1,670
  • 10
  • 23
22
votes
3 answers

Why doesn't Q-learning converge when using function approximation?

The tabular Q-learning algorithm is guaranteed to find the optimal $Q$ function, $Q^*$, provided the following conditions (the Robbins-Monro conditions) regarding the learning rate are satisfied $\sum_{t} \alpha_t(s, a) = \infty$ $\sum_{t}…
nbro
  • 42,615
  • 12
  • 119
  • 217
21
votes
3 answers

How can we process the data from both the true distribution and the generator?

I'm struggling to understand the GAN loss function as provided in Understanding Generative Adversarial Networks (a blog post written by Daniel Seita). In the standard cross-entropy loss, we have an output that has been run through a sigmoid function…
21
votes
3 answers

What kind of word embedding is used in the original transformer?

I am currently trying to understand transformers. To start, I read Attention Is All You Need and also this tutorial. What makes me wonder is the word embedding used in the model. Is word2vec or GloVe being used? Are the word embeddings trained from…
21
votes
1 answer

What is the number of neurons required to approximate a polynomial of degree n?

I learned about the universal approximation theorem from this guide. It states that a network even with a single hidden layer can approximate any function within some bound, given a sufficient number of neurons. Or mathematically, ${|g(x)−f(x)|<…
21
votes
2 answers

What is the "Hello World" problem of Reinforcement Learning?

As we all know, "Hello World" is usually the first program that any programmer learns/implements in any language/framework. As Aurélien Géron mentioned in his book that MNIST is often called the Hello World of Machine Learning, is there any "Hello…
Arpit-Gole
  • 404
  • 3
  • 9
21
votes
1 answer

What is a fully convolution network?

I was surveying some literature related to Fully Convolutional Networks and came across the following phrase, A fully convolutional network is achieved by replacing the parameter-rich fully connected layers in standard CNN architectures by…
21
votes
2 answers

What are the main differences between skip-gram and continuous bag of words?

The skip-gram and continuous bag of words (CBOW) are two different types of word2vec models. What are the main differences between them? What are the pros and cons of both methods?
DRV
  • 1,843
  • 3
  • 15
  • 20
21
votes
2 answers

Problems that only humans will ever be able to solve

With the increasing complexity of reCAPTCHA, I wondered about the existence of some problem, that only a human will ever be able to solve (or that AI won't be able to solve as long as it doesn't reproduce exactly the human brain). For instance, the…
20
votes
1 answer

How does LSTM in deep reinforcement learning differ from experience replay?

In the paper Deep Recurrent Q-Learning for Partially Observable MDPs, the author processed the Atari game frames with an LSTM layer at the end. My questions are: How does this method differ from the experience replay, as they both use past…
20
votes
4 answers

What is the difference between actor-critic and advantage actor-critic?

I'm struggling to understand the difference between actor-critic and advantage actor-critic. At least, I know they are different from asynchronous advantage actor-critic (A3C), as A3C adds an asynchronous mechanism that uses multiple worker agents…
20
votes
3 answers

How do I choose the best algorithm for a board game like checkers?

How do I choose the best algorithm for a board game like checkers? So far, I have considered only three algorithms, namely, minimax, alpha-beta pruning, and Monte Carlo tree search (MCTS). Apparently, both the alpha-beta pruning and MCTS are…