Most Popular

1500 questions
8
votes
2 answers

How does AlphaZero's MCTS work when starting from the root node?

From the AlphaGo Zero paper, during MCTS, statistics for each new node are initialized as such: ${N(s_L, a) = 0, W (s_L, a) = 0, Q(s_L, a) = 0, P (s_L, a) = p_a}$. The PUCT algorithm for selecting the best child node is $a_t = argmax(Q(s,a) +…
sb3
  • 167
  • 1
  • 7
8
votes
1 answer

Validation accuracy higher than training accurarcy

I implemented the unet in TensorFlow for the segmentation of MRI images of the thigh. I noticed I always get a higher validation accuracy by a small gap, independently of the initial split. One example: So I researched when this could be…
Lis Louise
  • 139
  • 4
8
votes
1 answer

Why is there a Uniform and Normal version of He / Xavier initialization in DL libraries?

Two of the most popular initialization schemes for neural network weights today are Xavier and He. Both methods propose random weight initialization with a variance dependent on the number of input and output units. Xavier proposes $$W \sim…
Tinu
  • 628
  • 1
  • 4
  • 14
8
votes
2 answers

Why is KL divergence used so often in Machine Learning?

The KL Divergence is quite easy to compute in closed form for simple distributions -such as Gaussians- but has some not-very-nice properties. For example, it is not symmetrical (thus it is not a metric) and it does not respect the triangular…
8
votes
1 answer

How is the DQN loss derived from (or theoretically motivated by) the Bellman equation, and how is it related to the Q-learning update?

I'm doing a project on Reinforcement Learning. I programmed an agent that uses DDQN. There are a lot of tutorials on that, so the code implementation was not that hard. However, I have problems understanding how one should come up with this kind of…
8
votes
1 answer

What is the cost function of a transformer?

The paper Attention Is All You Need describes the transformer architecture that has an encoder and a decoder. However, I wasn't clear on what the cost function to minimize is for such an architecture. Consider a translation task, for example, where…
8
votes
2 answers

What is the difference between the positional encoding techniques of the Transformer and GPT?

I know the original Transformer and the GPT (1-3) use two slightly different positional encoding techniques. More specifically, in GPT they say positional encoding is learned. What does that mean? OpenAI's papers don't go into detail very much. How…
Leevo
  • 305
  • 2
  • 9
8
votes
2 answers

Could there be existential threats to humanity due to AI?

We are doing research, spending hours figuring out how we can make real AI software (intelligent agents) to work better. We are also trying to implement some applications e.g. in business, health and education, using the AI technology. Nonetheless,…
quintumnia
  • 1,173
  • 2
  • 10
  • 35
8
votes
2 answers

How should we interpret this figure that relates the perceptron criterion and the hinge loss?

I am currently studying the textbook Neural Networks and Deep Learning by Charu C. Aggarwal. Chapter 1.2.1.2 Relationship with Support Vector Machines says the following: The perceptron criterion is a shifted version of the hinge-loss used in…
8
votes
2 answers

Why is the perceptron criterion function differentiable?

I'm reading chapter one of the book called Neural Networks and Deep Learning from Aggarwal. In section 1.2.1.1 of the book, I'm learning about the perceptron. One thing that book says is, if we use the sign function for the following loss function:…
8
votes
1 answer

Is there a connection between the bias term in a linear regression model and the bias that can lead to under-fitting?

Here is a linear regression model $$y = mx + b,$$ where $b$ is known as $y$-intercept, but also known as the bias [1], $m$ is the slope, and $x$ is the feature vector. As I understood, in machine learning, there is also the bias that can cause the…
8
votes
1 answer

Why is the learning rate generally beneath 1?

In all examples I've ever seen, the learning rate of an optimisation method is always less than $1$. However, I've never found an explanation as to why this is. In addition to that, there are some cases where having a learning rate bigger than 1 is…
8
votes
1 answer

Which loss function should I use in REINFORCE, and what are the labels?

I understand that this is the update for the parameters of a policy in REINFORCE: $$ \Delta \theta_{t}=\alpha \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) v_{t}, $$ where $v_t$ is usually the discounted future reward and …
8
votes
1 answer

Why do the standard and deterministic Policy Gradient Theorems differ in their treatment of the derivatives of $R$ and the conditional probability?

I would like to understand the difference between the standard policy gradient theorem and the deterministic policy gradient theorem. These two theorem are quite different, although the only difference is whether the policy function is deterministic…
8
votes
2 answers

What are some best practices when trying to design a reward function?

Generally speaking, is there a best-practice procedure to follow when trying to define a reward function for a reinforcement-learning agent? What common pitfalls are there when defining the reward function, and how should you avoid them? What…