For questions related to AI theory that relies on the knowledge of a distribution of probabilities across one or more dimensions affecting probability. Such a distribution may be in discrete buckets, such as quartile, octile, or percentile conventions or continuous functions based on some closed form (algebraic formula). Distributions of probability are key in planning, natural language handling, and other AI objectives.
Questions tagged [probability-distribution]
89 questions
8
votes
2 answers
Why is KL divergence used so often in Machine Learning?
The KL Divergence is quite easy to compute in closed form for simple distributions -such as Gaussians- but has some not-very-nice properties. For example, it is not symmetrical (thus it is not a metric) and it does not respect the triangular…
Federico Taschin
- 253
- 2
- 8
8
votes
1 answer
What are the main benefits of using Bayesian networks?
I have some trouble understanding the benefits of Bayesian networks.
Am I correct that the key benefit of the network is that one does not need to use the chain rule of probability in order to calculate joint distributions?
So, using the chain…
Sebastian Dine
- 181
- 1
7
votes
1 answer
What loss function to use when labels are probabilities?
What loss function is most appropriate when training a model with target values that are probabilities? For example, I have a 3-output model. I want to train it with a feature vector $x=[x_1, x_2, \dots, x_N]$ and a target $y=[0.2, 0.3, 0.5]$.
It…
Thomas Johnson
- 173
- 4
5
votes
1 answer
Many of the best probabilistic models represent probability distributions only implicitly
I am currently studying Deep Learning by Goodfellow, Bengio, and Courville. In chapter 5.1.2 The Performance Measure, P, the authors say the following:
The choice of performance measure may seem straightforward and objective, but it is often…
The Pointer
- 611
- 5
- 22
5
votes
2 answers
What is a probability distribution in machine learning?
If we were learning or working in the machine learning field, then we frequently come across the term "probability distribution". I know what probability, conditional probability, and probability distribution/density in math mean, but what is its…
Eka
- 1,106
- 8
- 24
5
votes
1 answer
Why is the Jensen-Shannon divergence preferred over the KL divergence in measuring the performance of a generative network?
I have read articles on how Jensen-Shannon divergence is preferred over Kullback-Leibler in measuring how good a distribution mapping is learned in a generative network because of the fact that JS-divergence better measures distribution similarity…
ashenoy
- 1,419
- 6
- 19
4
votes
1 answer
How can I make an MNIST digit recognizer that rejects out-of-distribution data?
I've done an MNIST digit recognition neural network.
When you put images in that are completely unlike its training data, it still tries to classify them as digits. Sometimes it strongly classifies nonsense data as being a specific digit.
I am…
river
- 143
- 6
4
votes
1 answer
How does the VAE learn a joint distribution?
I found the following paragraph from An Introduction to
Variational Autoencoders sounds relevant, but I am not fully understanding it.
A VAE learns stochastic mappings between an observed $\mathbf{x}$-space, whose empirical distribution…
a12345
- 243
- 1
- 7
4
votes
1 answer
Why do we sample vectors from a standard normal distribution for the generator?
I am new to GANs. I noticed that everybody generates a random vector (usually 100 dimensional) from a standard normal distribution $N(0, 1)$. My question is: why? Why don't they sample these vectors from a uniform distribution $U(0, 1)$? Does the…
dato nefaridze
- 882
- 10
- 22
4
votes
2 answers
When should one prefer using Total Variational Divergence over KL divergence in RL
In RL, both the KL divergence (DKL) and Total variational divergence (DTV) are used to measure the distance between two policies. I'm most familiar with using DKL as an early stopping metric during policy updates to ensure the new policy doesn't…
mugoh
- 549
- 4
- 21
4
votes
1 answer
What is the difference between model and data distributions?
Is there any difference between the model distribution and data distribution, or are they the same?
Bhuwan Bhatt
- 404
- 2
- 13
4
votes
1 answer
In deep learning, do we learn a continuous distribution based on the training dataset?
At least at some level, maybe not end-to-end always, but deep learning always learns a function, essentially a mapping from a domain to a range. The domain and range, at least in most cases, would be multi-variate.
So, when a model learns a…
ashenoy
- 1,419
- 6
- 19
4
votes
1 answer
How are the parameters of the Bernoulli distribution learned?
In the paper Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask, they learn a mask for the network by setting up the mask parameters as $M_i = Bern(\sigma(v_i))$. Where $M$ is the parameter mask ($f(x;\theta, M) = f(x;M \odot \theta$),…
mshlis
- 2,399
- 9
- 23
4
votes
1 answer
What does $x,y \sim \hat{p}_{data}$ mean in the Deep Learning book by Goodfellow
In chapter 5 of Deep Learning book of Ian Goodfellow, some notations in the loss function as below make me really confused.
I tried to understand $x,y \sim p_{data}$ means a sample $(x, y)$ sampled from original dataset distribution (or $y$ is the…
David Ng
- 143
- 4
3
votes
0 answers
Relation between SDE diffusion and DDPM/DDIM
Background & Definitions
In DDPM, the diffusion backward step is described as follows (where $z\sim \mathcal{N}(0,I)$ and $x_{T}\sim \mathcal{N}(0,I)$):
and in DDIM we have
while in the SDE formulation (from the Fokker-Planck equation) the step…
snatchysquid
- 89
- 6