Questions tagged [probability-distribution]

For questions related to AI theory that relies on the knowledge of a distribution of probabilities across one or more dimensions affecting probability. Such a distribution may be in discrete buckets, such as quartile, octile, or percentile conventions or continuous functions based on some closed form (algebraic formula). Distributions of probability are key in planning, natural language handling, and other AI objectives.

89 questions
8
votes
2 answers

Why is KL divergence used so often in Machine Learning?

The KL Divergence is quite easy to compute in closed form for simple distributions -such as Gaussians- but has some not-very-nice properties. For example, it is not symmetrical (thus it is not a metric) and it does not respect the triangular…
8
votes
1 answer

What are the main benefits of using Bayesian networks?

I have some trouble understanding the benefits of Bayesian networks. Am I correct that the key benefit of the network is that one does not need to use the chain rule of probability in order to calculate joint distributions? So, using the chain…
7
votes
1 answer

What loss function to use when labels are probabilities?

What loss function is most appropriate when training a model with target values that are probabilities? For example, I have a 3-output model. I want to train it with a feature vector $x=[x_1, x_2, \dots, x_N]$ and a target $y=[0.2, 0.3, 0.5]$. It…
5
votes
1 answer

Many of the best probabilistic models represent probability distributions only implicitly

I am currently studying Deep Learning by Goodfellow, Bengio, and Courville. In chapter 5.1.2 The Performance Measure, P, the authors say the following: The choice of performance measure may seem straightforward and objective, but it is often…
The Pointer
  • 611
  • 5
  • 22
5
votes
2 answers

What is a probability distribution in machine learning?

If we were learning or working in the machine learning field, then we frequently come across the term "probability distribution". I know what probability, conditional probability, and probability distribution/density in math mean, but what is its…
5
votes
1 answer

Why is the Jensen-Shannon divergence preferred over the KL divergence in measuring the performance of a generative network?

I have read articles on how Jensen-Shannon divergence is preferred over Kullback-Leibler in measuring how good a distribution mapping is learned in a generative network because of the fact that JS-divergence better measures distribution similarity…
4
votes
1 answer

How can I make an MNIST digit recognizer that rejects out-of-distribution data?

I've done an MNIST digit recognition neural network. When you put images in that are completely unlike its training data, it still tries to classify them as digits. Sometimes it strongly classifies nonsense data as being a specific digit. I am…
river
  • 143
  • 6
4
votes
1 answer

How does the VAE learn a joint distribution?

I found the following paragraph from An Introduction to Variational Autoencoders sounds relevant, but I am not fully understanding it. A VAE learns stochastic mappings between an observed $\mathbf{x}$-space, whose empirical distribution…
4
votes
1 answer

Why do we sample vectors from a standard normal distribution for the generator?

I am new to GANs. I noticed that everybody generates a random vector (usually 100 dimensional) from a standard normal distribution $N(0, 1)$. My question is: why? Why don't they sample these vectors from a uniform distribution $U(0, 1)$? Does the…
4
votes
2 answers

When should one prefer using Total Variational Divergence over KL divergence in RL

In RL, both the KL divergence (DKL) and Total variational divergence (DTV) are used to measure the distance between two policies. I'm most familiar with using DKL as an early stopping metric during policy updates to ensure the new policy doesn't…
4
votes
1 answer

What is the difference between model and data distributions?

Is there any difference between the model distribution and data distribution, or are they the same?
4
votes
1 answer

In deep learning, do we learn a continuous distribution based on the training dataset?

At least at some level, maybe not end-to-end always, but deep learning always learns a function, essentially a mapping from a domain to a range. The domain and range, at least in most cases, would be multi-variate. So, when a model learns a…
4
votes
1 answer

How are the parameters of the Bernoulli distribution learned?

In the paper Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask, they learn a mask for the network by setting up the mask parameters as $M_i = Bern(\sigma(v_i))$. Where $M$ is the parameter mask ($f(x;\theta, M) = f(x;M \odot \theta$),…
mshlis
  • 2,399
  • 9
  • 23
4
votes
1 answer

What does $x,y \sim \hat{p}_{data}$ mean in the Deep Learning book by Goodfellow

In chapter 5 of Deep Learning book of Ian Goodfellow, some notations in the loss function as below make me really confused. I tried to understand $x,y \sim p_{data}$ means a sample $(x, y)$ sampled from original dataset distribution (or $y$ is the…
3
votes
0 answers

Relation between SDE diffusion and DDPM/DDIM

Background & Definitions In DDPM, the diffusion backward step is described as follows (where $z\sim \mathcal{N}(0,I)$ and $x_{T}\sim \mathcal{N}(0,I)$): and in DDIM we have while in the SDE formulation (from the Fokker-Planck equation) the step…
1
2 3 4 5 6