The technique for off-policy value evaluation comes from importance sampling, which states that
$$E_{x \sim q}[f(x)] \approx \frac{1}{n}\sum_{i=1}^n f(x_i)\frac{q(x_i)}{p(x_i)},$$ where $x_i$ is sampled from $p$.
In the application of importance sampling to RL, is the expectation of the function $f$ equivalent to the value of the trajectories, which is represented by the trajectories $x$?
The distributions $p$ represent the probability of sampling trajectories from the behavior policy and the distribution $q$ represents the probability of sampling trajectories from the target policy $q$?
How would the trajectories from distribution $q$ be better than that of $p$? I know from the equation how it is better, but it is hard to understand intuitively why this could be so.
 
     
     
     
    
