At the bottom end of what you can do, there's the "three-cornered hat" measurement. You need your new clock and two other clocks. The other two clocks don't have to be better, or even comparable to, your new clock (although the better they are the more reliable the results are). What is important is that the errors of all three clocks are statistically uncorrelated. Then you measure the difference between all three pairs of clocks (A vs. B, A vs. C, B vs. C) many times over a period of time.
Under the assumption of uncorrelated errors, you can calculate:
$$\sigma^{2}_{A} = \frac{1}{2}\left(\sigma^{2}_{ab} + \sigma^{2}_{ac} - \sigma^{2}_{bc}\right) \\
\sigma^{2}_{B} = \frac{1}{2}\left(\sigma^{2}_{ab} + \sigma^{2}_{bc} - \sigma^{2}_{ac}\right) \\
\sigma^{2}_{C} = \frac{1}{2}\left(\sigma^{2}_{ac} + \sigma^{2}_{bc} - \sigma^{2}_{ab}\right) \\
$$
(source; I reformatted the equation in the introduction slightly). $\sigma^{2}$ represents "variance", and these equations are valid for the conventional statistical variance, but when measuring clocks you would probably be using the Allan variance (AVAR) or modified Allan variance (MVAR), and the equations are valid for them too, as long as all three $\sigma^{2}$s are of the same type (and the same $\tau$ for AVAR and MVAR).
In other words, you can calculate the quality of your clock — its variance against "perfect time", given the three sets of pairwise variances, even if the two other clocks are worse than yours.
In real life, the assumption of uncorrelated errors never holds exactly, and it's impossible for this method to tell the difference between an error that affects A, and an error that affects both B and C consistently. When the assumptions are violated, the variances that come out will be inaccurate, and will sometimes even be negative (which is physically impossible). Likewise if B and C are too much noisier than A, or if the collection period simply isn't long enough, A's true variance may be "lost in the noise". Those are downsides. Nonetheless, it's commonly assumed that if we design an experiment well, use the best standards we can get our hands on (but of different designs), and isolate from common environmental influences, we can get measurements that will be useful for some purpose (maybe while tweaking your new design to find the best stability).
Then when you need to characterize things even better you can use bigger ensembles, as in Dale's answer, to reduce your odds of getting fooled by two clocks that just happen to zig and zag in a coordinated way.