4

I've been doing some exploratory work on quantum kernel methods (I'm a newcomer to the field), and I struggle to reconcile two facts.

  1. As I increase the number of qubits, the Hilbert space becomes so big that most samples tend to be orthogonal to each other, therefore losing any similarity measure the kernel is supposed to find. At this point, the Gram matrix of the kernel tends to the Identity matrix, from which it is impossible to make any predictions. Hence, in order for a quantum kernel method to work, I end up using a small qubit register (as small as 3 qubits).
  2. A necessary (but not sufficient) condition for quantum advantage is that the circuit must be hard to simulate or compute classically. However, a 3 qubit circuit is trivial to simulate on a classical computer. In fact, it's probably easier if one considers that the dimensionality of the data might require a deep circuit to encode all data points, which poses some problems in terms of error.

Am I on the right path? or is any of these facts not entirely true?

I brief overview of my thinking would be as follows.

I appreciate Quantum Kernel Methods are essentially linear models on the encoded data, and the real substance is in the encoding / feature map. And it may well be the case that single qubits rotations and entangling gates encode the data in a very useful way. However, as long as it is done on just 3 qubits, I can 'borrow' the idea of that feature map from the quantum circuit and just simulate it on a classical computer.

Is it the case that there might be an exceptionally good encoding circuit that is able to capture similarity on a 30-qubit Hilbert space, therefore needing a quantum computer to avoid manipulating matrices of dimension 2^30 by 2^30? Is this the scenario where quantum kernel methods offer an advantage?

I've seen papers showing 20+ qubit circuits with nice looking Gram matrices (with the desirable block diagonal structure), however, upon further inspection, I see the scale of the colormap is [0:0.2]. Then I wonder, does this pose a problem in terms of the number of shot required to resolve values in that rather small interval (compared to what would be required if the similarity was noticeable in the interval [0:1])?

aim137
  • 43
  • 4

1 Answers1

3

The basic constraint is that the (quantum) feature map has to be chosen very carefully so that the feature vectors do not have vanishing inner products. If $n$ is the number of qubits, $x_i$ for $i=1, \dots, N$ are the data, and $G$ is the gram matrix with elements $(G)_{ij}=k(x_i, x_j)$, some of the problem constraints are:

  1. Small $n$ $\Rightarrow $ classical simulability
  2. Large $n$ $\not\Rightarrow$ small elements $G_{ij}$
  3. Efficient learnability (sample size subexponential in $n$) $\Rightarrow$ effective rank of $G$ is$^1$ $\text{poly}(n)$ as $N\rightarrow \infty$

Point 1 you already recognized, but point 2 is important so that there could be any possibility of advantage with quantum kernel (there is). Point 3 clarifies that scaling behavior is more important than any fixed choice of $n$ and $N$, which should be expected since this is how advantage works in other quantum algorithms.


To expand on these:

2. High dimensional feature space doesn't imply vanishing inner products.

  • Counterexample: Gaussian (RBF) kernel, which corresponds to an infinite-dimensional feature space
  • There are ways to reduce the effective dimensionality of a feature map, while still using vectors in a high-dimensional feature space. For instance, you can use a kind of "bandwidth" that generically increases inner products without reducing dimensionality (see: originally appearing in [1], formalized in [2] and [3]; disclaimer, I was an author on [1] and [3]). This is precisely how the RBF kernel can function despite its infinite dimensional embeddings. Personally I no longer think this is a promising direction, and there's numerical evidence that it makes classical simulation easy [4].
  • Sometimes the kernel values concentrate around some fixed value that isn't zero, e.g. [5]. Then, as you said, sampling error becomes a significant issue.

3. Eigenvalue scaling of $G$ dictates learnability

The problem of $k(x_i, x_j)$ concentrating or being small for $i \neq j$ is deeper than sampling error. The issue is rather in the eigendecomposition of $G$: Roughly, you want the eigenvectors of $G$ to "align" nicely with the target function $f$ that you are learning, and so you want $G$ to have an effective rank (dimension of the subspace spanned by eigenvectors with nontrivial eigenvalues) that matches the rank of $f$ (in an orthonormal decomposition sense):

  • If $G$ is close to the identity in $2^n$ dimensions, then either (a) you have chosen the kernel poorly or (b) the function requires exponential sample complexity to learn.
  • If $(G)_{ij} \approx c$ for all $i\neq j$, then you have the opposite problem where $G$ is approximately rank-1 as $c \rightarrow 1$. Then you can succeed only if $f$ is a "simple" function that consists of one basis function.

Luckily for us, the rank of $f$ is not obviously related to classical hardness. For some choice of classically-hard, low-rank $f$, you want to find a kernel matrix $G$ that is approximately low-rank (polynomial scaling in $n$), and whose eigenvectors "align" with $f$ [6]. By assumption, the circuit that computes this kernel cannot be efficiently classically simulated.


$^1$ Whenever I say $G$ what I am actually talking about the "integral kernel operator" $T: L_\mu^2(\mathcal{X}) \rightarrow L_\mu^2(\mathcal{X})$ where $(Tf)(\cdot) = \int_\mathcal{X} k(\cdot, x) f(x) \mu(dx)$ for some probability measure $\mu$ on your (measurable) space $\mathcal{X}$ where the data live, bounded $k$, etc. You can think of this as $\lim_{N\rightarrow \infty} G$, or go read up on $T$ in any standard reference for kernel methods. The formalism isn't worth getting into here.

forky40
  • 7,988
  • 2
  • 12
  • 33