I have done quite a few Google/paper searches but did not found an answer. I would like to test the possibility of speeding up/ improving the accuracy of an existing unsupervised machine learning (mainly clustering and pca) project. Before I analyse the exact implementation on a QC (I have never performed ml on a QC), I would like to ask whether there are any limitations or rule of thumbs on the amount of input data (e.g. rows, gigabytes …) that a QC (and e.g. quantum kernel machine learning) can handle.
1 Answers
One way to formalize limitations on the amount of data $n$ that you can process is to describe how the runtime of your chosen algorithm scales with $n$ compared to the amount of resources you have at your disposal. Both parts are important - a classical learning algorithm whose runtime scales like $n^3$ might not be a problem if you can parallelize the problem and run it on a large computing cluster for a month.
This means you need to constrain the computational resources that you have at your disposal when using a quantum device. On near-term superconducting qubit devices running shallow circuits, the readout time tends dominates the circuit execution time (typical gate times are order $10$ ns, compared to $10^2$ or $10^3$ ns for measurement). So a reasonable constraint for superconducting qubit devices is a "shot budget" $M$, the total number of times you measure the output of a quantum circuit on the device. If, instead, your total access time to the device is limited, you can do calculations to convert that time to a shot budget.
Now suppose you have a shot budget of $M$ measurements and want to implement a quantum kernel-based classifier. Here is an example analysis of how the amount of data will be constrained by this shot budget:
- If we have $n$ training data and $t$ testing data, a kernel method means we need to construct an $n\times n$ Gram matrix $K_{ij} = |\langle \psi(x_i)|\psi(x_j)\rangle|^2$ on the training data with $n(n-1)/2$ unique entries (assuming $1$'s on the diagonal) and a test matrix with $nt$ unique entries. The relationship of $n$ and $t$ has to be tuned carefully (e.g. cross-validation) to make empirical claims about the generalization of your algorithm: If $n \gg t$ you will overfit, and $n < t$ doesn't make sense.
- Each entry $K_{ij}$ is estimated using a quantum circuit (e.g. SWAP test), and so there will be statistical uncertainty associated with each entry. This uncertainty is controlled by the number of measurements $m$ used to estimate each $K_{ij}$. For an SVM, $m$ might matter a lot, while for kernel PCA it might not matter at all (Achlioptas, 2001). This issue will also depend on the feature map $|\psi(x)\rangle$ you are implementing - $K_{ij}$ might be larger or smaller, with the latter having catastrophic effect on the trainability of the kernel algorithm unless $m$ is scaled suitably (e.g. Thanaslip, 2022).
- Error mitigation to improve the accuracy of estimates of $K_{ij}$ will incur additional overhead; this depends entirely on what kind of error mitigation you implement.
So to obey our shot budget, and not accounting for any error mitigation we need to pick $n,t,m$ such that $$ M \approx mn\left( \frac{n-1}{2} + t\right), $$ subject to practical constraints like $n > t$. This is roughly the analysis that was used to choose the dataset size used in (Peters, 2022), which contains some additional details that might be useful [disclaimer: I am an author on that paper].
- 7,988
- 2
- 12
- 33