What is the best way to parallelise processes across multiple cores when computing logical error rates with Stim and PyMatching?

Question

I am currently trying to compute logical error rates for the surface code using Stim's detector error models and PyMatching for different distances and noise strengths.

tl;dr : What is the best strategy for parallelising this process over different cores?

In doing so, I am trying to optimise the number of cores at my disposal by parallelising the different computations across multiple cores. My current workflow leverages joblib for this task

from joblib import Parallel, delayed, parallel_backend
distances = [...] # A set of distances
noise_strengths = [...] # A set of noise strengths
num_shots = ... # Some number of shots to sample
num_cycles = ... # Number of cycles over which to perform syndrome extraction
threads = ... # Maximum number of threads allowed
jobs = ... # Maximum number of jobs allowed
with parallel_backend("loky", inner_max_num_threads=threads):
        joblib_results = Parallel(n_jobs=jobs)(delayed(compute_logical_error_rate)(distance,noise,n_shots) for distance in distances for noise in noise_strengths)

where the logical error rate is computed using stim and pymatching in the following way

def compute_logical_error_rate(distance,noise,num_shots,num_cycles):
   sc_circuit = surface_code(distance,noise,num_cycles) # Define circuit
   sc_model = sc_circuit.detector_error_model(decompose_errors=True) # Detector error model

   sampler = sc_circuit.compile_detector_sampler() # Define sampler
   syndrome, measured_obs = sampler.sample(shots=num_shots, separate_observables=True) # Extract samples

   matching = Matching.from_detector_error_model(sc_model) # Matching graph
   expected_obs = matching.decode_batch(syndrome) # Perform decoding

   num_errors = np.sum(np.any(expected_obs != measured_obs, axis=1)) # Compute errors
   logical_error_rate = num_errors/num_shots # Compute logical error rate

   return logical_error_rate

This workflow allows me to assign a different simulation to each core, i.e. each core will perform the computation of the logical error rate for a given distance and noise strength. However, upon increasing the number of samples that I take, defined as num_shots, I start running into memory problems. I have identified the reason to be the syndrome variable, which is a numpy array whose size scales linearly with number of samples considered syndrome.shape = (num_shots,...). The way I am parallelising naturally results in storing this variable multiple times (once for each job/core), thus leading to a memory overload.

I am interested in understanding how I can best leverage multiple cores to make this whole set of simulations more efficient. Would it make more sense to tackle one job at a time (fixed distance and noise strength) and parallelise the sampling across different cores? (If so, how can that be done?) - There seems to be a comment about this approach on the original Stim paper, but I do not see how to replicate this behaviour with my code.

Craig Gidney · Accepted Answer · 2024-02-14T05:10:45.020

(Sounds like you want sinter. In particular you should check out sinter.collect. It's for exactly this task, but supports additional features such as the ability to save progress to a file so it can resume work later if your machine lost power.)

First, you should divide up the work into batches. Stim is optimized to work best when you keep asking for batches of shots of the same size, where that size is a multiple of 256.

Second, you should ask for bit packed data. This makes the data 8x denser, so there's a lot less copying.

Third, you should return how many samples were taken and how many errors were seen, instead of their ratio. This allows you to run the same circuit on multiple threads and easily combine the result afterwards by adding the returned numbers of shots and errors.

def count_logical_errors(distance,noise,num_shots,num_cycles):
   sc_circuit = surface_code(distance,noise,num_cycles) # Define circuit
   sc_model = sc_circuit.detector_error_model(decompose_errors=True) # Detector error model       
   sampler = sc_circuit.compile_detector_sampler() # Define sampler
   matching = Matching.from_detector_error_model(sc_model) # Matching graph
shots_left = num_shots
   num_errors = 0
   while shots_left > 0:
       batch_size = min(1024, shots_left)
       shots_left -= batch_size
       syndrome, measured_obs = sampler.sample(
           shots=batch_size,
           separate_observables=True,
           bit_packed=True,
       ) # Extract samples
   expected_obs = matching.decode_batch(syndrome) # Perform decoding

   num_errors += np.sum(np.any(expected_obs != measured_obs, axis=1)) # Compute errors


return num_errors, num_shots

Incidentally, your method will be much more flexible if you pass in the circuit instead of the parameters used to make the circuit.

What is the best way to parallelise processes across multiple cores when computing logical error rates with Stim and PyMatching?

1 Answers1