- Output format: Use a bit packed output format that has 8 results per byte instead of 1. In the python bindings, this means calling - sample_bit_packedinstead of- sample. From the command line, this means specifying- --out_format=b8. You're paying ~5x performance if you're using a non-bit packed output.
 - From the command line, for detection events, you can also consider sparse output formats like - --out_format=r8where each byte tells you the number of 0 results until the next 1 (except 0xFF used for long runs of zeros). For error rates below 0.1% this can easily be 10x less data output.
 
- Sample size: Take a number of samples that's a multiple of 256. Stim internally pads the number of samples up to a multiple of 256, because it operates simultaneously across batches using 256 bit wide AVX instructions. It does this even if you ask for 2 samples, and the other 254 are going to be discarded. 
- Command Line Tool: Consider using the command line tool instead of the python bindings. The python bindings add overhead (extra data copying), and are currently less flexible than the command line tool. You're paying ~2x performance to use stim via python. - Note that when you print a - stim.Circuitin python, the result is text that the command line tool can parse to get that exact same circuit, so it's quite easy to generate your circuit files from python by making the circuit as usual and printing it to a file.
 - An example where the command line tool is currently necessary is if your circuit is really big, with hundreds of billions of bits of data coming out. The command line tool will notice the size and automatically switch from buffering all results in memory to streaming results to disk. The python bindings are not currently capable of doing this, because they return the results as a numpy array; in memory. Speaking from personal experience, streaming the results from huge circuits can be the difference between things working and watching helplessly as your computer completely freezes due to madly swapping memory to and from disk. 
- REPEAT block: Use the - REPEATblock instead of explicitly repeating things. If you have a circuit with a hundred thousand rounds of the same operations, putting them inside a- REPEAT 100000 { ... }block instead of actually repeating them a hundred thousand times will reduce memory usage and hugely reduce parsing time.
 - In the python bindings, you can create a repeat block by multiplying the circuit by an integer. So create a - stim.Circuitcontaining the body of the loop, then do- full_circuit += loop_body * 100000.
 
- Grouping operations: Have your measurements come in groups and your resets come in groups, without other operations or noise in between. For example, this: -  M 0
 M 1
 M 2
 X_ERROR(0.01) 0
 X_ERROR(0.01) 1
 X_ERROR(0.01) 2
 - happens to be faster than this: -  M 0
 X_ERROR(0.01) 0
 M 1
 X_ERROR(0.01) 1
 M 2
 X_ERROR(0.01) 2
 - This is because (currently) stim temporarily transposes the entire stabilizer tableau when performing a measurement, but avoids re-transposing for adjacent measurements. - It's certainly possible for stim to be made smarter here, to move measurements around so more can be fused or to partially transpose the tableau as required, but currently it's not that smart. - Grouping measurements isn't relevant if you're sampling detection events or specify - --frame0, because those modes bypass the stabilizer tableau simulation step where the transposing occurs. But you never know when you'll later want raw measurements...
 - Another, more minor, benefit of grouping operations of the same type is that a noisy operation with many targets can have their "did the error happen" bits generated together in an efficient way. But if you're asking for hundreds or thousands of samples then you're already getting most of that benefit.