I am trying to reproduce Li-injection method in the surface code (from here). Specifically, I start with d1=7 surface code, perform the injection + two rounds of stabilizer measurements, and check the "acceptance rate". If there were no errors, I continue by d2=7, so I just measure all data qubits and check if there was a logical error.
Specifically, I am using STIM so I actually inject S gate instead of T (my circuit can be found in crumble). I only induce two-qubit gate depolarizing error with probability 0.1%, to match for one point from the paper. Nevertheless, I am not able to reproduce the results.
Going over all possible errors Li mentioned, for T state he counts 6 two-qubits errors from the CNOT of the stabilizer measurement. In the case of the S gate, I should actually have only 5 (because Y on the injected qubit is not an error), so I should have a logical error of 5p/15 = p/3.
Now, my problem is that Li states that these errors are originated solely in the first round of measurement, and these are the only possible errors that leads to a logical error (for first order in p). But why? If, for example, I have an XI error on the second data qubit in the second round - it is not detectable, and leads to a logical error. Indeed, if I assume NO errors in the second round, I get exactly p/3 logical error rate, but acceptance rate of ~ 75% which is much higher than in the paper; on the other hand, if I DO assume errors in the second round (as should be), I get the correct acceptance rate of ~ 59%, but much higher logical error rate ~ 1.8p.
So, my question is two-fold: First, why does Li ignore the possibility of errors in the second round, like the example I gave above? Second, If Li is correct, what do I miss and how do I incorporate it in my simulations?