I am working on a prediction problem where each outcome vector in my training data $y_i$ was generated to satisfy a set of linear constraints $A_i y_i = b_i$. I know each $A_i$ and $b_i$ exactly. I want to be able to train the model on inputs $u_i$, return predicted $\hat{y_i}$ such that $A_i \hat{y_i}= b_i$ $\forall i$. Then, I should be able to predict new output vectors $\hat{z_j}$ off of new inputs $v_j$ with new linear constraints like $C_j \hat{z_j} = d_j$. The new predictions $\hat{z_j}$ should then satisfy $C_j \hat{z_j} = d_j$ $\forall j$.
I know I can modify the loss function using a penalty $\rho$ akin to a Lagrangian, $L = MSE + \rho (b - Ax)$ but I have only seen examples of this where there is one set of linear constraints that is the same for all observations, $A x_i = b$ $\forall i$. (It might be that this is possible with unique $A_i, b_i$ and I just haven't seen how to do so.) I have a unique set of linear constraints for each observation. An alternative to Lagrangian loss functions might be to just pass the problem context $(A_i, u_i, b_i)$ as the training data, but the model might return $\hat{y_i}$ that don't satisfy the system of equations. Is there an existing architecture that can enforce context-specific linear constraints, and possibly use the data on the constraints to learn the weights? Perhaps the constraints can be encoded in the final layer of the network architecture, but I have not found any work that seems to do that.