So the important thing is simply that it take the form of a total derivative, even if that symbolically involves the expressions for $\mathbf q$ and its derivatives. So for example it could take the form $\mathbf q \dot{\mathbf q},$ and that’s fine because it is symbolically a total derivative.
Background: the Lagrangian doesn’t know about paths
Backing up one step to explain this better: we start from a notion of paths, which map a time interval, say $t\in(0, T)$, to a bunch of vectors in a coördinate space, say $\mathbf q(t)$ in $\mathbb R^{3n}$ for $n$ particles in unconstrained 3D space, although one of the great strengths about the Lagrangian formalism is that it does not care about how you parameterize the space and therefore you can impose constraints on the space without messy constraint forces. Call the time interval $\mathsf T$ and the coördinate space $\mathsf C$, paths are functions $\mathsf T \to \mathsf C.$ [If you really want to go streamlined-and-abstract, you can also make time one of the coördinates in the space and take $\mathsf T = (0, 1)$ or so, as a “progress along the path” parameter rather than a “time” parameter.]
We then invent action principles, we say that the laws of physics can somehow be encoded in a function $ \mathcal S: (\mathsf T\to\mathsf C)\to \mathbb R,$ assigning numbers to paths. Then we are saying that of all the paths which a particle can take between two points in $\mathsf C$, the ones that it does take according to physics are the ones where for all path-perturbations $\delta \mathbf{q}$ vanishing at the endpoints of $\mathsf T$, $$ S[\mathbf q + \delta \mathbf q] \approx S[\mathbf q]$$to first order in $\delta \mathbf q.$ Obviously this doesn’t really help us if we don’t have some additional structure, which is why we impose the Lagrangian structure. Now this is important, while $S$ only has one path and has to deal with strange things like “taking derivatives with respect to time” of that path, the Lagrangian doesn’t really know about those.
An $n^\text{th}$ order Lagrangian is just a function from $n+1$ coördinates and one time to the real numbers. It doesn’t know, as a function, that its various coördinates are going to come from different paths or that those paths are connected to each other by time derivatives in the action principle. It’s just a function $L : \mathsf T \times \mathsf C^{n+1} \to \mathbb R.$ The fact that these arguments are symbolic derivatives comes from the fact that we assume that the action principle $S$ can be phrased in terms of $L$ by an expression of the form, $$S[\mathbf q] = \int_\mathsf T dt~L\big(t, \mathbf q(t), \dot{\mathbf q}(t), \ddot {\mathbf q}(t), \dots\big).$$ Note that the logic has us generate $n$ paths from the one path, then we evaluate them at some position upon the path, feed them to the Lagrangian, get a number, and then sum those numbers for all points along the path. Then you know the rest of the major part of this story: we do this path-perturbation procedure and find that assuming $L$ is a nice function then it has partials with respect to all of its $\mathbf q_{0,1,2,\dots n}$ arguments, not knowing that $q_i$ corresponds to a coördinate of the $i^\text{th}$ time derivative of a path; and if the coordinate space is a vector space then we understand these partials as covectors $\mathsf C\to\mathbb R$. These partials mean that to first order,
$$
\begin{align}
S[\mathbf q + \delta\mathbf q] &=\int_\mathsf T dt~L\big(t, \mathbf q(t) + \delta\mathbf q(t), \dot{\mathbf q}(t) + \delta\dot{\mathbf q}(t), \ddot {\mathbf q}(t) + \delta\ddot{\mathbf q}(t), \dots\big)\\
&\approx\int_\mathsf T dt~\left[L\big(t, \mathbf q(t), \dot{\mathbf q}(t), \ddot {\mathbf q}(t), \dots\big) + \sum_{i=0}^{n}\frac{\partial L}{\partial \mathbf {q_i}}\cdot \left(\frac{d~}{dt}\right)^{i}\delta \mathbf q\right]
\end{align},$$and we then integrate-by-parts all of these time derivatives away into boundary terms which vanish because $\delta q, \delta \dot q, \dots = 0$ at the boundaries of $\mathsf T,$ getting the Euler-Lagrange equations of motion,$$0 = \sum_{i=0}^{n} (-1)^i ~ \left(\frac{d~}{dt}\right)^{i} \frac{\partial L}{\partial \mathbf {q_i}}.$$
Now, interpreting these equations requires a sort of “dance” in your head!
- First, we take the partial derivatives of the Lagrangian ignoring the connections of the different derivatives to each other, that is what $\partial L / \partial q_i$ means.
- Then, we insert into those functions the actual path $q(t)$ and its derivatives $\dot q, \ddot q$.
- Then, we take the total time derivatives with respect to $t,$ and insert minus signs corresponding to integration-by-parts.
- And only after all of that is done, does the resulting expression need to be equal to zero.
It was very important to me, in resolving the sort of confusion that you are dealing with now, to see that this step 2 sits in the middle of this interpretation. I even had a professor at Cornell who taught me all this confess “I’m actually not completely sure why we take partials and then total derivatives, but that is what the mathematicians and textbooks tell me to do.” It is the same confusion.
But we know about paths
Now, we usually impose our knowledge of the relationship upon the equations. We don’t write $q_{0,1,\dots n-1}$ but rather $q, \dot q, \ddot q$ as if we were taking derivatives. From the perspective of the Lagrangian these are all just symbols, but we abuse the notation for the sake of our own sanity.
Now we come to this total-time-derivative invariance. The Lagrangian function itself does not know that its arguments are time derivatives, but we know that certain assemblies like $\dot q \ddot q$ or $q \dot q$ or $q \ddot q + \dot q^2$ are all total time derivatives of something.
Given a total time-derivative, it cannot affect the equations of motion. And the proof is really simple, we go back to the place where the equations of motion came from: the principle of least action $S[\mathbf q + \delta \mathbf q]\approx S[\mathbf q].$
If we add a total time derivative of something to our Lagrangian, our action principle looks like, for some symbolic expression $K$, $$\begin{align}
S'[\mathbf q] &= \int_\mathsf T dt~\left[L\big(t, \mathbf q(t), \dot{\mathbf q}(t), \ddot {\mathbf q}(t), \dots\big) + \frac{dK}{dt}\right]\\
&= S[\mathbf q] + K[\mathbf q_1,\dot{\mathbf q}_1, \ddot{\mathbf q}_1, \dots] - K[\mathbf q_0,\dot{\mathbf q}_0, \ddot{\mathbf q}_0, \dots]
\end{align}
$$where subscript $1$ indicates the final value at the end of $\mathsf T$ and subscript $0$ indicates the initial value. You substitute this with $q + \delta q$ and none of these $K$ terms change because after the perturbation, $q_{0,1}, \dot q_{0,1}, \dots$ are all the same: $\delta q$ vanishes for all of these.
So $K$ just vanishes when we try to analyze the actual physics of the system $S[\mathbf q + \delta \mathbf q]\approx S[\mathbf q].$ Whatever number it is, it is the same number on both sides and gets subtracted out.
You can formalize this by saying that you can always add or subtract any expression to/from a Lagrangian that looks like a total time derivative. It does not preserve the identity of the Lagrangian, it does not even preserve the value of the action integral, but it only introduces a boundary term into the results of the action integral and therefore it must disappear when the equations of motion are considered.