I am working on implementing the learning shown in this paper:
https://homes.cs.washington.edu/~todorov/courses/amath579/reading/Continuous.pdf
In the paper, the authors devise a continuous time learning extension of the actor critic method in reinforcement learning. My question is simply about the dimensions of some objects in equations shown in the paper.
I will summarize the relevant equations for my problem. Let the state of the system be described by a state vector $\mathbf{x}(t)\in\mathbb{R}^n$ and the actions undertaken by an agent interacting with the system be described by $\mathbf{u}(t)\in\mathbb{R}^m$. Then the natural generalisation of the evolution of the system to continuous time is:
$$\dot{\mathbf{x}}(t)=\mathbf{f}(\mathbf{x}(t),\mathbf{u}(t)) \tag{1} $$
A quick note is that I am simply copying the paper's notation (which is part of the problem) and I am also numbering the equations according to where they appear in the paper. From this we also see that $f\in\mathbb{R}^n$. I mention these equations just to set the scene and dimensions of the various objects. They then do some derivations, and end up saying that they will describe the actor as:
$$\mathbf{u}(t) = s\left( \mathbf{A}(\mathbf{x}(t);\mathbf{w}^A) + \sigma \mathbf{n}(t)\right) \tag{19}$$
Here, $s()$ is an activation function, $\mathbf{w}^A = \{ \omega_i^A \}$ is a parameter vector, $\mathbf{n}(t)\in\mathbb{R}^m$ and $\mathbf{A}(\mathbf{x}(t);\mathbf{w}^A)\in\mathbb{R}^m$. The last two make sense since $\mathbf{u}(t)\in\mathbb{R}^m$ and I would guess then that the function $s()$ is applied elementwise. We have that $\mathbf{n}(t)$ is some noise to encourage exploration. Now $ \mathbf{A}(\mathbf{x}(t);\mathbf{w}^A)$ is defined by further refining the above expression in the appendix to
$$\mathbf{u}(t) = \mathbf{u}^{\text{max}}s\left( \sum_k \omega_k^{A} b_k(\mathbf{x}(t))+ \sigma \mathbf{n}(t)\right) \tag{33}$$
Here is where the problem comes in. We would expect the the summation above outputs an $m$ dimensional vector as it is added to the presumably $m$ dimensionl $\mathbf{n}(t)$, then $s()$ (a sigmoid function) is applied pointwise so that each of the $m$ elements is limited to a max value of 1, and then $\mathbf{u}^{\text{max}}$ pointwise multiplies each of those elements to choose its new maximum possible value. However, the $b_k(\mathbf{x}(t))$ above apparently form a Normalized Gaussian Network and are defined as follows:
$$b_k(\mathbf{x}) = \frac{a_k(\mathbf{x})}{\sum_{l=1}^K a_l(\mathbf{x})}$$
with
$$a_k(\mathbf{x}) = e^{-||\mathbf{s}_k^T(\mathbf{x}-\mathbf{c}_k)||^2} \tag{32}$$.
So each $\mathbf{c}_k$ is the center in state space of the $k$'th basis vector from which the current position in state space is subtracted. This is dotted with $\mathbf{s}_k$ to account for units and the whole thing is put through a norm and exponentiated. So I believe the function $a_k(\mathbf{x})$ in some sense measures the distance of the current state $\mathbf{x}$ to $k$'th basis vector. Then $b_k(\mathbf{x})$ does the same thing but normalized over all distances.
But here is my question: surely $b_k(\mathbf{x})$ is a scalar due to its form as a ratio of $a_k$ functions (and the form of $a_k$)? In that case, it does not output an $m$ dimensional vector.
In my code (which does seem to work) I have $m=1$ so I haven't encountered this issue so far but I need to understand to build more complicated models.
An answer to how the above makes sense may be all I need, but if you think the solution is merely that we replace the scalar with an $m$ dimensional vector whose every entry is the scalar value, then we get another issue later on. When we define a Normalized Gaussian Network $\hat{\mathbf{f}}$ to try to learn $\dot{\mathbf{x}}(t)$ as follows:
$$\dot{\mathbf{x}}(t)\sim \hat{\mathbf{f}}(\mathbf{x}(t),\mathbf{u}(t)) = \sum_k \omega_k^M b_k(\mathbf{x}(t),\mathbf{u}(t)) \tag{35}$$
They say very very little about this, but I would assume they mean that the function $b_k$ takes as input the vectors $\mathbf{x}(t)$ and $\mathbf{u}(t)$ "stacked" in some sense, and gives you the distance to basis vectors in the larger space of states of the system and actions taken by the agent. The issue is then with the way the $\omega_k^M$ are updated, which obeys the following equation:
$$\dot{\omega_k^M}(t)=\eta^M\left(\dot{\mathbf{x}}(t)-\hat{\mathbf{f}}(\mathbf{x}(t),\mathbf{u}(t))\right)b_k\left(\mathbf{x}(t),\mathbf{u}(t)\right)$$
Now you will note that the weights $\omega_k^M$ are vectors, since the difference $\dot{\mathbf{x}}(t)-\hat{\mathbf{f}}(\mathbf{x}(t),\mathbf{u}(t))$ is a difference of elements of $\mathbb{R}^n$ and we are assuming the $b_k$ are scalars. I do not know if this is correct and there is very little explanation given in this part of the paper.
Any help understanding the dimensions of these objects would be appreciated.
Edit 1: For some additional information that may be helpful for the deduction of the dimensions involved, the weights $\omega_k^A$ in equation (33) are updated by the following equation:
$$\dot{\omega}_i^A = \eta^A \delta(t) \mathbf{n}(t) \frac{\partial \mathbf{A}(\mathbf{x}(t);\mathbf{w}^A)}{\partial \omega_i^A}\tag{20}$$
where $\eta^A$ and $\delta(t)$ are definitely scalars.
As a further note, I pointed out above (and they specifically state in the paper) that $\mathbf{w}^A$ is a "parameter vector" so I would be surprised if $\omega_k^A$ in (33) is a vector too. However, they say almost nothing about (35), so there I'm less sure.