Notation used in paper on Continuous Time Reinforcement Learning

Question

I am working on implementing the learning shown in this paper:

https://homes.cs.washington.edu/~todorov/courses/amath579/reading/Continuous.pdf

In the paper, the authors devise a continuous time learning extension of the actor critic method in reinforcement learning. My question is simply about the dimensions of some objects in equations shown in the paper.

I will summarize the relevant equations for my problem. Let the state of the system be described by a state vector $\mathbf{x}(t)\in\mathbb{R}^n$ and the actions undertaken by an agent interacting with the system be described by $\mathbf{u}(t)\in\mathbb{R}^m$. Then the natural generalisation of the evolution of the system to continuous time is:

$$\dot{\mathbf{x}}(t)=\mathbf{f}(\mathbf{x}(t),\mathbf{u}(t)) \tag{1} $$

A quick note is that I am simply copying the paper's notation (which is part of the problem) and I am also numbering the equations according to where they appear in the paper. From this we also see that $f\in\mathbb{R}^n$. I mention these equations just to set the scene and dimensions of the various objects. They then do some derivations, and end up saying that they will describe the actor as:

$$\mathbf{u}(t) = s\left( \mathbf{A}(\mathbf{x}(t);\mathbf{w}^A) + \sigma \mathbf{n}(t)\right) \tag{19}$$

Here, $s()$ is an activation function, $\mathbf{w}^A = \{ \omega_i^A \}$ is a parameter vector, $\mathbf{n}(t)\in\mathbb{R}^m$ and $\mathbf{A}(\mathbf{x}(t);\mathbf{w}^A)\in\mathbb{R}^m$. The last two make sense since $\mathbf{u}(t)\in\mathbb{R}^m$ and I would guess then that the function $s()$ is applied elementwise. We have that $\mathbf{n}(t)$ is some noise to encourage exploration. Now $ \mathbf{A}(\mathbf{x}(t);\mathbf{w}^A)$ is defined by further refining the above expression in the appendix to

$$\mathbf{u}(t) = \mathbf{u}^{\text{max}}s\left( \sum_k \omega_k^{A} b_k(\mathbf{x}(t))+ \sigma \mathbf{n}(t)\right) \tag{33}$$

Here is where the problem comes in. We would expect the the summation above outputs an $m$ dimensional vector as it is added to the presumably $m$ dimensionl $\mathbf{n}(t)$, then $s()$ (a sigmoid function) is applied pointwise so that each of the $m$ elements is limited to a max value of 1, and then $\mathbf{u}^{\text{max}}$ pointwise multiplies each of those elements to choose its new maximum possible value. However, the $b_k(\mathbf{x}(t))$ above apparently form a Normalized Gaussian Network and are defined as follows:

$$b_k(\mathbf{x}) = \frac{a_k(\mathbf{x})}{\sum_{l=1}^K a_l(\mathbf{x})}$$

with

$$a_k(\mathbf{x}) = e^{-||\mathbf{s}_k^T(\mathbf{x}-\mathbf{c}_k)||^2} \tag{32}$$.

So each $\mathbf{c}_k$ is the center in state space of the $k$'th basis vector from which the current position in state space is subtracted. This is dotted with $\mathbf{s}_k$ to account for units and the whole thing is put through a norm and exponentiated. So I believe the function $a_k(\mathbf{x})$ in some sense measures the distance of the current state $\mathbf{x}$ to $k$'th basis vector. Then $b_k(\mathbf{x})$ does the same thing but normalized over all distances.

But here is my question: surely $b_k(\mathbf{x})$ is a scalar due to its form as a ratio of $a_k$ functions (and the form of $a_k$)? In that case, it does not output an $m$ dimensional vector.

In my code (which does seem to work) I have $m=1$ so I haven't encountered this issue so far but I need to understand to build more complicated models.

An answer to how the above makes sense may be all I need, but if you think the solution is merely that we replace the scalar with an $m$ dimensional vector whose every entry is the scalar value, then we get another issue later on. When we define a Normalized Gaussian Network $\hat{\mathbf{f}}$ to try to learn $\dot{\mathbf{x}}(t)$ as follows:

$$\dot{\mathbf{x}}(t)\sim \hat{\mathbf{f}}(\mathbf{x}(t),\mathbf{u}(t)) = \sum_k \omega_k^M b_k(\mathbf{x}(t),\mathbf{u}(t)) \tag{35}$$

They say very very little about this, but I would assume they mean that the function $b_k$ takes as input the vectors $\mathbf{x}(t)$ and $\mathbf{u}(t)$ "stacked" in some sense, and gives you the distance to basis vectors in the larger space of states of the system and actions taken by the agent. The issue is then with the way the $\omega_k^M$ are updated, which obeys the following equation:

$$\dot{\omega_k^M}(t)=\eta^M\left(\dot{\mathbf{x}}(t)-\hat{\mathbf{f}}(\mathbf{x}(t),\mathbf{u}(t))\right)b_k\left(\mathbf{x}(t),\mathbf{u}(t)\right)$$

Now you will note that the weights $\omega_k^M$ are vectors, since the difference $\dot{\mathbf{x}}(t)-\hat{\mathbf{f}}(\mathbf{x}(t),\mathbf{u}(t))$ is a difference of elements of $\mathbb{R}^n$ and we are assuming the $b_k$ are scalars. I do not know if this is correct and there is very little explanation given in this part of the paper.

Any help understanding the dimensions of these objects would be appreciated.

Edit 1: For some additional information that may be helpful for the deduction of the dimensions involved, the weights $\omega_k^A$ in equation (33) are updated by the following equation:

$$\dot{\omega}_i^A = \eta^A \delta(t) \mathbf{n}(t) \frac{\partial \mathbf{A}(\mathbf{x}(t);\mathbf{w}^A)}{\partial \omega_i^A}\tag{20}$$

where $\eta^A$ and $\delta(t)$ are definitely scalars.

As a further note, I pointed out above (and they specifically state in the paper) that $\mathbf{w}^A$ is a "parameter vector" so I would be surprised if $\omega_k^A$ in (33) is a vector too. However, they say almost nothing about (35), so there I'm less sure.

cinch · Answer 1 · 2024-03-13T21:19:20.060

Overall your understanding of dimensions here is correct. The function $a_k(\mathbf{x})$ measures the exponential of the negative squared Euclidean distance between the input state $x$ and the center $c_k$, and then weighted by $s_k$ of the $k$-th basis function. Then the normalized scalar $b_k(\mathbf{x})$ is a value between $0$ and $1$, representing the contribution or activation of the $k$-th basis function in a normalized way.

This type of function is commonly used in radial basis function (RBF) networks, where Gaussian-like functions centered at specific points are combined to form a basis for approximating a target function. The $b_k(\mathbf{x})$ functions are often interpreted as activation functions for the input state or stacked with action as joint input representation, determining how much influence each basis function has at a given input representation.

Further clarification based on your comment question, indeed here's a little messy and the authors were not clear enough assuming readers already have much background in HJB equations of classical physics and optimal control theory, you can further review the linked reference of HJB equation. As clearly defined in (32) here they're using linear function approximation thus the summations in both (33) & (35) express the usual inner product of the parameter vector and the basis vector $[b_1,...,b_k]$. But notice value function is scalar thus the parameters $ω_k$ in (32) is interpreted as usual which has no upper right superscript if you take minute notice, however, the parameters in (33) & (35) have superscript $A,M$, respectively. So as you rightly concerned to make dimension match with their LHS, their parameters should be construed as matrix multiplication with their respective basis vector $[b_1(x),...,b_k(x)],[b_1(x,u),...,b_k(x,u)]$ instead of simply inner product producing a scaler.

Notation used in paper on Continuous Time Reinforcement Learning

1 Answers1