PCA is defined as the method that finds the linear subspace spanned by orthogonal principal components, which minimizes the reconstruction MSE of the data and explains their variance up to a predefined threshold. If you used another loss function, say, absolute error or a different divergence, you wouldn’t necessarily end up with the same optimal subspace.
The data is linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified. The principal components of a collection of points in a real coordinate space are a sequence of ${\displaystyle p}$ unit vectors... Here, a best-fitting line is defined as one that minimizes the average squared perpendicular distance from the points to the line.
On the other hand, in a linear autoencoder (AE), the encoder and decoder are linear functions. When you train such an AE to minimize the MSE between the input and its reconstruction, you are essentially trying to find two matrices $W$ (encoder) and $W'$ (decoder) such that on average for input $x$ the reconstruction $x'=W'Wx$ is as close as possible to $x$ in terms of squared error. Therefore AE only needs to learn to encode data into a bottleneck predefined $k$-dimensional latent space minimizing MSE, just as PCA does. Though the basis learned by the linear AE might not be orthogonal, geometrically speaking the space they span will be the same as the one spanned by the top-$k$ principal components from PCA intuitively.