0

I understand both why high dimensionality and overfitting are undesired but recently I came up multiple sources mentioning that

High-dimensional data often leads to overfitting ([example][1])

But as far as I understood when more features are being measured and considered I would need much more data to train a model per the curse of dimensionality. This means that when more features are involved it's much more likely that my model is underfitting and learning just noise and doesn't have enough data to find a meaningful pattern in my data which is able to generalize.

Can somebody clarify? [1]: https://vtiya.medium.com/the-relationship-between-high-dimensionality-and-overfitting-5bca0967b60f

1 Answers1

2

These are 2 different points, which however point to the same direction:

  1. The more the features, the easier it is to overfit
  2. High dimension leads to overfitting

The "I need more data to train..." is exactly pointing to (1): you need more data to train your model, to avoid it overfitting on the small sample you have.

I'd encourage you to consider the linear regression case... If you have N features, if you have less than N samples, even linear regression overfits (thus, you need more data so that your model does not overfit)

Alberto
  • 2,863
  • 5
  • 12