"According to the above logic, an object whose size is smaller than that (say a raindrop) should NEVER decrease in size while the distance increases." This is false!
The logic says that a good simple model of a camera is a pinhole camera/camera obscura. This is nice, because the aperture can be idealized as a point with no spatial extent. The plus side of a camera obscura is that you aren't really going to have depth of field blurring. The downside is that the image is very dim. You imply in your question: you understand why things appear in perspective for a pinhole camera.
Unfortunately, nature has said, (for the most part) nope, not good enough, we need a better eye to survive in the wild. We use a lens, retina, and a larger aperture. That's where your confusion comes in. What the heck goes on with a larger aperture? Without a lens, you'd just have a mash of colors hitting your retina. Without a lens, a point on any object would have ray trajectories to any visible point on your retina, and light from every object would distribute itself over the whole retina, and you'd just see a blur.
So your insistence on a finite size aperture means that you now need a lens... well great, how could something possibly focus light? Paraphrasing my answer on another post:
You can't just magically change the direction of a ray of light, you
have to do it by mere mortal means, in the form of a slab of glass.
Even simple lenses act in a way that's rather complicated.
If you're really insistent that you aren't satisfied with the pinhole camera explanation, you need ray optics. You need Snell's law, which tells you how glass works. You need the paraxial approximation, which gives you nice formulas for what lenses do. And you'll have to start worrying about depth of field and other things. You'll have to focus your camera differently depending on the distance to the raindrop you're worried about. And you'll find that all the details aren't that important!!!
Try to work out the details with a lens focused at infinity, which takes in a tube of light rays from any direction and turns it into a point on your photosensitive surface. (ex. this page/image, where parallel rays of light are focused onto the same point). Giving it away: the specification "direction" $\mapsto$ "point on your photograph" means you're almost working with a pinhole camera again.