Questions tagged [computer-vision]

For questions related to computer vision, which is an interdisciplinary scientific field (which can e.g. use image processing techniques) that deals with how computers can be made to gain high-level understanding from digital images or videos. For example, image recognition (that is, the identification of the type of objects in an image) is a computer vision problem.

For more info, see e.g. https://en.wikipedia.org/wiki/Computer_vision.

534 questions
88
votes
9 answers

How is it possible that deep neural networks are so easily fooled?

The following page/study demonstrates that the deep neural networks are easily fooled by giving high confidence predictions for unrecognisable images, e.g. How this is possible? Can you please explain ideally in plain English?
21
votes
1 answer

What is a fully convolution network?

I was surveying some literature related to Fully Convolutional Networks and came across the following phrase, A fully convolutional network is achieved by replacing the parameter-rich fully connected layers in standard CNN architectures by…
17
votes
1 answer

Are information processing rules from Gestalt psychology still used in computer vision today?

Decades ago there were and are books in machine vision, which by implementing various information processing rules from gestalt psychology, got impressive results with little code or special hardware in image identification and visual…
13
votes
3 answers

Is it possible to train a neural network to estimate a vehicle's length?

I have a large dataset (over 100k samples) of vehicles with the ground truth of their lengths. Is it possible to train a deep network to measure/estimate vehicle length? I haven't seen any papers related to estimating object size using a deep neural…
12
votes
1 answer

In Computer Vision, what is the difference between a transformer and attention?

Having been studying computer vision for a while, I still cannot understand what the difference between a transformer and attention is?
novice
  • 123
  • 1
  • 4
12
votes
0 answers

Extending FaceNet’s triplet loss to object recognition

FaceNet uses a novel loss metric (triplet loss) to train a model to output embeddings (128-D from the paper), such that any two faces of the same identity will have a small Euclidean distance, and such that any two faces of different identities will…
11
votes
3 answers

Is it difficult to learn the rotated bounding box for a (rotated) object?

I have checked out many methods and papers, like YOLO, SSD, etc., with good results in detecting a rectangular box around an object, However, I could not find any paper that shows a method that learns a rotated bounding box. Is it difficult to learn…
11
votes
2 answers

Do deep learning algorithms represent ensemble-based methods?

According to the Wikipedia article on deep learning: Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using a deep graph with multiple processing layers, composed of…
10
votes
2 answers

Does placing tires on aircraft help prevent AI-assisted attacks?

I see in the screenshots below from some footage of Operation Spider's Web (Ukrainians blowing up Russian airplanes with drones on 2025-06-01) that tires were placed on planes. Some people say that this is to confuse AI-assisted attacks.
Franck Dernoncourt
  • 3,473
  • 2
  • 21
  • 39
9
votes
1 answer

Why does nobody use decision trees for visual question answering?

I'm starting a project that will involve computer vision, visual question answering, and explainability. I am currently choosing what type of algorithm to use for my classifier - a neural network or a decision tree. It would seem to me that, because…
9
votes
1 answer

In YOLO, what exactly do the values associated with each anchor box represent?

I'm going through Andrew NG's course, which talks about YOLO, but he doesn't go into the implementation details of anchor boxes. After having looked through the code, each anchor box is represented by two values, but what exactly are these values…
9
votes
2 answers

Why class embedding token is added to the Visual Transformer?

In the famous work on the Visual Transformers, the image is split into patches of a certain size (say 16x16), and these patches are treated as tokens in the NLP tasks. In order to perform classification, a CLS token is added at the beginning of the…
9
votes
1 answer

What are sim2sim, sim2real and real2real?

Recently, I always hear about the terms sim2sim, sim2real and real2real. Will anyone explain the meaning/motivation of these terms (in DL/RL research community)? What are the challenges in this research area? Anything intuitive would be appreciated!
9
votes
4 answers

What could an oscillating training loss curve represent?

I tried to create a simple model that receives an $80 \times 130$ pixel image. I only had 35 images and 10 test images. I trained this model for a binary classification task. The architecture of the model is described below. conv2d_1 (Conv2D) …
8
votes
3 answers

What are the state-of-the-art approaches for detecting the most important "visual attention" area of an image?

I'm trying to detect the visual attention area in a given image and crop the image into that area. For instance, given an image of any size and a rectangle of say $L \times W$ dimension as an input, I would like to crop the image to the most…
1
2 3
35 36