2

I'm curious to know about the capabilities of AI today in 2022. I know that AI has become pretty good at recognizing things like objects in photos. But what about when it comes to elements in HTML? Would it be feasible to use AI to determine things like:

  • Is there a call-to-action? Basically a button or main action that directs the user somewhere. The text in the call to action can obviously contain a variety of different text.
  • Is there a form on the page for the user to fill out?

The last time I tried running a rendered image of a website through image recognition software, such as Google Vision or Amazon's Rekognition, it didn't detect these things, which didn't surprise me. However, maybe there's a better or alternate way, such as using the source code? But the end goal would be to determine if the page is meant to capture leads, and the form elements are some of the criteria we'd be looking for. Maybe this can also be seen as a categorization type of task too.

As I understand, AI is a broad term. So, if this was a feasible project, I'd also be curious to know what branch of AI would be the correct one to explore.

nbro
  • 42,615
  • 12
  • 119
  • 217
kenshin9
  • 121
  • 2

2 Answers2

0

The branch of AI that is devoted to image-related processing is Computer Vision (powered by deep learning, indeed). In particular for this project you probably need to train an object recognition model (e.g. Faster R-CNN, YOLO), able to find relevant parts of the websites like buttons and forms. An object detector finds the location (bounding box, i.e. a rect) and class (which kind of object is) of each detection. In alternative, you can look at semantic or even instance segmentation (look here for an explanation) model (e.g. Mask R-CNN) that instead provides segmentation masks which can handle, in principle, arbitrary complex shapes.

  • To train such a model, you have to collect a dataset of (images, labels), where the labels comprise: bounding box or segmentation mask of the page's object of interest, and its class (e.g. button, form, video, header, etc).

I think it would be interesting to train a multi-modal model that at the same time learns from images of webpage, and their source code as well. In this case you need to pair CV with NPL (natural language processing). In order of complexity, code (text in general) can be understood by recurrent neural networks, attention models, and large language models (e.g. BERT, GPT, etc.)

  • In this other solution, you have two neural networs: one processes the image, the other the text/code; you have to combine both at some point, then have the output layer.
Luca Anzalone
  • 3,216
  • 4
  • 15
0

Two main approaches are used for AI to recognize elements and understand the purpose of webpages:

Computer Vision with Deep Learning: This is the leading technique and relies heavily on deep learning models, a type of artificial neural network particularly adept at image recognition.expand_more These models are trained on vast amounts of labeled webpage data, allowing them to identify specific elements on a webpage like buttons, menus, or text sections. Some popular deep learning architectures for object recognition on webpages include Faster R-CNN and YOLO.expand_more

Natural Language Processing (NLP): This approach focuses on analyzing the text content of a webpage. By understanding the language used, NLP techniques can classify the webpage's purpose. For instance, an NLP system might identify keywords related to e-commerce on a webpage, suggesting it's an online store.