I'm currently trying to build a semantic scraper that can extract product information from different company websites of suppliers in the packaging industry (with as little manual customization per supplier/website as possible).
The current approach that I'm thinking of is the following:
- Get all the text data via scrapy (so basically a HTML-tag search). This data would hopefully be already semi-structured with for example: name, description, product image, etc.
- Fine-tune a pre-trained NLP model (such as BERT) on a domain specific dataset for packaging to extract more information about the product. For example: weight and size of the product
What do you think about the approach? What would you do differently?
One challenge I already encountered is the following:
- Not all of the websites of the suppliers are as structured as for example e-commerce sites are → So small customisations of the XPath for all websites is needed. How can you scale this?
Also does anyone know an open-source project as a good starting point for this?
