1

I am working on a project where I have a dataset consisting of unstructured data from multiple ERP systems. Each dataset (extracted from an ERP) has different columns, and unfortunately, there is no standard format for the data. Among the columns, there is a product code, along with other product-related information. The product code can be in various columns, or even within a larger column description.

My goal is to extract the product code from each row from this unstructured data. I am looking for advice on which type of model or strategy I can apply to extract the product codes automatically.

Here are some key points for consideration:

  • The data is unstructured and comes from various ERP systems;
  • There is no standard format for the columns or the product code placement;
  • The product code can be within a larger column description;
  • The extract data isn't natural language (it doesn't have a syntax). I don't have actual sequences. The columns are extracted from an ERP system and they basically contain a bunch of keywords, like: '3/4" SHOE RED BLUE NIKE", stuff like that;
  • The are millions of possible product codes.

Any suggestions or recommendations on models, strategies, or tools that would help me achieve this goal would be greatly appreciated. If you have any experience with a similar problem, please share your insights or any relevant resources.

Thank you in advance for your help!

delucca
  • 11
  • 4

1 Answers1

0

There are two class of methods that could potentially solve this: hardcoded rules or learning. As with any other problem, go for the first one (rules) first and only if it's not enough go for the second one (learning).

Here's how I would proceed:

  • try to think about what makes something a product code or not. For instance, if you get a column and split it in strings separated by spaces or column separators, you could find things such as
    • minimum length,
    • combinations of letter, numbers and characters that required, allowed or forbidden,
    • maximum length,

Even if these criteria aren't enough to get exactly the codes you want they will help you on the next step. The important thing is that as much as possible you shouldn't miss any valid code. You should refine the rules as much as possible and if they aren't enough you can start learning.

  • With the first method you have a way of generating a dataset of potential codes. Now you should look at the dataset and find those datapoints that aren't codes. If you can create a rule to exclude them, do it. If not, annotate them by hand (this is, add a "wrong" flag to those). Then you can train a model to classify the potential codes as correct or incorrect.

Let me know if this didn't answer your question and if you have questions :)