There are some predefined categories( Overview, Data Architecture, Technical Details, Applications, etc). The requirement is to classify the input text of paragraphs into their resp. category. I can't use any pre-trained word embeddings (Word2Vec, Glove) because the data entered is not in general English ( talking about dogs, environment, etc) but pure technical (How does a particular program orks, steps to download anaconda, etc). Don't have any data available on the internet to train as well. Anything that understands semantic-surface-level of a sentence will work
1 Answers
Problem Statement
Find the category of technical text on a surface-semantic-level. The requirement is to classify the input text of paragraphs into their respective categories.
The categories given as predefined are as follows.
- Overview
- Data Architecture
- Technical Details
- Applications
- etc
Some document types that would be added in place of 'etc' might include these.
- Requirements
- High Level Design
- System Architecture
- Testing Plan
- Deployment Plan
- Disaster Recovery Plan
Semantic Structure of Technical Documents
It is correct that the terminology between a requirements document and data architecture will likely have much in common. More precisely, the distributions of linguistic elements of all documents common to a given system or project are likely to contain domain commonalities. A system that pilots a drone will have the linguistic elements "drone(s)", "hover(ing|s)", "target(ing|s)", "flight plan(s)" in similar distributions throughout its documentations.
Indicators
It is likely that these five distinguishing characteristics can be exploited in categorization.
- Sentence semantics
- Inserted diagrammatic and pictorial conventions
- Header conventions
- Linguistic elements that appear commonly in only one type of technical document
- Elements and structure from copying previous documents and modifying them or the use of boiler plates and templates in specific departments
Rather than focus entirely on text recognition and abandoning the other items in the list above would be unwise. Diagrams, such as network diagrams and UML diagrams, may be quite easy to discern using deep convolutional approaches and would clearly identify the category to which the hosting document belongs. That is also true of test case tables.
Recognizing that the proportions between the five indicators above and sections of indicators are variables in the model upon which training is applied will produce the best results. For instance, the final paragraph may be more telling than the rest of the body.
Also, be aware that one can couple the proportional appearance of language elements within all text documents on record with the language elements identified in the example and then in later use of the trained model. The training is likely to progress faster and produce more accurate and reliable results if features include an indication that the linguistic element "test case(s)" appearing in the input text is significantly less prevalent in the domain of technical documentation than the linguistic element "-ing" to indicate continuous tenses of verbs.
Avoid Static Grammars
Language parsers based on fixed language rules (grammars) have not had much success compared to association based semantic mapping, and linguistics has moved away from those static models for similar reasons. Avoid grammar based parsing.
Existing Work
For the textual categorization, the below academic publications are some of the recent work that have already gained some notoriety.
- Semantic clustering and convolutional neural network for short text categorization, P Wang, J Xu, B Xu, C Liu, H Zhang, F Wang… - Proceedings of the 53rd, 2015
- Document Modeling with Gated Recurrent Neural Network for Sentiment Classification, D Tang, B Qin, T Liu - Proceedings of the 2015 conference on empirical, 2015
- Learning Semantic Representations of Users and Products for Document Level Sentiment Classification*, D Tang, B Qin, T Liu - Proceedings of the 53rd Annual Meeting of the, 2015
- Effective use of word order for text categorization with convolutional neural networks, R Johnson, T Zhang - arXiv preprint arXiv:1412.1058, 2014 - arxiv.org
- Jumping NLP curves: A review of natural language processing research, E Cambria, B White - IEEE Computational intelligence, 2014
- Recurrent Convolutional Neural Networks for Text Classification, S Lai, L Xu, K Liu, J Zhao - AAAI, 2015
Faulty Approaches
The approach in the comment of comparing the semantics of sentences will only determine one aspect of information redundancy between two sentences.
The number of comparisons are also a consideration. For $\chi$ documents each containing $\sigma$ sentences would require $\chi \, (\chi - 1) \, \sigma \, (\sigma - 1)$ comparisons. For $\chi = 10,000$ documents containing $\sigma = 1,000$ sentences, we have 99,890,010,000,000 sentence comparisons, the totality of which provides no particularly useful information about the category of any of the documents.
The documents must be related to a concept class, not each other.
Visualizations of the semantics aren't particularly useful unless you are looking for something and the visualization is designed to present that which is sought.
A Better Plan
- Determine the number of example training documents, $t$, needed to be categorized by experts to create a sufficiently large training data set. (Consider using the PAC learning framework designed for this purpose.)
- Draw that number from the full set of documents, using an appropriate random or highly pseudo-random method, herein referred to as method $\mathbb{D}$. That is the training example set.
- Draw that number from the full set of documents again, using method $\mathbb{D}$. That is the test example set.
- Have experts label (categorize) both example sets according to technical document type. Use the same experts for both training and testing, and have them alternate periodically between the two sets so that their learning, fatigue, or boredom curves have a minimal effect on their categorizing of the two sets.
- Profile each of the category-labeled documents in terms of the above five indicators. Note that to use the images embedded in the document as part of the profiling, which could drastically improve system reliability, the images must be run through a separate diagram categorizing network trained separately on images to find diagrams characteristic of one of the technical document types. That may sound like much work, but consider that categorizing drawing types is a well developed science and the labeling is only of the diagram types that are representative of particular document types.
- Train an appropriately designed artificial network to categorize the documents, using the profiling results as example inputs and the labels from the categorizing from the experts.
- Test using the test set
- Based on the results of the test, decide whether to use the current training or re-execute a previous step using information gained from the first training and train again
- Run the trained network on the full document set
- Pull a sample from the result using method $\mathbb{D}$ to validate effective completion of the run
 
    
    - 7,543
- 1
- 28
- 63
 
     
    