TextVect is a tool for extracting features from textual documents. It allows for segmentation of documents into paragraphs, sentences, entities, or tokens and extraction of lexical, syntactic, and semantic features for each of these segments. These features are useful for various machine-learning tasks such as text classification, assertion classification, and relation identification, TextVect enables users to access these features without installation of the many necessary text processing and NLP tools.
Use of this tool involves three stages as shown in Fig below: segmentation, feature selection, and classification. First, the user specifies the segment of text for which to generate the features: document, paragraph or section, utterance, or entity/snippet. Second, the user selects the types of features to extract from the specified text segment. Third, the user can download the vector of features for training a classifier. Currently, TextVect extracts the following features:
- unigrams and bigrams
- pos tags
- UMLS concepts
NLP Task PerformedClassificationInformation ExtractionEvaluation
A. Kumar. Feature Engineering for Classification of Clinical Text. Technical report, UC San Diego Master's Thesis. 15 March 2013.
A Kumar, WW Chapman. TxtVect: A Tool for Extracting Features from Clinical Documents. AMIA. 2012. (3).1822. poster