New to NLP

Often a NLP researcher will need to compare annotated datasets between experts e.g., two clinicians diagnoses for a set of patients or evaluate the ability of a system to replicate the actions of an expert e.g., diagnose a patient against a set of confirmed or refuted disease cases. The Clinical Evaluation Workbench is a platform which helps researcher compare two sets of annotations (manually created by experts, automatically created by a system, or both!) to determine how well two approaches agree with each other. The workbench supports comparisons of marked events like problems by determining how frequently two approaches agree based on various match criteria. For instance, whether a system spans the problem mention the same way an expert spans the problem in a given text. The system also supports whether two approaches label the span as a particular concept e.g., both system and expert call the span an instance of cough. Other supported comparisons include contexts - Experiencer, Temporality, Negation - see ConText algorithm description. The workbench allows NLP researchers to visualize all annotations and differences between two files.

System development was funded by ONC SHARP Area 4, VA Consortium for Healthcare Informatics Research, and ShARe projects.

Know NLP

Clinical Evaluation Workbench is a java platform that supports the comparison of two files (primary and secondary) comprising annotations from clinical information extraction systems or manually-annotated reports. The files must map to the Workbench Information Model (UIMA-inspired common type system). The system then calculates outcome measures for all secondary file annotations when compared against the primary file annotations (considered correct answers) based on event string spans (exact and overlap), event concept classes, attributes, and relationships between events. The workbench allows NLP researchers to visualize all annotations and their outcome measures. The workbench currently supports Knowtator XML and CLEF pipe-delimited formats.

Authors

Lee Christensen

Wendy Chapman

Sean Murphy

Associated Institutions

University of Utah

University of California San Diego

Mayo Clinic

VA CHIR/VINCI

iDASH

Minimum Requirements

Java

Download

NLP Task Performed

Evaluation

Programming Languages

Java

Operating Systems

WindowsMacLinux

Documentation

How to Use Tutorial

Evaluation Workbench Overview

The Evaluation Workbench (WB) is a tool written in Java that is used for comparing sets of human- or computer-generated annotations associated with a corpus of documents. Currently, the WB is mostly used for examining annotations generated using eHOST or Knowtator, in addition to annotations generated by several NLP systems. Annotations from each set are aligned on the basis of exact or partial textual match, and a number of different statistical measures are calculated and displayed. The WB can be used to identify which classifications or attributes two sets differ on, to adjudicate on which annotations were correct, and to annotate text correctly in those cases where neither set was correct.

WB Statistics

The WB's main function is to statistically analyze differences between two sets of annotations over a single corpus of text documents. The first set is considered the reference or "gold" standard, and the second set is the comparison set. To identify matching annotations, WB looks for pairs of annotations that cover the same snippet of text in a document, and that share a user-selectable match criterion. Match criteria include classification (e.g. two annotations that cover the same text and have the classification "chest pain" would be considered matching), semantic type (e.g. two annotations of type "anatomic location"), attributes (e.g. two annotations containing "status=absent"), span only (e.g. two annotations that cover the same text position), and classification only (e.g. annotations with the same classification, regardless of whether they cover the same text).

The WB has two user-selectable modes for determining whether two annotations cover the same text: In "exact" mode the two annotations must start and end at the same character positions. In "overlap" mode they must overlap at at least one character position. Since annotators will sometimes vary in where they mark the start / end positions of an annotation, "overlap" mode will sometimes pick up matches that "exact" mode does not.

Based on the selection criterion, the WB calculates counts of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN), and further calculates accuracy, positive and negative predictive value, sensitivity, specificity, Scott's Pi, Cohen's Kappa, and F measure based on these counts. Determining FP and FN counts relies both on the presence or absence of an annotation in the correspondence set that matches an annotation in the reference set, and on annotation attributes that refer to the presence or absence of the concept denoted. For instance, if a reference annotation contains the attribute "status=absent" and the matching comparison annotation also contains "status=absent", WB would count this as a TN match, since both annotations denote an object described as being absent. However, if the comparison annotation did not have this attribute it would be counted as a FP. If the reference annotation did not contain a mention of the status attribute and there were no matching comparison annotation, WB would count that as FN.

WB Functionality

The first function of the WB is to calculate statistics based on annotation matches / mismatches.

The second function is to enable users to quickly locate annotations based on the selected match criteria and their status as TP, FP, TN or FN matches. Being able to view all these annotations in the corpus is useful for a variety of purposes including adjudicating between the two annotation sets and finding patterns of errors made by annotators. As the user hovers the mouse over the TP,FP,TN and FN column for a selected classification in the Statistics panel, a table of documents containing those matches/mismatches, along with the match/mismatch count. If the user hovers the mouse over one of those columns, that document appears, with matches/mismatches highlighted. If the user hovers the mouse over an annotation of the current semantic type (as displayed in the Schema/Type panel), that annotation becomes the current annotation, and its attributes are displayed in the Attribute panel. As the user hovers the mouse over annotations belonging to other semantic types in the Document panel, those annotations are transiently highlighted in gray, and if the user clicks the mouse, that new annotation becomes the selected annotation, and the new semantic type becomes the selected semantic type.

The third function of the WB is to enable human users to perform adjudication between mismatched annotations. The user can select an annotation and mark it as being either correct or incorrect. Adjudicated annotations are highlighted using a different color scheme, and can be stored to a CSV file (the ValidationFile parameter) for further analysis. CSV entries contain document name, annotator name, classification, start, end and whether valid. For instance, the following lines indicate two matching annotations judged as correct:

  • rec003,ss1_batch02_relations_Instance_0,C0004153,91,106,true
  • rec003,ss1_batch02_relations_Instance_1,C0004153,91,106,true
  • To adjudicate, the user selects an annotation in the document panel, then selects the menu Annotate->Verify Annotation or Annotate->Falsify Annotation, to either validate or invalidate that annotation. Validating / invalidating an annotation changes the color highlighting in the document panel. To Validate or all annotations in the current document, select Annotate->Verify All Annotations. To remove verification status from an annotation or from all annotations in a document, select Annotate->Unverify Annotation or Annotate->Unverify All Annotations. To store adjudications to the validation file (specified in the startup parameter file) for later use, invoke File->Store Verified Annotations. Then the WB is initialized later on, verified annotations will be highlighted in the verification color scheme.

    WB Architecture

    The WB (figure X) consists of four panels, including the Schema / Type panel ("STP", upper right), the Statistics Panel ("SP", upper left), the Annotation Panel ("AP", lower right), and the Document Panel ("DP", lower left). In addition, there is a popup window for viewing and editing attributes associated with annotations, and another popup for viewing named semantic relationships between annotations.

    Related Publications

    Liadh K, Goeuriot L, Suominen H, Mowery DL, Velupillai S, Chapman WW, Zuccon G, Palotti J. Overview of the ShARe/CLEF eHealth Evaluation Lab 2014. Information Access Evaluation. Multilinguality, Multimodality, and Interaction Lecture Notes in Computer Science Volume 8685, 2014: 172-191.

    Suominen H, Salantarä S, Velupillai S, Chapman WW, Savova G, Elhadad N, Pradhan S, South BR, Mowery DL, Leveling J, Kelly L, Goeuriot L, Martinez D, Zuccon G. Overview of the ShARe/CLEF eHealth evaluation lab 2013. Information Access Evaluation. Multilinguality, Multimodality, and Visualization Lecture Notes in Computer Science Volume 8138, 2013: 212-231

    Mowery DL, Velupillai S, South BR, Christensen L, Martinez D, Elhadad N, Pradhan S, Savova G, Chapman WW. Task 2: ShARe/CLEF eHealth Evaluation Lab 2014. CLEF 2014 Working Notes, 1180. pp. 31-42. ISSN 1613-0073. Sheffield, United Kingdom. 2014.

    Suominen H, Schreck T, Leroy G, Hochheiser HS, Nualart J, Goeuriot L, Kelly L, Mowery DL, Ferraro G, Keim D, Chapman WW, Hensen P. Task 1 of the CLEF eHealth Evaluation Lab 2014: Visual-Interactive Search and Exploration of eHealth Data. CLEF 2014 Working Notes, 1180. pp. 1-30. ISSN 1613-0073. Sheffield, United Kingdom. 2014

    Velupillai S, Mowery DL, Christensen L, Elhadad N, Pradhan S, Savova G, Chapman WW. Disease/Disorder Semantic Template Filling – information extraction challenge in the ShARe/CLEF eHealth Evaluation Lab 2014. AMIA Symp Proc. 1613. Washington DC. 2014.

    Pradhan M, Elhadad N, South BR, Martinez D, Christensen L, Vogel A, Suominen H, Chapman WW, Savova G.Task 1: ShARe/CLEF eHealth Evaluation Lab 2013 Online Working Notes of the CLEF 2013 Evaluation Labs and Workshop, 23 - 26 September, Valencia - Spain

    Mowery DL, South BR, Leng J, Murtola LM, Danielsson-Ojala R, Salanterä, Chapman WW. Creating a reference standard of acronym and abbreviation annotations for the ShARe/CLEF eHealth challenge 2013. AMIA Symp Proc. Washington, DC. 2013.

    Use Cases

    Evaluating Inter-annotator Agreement when Identifying/Normalizing Acronyms/Abbreviations from Clinical Texts

    Publication

    Mowery DL, South BR, Murtola LM, Salanterä S, Suominen H, Elhadad N, Pradhan S, Savova G, Chapman WW. Task 2: ShARe/CLEF eHealth evaluation lab 2013. CLEF Proc. Valencia, Spain. 2013.

    Document Types

    ShARe corpus - electrocardiograms, radiology, discharge summaries, echocardiograms

    Sample Size

    n=300 reports
    reference standard developed by 9 Finnish nursing professionals, 1 Australian nurse, 1 Australian NLP researcher, and 2 American informaticians

    Performance

    Used to evaluate 2013 ShARe/CLEF Challenge participants performance against reference standard annotations.
    Task 2 (AA normalization): Highest Accuracy = 72

    Evaluating Inter-annotator Agreement when Identifying/Normalizing Disease/Disorders from Clinical Texts

    Publication

    Pradhan M, Elhadad N, South BR, Martinez D, Christensen L, Vogel A, Suominen H, Chapman WW, Savova G.Task 1: ShARe/CLEF eHealth Evaluation Lab 2013 Online Working Notes of the CLEF 2013 Evaluation Labs and Workshop, 23 - 26 September, Valencia - Spain

    Document Types

    ShARe corpus - electrocardiograms, radiology, discharge summaries, echocardiograms

    Sample Size

    n=300 reports
    reference standard developed by 3 clinical coders

    Performance

    Used to evaluate 2013 ShARe/CLEF Challenge participants performance against reference standard annotations.
    Task 1a (DD identification):Highest F1 = 75; Recall = 71; Precision = 80
    Task 2a (DD normalization): Highest Accuracy = 59