Explanation Based Analysis of RTE Data

Skip to end of metadata
Go to start of metadata

Explanation-Based Analysis of RTE Data

This page collects information about the collaborative effort to annotate RTE data with a human-like inference process. You can find existing annotation files and other resources on the Annotation Resources page.


We seek to start – and allow coordination of – a community-wide effort to annotate Recognizing Textual Entailment (RTE) examples with the inference steps required to reach a decision about the example label (Entailed vs. Contradicted vs. Unknown). This effort was proposed by Sammons et al. at ACL 2010. This wiki contains the annotation resources described in that paper, and is intended to be a starting point for the NLP community to develop the annotation scheme into a formal, consistent standard, and to augment, extend, and refine the existing annotations accordingly.


Broadly speaking, the RTE challenge frames Natural Language Understanding in the context of recognizing when two text fragments share the same meaning. The task is specified operationally, with a human-annotated gold standard (similar to many NLP task formulations), which allows solutions to use any means to reach the correct decision. This open-endedness has many advantages, but at least one significant drawback: it is hard to assign partial credit to systems that make a partially correct inference about a given RTE example. Most RTE examples require a number of inference steps to reach the correct answer, so solving only one of them is unlikely to significantly improve performance on the overall task. As a result, the RTE evaluation does not afford a much-needed resource to researchers developing NLP solutions that address more focused inference tasks – a means of evaluating that resource in the context of an end-to-end inference application.

Anticipated Benefits

The NLP community benefits from evaluation of their focused inference solutions in the RTE context (caveat: assumes adoption by RTE developers and/or development of openRTE software), and a market for their solution. Anyone is free to augment the annotation of existing data with a new entailment phenomenon; or to augment the pool of RTE data with examples that highlight a phenomenon in which they are interested (provided they also annotate this data with the full explanation-based analysis).

RTE system developers and NLP researchers can assess the relative importance of entailment phenomena in RTE corpora, by examining the distribution of labeled entailment phenomena.

RTE evaluation improves, allowing RTE system developers to evaluate contributions of individual components which may have broad application to textual inference tasks.


We need an elegant way to support gradual refinement of a fundamental annotation scheme. We need to define a "basic" level of annotation, over which finer-grained inference steps can be labeled, and which localize alternative inference paths that lead to the correct decision.

Related Work


Toledo et al. are working on a formal semantic annotation of inference phenomena in RTE examples. The project page is here: The annotation effort is described in this paper. This is an ambitious project, and they have started with a focus on annotating more frequent and readily identifiable phenomena.


Alexander Yates and Peter LoBue published a paper in ACL 2011 that isolates knowledge categories needed for RTE inference, based on a process similar to ours.


Researchers from CELCT together with other prominent RTE researchers published a paper at LREC 2010 in which they specify a procedure for isolating entailment phenomena by generating RTE examples in which only a single inference step is required to determine the correct RTE label. This is very similar in spirit to the approach we propose here. One potential drawback arises when trying to evaluate a solution to a specific entailment phenomenon: how to provide/generate good negative ("contradicted"/"unknown") examples. What constitutes a "good" negative example? This is hard to pin down. But our intuition is that a simplistic solution may achieve good results on a single phenomenon in isolation, but hurt overall performance when incorporated into an RTE system, because that same simplistic solution causes new errors when other entailment phenomena are active. We therefore propose the annotation-based approach: given enough data, it should be possible to draw a sample with desired characteristics, e.g. a balance of entailment phenomena in both positive and negative examples.


Clark et al. analyzed knowledge needs for 100 examples from the RTE 3 data set.

Konstantina Garoufi performed a very detailed analysis of lexico-syntactic entailment phenomena for positive entailment examples in her Master's thesis.

Search this space

Pitching In

If you would like to participate in this effort, please email Mark Sammons (mssammon at illinois dot edu). Since we do not know how involved "the community" wants to be yet, we thought we'd start small: this wiki, which provides some useful functionality at the cost of requiring users who want to participate to have an account. If a sufficient number of people want to participate, we'll open up the process.


The annotation files, annotator instructions, annotation scheme can be found via the Annotation Resources page.

Other Materials

The slides from our ACL presentation are available.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.