In TSRM we attempt to match conceptually similar documents to a query by establishing term similarity to relate similar documents and associate relevant documents to a query.
The establishment of term similarity in TSRM aims not only at the identification of similar concepts expressed by different terms across documents, but also it aims at the resolution of the so called term mismatch problem in IR, by allowing for query expansion using similar terms.
In the following, we present the TSRM resources in more detail namely, a natural language processing approach for multi-word term extraction, the FASTR tool for term variants detection [7] and, finally, the term similarity measures which are inspired by the lexical and contextual term similarity criteria defined by the work of Nenadic et al.
We have implemented in our TSRM model the Lexical Similarity (LS) and the Contextual Similarity (CS) measures in our process for term similarity computation.
However, with TSRM there is no need to omit terms in our vectors, given that document vectors are usually very small (consisting of less than 20-30 terms).
The approach adopted by TSRM for the document retrieval application can be viewed as a three phase process: the "Corpus Processing phase", the "Query Processing phase" and, finally, the "Document Retrieval phase" which are described below.
In TSRM we opted for a knowledge-poor approach, therefore domain specific external resources, such as domain lexica and thesauri, were not applied.
All document similarity measures above (VSM, TSRM) are normalized in the range [0,1].
We conducted a series of comparative experiments, so as to investigate the potential efficiency of TSRM over classic IR Models (such as VSM) and, most importantly, the relative performance of TSRM compared to state-of-the-art IR methods using external knowledge resources (such as ontologies, or term taxonomies) for discovering term associations (e.g., [5,3]): Query expansion by semantically similar terms is applied as a means for capturing similarities between terms of different degrees of generality in documents and queries (e.g., "human", "man").
TSRM: the proposed method with document similarity computed by Eq.