Main /

The Hybride Research Project


Hybride est un projet du programme blanc ANR 2011 avec la référence ANR-11-BS02-0002

The Hybride Research Project aims at developing new methods and tools for supporting knowledge discovery from textual data by combining methods from Natural Language Processing (NLP) and Knowledge Discovery in Databases (KDD).

A key idea is to design an interacting and convergent process where NLP methods are used for guiding text mining and KDD methods are used for analysing textual documents. NLP methods are mainly based on text analysis, and extraction of general and temporal information, while KDD methods are based on pattern mining, e.g. itemsets and sequences, formal concept analysis and variations, and graph mining.

For example, NLP methods applied to some texts locate textual information that can be used by KDD methods as constraints for focusing the mining of textual data. By contrast, KDD methods can extract itemsets or sequences that can be used for guiding information extraction from texts and text analysis. This combination of NLP and KDD methods for common objectives, can be viewed as a virtuous circle, i.e. a sequence of complex operations from NLP and KDD that reinforces itself through a feedback loop.

The fundamental aspects of this project can be understood through the main steps of the knowledge discovery loop with a NLP/KDD perspective:

Data preparation
Data mining
Interpretation and validation of the results
Knowledge construction

At each step, new methods have to be designed for achieving this interrelated NLP/KDD loop. The consortium has gained a rather good experience on NLP and KDD, but efforts are still necessary for adapting the classical KDD loop to become an actual NLP/KDD loop.

There is a need to solve interaction problems at each steps of the NLP/KDD loop where interaction amounts for one process to prepare the application of the second. Finally, a system integrates the operations involved within the whole loop, in the context of Orphanet for text analysis and production of new documentation on rare diseases.