EECS Main > Events

Event Details

Graduate Research Seminar: Ramanathan Narayanan

12 noon - 1:30 p.m.
May 14, 2008
Ford ITW Auditorium


Ramanathan Narayanan
"Mining Protein Interactions from Text using Convolution Kernels"
Abstract: As the size of biomedical literature databases scale exponentially, there is an urgent need to develop accurate and efficient methods for information extraction and text mining. An important problem in Bioinformatics is the discovery of protein-protein interactions described in textual databases like Pubmed. Despite resource-intensive efforts to create manually curated interaction databases (BIND, HPRD, MINT, DIP), the sheer volume of biological literature databases makes it impossible to achieve significant coverage. As a result, several machine learning techniques have been applied to automate the process of extracting interacting proteins. Among these, the use of Support Vector Machines(SVMs) with a Bag-of-Words approach has been shown to be accurate as well as efficient in mining protein interactions from text. In this paper, we describe a scalable hierarchical Support Vector machine based framework to efficiently mine protein interactions with high precision. Our system incorporates state of the art named entity recognition and word-frequency approaches to identify protein references. In addition, we describe a convolution tree-vector kernel based on syntactic similarity of natural language text to further enhance the mining process. By using the inherent syntactic similarity of interaction phrases as a kernel method, we are able to significantly improve the classification quality. Our hierarchical framework allows us to reduce the search space dramatically with each stage, while sustaining a high level of accuracy. We tested our framework on a corpus of 15000 manually annotated phrases gathered from various sources. Our named entity recognition technique yields a precision rate of 96% and a recall rate of 70% for identifying biological entity references. The convolution kernel technique identifies sentences describing interactions with a precision of 95% and a recall of 93%, yielding significant improvements over previous SVM-based techniques.

The GEECS Wiki:
Northwestern University Robert R. McCormick School of Engineering
and Applied Science Electrical Engineering and Computer Science Department