关键词:
Computer Science
Computer Engineering
topological data analysis
persistent homology
image classification
protein description
persistent cycles
gene expression analysis
applied machine learning
convolutional neural network
shallow learning
simplicial complex
homology
feature vector
摘要:
The growing need to understand and process data has driven innovation in many disparate areas of data science. The computational biology, graphics, and machine learning communities, among others, are striving to develop robust and efficient methods for such analysis. In this work, we demonstrate the utility of topological data analysis (TDA), a new and powerful tool to understand the shape and structure of data, to these diverse areas. First, we develop a new way to use persistent homology, a core tool in topological data analysis, to extract machine learning features for image classification. Our work focuses on improving modern image classification techniques by considering topological features. We show that incorporating this information to supervised learning models allows our models to improve classification, thus providing evidence that topological signatures can be leveraged for enhancing some of the pioneering applications in computer vision. Next, we propose a topology based, fast, scalable, and parameter-free technique to explore a related problem in protein analysis and classification. On an initial simplicial complex built using constituent protein atoms and bonds, simplicial collapse is used to construct a filtration which we use to compute persistent homology. This is ultimately our signature for the protein-molecules. Our method, besides being scalable, shows sizable time and memory improvements compared to similar topology-based approaches. We use the signature to train a protein domain classifier and compare state-of-the-art structure-based protein signatures to achieve a substantial improvement in accuracy. Besides considering the intervals of persistent homology like our first two applications, some applications need to find representative cycles for them. These cycles, especially the minimal ones, are useful geometric features functioning as augmentations for the intervals in a purely topological barcode. We address the problem of computing these