关键词:
Information technology
Computer science
摘要:
Much of the information created today is scattered throughout books, magazines, web pages, academic papers, and other documents that cannot be directly queried in a structured way. To address this challenge, people are building structured knowledge bases to make this information more accessible. The process of populating knowledge bases from unstructured inputs is called knowledge base construction (KBC). Through KBC, we can make troves of valuable information more accessible. However, existing KBC systems have limited abilities to handle input data from a wide variety of formats, including unstructured text, tables, and figures, contained within highly variable file structures. This data heterogeneity makes automated, scalable KBC difficult to achieve in real-world scenarios. This dissertation focuses on automating and scaling the complex process of building KBC systems from heterogeneous data. In particular, we study both knowledge distillation and representation from richly formatted data, where information is expressed via combinations of textual, structural, tabular, and visual cues, as well as novel techniques for making these processes feasible in practice. This dissertation consists of two parts. In the first part, we aim to discover the fundamental building blocks of KBC from richly formatted data. We present Fonduer, a KBC system enabling extraction of information from richly formatted data. Fonduer automatically models richly formatted data and allows both users and machines to systematically retrieve all of a document's multimodal information in a programmatic way. This information can formalize multimodal signals for both training data generation via weak supervision and for augmenting deep learning models with multimodal features to perform task learning. In the second part of this dissertation, we study how building knowledge bases can be made feasible in practice. We investigate two crucial parts of Fonduer's pipeline: training data generation and ta