In the past decade, information extraction has made huge progress. We can now extract facts from Web documents at large scale, and knowledge bases (KBs) such as KnowItAll, DBpedia, NELL, BabelNet, WikiData, and our own YAGO contain many millions of entities, and hundred millions of facts about them.
And yet, all of these KBs operate on an extremely reduced fraction of knowledge: They essentially focus on binary relations between a subject and an object. For example, a KB can know that type(autism, developmentalDisorder)
, or that vaccinates(MmrVaccine, measles)
. This knowledge representation model is called RDF. The problem is that RDF can capture barely anything from the Wikipedia article about vaccines. Take for example this text about the supposed link between vaccines and autism:
In February 1998, Andrew Wakefield published a paper in the medical journal The Lancet, which reported on twelve children with developmental disorders. The parents were said to have linked the start of behavioral symptoms to vaccination. The resulting controversy became the biggest science story of 2002. As a result, vaccination rates dropped sharply. In 2011, the BMJ detailed how Wakefield had faked some of the data behind the 1998 Lancet article.
From this text, the current mainstream methods for KB construction would extract just “Andrew Wakefield published a paper” — and barely anything else. Of course, we could use non-symbolic methods (such as distributional methods or deep learning approaches) to decide whether Andrew Wakefield’s paper is trustworthy or not. But suppose that we want to decide whether there is a causal link between autism and vaccination; why we see a lower vaccination rate; or with which arguments another blog post supports the anti-vaccine movement. For this, we need a more detailed understanding of the text. The machine would have to understand:
Current mainstream methods cannot model, extract, let alone reason on this type of information (i.e., apply logical arguments). The goal of the NoRDF project is to go beyond binary relations between entities, and to enrich KBs with events, causation, precedence, stories, negation, and beliefs. We want to extract this type of information at scale from structured and unstructured sources, and we want to allow the machine to reason on it, i.e., to apply logical arguments to reach an argumented conclusion. For this purpose, we want to bring together research on knowledge representation, on reasoning, and on information extraction.