Foundation models are models that are pretrained from raw data at scale and can be adapted to various downstream applications. For example, language models, like BERT and GPT family, are pretrained from a large amount of raw text such as Wikipedia and Books, and the resulting models can be adapted (e.g. via finetuning, prompting) to help an extremely wide range of applications, including question answering, text classification, etc. These language models achieve remarkable performance on many natural language processing (NLP) tasks, becoming the foundation of today’s NLP systems. They are serving important roles in various products and tools that we use every day, such as search engines like Google and personal assistants like Alexa.
What can foundation models learn from?
While text is commonly used to train language models, knowledge graphs (KG) provide complementary information to text. KGs offer structured background knowledge by representing entities as nodes and relations between them as edges, e.g. <Leonardo da Vinci — born in — Italy>. Examples of knowledge graphs include Freebase, Wikidata (general-purpose facts), ConceptNet (commonsense), and UMLS (biomedical facts).
Text and KGs have complementary strengths. Text has a broad coverage of knowledge and captures rich context. Meanwhile, KGs are structured and offer scaffolds for logical or multi-step reasoning by providing paths between entities. Some KGs also include knowledge that may not be commonly stated in text; for instance, people do not often state obvious facts like “people breathe” or compositional sentences like “The birthplace of the painter of the Mona Lisa is Italy”. Hence, text and KG have the potential to mutually inform each other, i.e. text can contextualize the KG with its rich prose, and KG can ground the text with its structure for reasoning.
Training a foundation model from text and KG
The above observation motivates research in fusing the strengths of the two modalities, text and KG. In our recent work published at NeurIPS 2022, we develop DRAGON, a new method to train a foundation model jointly from text and KG.
Challenges
To pretrain a powerful foundation model, we need both (i) an expressive model that allows the two modalities to interact in a deep, bidirectional manner; and (ii) a self-supervised training objective that allows the model to learn joint reasoning over text and KG at scale without manual labels. Existing models that combine text and KGs tend to perform indirect/uni-directional fusion of the two modalities, or are supervised by small labeled datasets rather than being self-supervised at scale. We will address these challenges below.
We introduce DRAGON (Deep Bidirectional Language-Knowledge Graph Pretraining), a self-supervised method to pretrain a deeply fused foundation model from text and KG. As an overview, DRAGON consists of three steps. We first sample pairs of text segments and relevant KG subgraphs to create inputs for the model (Step 1). We then use a deep bidirectional model to fuse the input text and KG (Step 2), and finally pretrain the model using a joint self-supervised task over the two modalities (Step 3).
Step 1: Text-KG Input Sampling
Given a text corpus (e.g. Books) and a large KG (e.g. ConceptNet) as raw data, we want to sample informative pairs of (text segment, KG subgraph) as inputs for the model so that the text and KG are semantically related and can inform each other. To achieve this, for each text segment sampled from the text corpus, we retrieve a relevant subgraph from the KG by doing simple entity linking, i.e. string-match entities mentioned in the text segment to the KG, and extract these entities nodes as well as their neighbor nodes from the KG. Consequently, we obtain a pair of (text segment, local KG). Henceforth, we use “KG” to refer to this local KG for convenience.
Step 2: Deep Bidirectional Cross-Modal Model
Given the input text and KG, we want to design a model that can capture rich interactions between them. Some inspirations we can draw are deep bidirectional contextualization of inputs, which have made BERT very successful, and graph neural networks (GNNs), which are shown to be effective for modeling graph algorithms including knowledge graph reasoning. With these motivations, we designed a model architecture called GreaseLM that combines Transformer and GNN to fuse text and KG bidirectionally for multiple layers. Specifically, each layer of this model has a Transformer that encodes the input text and a GNN that encodes the input KG, which are then fused by a bidirectional modality interaction module.
Step 3: Bidirectional Self-supervision
The final step is to pretrain the model using the inputs we prepared. We train the model by unifying two self-supervised reasoning tasks over text and KG. The first task is language modeling, which predicts masked words or next words from the input text. The other task is link prediction, which predicts edges that were held out from the input KG. The intuition is that by combining the two tasks, the model is encouraged to use both the text and the KG to reason about the missing words in text and missing links in the KG. This joint training facilitates the model to propagate information bidirectionally between the two modalities.
We pretrain DRAGON in two domains
DRAGON improves over vanilla language models (LMs) and previous LM+KG models
We finetune and evaluate the pretrained DRAGON on diverse downstream tasks in each domain:
We compare DRAGON with two types of baselines. The first is a vanilla language model (LM), i.e. RoBERTa or BioLinkBERT. The second is GreaseLM, which takes a vanilla language model and *finetunes* it using KG, but it does not pretrain using KG. So the key difference is that DRAGON pretrains a language model using KG as well as finetuning.
The figure below shows the evaluation results. DRAGON outperforms the baseline language models and GreaseLM across the commonsense and biomedical question answering tasks. In particular, we can decompose and see the effect of using KG in pretraining with respect to the vanilla language model (purple arrow), and the effect of self-supervised pretraining with respect to GreaseLM (blue arrow). We can see that both components of DRAGON contribute to significant performance improvements.
Effective for complex reasoning
We also find that DRAGON exhibits several interesting strengths. The first strength is improved performance for complex reasoning tasks. We identified several types of complex reasoning such as questions that contain negation terms (like “no”, “never”) or conjunction terms (like “and”, “but”)—these indicate logical reasoning—and also questions that contain many prepositional phrases or many entity mentions—these indicate involving more reasoning constraints or steps. We find that DRAGON attains large performance gains on these complex questions compared to the baseline models.
The intuition is that because DRAGON is pretrained using KG, it learns to use KG as a scaffold for performing structured reasoning about entities. For instance, when the question contains a conjunction (example on the left of the figure), the model exhibits stronger attention weights over the entities related to the conjunction after several layers of the GNN over the KG, which lead to the correct answer. Similarly, when the question further contains a negation (example on the right of the figure), the model exhibits stronger attention weights over the entities that are not negated. One interpretation of these findings is that DRAGON uses the structure of KG and GNN as a scaffold for performing complex reasoning—this insight is related to recent works that provide language models with scratch space for doing intermediate reasoning. Another interpretation is that the GNN component of DRAGON learns to perform soft execution of natural language inputs (questions) on the KG—this insight is related to recent works showing that GNNs can learn to execute graph algorithms, including execution of complex logical queries on KG.
DRAGON also exhibits an ability to extrapolate to more complex questions. For instance, it adjusts the entity attention weights and final predictions accordingly when an extra context (i.e. extra reasoning step) is added to the original question, as in the figure below (left → right). Meanwhile, vanilla language models (RoBERTa) and KG-augmented finetuning (GreaseLM) struggle on these QA examples. This may suggest that KG-augmented pretraining (DRAGON) is important for acquiring broader reasoning abilities that generalize to harder test examples.
Effective for few-shot and data-efficient QA
Another strength of DRAGON is few-shot and data-efficient QA. For each QA dataset, we tried finetuning DRAGON and the baseline models with only 10% or 1% of the available training data. We find that DRAGON provides large improvements in these low-resource paradigms. This suggests that DRAGON internalized more knowledge thanks to the self-supervised pretraining with knowledge graphs.
We introduced DRAGON, a new method to pretrain a deeply fused foundation model from text and knowledge graphs (KG). Specifically, we design a bidirectional cross-modal model for text and KG, and train the model using two joint self-supervised tasks: language modeling and KG link prediction.
DRAGON can be used as a drop-in replacement for existing BERT models, and can be finetuned to solve various NLP tasks. In particular, DRAGON achieves significant performance improvements for knowledge- and reasoning-intensive applications, such as complex question answering that involves commonsense/biomedical knowledge and multi-step reasoning.
The pretrained DRAGON models are made publicly available. We hope that they could be helpful for your projects and research. Finally, we think that DRAGON opens up many exciting future projects, such as generalizing it to GPT or sequence-to-sequence style language models to perform knowledge-grounded text generation.
This blog post is based on the paper:
The models, code and data are available on GitHub. If you have questions, please feel free to email us.
Many thanks to my collaborators and advisors, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D Manning, Percy Liang and Jure Leskovec for their help, and to the members of the Stanford SNAP group, P-Lambda group, and AI lab for their valuable feedback.
OWASP Threat Dragon is a modeling tool used to create threat model diagrams as part of a secure development lifecycle. Threat Dragon follows the values and principles of the threat modeling manifesto. It can be used to record possible threats and decide on their mitigations, as well as giving a visual indication of the threat model components and threat surfaces. Threat Dragon runs either as a web application or as a desktop application.
Threat Dragon supports STRIDE / LINDDUN / CIA / DIE / PLOT4ai, provides modeling diagrams and implements a rule engine to auto-generate threats and their mitigations.
Use the version 1 or version 2 documentation to get started, along with the recording of Mike Goodwin giving a lightning demo during the OWASP Open Security Summit in June 2020.
An introduction to Threat Dragon is provided by the OWASP Spotlight series, and the Threat Modeling Gamification seminar by Vlad Styran shows how using Threat Dragon can make threat modeling fun.
There are a couple of OWASP community pages that give overviews on Threat Modeling and how to get started: Threat Modeling and Threat Modeling Process.
The easiest way to get in contact with the Threat Dragon community is via the OWASP Slack #project-threat-dragon project channel, you may need to subscribe first.