Enhancing SPARQL Query Generation for Knowledge Base Question Answering Systems by Learning to Correct Triplets

Qi, Jiexing; Su, Chang; Guo, Zhixin; Wu, Lyuwen; Shen, Zanwei; Fu, Luoyi; Wang, Xinbing; Zhou, Chenghu

doi:10.3390/app14041521

Open AccessArticle

Enhancing SPARQL Query Generation for Knowledge Base Question Answering Systems by Learning to Correct Triplets

¹

School of Electronic, Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

²

Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(4), 1521; https://doi.org/10.3390/app14041521

Submission received: 10 January 2024 / Revised: 5 February 2024 / Accepted: 10 February 2024 / Published: 14 February 2024

(This article belongs to the Special Issue Unlocking the Potential of AI for Advancing Scientific Research)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Generating SPARQL queries from natural language questions is challenging in Knowledge Base Question Answering (KBQA) systems. The current state-of-the-art models heavily rely on fine-tuning pretrained models such as T5. However, these methods still encounter critical issues such as triple-flip errors (e.g., (subject, relation, object) is predicted as (object, relation, subject)). To address this limitation, we introduce TSET (Triplet Structure Enhanced T5), a model with a novel pretraining stage positioned between the initial T5 pretraining and the fine-tuning for the Text-to-SPARQL task. In this intermediary stage, we introduce a new objective called Triplet Structure Correction (TSC) to train the model on a SPARQL corpus derived from Wikidata. This objective aims to deepen the model’s understanding of the order of triplets. After this specialized pretraining, the model undergoes fine-tuning for SPARQL query generation, augmenting its query-generation capabilities. We also propose a method named “semantic transformation” to fortify the model’s grasp of SPARQL syntax and semantics without compromising the pre-trained weights of T5. Experimental results demonstrate that our proposed TSET outperforms existing methods on three well-established KBQA datasets: LC-QuAD 2.0, QALD-9 plus, and QALD-10, establishing a new state-of-the-art performance (95.0% F1 and 93.1% QM on LC-QuAD 2.0, 75.85% F1 and 61.76% QM on QALD-9 plus, 51.37% F1 and 40.05% QM on QALD-10).

Keywords:

Knowledge Base Question Answering; Text-to-SPARQL; semantic parsing; further pretraining; Triplet Structure

1. Introduction

The evolution of the Semantic Web has facilitated the creation and storage of structured knowledge [1,2,3,4,5,6]. With the rapid development of the Semantic Web, various types of knowledge bases (KBs) have been proposed, such as DBpedia [7], Wikidata [8], and FreeBase [9]. Knowledge Base Question Answering (KBQA) systems, which have garnered considerable attention from researchers, enable users—particularly those without programming expertise—to effortlessly interact with Knowledge Bases [10,11,12,13]. Central to KBQA is Complex Question Answering (CQA) which navigates queries containing multiple subjects, compound relations, or numerical operations, and is vital in applications like Apple’s Siri and Microsoft’s Cortana [14]. The fundamental challenge posed by this task is the translation of the user’s natural language queries into a specialized knowledge base query language, such as SPARQL.

A typical KBQA system consists of the following parts [11]: (1) Entity Linking (EL), (2) Relation Linking (RL), and (3) Query Building (QB). Figure 1 provides an illustrative example of this pipeline. Once entity linking and relation linking have been completed, the query building module coalesces this information into a SPARQL query. This paper places particular emphasis on the QB component, which is responsible for translating user queries into SPARQL queries when the gold entity and relation are already identified.

Generating SPARQL queries is challenging due to the inherent differences in syntax and structure between natural language and formal query languages used in knowledge bases. Traditional rule-based and template-filling methods have limitations in handling complex and variable natural language inputs, often suffering from poor generalization capabilities [10,11]. In recent years, methods based on neural networks, particularly those utilizing sequence-to-sequence models, have been increasingly adopted for this task. The most recent work employing pre-trained models like T5 [15] and BART [16], as well as BERT with Pointer-Generation Networks (PGN) [17], have significantly outperformed those traditional methods. However, despite achieving around 90% accuracy, state-of-the-art models still suffer from a significant number of “Triple Flip” errors, as pointed out in [11]. This type of error refers to the reversal of subject and object positions in the SPARQL triples generated by the model. For instance, given a correct SPARQL query

SELECT (COUNT (?vr0) AS ?vr1){wd:Q18813 wdt:P1574 ?vr0}

where wd:Q18813 and wdt:P1574 are the Internationalized Resource Identifiers (IRI) for entity New Testament and relation exemplar of. The model may predict

SELECT (COUNT (?vr0) AS ?vr1) {?vr0 wdt:P1574 wd:Q18813}

instead, where the positions of the subject and object are flipped.

In this work, we address the triple flip error by introducing the TSET (Triplet Structure Enhanced T5) model, which exploits a novel pretraining stage positioned between the general T5 pretraining and the task-specific fine-tuning for Text-to-SPARQL parsing. The core of TSET is to enhance the model’s understanding of triple-structure information within SPARQL queries. To achieve this, we incorporate a Triple Structure Correction (TSC) objective into the model’s training regimen. This new objective is combined with the conventional Masked Language Modeling (MLM) objective to yield a more robust pretraining of the T5 model. Following the pretraining, we apply the TSET model to Text-to-SPARQL tasks using task-specific labeled data. To further improve the model’s performance, we introduce a semantic transformation approach. In this approach, we map the IRI encoding inside the initial input to the corresponding literal during the proposed pretraining. Then, in the Text-to-SPARQL tasks, the literal inside the final output is mapped back to the IRI encoding. These measures ensure that the semantic information is preserved during the training process, allowing for the maximum utilization of the pre-trained model’s weights and generating directly executable SPARQL queries as the final output.

Experimental results show that our model can significantly improve the quality of SPARQL query generation. On three well-known KBQA datasets, LC-QuAD 2.0 [18], QALD-9 plus [19], and QALD-10 [20], TSET surpasses all previous methods in answer F1 score and Query Match (QM) accuracy, achieving new state-of-the-art performance. We also do a comprehensive set of ablation studies to demonstrate the effectiveness of our proposed Triple Structure Correction (TSC) objective and the semantic transformation approach. A thorough error analysis has surprisingly shown that despite our primary focus on addressing triple errors, our model also exhibited improvements in sentence-level errors. It further underscores the effectiveness of our method in improving the overall quality of SPARQL query generation.

In summary, our main contributions are as follows:

We present TSET, a model incorporating the novel Triple Structure Correction (TSC) objective, which improves SPARQL query comprehension.
We design semantic transformation during pretraining to enhance the model’s understanding of triple-structure information while still being able to leverage the pre-trained weights.
Through exhaustive evaluation using three popular KBQA datasets, we demonstrate the superiority of TSET, setting a new State-Of-The-Art performance on query match and answer F1 metric.

2. Related Work

Our work focuses on SPARQL query building, which is one of the most important submodules in KBQA systems, used to convert natural language questions into SPARQL queries. In previous work, this part was usually designed and evaluated together with tasks such as entity linking and relation linking. Recently, Banerjee et al. [11] separated this task and evaluated it independently using modern PLM models. In this section, we first review the two types of methods used in previous KBQA systems [10,21]: retrieval-based methods and semantics-based methods. Then, we provide a brief introduction to the application of pretrained language models in semantic parsing-based methods.

2.1. Information Retrieval-Based Methods

The information retrieval-based KBQA methods typically comprise four modules [10]: (1) retrieval source construction, which involves extracting a graph from the knowledge base that is relevant to the question, consisting of both pertinent facts and a substantial amount of noisy facts; (2) question representation, which aims to comprehend the question and generate instructions to guide the reasoning process; (3) graph-based reasoning, which conducts reasoning by employing semantic matching on the graph; and (4) answer generation, which generates answers based on the reasoning status at the conclusion of the reasoning process.

However, these methods face challenges in terms of accuracy and effectiveness, primarily due to the incompleteness of the source knowledge base and the exponential growth in subgraph size as the distance to topic entities increases, especially when dealing with complex user questions [22,23]. To address these challenges, researchers have proposed various solutions. For instance, some studies utilize sentences as nodes to expand incomplete knowledge bases [24], while others try to filter out irrelevant information in the reasoning process [25,26,27,28]. To comprehend complex questions, certain studies dynamically update reasoning instructions during the reasoning process [29,30]. Moreover, aside from explicitly recording the analytical aspect of the problem using attention, other approaches suggest incorporating information retrieved throughout the reasoning process to update instructions [31]. While information retrieval-based methods have their merits, they tend to lack interpretability and do not generate SPARQL queries [10,21], which are the focal points of this paper. As such, our research will primarily consider semantic parsing-based methods.

2.2. Semantic Parsing-Based Methods

The semantic parsing-based method obtains an executable query (such as SPARQL) by parsing the user’s natural language question and then returning the executed results. Generally, these methods include four modules [10]: (1) the question understanding module, which performs semantic and syntactic analysis on the question and generates an encoded question representation; (2) the logical parsing module, which converts the encoded question into an uninstantiated logical form; (3) the KB alignment module, which aligns the logical form with the knowledge base semantically and instantiates the logical form; and (4) the KB execution module, which executes the logical form to generate the predicted answer. In complex KBQA, these modules face different challenges, including difficulty in question understanding, diverse query types, and vast search space. In addition, manually annotating logical forms is costly and time-consuming, and using weak supervision signals for training is also challenging.

The SPARQL query building task is often interleaved with other modules in these pipelines, where it is rarely evaluated separately [21]. Early work usually exploits a slot-filling method by using a set of pre-defined SPARQL templates with some slots that have to be filled. These models often design and summarize some common simple SPARQL templates for a specific dataset, such as TeBaQA [32], Template-based QA over RNN [33], and SubQG [34]. However, the templates used by these methods are not guaranteed to be adapted to other datasets [11]. Other attempts try to construct intermediate query representations, such as AQG [35] and SQG [36].

Recent studies have introduced methodologies that utilize literal values instead of IRI encodings as inputs or outputs for models. For instance, Soru et al. [37,38] developed a technique named SPARQL 1:1 encoding, which utilizes character values to represent structural elements, including parentheses and punctuation marks. Diomedi and Hogan [39] applied Neural Machine Translation (NMT) to translate query templates containing placeholders. Meanwhile, Lin and Lu [40] implemented a method for transforming literals for relations specifically for the DBpedia.

2.3. Pretrained Language Models on Semantic Parsing-Based Methods

In recent years, large pre-trained language models such as BERT [41], GPT [42], BART [16], and T5 [15] have been showing exciting performance on many NLP tasks, such as sentiment analysis, natural language inference, coreference resolution, and question answering. These models are carefully designed with various training objectives based on reconstruction, such as mask [41], replacement [15], deletion [16], and permutation [43]. Shaw et al. [44] showed that fine-tuning a pre-trained T5 model could yield results competitive to the then-state-of-the-art method. Xie et al. [45] explored the T5 model for a series of structured knowledge grounding tasks such as fact verification, data-to-text, and conversational text-to-SQL tasks, achieving promising results.

More recently, Zou et al. [46] proposed a sequence-to-sequence model consisting of a relation-aware attention encoder and a pointer network decoder to handle the SPARQL query building task. Banerjee et al. [11] experimented with BART [16], T5 [15], and PGNs (Pointer Generator Networks) [17] with BERT [41] embeddings, giving new state-of-the-art in the PLM era for the SPARQL query building task. Furthermore, many works consider pre-training to enhance the model’s performance on downstream semantic parsing tasks. For example, PPTOD [47] used dialogue multi-task pre-training to improve the model performance on task-oriented dialogue (TOD) tasks. Recent works like TABERT [48], SCoRe [49], and STAR [50] et al. use synthetic data and carefully design the pretraining objectives to improve the PLM in pre-training for text-to-SQL tasks.

3. Preliminaries

In this section, we provide a succinct introduction to the concept of a knowledge base, followed by a formal task definition for the process of building SPARQL queries, often referred to as the Text-to-SPARQL task.

3.1. Knowledge Base (KB)

A Knowledge Base (KB) is a repository used to store structured information in a machine-readable format. It consists of a collection of facts represented as triplets, often in the form of (subject, relation, object) [10]. The subject and object in these triplets are commonly referred to as “entities”. For example, (Canada, has capital, Ottawa) is a possible triple. Modern KBs contain vast amounts of data. For instance, Wikidata, one of the well-known KBs, encompasses over 13.9 billion triples [46]. Each entity and relation in a KB is uniquely identified by an Internationalized Resource Identifier (IRI). Taking Wikidata as an example, the entity “Canada” has the IRI “https://www.wikidata.org/entity/Q16” (accessed on 10 January 2024). Due to the length of IRIs, it is common practice to use abbreviations or prefixes to represent them. For example, “wd:Q16” denotes the full IRI “https://www.wikidata.org/entity/Q16” (accessed on 10 January 2024).

3.2. Problem Formulation

In the context of KBQA systems, the Text-to-SPARQL task aims to bridge the gap between natural language questions and their corresponding SPARQL queries given the output of entity and relation linking. Let

G = {< e_{s}, r, e_{o} > | e_{s}, e_{o} \in E, r \in R}

represent the KB, where

< e_{s}, r, e_{o} >

indicates the existence of relation r between subject entity

e_{s}

and object entity

e_{o}

. Here,

E

represents the set of all entities in the knowledge base, and

R

represents the set of all relations. Assume that the user poses a natural language question

Q = {q_{i}}_{i = 1}^{| Q |}

, where

q_{i}

represents the i-th word in the question. The gold linked entities are denoted as

E = {E_{i}}_{i = 1}^{| E |}

, where

E \subseteq E

, and the gold linked relations are represented as

R = {R_{i}}_{i = 1}^{| R |}

, with

R \subseteq R

. The objective of SPARQL query building is to predict the formal SPARQL query Y given Q, E, and R. It is worth noting that

E_{i}

and

R_{i}

represent the resource IRI in the KB, for example,

E_{1}

is wd:P221211. The corresponding literal value for

E_{1}

is denoted as

S_{E_{1}}

and is Twelfth Night. By accurately generating SPARQL queries based on user questions, KBQA systems can effectively retrieve relevant information from the KB and provide accurate answers to user queries.

4. Method

In this section, we introduce our TSET (Triplet Structure Enhanced T5) model designed specifically for the text-to-SPARQL task. The innovation of TSET lies in the inclusion of an intermediate pretraining stage that bridges the gap between generic T5 pretraining and task-specific fine-tuning for Text-to-SPARQL translation. This additional pretraining aims to enhance the model’s grasp of positional information within triple structures, a critical aspect of SPARQL queries. Moreover, to improve semantic understanding, we propose a method called semantic transformation for substituting Internationalized Resource Identifier (IRI) encodings with their corresponding literal values during both the pretraining and fine-tuning phases. Once the model is fine-tuned, these literal values are converted back to IRI encodings, ensuring that the generated SPARQL queries are directly executable.

The rest of this section is organized as follows: we will delve into the details of the semantic transformation in Section 4.1. The novel pretraining stage and the corresponding objectives will be elaborated in Section 4.2. We also discuss the fine-tuning on downstream text-to-SPARQL tasks in Section 4.3.

4.1. Semantic Transformation

In a typical SPARQL query, entities and relations are often represented using Internationalized Resource Identifiers (IRIs), such as wd:Q18813 for “Canada” and wdt:P1574 for “has capital”. Previous methods [11] directly transform natural language questions like “What is the capital city of Canada?” into SPARQL queries containing these IRIs. We argue that this approach imposes a semantic limitation on the model, reducing its understanding to merely memorizing the combinatorial relationship of specific IRI encodings, such as wd:Q18813 wdt:P1574, without comprehending their semantic meaning.

To address this limitation, we introduce a method called Semantic Transformation, designed to enhance the model’s semantic understanding of entities and relations within SPARQL queries. Specifically, this method replaces the original IRI encodings with their corresponding literal values during both the intermediate pretraining stage and the fine-tuning stage. This strategy aims to make the most of the T5 model’s pretrained weights by enabling it to process more semantically rich information. As depicted in Figure 2, during the TSET pretraining stage, entities and relations in the SPARQL queries are replaced by their respective literal values, a process we term Semantic Forwarding. Likewise, during the fine-tuning stage, any SPARQL queries output by the model, which contain these literal values, are reverse-mapped back to their original IRI encodings to ensure the query is executable. We refer to this operation as Semantic Backwarding.

4.2. Further Pretraining with TSC and MLM Objective

In this pretraining stage, we combine the proposed Triple Structure Correction (TSC) objective and MLM objective to learn better representations for SPARQL query buliding tasks. We use the SPARQL queries in the training set of the LC-QuAD 2.0 dataset as the pretraining corpus, which contains about 24k examples.

In order to alleviate the triple flip problem in the output, we propose a new Triple Structure Correction (TSC) objective to strengthen the model during the pre-training. Specifically, we collect all SPARQL queries from the training data of LC-QuAD 2.0 as the pre-training corpus. We randomly swap the subject, predicate, and object positions in the triples of SPARQL queries with a certain probability and then let the model recover the correct SPARQL query. For an input SPARQL query x and the corresponding target y, the objective for TSC is as follows:

L_{Θ} = - \sum_{i = 1}^{| y |} log P_{Θ} (y_{i} ∣ y_{< i}; x)

(1)

where

Θ

is the model parameters.

Additionally, we employ the Masked Language Modeling (MLM) objective by randomly masking specific tokens in an input sequence and let the model predict the original masked tokens from their context. Since T5 is an encoder-decoder architecture model, the implementation is a bit different from BERT-style. Instead of using an extra MLP to predict which token the [MASK] is, our model treats it as a sequence-to-sequence problem. For example, we process the sentence

select distinct ?answer where wd:Porky Pig wdt:present in work ?answer.

The words “distinct”, “?”, and “answer” (marked by using T5’s reserved sentinel token <extra_id_0>) are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as <extra_id_0> and <extra_id_1>) that is unique over the example. Since “distinct”, “?” and “answer” occur consecutively, they are replaced by a single sentinel <extra_id_0>. The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token. We use a corruption rate of 15% in all cases and average span lengths of three in all of the experiments.

The final loss

L_{p r e}

of this multi-task pretraining stage is the sum of

L_{M L M}

and

L_{T S C}

.

L_{p r e} = L_{M L M} + L_{T S C}

(2)

4.3. Fine-Tuning on the Downstream SPARQL Query Buliding Task

Following the pre-training phase, the model undergoes fine-tuning for the SPARQL query building task. This task converts user questions into SPARQL queries when the gold entity and the gold relation are available. The input for our model is a combination of question Q, gold entity E, gold relation R, and necessary delimiters. We follow the approach of [11] to serialize the inputs. More specifically, the input is in the form

X = {q_{1} q_{2} \dots [SEP] E_{1} S_{E_{1}}, E_{2} S_{E_{2}}, \dots [SEP] R_{1} S_{R_{1}}, R_{2} S_{R_{2}}, \dots .}

(3)

where each word is separated from the others by a space. Semantic forwarding is performed before input to the TSET model and the objective for this stage is the same as in Equation (1). For the tokenization of the input, we follow similar steps as [11], which exploits the sentinel tokens in the T5 model to replace the IRI and SPARQL keywords in the input or output to avoid increasing the number of tokens in the token vocabulary and introducing new random word embedding.

5. Experiment

In this section, we present the experimental evaluation conducted on three extensively utilized KBQA datasets, namely LC-QuAD 2.0 [18], QALD-9 plus [19], and QALD-10 [20], to assess the viability and efficacy of our proposed TSET model. We begin by introducing the experimental settings, encompassing the datasets used, the evaluation metric employed, and the implementation details. Subsequently, we compare the performance of TSET with other state-of-the-art methods on the respective datasets. Additionally, to evaluate the performance of TSET in low-resource settings, we conduct experiments with limited training data. Finally, we conduct ablation studies to investigate the individual contributions of each proposed module and provide a comprehensive analysis of their impact on overall performance.

5.1. Experiment Setup

5.1.1. Datasets

In this work, we utilize three widely used datasets, all based on Wikidata: LC-QuAD 2.0 [18], QALD-9 plus [19], and QALD-10 [20].

LC-QuAD 2.0 provides labeling in both DBpedia 2018 and Wikidata; it consists of a mixture of simple and complex questions that were verbalized by human workers on Amazon Mechanical Turk. It is a large and varied dataset comprising 24,180 train questions and 6046 test questions, which contains a wide range of question types, such as single fact, multi fact, boolean, dual intentions, count, ranking, and so on. Since most of the DBpedia queries could not be executed in the current version, we use the Wikidata query as the target label, wherein similar steps have been performed by [11,46]. Our experimental approach is consistent with the methodologies outlined in references [11,46], encompassing three primary phases. Initially, we systematically rename all variables within the SPARQL queries based on their sequence of appearance. For instance, a typical SPARQL query such as

SELECT (COUNT(?sub) AS ?value) {?sub wdt:P1433 wd:Q324878}

is transformed to

SELECT (COUNT(?var0) AS ?var1) {?var0 wdt:P1433 wd:Q324878}

to standardize variable names according to their occurrence order. Subsequently, we encode each element of the SPARQL syntax (e.g., SELECT, COUNT, DISTINCT, etc.) as a unique sentinel token, assigning a distinct ID to each keyword, a method detailed in Section 4.2. Lastly, we employ a serialized input format aligned with the one described in [11], as detailed in Section 4.3, to maintain consistency with established practices.

In addition, we leverage the QALD-9 plus and QALD-10 datasets, both integral components of the Question Answering over Linked Data (QALD) challenge series. The QALD-9 plus dataset enriches the original QALD-9 by introducing high-quality translations of questions into eight languages and transferring SPARQL queries from DBpedia to Wikidata. This not only bolsters the dataset’s usability and relevance but also facilitates the training and testing of KBQA systems over DBpedia and Wikidata in multiple languages. In contrast, QALD-10 seeks to rectify several deficiencies in previous benchmarks, such as insufficient translation quality for non-English languages and the lower complexity of gold standard SPARQL queries. Consequently, QALD-10 stands out as one of the most complex and practically applicable datasets in the QALD challenge series.

5.1.2. Evaluation Metric

In our experiments, we utilize two primary metrics: Query Match (QM) and Answer F1. The Query Match (QM) accuracy evaluates the congruence of the whole predicted sequence with the ground truth, achieved by decomposing the predicted SPARQL queries into distinct triples. On the other hand, the answer F1 score provides a measure of accuracy between the ground truth answer and the answers derived from the predicted SPARQL queries, which is frequently employed in similar works in the field.

For one example, assume the predicted SPARQL query is

Q_{p}

and the ground truth (true label) SPARQL query is

Q_{t}

. We convert the triples in the WHERE clauses of

Q_{p}

and

Q_{t}

into sets, denoted as

S_{p}

and

S_{t}

, respectively. The QM metric for one example can be defined as follows:

Q M = \{\begin{matrix} 1, & if S_{p} = S_{t} \\ 0, & if S_{p} \neq S_{t} \end{matrix}

(4)

In other words, if the predicted SPARQL query and the ground truth query have the same set of triples, regardless of the order, the QM metric is 1. Otherwise, if the sets of triples do not match exactly, the QM metric is 0. This definition ensures that the QM metric is insensitive to the order of triples in the WHERE clause. As long as the two sets have the same elements, they are considered a match. This flexibility makes the QM metric suitable for evaluating text-to-SPARQL tasks that consider triple ordering. The overall QM score for the test set is the average QM score for each example in the test set.

The answer F1 is a metric that measures the accuracy of the results obtained after executing a query. Assume

A_{p}

is the predicted answer and

A_{q}

is the ground truth. Let P represent the precision, which is the ratio of the number of correct results obtained from the predicted query to the total number of results. Similarly, let R represent the recall, which is the ratio of the number of correct results obtained from the predicted query to the total number of correct results in the ground truth query.

P = \frac{| A_{p} \cap A_{q} |}{A_{p}}

(5)

R = \frac{| A_{p} \cap A_{q} |}{A_{q}}

(6)

F 1 = \frac{2 \times P \times R}{P + R}

(7)

The formal mathematical definition of the answer F1 score is the harmonic mean of precision and recall, as expressed by the equation above. Also, the overall answer F1 score for the test set is the average answer F1 score for each example in the test set.

5.1.3. Baseline Methods

In this paper, we compare our methods with some classical approaches like AQG-net [35], Multi-hop QGG [51], CLC (+BERT/Tencent Word) [46], and some recently used PLM methods such as BART [16], PGN-BERT (-BERT) [11], SGPT

_{Q, K}

[52], and T5 [15].

AQG-net [35] employs a generative model based on neural networks to generate an abstract representation of query graphs, capturing the logical structures of queries. The generated graph is then populated with all possible candidate permutations. Subsequently, an existing ranking model is utilized to identify the most appropriate query. In a similar vein, Multi-hop QGG [51] explores a novel strategy for expanding the candidate query graph. This strategy incorporates both constraints and core paths, leading to an enriched graph representation. It also uses the reinforcement algorithm to learn a policy function, with the F1 score serving as the reward signal for the predicted answers compared to the ground truth answers. CLC [46] enhances the basic transformer model by incorporating a relation-aware self-attention encoder and a pointer-network-based decoder in their approach. Additionally, they conducted experiments using two different PLM embeddings, namely Tencent word and BERT, to further enhance the model’s performance.

Banerjee et al. [11] modify the original BART and T5 models with special tokenization, either by using the sentinel token or by adding new input tokens. They also propose a pointer generator network with KG embeddings and a reranking module. All of these approaches use PLMs and achieve better performance than before.

5.1.4. Implementation Details

Our code is implemented with Pytorch and the model is based on Huggingface transformers. The total batch size is 2048, which was implemented by gradient accumulation. The learning rate for our experiment is

1 \times 10^{- 4}

and we use Adafactor as an optimizer. The dropout rate is set at 0.1. The total epochs for our training are 1024 and it costs about one day on a GeForce RTX 3090 GPU.

5.2. Experiment Results

5.2.1. Main Results

The results on LC-QuAD 2.0 are shown in Table 1. Numbers are evaluated on the test set. Our proposed TSET model achieves state-of-the-art performance. In comparison to the T5 baseline model of equal size, the TEEST model achieves superior performance across all sizes, as measured by both the QM and answer F1 metrics. Specifically, for the small-size models, TSET-small improved the F1 score by 1.6% points, reaching 94.0%, and enhanced the QM score by 1.7% points, achieving 92.0%. Similarly, for the base-size models, TSET-base improved the F1 score by 1.6% points, reaching 95.0%, and enhanced the QM score by 1.8% points, achieving 93.1%.

Table 2 shows the experimental results on the QALD-9 plus and QALD-10 datasets, where TSET surpasses the baseline methods and achieves the state-of-art QM and answer F1 score. On the QALD-9 plus dataset, TSET achieved a 61.76% QM score and a 75.85% answer F1 score, which is a large absolute improvement with the T5 baseline method. On the QALD-10 dataset, TSET attained a QM score of 40.05% and an answer F1 score of 51.37%, establishing a new state-of-the-art performance. We use bold font to highlight the highest score and improvement in each indicator.

Overall, our proposed TSET model achieves better results than other models on these three widely used datasets, demonstrating the TSC objective’s effectiveness in the pretraining period.

5.2.2. Low-Resource Evaluation

To further assess the model’s generalization ability, we conducted an evaluation in a challenging low-resource scenario. Given the limited availability of training data for QALD-9 plus and QALD-10, we opted to utilize LC-QuAD 2.0 as a suitable dataset for exploring this setting. Subsequently, we randomly selected subsets comprising 1%, 5%, 10%, and 20% of the original training data to train our model. The obtained results for the Query Match (QM) and answer F1 scores are presented in Table 3.

Remarkably, our model consistently outperformed the baselines across all four settings. Notably, even when the small size model was trained with just 1% of the data, our method achieved a substantial improvement in performance over the baseline model, with an increase of 21.19 points in the QM score and 20.56 points in the answer F1 score. These findings provide strong evidence of our model’s superior generalization capabilities, demonstrating its effectiveness in adapting to new tasks, even when confronted with limited training data.

5.3. Ablation Study

We also conducted a series of ablation experiments to analyze the effectiveness of our proposed approach, including the effects of different pre-training objectives and performance on different question types.

5.3.1. Natural Language Question Types

As previously stated, the LC-QuAD 2.0 dataset, with its considerable size and complexity, encompasses a broad range of question types. Table 4 shows all of the question types in this dataset, which include single or multi-fact, multi-intentions, boolean, count, ranking, string operation, and temporal aspect, among others. This section presents experimental evaluations assessing the performance of our model across these question types.

Figure 3 provides a detailed visual representation of the QM scores for each question type. The QM scores, depicted through horizontal bars, represent the performance of two models: TSET and T5. Question types are listed on the y-axis, with corresponding QM scores on the x-axis. The figure clearly demonstrates that the TSET model outperforms the T5 model for most problem types. Specifically, TSET has a higher QM score in 8 out of the 13 problem types, indicating its superior performance. Moreover, TSET matches T5’s performance in 2 out of 13 problem types, showcasing its robustness and versatility. The significant improvement of TSET over T5, denoted by the red line in the figure, is evident across several question types. In several cases, the improvement exceeds 5%, highlighting the substantial enhancement offered by the TSET model. Despite these improvements, it is worth noting that the T5 model performed better in three problem types. However, the overall superiority of TSET across most question types underscores the effectiveness of our proposed model and its potential for tackling a wide array of problem types.

5.3.2. Pretraining Objectives

We also conducted additional experiments to analyze the relative contributions of the proposed pretraining objectives. Firstly, we explored the impact of removing either the TSC or MLM pretraining objective. As shown in Table 5, removing the TSC objective led to an almost 1-point decrease in the QM score, while removing the MLM objective resulted in a performance loss of 0.41.

Further, we examined the TOC (Triple Order Correction) and SOC (Sentence Order Correction) objectives, which represent constrained and expanded variants of the proposed TSC objective, respectively. The TOC objective selectively randomizes the subject and object components of triples, whereas the SOC objective expands this permutation to encompass the entirety of a SPARQL query. Table 6 presents samples of pretraining input texts derived from the application of TOC, SOC, and TSC during the pretraining phase, illustrating the distinct permutations used. As is shown in Table 5, TOC+MLM objectives resulted in performance close to that of the TSC+MLM objectives but with a 0.18 performance drop. We could also find that the performance of SOC+MLM was significantly subpar in comparison to the TSC+MLM objectives.

In conclusion, these experiments demonstrate the effectiveness of the proposed TSC objective. The larger structural permutations, as in the SOC objective, are unsuitable for this task. Hence, we conclude that the TSC objective, with its more targeted and smaller-scale structure permutation, is more appropriate for enhancing the model’s understanding of triple-structure information in SPARQL queries.

5.3.3. Semantic Transformation

Semantic transformation is utilized during the input of the first stage (semantic forwarding) and the output of the second stage (semantic backwarding). We conducted another experiment without semantic transformation to observe the effectiveness of this module. The experimental results are shown in Table 5. It is evident that there is a significant drop in performance (1.9 QM score), highlighting the importance of this module.

5.3.4. Effect on SPARQL Difficulty

We categorized the test dataset from LC-QuAD 2.0, QALD-9 plus, and QALD-10 based on the number of triples (hops) in the corresponding SPARQL queries into four classifications: hop

= 1, 2, 3

and

\geq 4

. Table 7 presents the performance of the TSET model in comparison to the baseline T5 model on questions of varying complexities. The results demonstrate that the TSET model consistently outperforms the T5 model across almost all subsets. This performance enhancement validates the effectiveness of the methods introduced in our study.

5.4. Case Study

In Table 8, we show how the TSET model could predict the SPARQL query more accurately by demonstrating two examples of question-SPARQL pairs sampled from three datasets. Within these examples, T5 gives the wrong prediction in which a triple flip occurs. For example, in the second question, the user asks, “What is the IQ test for insights measurements?” After Entity Linking and Relation Linking, the identified entities are as follows: intelligence wd:Q83500; IQ test wd:Q12021385; instance of wdt:P31; measures wdt:P2575. Both models correctly identify the second triple (?x wdt:P31 wd:Q12021385), indicating that the answer must be a type of IQ test. However, for the first triple, T5 fails to make the correct prediction, resulting in a Triple Flip error. In contrast, our proposed TSET model effectively captures the structural and positional information within the triple, giving the correct triple prediction (?x wdt:P2575 wd:Q83500). This indicates that it should be a subject measuring intelligence rather than intelligence measuring something. The TSET model demonstrates its ability to grasp the internal structure and positional information of triples, surpassing T5 in accurately generating SPARQL queries.

5.5. Error Analysis

We also provide a concise error analysis on the test set of the LC-QuAD 2.0 dataset. Table 9 shows the number of examples with incorrect predictions by our model and the baseline model. We classify errors into two categories: Triple errors and Sentence errors. The triple error refers to the triplet prediction error in the predicted SPARQL, while the sentence error refers to the error in other parts in addition to the triplet part. Note: A wrong prediction may have both triple and sentence errors.

Within the Triple errors, we meticulously examine a specialized error subset known as Triple Flip Errors. These errors are distinctly itemized in the last column of Table 9. The numbers in Table 9 demonstrate that our proposed TSET model substantially reduces Triple Flip Errors compared to the original T5 model. Moreover, there is a significant enhancement in the model’s performance concerning both Triple Errors and Sentence Errors.

6. Conclusions and Future Work

In this work, we address the critical challenge of triple-flip errors in SPARQL query building tasks, a pervasive issue affecting the performance of Knowledge Base Question Answering (KBQA) systems. We introduce a novel approach that leverages a new objective called Triple Structure Correction (TSC) to enrich the pre-training of the T5 model. Our work establishes a comprehensive framework to bridge the “semantic gap” between human language and machine-readable queries. The innovative Triple Structure Correction objective represents a key contribution in providing a practical solution to the challenges encountered in generating SPARQL queries. As a further testament to our methodology, we fine-tuned our model, which we named TSET, for the downstream task of SPARQL query building. Our experimental results validate the effectiveness of our approach, showing that TSET substantially outperforms state-of-the-art methods on three renowned KBQA datasets: LC-QuAD 2.0, QALD-9 plus, and QALD-10. Overall, this study paves the way for future research in KBQA, underscoring the pivotal roles of pre-training and fine-tuning in improving both the understanding and generation of complex SPARQL queries.

In the future, two prominent research directions beckon our attention. Firstly, we aspire to contemplate a more diverse array of neural network models and more complex knowledge bases. While the current study predominantly focuses on the T5 model-based KBQA system, advancements in neural network technology such as ChatGPT and other LLMs may also lend valuable insights into enhancing KBQA. Additionally, with the continuous enrichment and expansion of knowledge bases, fine-tuning and optimizing our model to accommodate larger and structurally complex knowledge bases remains a challenge. Secondly, we aim to delve deeper into pre-training and fine-tuning strategies to better leverage the structural information in knowledge bases and to devise more effective objective functions for improving model performance in understanding and generating SPARQL queries. We believe these future efforts will propel our model toward achieving greater success in handling intricate KBQA tasks.

Author Contributions

Conceptualization, J.Q. and L.F.; Methodology, J.Q. and L.F.; Software, J.Q., C.S. and L.W.; Formal analysis, J.Q. and L.F.; Investigation, C.S. and Z.G.; Validation, C.S., Z.G., Z.S. and L.W.; Writing—Original draft preparation, J.Q.; Writing—Reviewing and Editing, L.F.; Supervision, X.W.; Resources, C.Z.; Funding acquisition, L.F.; Project administration, L.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by NSF China (No. 62020106005, 61960206002, 42050105, 62061146002), Shanghai Pilot Program for Basic Research—Shanghai Jiao Tong University. Moreover, this work is also a contribution to the Deep-time Digital Earth (DDE) Big Science Program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We will release our code and data at https://github.com/JiexingQi/tset (accessed on 10 January 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shadbolt, N.; Berners-Lee, T.; Hall, W. The semantic web revisited. IEEE Intell. Syst. 2006, 21, 96–101. [Google Scholar] [CrossRef]
Hitzler, P. A review of the semantic web field. Commun. ACM 2021, 64, 76–83. [Google Scholar] [CrossRef]
Boumechaal, H.; Boufaida, Z. Complex Queries for Querying Linked Data. Future Internet 2023, 15, 106. [Google Scholar] [CrossRef]
Zhang, C.; Zha, D.; Wang, L.; Mu, N.; Yang, C.; Wang, B.; Xu, F. Graph Convolution Network over Dependency Structure Improve Knowledge Base Question Answering. Electronics 2023, 12, 2675. [Google Scholar] [CrossRef]
Hu, S.; Zhang, H.; Zhang, W. Domain Knowledge Graph Question Answering Based on Semantic Analysis and Data Augmentation. Appl. Sci. 2023, 13, 8838. [Google Scholar] [CrossRef]
Wang, S.; Qin, B. A Novel Joint Training Model for Knowledge Base Question Answering. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 32, 666–679. [Google Scholar] [CrossRef]
Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P.N.; Hellmann, S.; Morsey, M.; Van Kleef, P.; Auer, S.; et al. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 2015, 6, 167–195. [Google Scholar] [CrossRef]
Pellissier Tanon, T.; Vrandečić, D.; Schaffert, S.; Steiner, T.; Pintscher, L. From freebase to wikidata: The great migration. In Proceedings of the 25th International Conference on World Wide Web, Montréal, QC, Canada, 11–15 April 2016; pp. 1419–1428. [Google Scholar]
Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 1247–1250. [Google Scholar]
Lan, Y.; He, G.; Jiang, J.; Jiang, J.; Zhao, W.X.; Wen, J.R. Complex knowledge base question answering: A survey. IEEE Trans. Knowl. Data Eng. 2022, 35, 11196–11215. [Google Scholar] [CrossRef]
Banerjee, D.; Nair, P.A.; Kaur, J.N.; Usbeck, R.; Biemann, C. Modern baselines for SPARQL semantic parsing. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 2260–2265. [Google Scholar]
Song, Y.; Li, W.; Dai, G.; Shang, X. Advancements in Complex Knowledge Graph Question Answering: A Survey. Electronics 2023, 12, 4395. [Google Scholar] [CrossRef]
Borroto, M.A.; Ricca, F. SPARQL-QA-v2 system for Knowledge Base Question Answering. Expert Syst. Appl. 2023, 229, 120383. [Google Scholar] [CrossRef]
Zhou, H.; Young, T.; Huang, M.; Zhao, H.; Xu, J.; Zhu, X. Commonsense knowledge aware conversation generation with graph attention. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 4623–4629. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1073–1083. [Google Scholar]
Dubey, M.; Banerjee, D.; Abdelkawi, A.; Lehmann, J. Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. In Proceedings of the Semantic Web–ISWC 2019: 18th International Semantic Web Conference, Auckland, New Zealand, 26–30 October 2019; Proceedings, Part II 18. Springer: Berlin/Heidelberg, Germany, 2019; pp. 69–78. [Google Scholar]
Perevalov, A.; Diefenbach, D.; Usbeck, R.; Both, A. QALD-9-plus: A multilingual dataset for question answering over DBpedia and Wikidata translated by native speakers. In Proceedings of the 2022 IEEE 16th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 26–28 January 2022; pp. 229–234. [Google Scholar]
Usbeck, R.; Yan, X.; Perevalov, A.; Jiang, L.; Schulz, J.; Kraft, A.; Möller, C.; Huang, J.; Reineke, J.; Ngonga Ngomo, A.C.; et al. QALD-10—The 10th challenge on question answering over linked data. Semant. Web 2023, 1–15. [Google Scholar] [CrossRef]
Diefenbach, D.; Lopez, V.; Singh, K.; Maret, P. Core techniques of question answering systems over knowledge bases: A survey. Knowl. Inf. Syst. 2018, 55, 529–569. [Google Scholar] [CrossRef]
Min, B.; Grishman, R.; Wan, L.; Wang, C.; Gondek, D. Distant supervision for relation extraction with an incomplete knowledge base. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 777–782. [Google Scholar]
Petrochuk, M.; Zettlemoyer, L. SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 554–558. [Google Scholar]
Sun, H.; Dhingra, B.; Zaheer, M.; Mazaitis, K.; Salakhutdinov, R.; Cohen, W. Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4231–4242. [Google Scholar]
Miller, A.; Fisch, A.; Dodge, J.; Karimi, A.H.; Bordes, A.; Weston, J. Key-Value Memory Networks for Directly Reading Documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 1400–1409. [Google Scholar]
Xiong, W.; Yu, M.; Chang, S.; Guo, X.; Wang, W.Y. Improving Question Answering over Incomplete KBs with Knowledge-Aware Reader. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4258–4264. [Google Scholar]
Han, J.; Cheng, B.; Wang, X. Two-phase hypergraph based reasoning with dynamic relations for multi-hop KBQA. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Yokohama, Japan, 7–15 January 2021; pp. 3615–3621. [Google Scholar]
Yasunaga, M.; Ren, H.; Bosselut, A.; Liang, P.; Leskovec, J. QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 535–546. [Google Scholar]
Zhou, M.; Huang, M.; Zhu, X. An Interpretable Reasoning Network for Multi-Relation Question Answering. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 2010–2022. [Google Scholar]
Xu, K.; Lai, Y.; Feng, Y.; Wang, Z. Enhancing key-value memory neural networks for knowledge based question answering. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 2937–2947. [Google Scholar]
He, S.; Liu, C.; Liu, K.; Zhao, J. Generating natural answers by incorporating copying and retrieving mechanisms in sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 199–208. [Google Scholar]
Vollmers, D.; Jalota, R.; Moussallem, D.; Topiwala, H.; Ngomo, A.C.N.; Usbeck, R. Knowledge Graph Question Answering using Graph-Pattern Isomorphism. arXiv 2021, arXiv:2103.06752. [Google Scholar]
Athreya, R.G.; Bansal, S.K.; Ngomo, A.C.N.; Usbeck, R. Template-based question answering using recursive neural networks. In Proceedings of the 2021 IEEE 15th international conference on semantic computing (ICSC), Laguna Hills, CA, USA, 27–29 January 2021; pp. 195–198. [Google Scholar]
Ding, J.; Hu, W.; Xu, Q.; Qu, Y. Leveraging Frequent Query Substructures to Generate Formal Queries for Complex Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2614–2622. [Google Scholar]
Chen, Y.; Li, H.; Hua, Y.; Qi, G. Formal query building with query structure prediction for complex question answering over knowledge base. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Yokohama, Japan, 7–15 January 2021; pp. 3751–3758. [Google Scholar]
Hu, S.; Zou, L.; Zhang, X. A state-transition framework to answer complex questions over knowledge base. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2098–2108. [Google Scholar]
Soru, T.; Marx, E.; Valdestilhas, A.; Esteves, D.; Moussallem, D.; Publio, G. Neural machine translation for query construction and composition. arXiv 2018, arXiv:1806.10478. [Google Scholar]
Soru, T.; Marx, E.; Moussallem, D.; Publio, G.; Valdestilhas, A.; Esteves, D.; Neto, C.B. SPARQL as a Foreign Language. arXiv 2017, arXiv:1708.07624. [Google Scholar]
Diomedi, D.; Hogan, A. Question answering over knowledge graphs with neural machine translation and entity linking. arXiv 2021, arXiv:2107.02865. [Google Scholar]
Lin, J.H.; Lu, E.J.L. SPARQL Generation with an NMT-based Approach. J. Web Eng. 2022, 1471–1490. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 9 January 2024).
Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual Denoising Pre-training for Neural Machine Translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
Shaw, P.; Chang, M.W.; Pasupat, P.; Toutanova, K. Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 922–938. [Google Scholar]
Xie, T.; Wu, C.H.; Shi, P.; Zhong, R.; Scholak, T.; Yasunaga, M.; Wu, C.S.; Zhong, M.; Yin, P.; Wang, S.I.; et al. UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 602–631. [Google Scholar]
Zou, J.; Yang, M.; Zhang, L.; Xu, Y.; Pan, Q.; Jiang, F.; Qin, R.; Wang, S.; He, Y.; Huang, S.; et al. A chinese multi-type complex questions answering dataset over wikidata. arXiv 2021, arXiv:2111.06086. [Google Scholar]
Su, Y.; Shu, L.; Mansimov, E.; Gupta, A.; Cai, D.; Lai, Y.A.; Zhang, Y. Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 4661–4676. [Google Scholar]
Yin, P.; Neubig, G.; Yih, W.t.; Riedel, S. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8413–8426. [Google Scholar]
Yu, T.; Zhang, R.; Polozov, A.; Meek, C.; Awadallah, A.H. Score: Pre-training for context representation in conversational semantic parsing. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Cai, Z.; Li, X.; Hui, B.; Yang, M.; Li, B.; Li, B.; Cao, Z.; Li, W.; Huang, F.; Si, L.; et al. STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirate, 7–11 December 2022; pp. 1235–1247. [Google Scholar]
Lan, Y.; Jiang, J. Query Graph Generation for Answering Multi-hop Complex Questions from Knowledge Bases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 969–974. [Google Scholar]
Rony, M.R.A.H.; Kumar, U.; Teucher, R.; Kovriguina, L.; Lehmann, J. SGPT: A generative approach for SPARQL query generation from natural language questions. IEEE Access 2022, 10, 70712–70723. [Google Scholar] [CrossRef]

Figure 1. An Overview of the KBQA Pipeline: The user’s question will undergo processing through entity linking, relation linking, and SPARQL query building steps. The executed result will then be returned to the user.

Figure 2. The overview of our proposed TSET model. It was first pre-trained using the Triple Structure Correction (TSC) and Masked Language Modeling (MLM) objectives in a multi-task learning framework and then fine-tuned on the downstream dataset.

Figure 3. Comparison of TSET and T5 model performance across various problem types. The problem types are displayed on the y-axis, while the performance scores of the models are shown on the x-axis. The blue and light blue bars represent the scores of the T5 and TSET models, respectively, for each problem type. The red line with circular markers, which corresponds to the secondary x-axis at the top, shows the improvement of the TSET model over the T5 model. A higher score or a more positive improvement value indicates better performance. The problem types are ordered by the improvement of TSET over T5, with the problem type having the most significant improvement at the bottom.

Table 1. Experimental result for LC-QuAD 2.0. The QM scores are not provided by [11,52].

^{†}

denotes that only gold entities are provided.

^{★}

denotes re-implemented results. For the TSET model, the part inside the parentheses represents the comparative results with the T5 model of the same size.

Table 1. Experimental result for LC-QuAD 2.0. The QM scores are not provided by [11,52].

^{†}

denotes that only gold entities are provided.

^{★}

denotes re-implemented results. For the TSET model, the part inside the parentheses represents the comparative results with the T5 model of the same size.

Approach	F1	QM
AQG-net [35]	44.9	37.4
Multi-hop QGG [51]	52.6	43.2
CLC+Tencent Word [46]	52.9	48.4
CLC+BERT [46]	59.3	55.4
BART [11]	64.0	-
PGN-BERT [11]	77.0	-
PGN-BERT-BERT [11]	86.0	-
SGPT $_{Q, K}$ [52] $^{†}$	89.0	-
T5-small [11]	92.0	-
T5-base [11]	91.0	-
T5-small $^{★}$	92.4	90.3
T5-base $^{★}$	93.4	91.3
TSET-small	94.0 (+1.6)	92.0 (+1.7)
TSET-base	95.0 (+1.6)	93.1 (+1.8)

Table 2. Experimental result for QALD-9 plus and QALD-10. For the TSET model, the part inside the parentheses represents the comparative results with the T5 model of the same size.

Approach	QALD-9 Plus		QALD-10
Approach	QM	F1	QM	F1
T5-small	55.15	64.46	33.93	38.78
T5-base	58.09	69.50	36.48	39.86
TSET-small	61.03 (+5.88)	72.86 (+8.40)	40.05 (+6.12)	47.15(+8.37)
TSET-base	61.76 (+3.67)	75.85 (+6.35)	39.03 (+2.55)	51.37 (+11.51)

Table 3. The QM scores for low resource evaluation. For training, we selected 1%, 5%, 10%, and 20% of the data, respectively. “Impr.” is short for “Improvement”.

Ratio	Model /Impr.	Small		Base
Ratio	Model /Impr.	QM	F1	QM	F1
1%	T5	49.34	54.01	55.39	61.11
	TSET	70.53	74.57	75.22	79.58
	Impr.	+21.19	+20.56	+19.83	+18.47
5%	T5	78.58	81.45	80.23	83.09
	TSET	83.18	86.50	85.81	89.04
	Impr.	+4.60	+5.05	+5.58	+5.95
10%	T5	82.47	85.25	84.35	87.25
	TSET	86.09	89.61	88.32	91.14
	Impr.	+3.62	+4.36	+3.97	+3.89
20%	T5	85.49	87.51	86.52	88.61
	TSET	88.31	90.96	90.39	92.48
	Impr.	+2.82	+3.45	+3.87	+3.87

Table 4. Different questions according to the type of LC-QuAD 2.0 dataset.

Type	Example Questions
Rank	What open cluster has the largest radius?
Center	What is the total solar radiation reflected off of Saturn?
Left-subgraph	Who married to of actor of Bepanaah?
Right-subgraph	Which spot came from the Ebola virus?
Boolean double one_hop right subgraph	Did Pope Paul VI work in both Rome and Munich?
Boolean one_hop right subgraph	Did Zinedine Zidane play as a midfielder?
Boolean with filter	Does the BMW M20B20 have a torque equal to 160?
Simple question left	What is the name of the opera based on Twelfth Night?
Simple question right	What is the capital of the Hamburg region?
Statement_property	As of 2009, how many people lived in Somalia?
String matching simple contains word	What is the game name starts with z?
String matching type + relation contains word	What temple that belongs to the World Heritage starts with letter P?
Two intentions right subgraph	What is the cause and place of John Denver’s death?

Table 5. The ablation for different pretraining objectives. TOC and SOC are different variants of TSC. They perform random permutations either only on the subject and object within a triplet or on the entire SPARQL query.

Approach	QM	Peformance Drop
TSET	92.04	-
w/o ST	90.14	−1.90
w/o TSC	91.07	−0.97
w/o MLM	91.63	−0.41
TOC instead of TSC	91.60	−0.44
TOC+MLM	91.86	−0.18
SOC instead of TSC	91.32	−0.72
SOC+MLM	91.48	−0.56

Table 6. Illustration of pretraining input text variants with Semantic Transformation (w. ST) and structure permutation for TOC, SOC, and TSC objectives. “w. ST” is short for “with Semantic Transformation”. E & R denote entities and relations inside the query.

Type	Example
Query	SELECT (COUNT (?vr0) AS ?vr1){wd:Q18813 wdt:P1574 ?vr0.}
E & R	New Testament wd:Q18813; exemplar of wdt:P1574
w. ST	SELECT (COUNT (?vr0) AS ?vr1){wd:New Testament wdt:exemplar of ?vr0.}
TOC	SELECT (COUNT (?vr0) AS ?vr1){?vr0 wdt:exemplar of wd:New Testament.}
SOC	(AS ?vr1 COUNT (?vr0)){wd:New Testament SELECT wdt:exemplar of ?vr0.}
TSC	SELECT (COUNT (?vr0) AS ?vr1){wd:New Testament ?vr0 wdt:exemplar of.}

Table 7. The performance on questions of varying difficulty levels (across different hops).

Dataset	Hop	1	2	3	≥4
LC-QuAD 2.0	# example	1548	2346	1472	680
	T5-small	85.4	91.51	92.79	92.5
	T5-base	87.01	92.07	93.47	94.26
	TSET-small	87.72 (+2.32)	93.90 (+2.39)	92.93 (+0.14)	93.52 (+1.02)
	TSET-base	89.92 (+2.91)	94.37 (+2.30)	94.29 (+0.82)	93.82 (−0.44)
QALD-9 plus	# example	67	32	24	13
	T5-small	73.13	37.5	37.5	7.69
	T5-base	76.11	40.62	41.66	7.69
	TSET-small	85.07 (+11.94)	43.75 (+6.25)	37.5	0 (−7.69)
	TSET-base	85.07 (+8.96)	50.00 (+9.38)	46.87 (+5.21)	0 (−7.69)
QALD-10	# example	182	93	45	72
	T5-small	58.24	13.97	4.44	0
	T5-base	64.83	18.27	4.44	0
	TSET-small	68.13 (+9.89)	22.58 (+8.61)	4.44	0
	TSET-base	70.32 (+5.49)	32.25 (+13.98)	4.44	0

Table 8. Some examples on three test datasets. In the SPARQL query, keywords are highlighted in blue and bold, while differences in subject/object are highlighted in red. Our model gives all correct predictions in these cases, while the original T5-small model fails.

Question #1 (LC-QuAD 2.0)	How Many Exemplars of the New Testament Are There?
Entities & Relations	New Testament `wd:Q18813`; exemplar of `wdt:P1574`
T5	SELECT (COUNT (?vr0) AS ?vr1) {wd:Q18813 wdt:P1574 ?vr0.}
TSET	SELECT (COUNT (?vr0) AS ?vr1) {?vr0 wdt:P1574 wd:Q18813.}
Question #2 (LC-QuAD 2.0)	What is the IQ test for insights measurements?
Entities & Relations	intelligence `wd:Q83500`; IQ test `wd:Q12021385`; instance of `wdt:P31`; measures `wdt:P2575`
T5	SELECTDISTINCT ?x WHERE {wd:Q83500 wdt:P2575 ?x. ?x wdt:P31 wd:Q12021385}
TSET	SELECTDISTINCT ?x WHERE {?x wdt:P2575 wd:Q83500. ?x wdt:P31 wd:Q12021385}
Question #3 (QALD-9 plus)	Who wrote Harry Potter?
Entities & Relations	Harry Potter `wd:Q8337`; author `wdt:P50`
T5	SELECTDISTINCT ?x WHERE {?x wdt:P50 wd:Q8337.}
TSET	SELECTDISTINCT ?x WHERE {wd:Q8337 wdt:P50 ?x.}
Question #4 (QALD-9 plus)	How many awards has Bertrand Russell?
Entities & Relations	Bertrand Russell `wd:Q33760`; award received `wdt:P166`
T5	SELECT (COUNT (DISTINCT ?var0) AS ?var1) WHERE {?var0 wdt:P166 wd:Q33760.}
TSET	SELECT (COUNT (DISTINCT ?var0) AS ?var1) WHERE {wd:Q33760 wdt:P166 ?var0.}
Question #5 (QALD-10)	Is Isfahan a big city?
Entities & Relations	Isfahan `wd:Q42053`; instance of `wdt:P31`; big city `wd:Q1549591`
T5	ASK {wd:Q1549591 wdt:P31 wd:Q42053.}
TSET	ASK {wd:Q42053 wdt:P31 wd:Q1549591.}
Question #6 (QALD-10)	When did human first start to bouldering?
Entities & Relations	bouldering `wd:Q852989`; date of establishment `wdt:P571`
T5	SELECTDISTINCT ?x WHERE {?x wdt:P571 wd:Q852989.}
TSET	SELECTDISTINCT ?x WHERE {wd:Q852989 wdt:P571 ?x.}

Table 9. Error analysis conducted on the test set across three datasets, illustrating the count of instances for various error types. The values in parentheses represent the relative reduction in these errors.

Dataset	Size	Approach	Total	Sentence	Triple	Triple Flip
LC-QuAD 2.0	Small	T5	619	99	582	348
	Small	TSET	511 (−17.4)	92 (−7.1)	481 (−17.3)	254 (−27.0)
	Base	T5	560	95	522	313
	Base	TSET	449 (−19.8)	78 (−17.9)	414 (−20.7)	223 (−28.7)
QALD-9 plus	Small	T5	65	54	58	8
	Small	TSET	56 (−13.8)	48 (−11.1)	51 (−12.1)	7 (−12.5)
	Base	T5	61	48	55	8
	Base	TSET	54 (−11.5)	47 (−2.1)	49 (−10.9)	7 (−12.5)
QALD-10	Small	T5	271	233	247	24
	Small	TSET	245 (−9.6)	220 (−5.6)	227 (−8.1)	14 (−41.6)
	Base	T5	255	225	233	21
	Base	TSET	232 (−9.0)	197 (−12.4)	215 (−7.7)	13 (−38.0)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qi, J.; Su, C.; Guo, Z.; Wu, L.; Shen, Z.; Fu, L.; Wang, X.; Zhou, C. Enhancing SPARQL Query Generation for Knowledge Base Question Answering Systems by Learning to Correct Triplets. Appl. Sci. 2024, 14, 1521. https://doi.org/10.3390/app14041521

AMA Style

Qi J, Su C, Guo Z, Wu L, Shen Z, Fu L, Wang X, Zhou C. Enhancing SPARQL Query Generation for Knowledge Base Question Answering Systems by Learning to Correct Triplets. Applied Sciences. 2024; 14(4):1521. https://doi.org/10.3390/app14041521

Chicago/Turabian Style

Qi, Jiexing, Chang Su, Zhixin Guo, Lyuwen Wu, Zanwei Shen, Luoyi Fu, Xinbing Wang, and Chenghu Zhou. 2024. "Enhancing SPARQL Query Generation for Knowledge Base Question Answering Systems by Learning to Correct Triplets" Applied Sciences 14, no. 4: 1521. https://doi.org/10.3390/app14041521

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing SPARQL Query Generation for Knowledge Base Question Answering Systems by Learning to Correct Triplets

Abstract

1. Introduction

2. Related Work

2.1. Information Retrieval-Based Methods

2.2. Semantic Parsing-Based Methods

2.3. Pretrained Language Models on Semantic Parsing-Based Methods

3. Preliminaries

3.1. Knowledge Base (KB)

3.2. Problem Formulation

4. Method

4.1. Semantic Transformation

4.2. Further Pretraining with TSC and MLM Objective

4.3. Fine-Tuning on the Downstream SPARQL Query Buliding Task

5. Experiment

5.1. Experiment Setup

5.1.1. Datasets

5.1.2. Evaluation Metric

5.1.3. Baseline Methods

5.1.4. Implementation Details

5.2. Experiment Results

5.2.1. Main Results

5.2.2. Low-Resource Evaluation

5.3. Ablation Study

5.3.1. Natural Language Question Types

5.3.2. Pretraining Objectives

5.3.3. Semantic Transformation

5.3.4. Effect on SPARQL Difficulty

5.4. Case Study

5.5. Error Analysis

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI