Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Integrating Large Language Models (LLMs) and Deep Representations of Emotional Features for the Recognition and Evaluation of Emotions in Spoken English

Appl. Sci. 2024, 14(9), 3543; https://doi.org/10.3390/app14093543

by Liyan Wang¹, Jun Yang², Yongshan Wang², Yong Qi^2,*

, Shuai Wang³ and Jian Li²

Reviewer 1:

Marco Palomino

Reviewer 2:

Rainer Rubira-García

Reviewer 3: Anonymous

Appl. Sci. 2024, 14(9), 3543; https://doi.org/10.3390/app14093543

Submission received: 9 February 2024 / Revised: 5 March 2024 / Accepted: 5 March 2024 / Published: 23 April 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

It was a pleasure to read your manuscript. I have stated below some suggestions aiming to improve its quality. I hope you find them helpful.

Content

Repetition: Some of your text is repetitive. Some of the sentences are stated more than once, which is wrong and unnecessary. For example, the sentence going from Line 54 to Line 57 is repeated later from Line 113 to Line 115. The first time there is a citation ([4-7]), but the rest of the text is identical. Thus, remove the text from Line 113 to Line 115. If the authors wish to emphasize their point, they can start Line 113 by saying “As stated above, existing research focuses on classification and pays limited attention to the quantitative evaluation of emotions”. There is no need to say anything more than that. Do not repeat the text.

Final paragraph of the Introduction: The final paragraph of the Introduction must comprise an outline of the rest of the paper. While this may seem an old and traditional approach, it is useful to navigate through the text. Hence, authors must add a paragraph at the end of the Introduction indicating what Section 2 is about, what Section 3 will present, and so on.

Subsection 2.1: The heading of Subsection 2.1 must be replaced with the following: “Emotion Analysis Research on Spoken Language”.

Related Work: Generally, the Related Work section is poorly written. The Related Work is not only to show that the authors have read other people’s work. Authors should critique the existing work, state where it is strong or weak, and how they have improved it. It is good to know that Pan and others [11] conducted a review of multimodal emotion recognition (Line 102), but how does this relate to the authors’ work? Have the authors done another review? Have they used Pan’s review as the basis for their work? They need to say this explicitly, rather than just stating facts without providing any insights into the authors’ work. It is great to read that Vu and others [14] combined multimodal technology with scaled data, but how does this relate to the authors’ work? Did Vu and others achieve poor accuracy (thus, the authors followed a different approach)? Did the authors repeat exactly what Vu and others did because they were very accurate? All these must be stated.

Citations: When citing the work of more than one person, the authors should state the last name of the first author followed by et al. For instance, “Voloshina et al. [7]” or “Luna-Jiménez et al. [13]”. I do not like the idea of translating “et al.” from Latin to English (“and others”). I would prefer the authors to replace all the occurrences of “and others” with “et al.”. However, if the Editors agree with this, the authors can keep the phrase “and others”. In any case, there is no need to state the first name of the first author for each reference. “Paranjape and others [8]” is enough. There is no need for “Aditya Paranjape and others”. “Luna-Jiménez et al. [13]” is enough. There is no need for “Cristina Luna-Jiménez et al.”. Fix all the citations to make sure the text only mentions the last name.

Line 135-137: Replace the sentence,

“However, the above studies, due to the use of only real datasets, not only lead to a smaller volume of data but also result in weaker robustness of the models.”

with the following

“However, the studies described above limited their work to the use of datasets containing recordings of actual people (instead of combining these with materials created by an LLM). This led to a smaller volume of data and weaker models.”

Related Work II (Line 146 - Line 167): Please, make sure that the Related Work is not only a list of references to existing work. We do not need that. What the authors must do is critique the existing work, explain (explicitly) how it differs from theirs, or how they have improved it.

Dataset: IEMOCAP is mentioned for the first time in Line 61. I appreciate the authors know IEMOCAP well, and other experts in the field will know this dataset too. However, authors cannot assume that all readers will know IEMOCAP. Hence, authors must explain what sort of dataset IEMOCAP is. Moreover, they must do it before they start talking about the work that they have done with it. Authors must move Subsection 4.1 to an earlier part of the manuscript. I suggest moving Subsection 4.1 to the start of Section 3. It can become Subsection 3.1, which allows the readers to understand the dataset before the method and model are presented. The rest of Section 4 can stay where it is at present.

IEMOCAP citation: Again, Line 61 refers to the IEMOCAP dataset for the first time. Thus, authors must add a citation to a source referring to it in Line 61. At the very least, authors must include the URL where the readers can download the IEMOCAP dataset or find further information about it. This must be done in Line 61, which is where the authors talk about IEMOCAP for the first time.

Excitement vs Happiness: Line 285 states that the “Happiness and Excited categories are merged into a single Happiness category”. Why? I am not sure this is correct. The authors must justify this decision better. “Excited” describes all sorts of excessive emotions (and not exclusively happiness). If you are excited, you might be agitated, nervous, anxious, or worked up about something (not happy). I appreciate authors may have reasons to back up their decision of merging the Happiness and Excited categories, but they must explain these reasons here. This is very important to leave without a proper explanation. If this is what others have done, the authors must state here who has done it and cite such work.

Comments on the Quality of English Language

English language and style

Generally, the manuscript is easy to follow, but there are several sentences that are poorly written. The manuscript requires several changes to improve its readability. I recommend that the authors should have the text proofread before its publication. Please, consider the following suggestions:

Line 22-24: The abstract includes the following sentence: “(3) a spoken English emotion evaluation network that precisely scores student spoken language emotion by analyzing different audio's emotional characteristics”. This does not read well. I recommend that the authors should use the following wording: “(3) an emotion evaluation network for the spoken English language that identifies, accurately, emotions expressed by Chinese students by analyzing different audio characteristics”. If some students who participated in the study were not Chinese, then remove the word Chinese from my suggested wording. If the authors think I am misinterpreting their ideas, do not use my suggested wording, but amend the current text, because it is not written correctly.

Line 43-46: I recommend that the authors should never produce sentences longer than 30 words. A sentence with more than 30 words is a sentence that needs to be broken into two or more separate sentences. Hence, replace the following sentence in Line 43:

“Therefore, incorporating emotional evaluation into the spoken language evaluation system is of great significance to help improving emotional expression ability of Chinese students and will further enhance the comprehensiveness and accuracy of spoken English evaluation.”

with the following:

“Therefore, incorporating emotional evaluation into the system is of great significance. It helps Chinese students to improve their ability to express emotions in English language and enhances the evaluation system”.

Note that I am breaking the original sentence into two. Also, I do not think you need to refer to comprehensiveness and accuracy. If you state that you are enhancing the system, you are already implying that it is more accurate. In any case, if you do not like my suggestion, do not use it but amend your text.

Acronyms: The acronym LLM must be defined the first time you use the phrase “large language model”. The authors did it correctly in Line 62. However, once they have done it in Line 62, they no longer need to do it again. After Line 62, authors do not need to use the phrase “large language model” again. For example, Line 72 must be replaced with “(1) Combining LLMs with deep representation…”.

Line 77: Replace the sentence “The study ingeniously integrates speech emotional data generated by LLM…” with “The study integrates emotional speech data generated by an LLM…”. The word ingeniously is not appropriate in this context, and the order of the words is “emotional speech” instead of “speech emotional”.

Line 148: Replace “Fidelia A and others[19]” with “Orji et al. [19]” or “Orji and others [19]”. Orji is the last name of the author; thus, this is the name that must be stated.

Line 273: If you define the acronym MLP in Line 273, you do not need to define it again in Line 307. The sentence in Line 307 can be amended as “... are then processed through an MLP to obtain the final…”.

Figure 2: Replace “Audio whit Emotion Result” with “Audio with Emotion Result”.

Line 496: Replace the word “professional” with “practical”.

Line 506-507: Replace “existing English spoken language evaluation software” with “existing spoken English language evaluation software”.

Author Response

Hello, thank you for reading and providing editing suggestions. Below is our response:

Content:

Repetition: We have revised the repetitive content in lines 113-115 as per your suggestions.

Final paragraph of the Introduction: An outline for the remaining sections of the paper has been added.

Subsection 2.1: The title of subsection 2.1 has been replaced with "Emotion Analysis Research on Spoken Language."

Related Work: Thank you for your suggestions. We have modified the related work section to enhance our evaluation of relevant studies and highlight our points of differentiation.

Citations: "And others" has been changed to "et al.," and citations now include only the surname of the first author.

Lines 135-137: Replaced with "However, the studies described above limited their work to the use of datasets containing recordings of actual people (instead of combining these with materials created by an LLM). This led to a smaller volume of data and weaker models."

Related Work II (Lines 146 - 167): Evaluative comments have been added to the cited content.

Dataset: Section 4.1 has been moved to 3.1, and we have improved the introduction to the IEMOCAP dataset.

IEMOCAP citation: Our citation of IEMOCAP has been enhanced.

Excitement vs Happiness: Regarding the merging of these two emotion categories, the citation [26] above selects 4 emotion classes for the task. Due to oversight, the source for the merging was not included. Sahu G. Multimodal speech emotion recognition and ambiguity resolution[J]. arXiv preprint arXiv:1904.06022, 2019., merges this classification. We have now added this reference in the document.

English language and style:

Lines 22-24: Your understanding is completely correct. Modified to "(3) an emotion evaluation network for the spoken English language that identifies, accurately, emotions expressed by Chinese students by analyzing different audio characteristics "

Lines 43-46: Modified to "Therefore, incorporating emotional evaluation into the system is of great significance. It helps Chinese students to improve their ability to express emotions in English language and enhances the evaluation system"

Acronyms: Clarifications for acronyms have been updated, and "LMM" is used directly afterward.

Line 77: Adjustments have been made as suggested.

Line 148: "Fidelia A and others[19]" has been replaced with "Orji et al. [19]."

Line 273: Modified to "... are then processed through an MLP to obtain the final…"

Figure 2: "Whit" has been corrected to "with."

Line 496: "Professional" has been changed to "practical."

Lines 506-507: "Existing English spoken language evaluation software" has been modified to "existing spoken English language evaluation software."

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

We are dealing in this proposal with the integration of verbal and non-verbal communication analysis applied to foreign language learning. The topic is appealing and well-presented formally speaking. Moreover, the integration of two research dimensions – Large Language Models (LLM) with affective space learning– gives a very interesting epistemological framework to this research although, for this very precise reason, its application to the case study may need further explanation in the manuscript. As the theoretical model integrates two different unique dimensions, we need a deeper reflection of why these specific models were needed to be brought together to study Chinese students (defined in a too general way by the authors) and their ability to express emotions in spoken English, a category that goes beyond language into culture. The method should explain better the case study and the characteristics of Chinese context, culturally and socially considered, as well as the internal bias that may arise by choosing subjects from a country where English language penetration vary considerably from region to region. We lack here a more qualitative approach that is vital when analyzing emotions. It is unclear if the study has considered differences in such a country as big and complex as China, from Hong Kong (with its singular situation regarding the topic of analysis) to Tibet. Results also need more integration into the state of the art and the contributions to clarify the reality of China education system and the way it deals with English as a foreign language, with its own peculiarities in a country led by a communist regime. Tables and figures are extensive and need some editing, as some graphics are too small for clear understanding.

Author Response

Thank you for your feedback and suggestions. Our research team took special consideration in collecting English oral speech audio data from Chinese students, specifically addressing the variations in accents across different regions of China. From this, we identified a common issue: introverted personality traits and a lack of emotional expression in English. This discovery became one of the focal points of our research.

To address this issue, we will focus on the practical application of sentiment analysis based on textual methods in Chinese students' oral English learning. We have developed a sentiment evaluation model and applied it to the practical learning of Chinese students' oral English. Through this approach, we aim to assist Chinese students in better expressing emotions, thereby enhancing their oral English proficiency.

Furthermore, we have adjusted the size and clarity of our charts and graphs to improve their comprehensibility, making it easier for readers to understand our research findings.

Overall, our research aims to explore new solutions to help alleviate the issue of emotional expression deficiency among Chinese students in English oral learning. We believe this study will provide valuable insights and guidance for Chinese students' English learning and have a positive impact on educational practices.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Comments for author File: Comments.docx

Author Response

Thank you for your review and valuable feedback on our article. Below, we address each of the points you raised:

Regarding the specific domain of spoken English assessment: This study focuses on the assessment of English oral emotion in telephony scenarios, marking the first implementation of automated evaluation in this area. This allows for practical application in the context of Chinese students' English oral practice, thereby aiding them in improving their emotional expression skills. We have supplemented the specific domain of spoken English assessment within our current framework to aid readers in better understanding the background and scope of our research.

Regarding the importance of emotional expression: We have expanded the content in the Introduction to discuss the significance of emotional expression in spoken English assessment, using examples of conversations between Asians and Westerners to illustrate how emotional expression impacts communication.

Regarding the validation of synthesized emotional voice quality: We utilized the Typecast TTS model to synthesize 2892 speech samples with corresponding emotions. During synthesis, we ensured that each synthesized speech sample's emotional category was derived from the highest scoring emotion label in the original speech. Sample voices can be downloaded from the provided link（https://github.com/leonialla/tts-audio-samples）, and we have employed the MOS evaluation method to assess the quality of the generated speech, which yielded favorable results.

Regarding the explanation of the Typecast pre-trained model and emotional classification: We have provided a brief explanation of the Typecast pre-trained model in section 3.2. However, as this model is not our own work, its specific details are referenced from [26] without further elaboration. The explanation of emotional classification in our article is included in section 3.3, referencing works [27][28].

Regarding the 'Mixed Dataset' in Figure 1: Following your suggestion, we have updated Figure 1 to clearly indicate that student-recorded voices include emotional annotations and average speech scores. We also revised the title of Step 1 to emphasize that it pertains to dataset construction.

Regarding the specific description of the process in Step 2: We have explicitly outlined the methods for extracting spatial and emotional features from audio in the text. The simplification of emotions into four categories is based on works [27][28], which we have now incorporated into the article.

Regarding the third step, we have rewritten the text to provide a clearer description of the flow and relationship between the coarse-tuning and fine-tuning stages. Additionally, the reason for setting lambda values to 0.1 and 1 during the fine-tuning period is to bias the model more towards teacher ratings while not neglecting the objective factors from the coarse-tuning stage. We have also added this content in the manuscript.

Regarding the explanation of Figure 3: We have added specific explanations related to Figure 3. For the majority of students, their ability to express various emotions is consistent, meaning that, as depicted in the graph, evaluation scores for six types of emotions are distributed around the same value for the same student. Only a few students exhibit excellent performance in expressing a specific emotion, as evidenced by significantly higher evaluation scores for one particular type of emotion compared to others. This graph serves to more effectively illustrate the issue of Chinese students' oral emotional expression in English learning.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for considering my comments and suggestions. Please, read below a few minor changes.

Content

Figure 1: What happened to Figure 1? It is corrupted. I hope it got corrupted while uploading the new version and you can still recover it; otherwise, you need to fix it. It is no longer readable.

Excitement vs Happiness (Line 346): Thank you for adding a citation to Sahu [28]. Could you please elaborate a little about this on the paper? It would be useful to make it very clear that you followed Sahu’s approach. For example, amend Line 346 to say: “... and the Happiness and Excitement category are merged into a single Happiness category, as suggested by Sahu [28]”. You do not need to say any more than that.

Comments on the Quality of English Language

English language and style

Thank you for following my suggestions. I have some minor comments regarding the references to your figures. Please, read below.

Line 264: Replace “the Figure” with “Figure 2”. The text in Line 264 must be “From Figure 2, it can be observed that for the majority of students, their ability...”.

Line 430: Please, remove the dot (“.”) between Figure and 1 in Line 430. It should be “Figure 1” (note the space between Figure and 1).

Line 463: Replace “The figure illustrates...” with “Figure 5 illustrates...”.

Line 475: Please, remove the dot (“.”) between Figure and 1 in Line 475. Also replace text current text in Line 475 with the following: “As shown in Step 3 (Training the Emotion Evaluation Network) in Figure 1, a quantitative analysis of emotions is conducted...”

Author Response

Hi, thanks for reading and providing editing suggestions. Here is our response below:

Content

Figure1: We apologize for this, Figure 1 may have become corrupted during the upload process and we will re-upload a readable version.

Excitement vs Happiness (Line 346): Okay, we've added the relevant statement to the article as you suggested.

English language and style

Line 264: We have replaced“the Figure” with “Figure 2”.

Line 430: We have removed the dot (“.”) between Figure and 1 in Line 430.

Line 463: We have replaced “The figure illustrates...” with “Figure 5 illustrates...”.

Line 475: We have removed the dot (“.”) between Figure and 1 in Line 475 and replaced text in Line 475 with “As shown in Step 3 (Training the Emotion Evaluation Network) in Figure 1, a quantitative analysis of emotions is conducted...” following your suggestions.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript still needs to undertake significant changes as requested previously. Furthermore, the proposed improvements are even more problematic in some parts as in the unsupported generalization added from line 43 to 48 with several stereotypical arguments included without any scientific references. The case study, the different types of emotions and their connection to performative language, characteristics of Chinese context as a communist country, culturally and socially considered, the internal bias that may arise by choosing subjects from a country where English language penetration vary considerably from region to region, etc., are dimensions that should be covered in the analysis.

Author Response

In response to the concerns raised, we acknowledge the need for significant revisions and appreciate the constructive feedback. The manuscript's sections from lines 43 to 48, which were criticized for unsupported generalizations and stereotypical arguments, will be meticulously revised to include scientific references supporting our claims. We recognize the importance of addressing the diverse dimensions mentioned, such as the analysis of case studies, the interplay between different types of emotions and performative language, and the unique characteristics of the Chinese context, considering both its national background and its cultural and social aspects. We will also include a discussion on the potential internal bias arising from the variable English language penetration across different regions in China, acknowledging that this variability may influence our study's outcomes.

Our methodology, previously benchmarked against the emotional expressions of native English-speaking actors as a gold standard, will be reassessed to ensure it respects the nuances of non-native speakers’ emotional expressions. The purpose is to design AI tools to assist English learning and teaching. Improvements will be made to our evaluation criteria and methods to enhance the tool's effectiveness in future work.

Regarding English language learning in China, we aim to provide a more nuanced view. While it is true that English education is highly emphasized across the country, reflected in the educational system and the value placed on English proficiency for academic and professional advancement, we recognize the disparities in English language learning opportunities and outcomes. These disparities stem from regional differences, which we will explore further in our analysis. Our goal is to contribute to improving Chinese students' oral English capabilities, fostering better international communication and collaboration. We are committed to continuing our research in this area, addressing both the challenges and opportunities presented by the current emphasis on English language learning in China.

Article Menu

Integrating Large Language Models (LLMs) and Deep Representations of Emotional Features for the Recognition and Evaluation of Emotions in Spoken English

Further Information

Guidelines

MDPI Initiatives

Follow MDPI