"六书"多模态处理的形声表征以完善汉语语言模型

李伟钢; Mayara C. MARINHO; Denise L. LI

doi:10.1631/FITEE.2300384

您当前的位置：

首页 >

文章列表页 >

"六书"多模态处理的形声表征以完善汉语语言模型

专辑 | articleDetailComponent.publishTimeText1：2024-07-11

"六书"多模态处理的形声表征以完善汉语语言模型

李伟钢 1 ， Mayara C. MARINHO 1 ， Denise L. LI 2 ， Vitor Vasconcelos DE OLIVEIRA 1 作者信息&出版信息

静态化线上测试单刊 · 2024年7月11日 · 2024年 25卷第1期 · DOI：10.1631/FITEE.2300384

Ai 摘要

1 Introduction

This chapter discusses the importance of Chinese language models and the challenges they face, as well as the lack of standardized coding for Chinese characters. It introduces the Six-Writings multimodal processing (SWMP) concept and its application to Chinese language models. The chapter also presents the Six-Writings pictophonetic coding (SWPC) approach, along with its applications, experimental results, and contributions to CNLP theory and technology.

2 Related works

To improve the performance of CNLP, several methods have been introduced to effectively capture the semantic and morphological information of Chinese characters. Several frameworks and models have been developed for the joint learning of character and word embeddings, utilizing various approaches such as convolutional neural networks (CNN), recurrent neural networks, and character embedding models. Additionally, some research works have leveraged images of Chinese characters to improve CNLP tasks, such as using convolutional auto-encoder (convAE) to learn character glyph features. Notably, there have been studies on the productive expression of phonetic semantic relations of Chinese characters, as well as the calculation of Chinese string similarity and dividing Chinese characters into squares based on the calculation of similarity. The paper distinguishes itself by incorporating a multimodal analysis of Chinese characters and using a generative radical/component coding approach to enhance the capabilities of CNLP. This approach provides a comprehensive understanding of Chinese character representation and aims to address the challenges associated with accurately calculating the similarity between Chinese characters.

3 Variation in similarity calculation and augmentation methods

This chapter introduces the concept of the coefficient of variation (Cv) and its application in the field of NLP, particularly in reference to the computation of similarity between characters or words. It discusses the challenges of using different technical approaches in CNLP research to represent Chinese characters and words, and the variation problems in the similarity calculations. The chapter further explores the concept of augmentation, specifically for WB and pinyin numerical codes, and how it can address the variation in the similarity calculations for Chinese characters. The statistical analysis of FC numbers and WB letter coding is also presented, along with the proposal of a coding approach to combine FC numbers and WB in order to effectively reduce the occurrence of duplicate codes. Additionally, the chapter proposes a normalization and augmentation method of digital conversion for pinyin to address the problem of small Cv in similarity calculations.

4 Framework of SWMP

The section provides an overview of the Six-Writings concept and presents the framework of SWMP for Chinese characters (or words), and the related discussion. The SWMP framework for Chinese language models consists of six parts: (1) pictophonetic, (2) pinyin, (3) property, (4) image, (5) audio/video, and (6) understanding (word embedding). This comprehensive multimodal processing framework allows for a detailed representation of Chinese characters and words in the Six-Writings style, facilitating various language processing tasks and enabling a deeper understanding of their structure and attributes. The digital conversion and augmentation of Chinese pinyin are shown in Table 2. In summary, modern Chinese characters are characterized by flexibility and diversity due to historical evolution and other factors. Recognizing Chinese characters requires the synthesis of multimodal information like Six-Writings, such as pictophonetic codes, pinyin, images, and others. The proposed SWMP for CNLP is a promising approach for achieving this goal.

5 SWPC approach

This chapter introduces the SWPC of Chinese characters and its application in measuring the similarity between the characters or words. It consists of the radical code and the phonetic code, and can be used to represent Chinese characters more comprehensively and informatively than previous pictophonetic codes. The chapter also provides detailed explanations and examples of the various combinations to form SWPC and discusses how it can improve the accuracy of Chinese character recognition and other NLP tasks. The chapter then illustrates SWPC with examples of different combinations of characters and explains how it avoids duplication of codes and enhances the digital representation of pictophonetic features. It further delves into word formation and the coding of Chinese words using SWPC, showing how it can naturally deduce the calculation of similarity between words. The chapter concludes with a discussion of the results of similarity calculation between Chinese word pairs and how SWPC compares to other coding methods.

6 SWPC for text/image processing

This chapter introduces SWPC for text/image processing which provides convenience for multimodal processing of Chinese characters using image and text data. It outlines the main steps of the image/text multimodal processing algorithm, with the complexity of O(n2), by combining SWPC with the image 0-1 matrix of Chinese characters and explains how HM distance similarity is used to predict the similarity between image matrices of Chinese characters. It also discusses the result generation and the limitations of the proposed algorithm using HM distance similarity. Furthermore, it emphasizes that further research in this direction will be essential for advancing algorithms in pattern recognition, image synthesis, and generation.

7 SWPC for analogical reasoning

This chapter establishes analogical reasoning models aligning with the morphological features of Chinese words applied to the Chinese analogical (CA8) data set and other Chinese idioms, including the CA8-MOR-10177 and CA8-SEM-7636 parts. The chapter presents analogical reasoning methods regarding the morphological regulations and discusses SWPC and its application to address Chinese modification.

The chapter focuses on the analogical modes for CA8-Mor-10177, addressing various pattern combinations involving repetitive, prefix, and suffix words and the corresponding generation of new words using SWPC. Furthermore, the chapter discusses analogical modes from CA8-Sem-7636, wherein some question pairs change from following the semantic rule to following the morphological rule, and requires calculating the similarity between words using SWPC.

In comparison with the baseline, the chapter provides insights on the effectiveness of SWPC in generating new words for the CA8-Mor-10177 and CA8-Sem-7636 data subsets, demonstrating 100% accuracy in predicting and addressing problems in the data set.

8 Fine-tuning of similarity by SWPC

This chapter discusses the calculation and comparison of fine-tuning the similarity using the FC number, WB code, and SWPC, as well as the similarity between Chinese word pairs using different methods. It analyzes the similarity of word pairs and the effectiveness of the proposed method by using different similarity calculation methods. The chapter also provides a comparative analysis of the similarity calculation for 960 word pairs in the COS960 test data set and describes the relative errors between the benchmark scores and the similarities computed using FC, WB, and SWPC, separately. It also highlights the sensitivity of the HM similarity to the pictophonetic coding of Chinese characters and its effectiveness in reflecting the similarity and dissimilarity between Chinese word pairs. The advantages of SWPC’s combination of FC and WB coding are emphasized as a more robust and credible approach for fine-tuning the results of CNLP tasks.

9 Conclusions and future work

This chapter discusses the proposed SWMP framework for Chinese language models, which integrates multimodal information of Chinese characters to enhance the effectiveness of CNLP. It introduces the concept of SWPC, which combines the expression of characters with Chinese grammar and flexible properties, possessing a generative and prompting mechanism for multimodal processing of Chinese characters and graphics. The applications and contributions of SWPC are also outlined, including its effectiveness in establishing word pairs and improving prediction accuracy of word similarity. The chapter concludes by acknowledging the shortcomings of the proposed framework and outlining future work, such as the need to integrate SWMP into the language model, establish a Chinese character database, develop appropriate coding/image multimodal analysis ML algorithms, and strengthen the coding of semantic features of Chinese words.

* 以上内容由AI自动生成，内容仅供参考。对于因使用本网站以上内容产生的相关后果，本网站不承担任何商业和法律责任。

展开

引用量

Ai 摘要

1 Introduction

2 Related works

3 Variation in similarity calculation and augmentation methods

4 Framework of SWMP

5 SWPC approach

6 SWPC for text/image processing

7 SWPC for analogical reasoning

8 Fine-tuning of similarity by SWPC

9 Conclusions and future work

当前期刊