Unravel the past大语言模型“荀子”饱读经书,算力十足



Thousands of years ago, texts appeared on animal bones, bronzes, bamboo slips, and silk brocades (织锦) before they were written on paper. But now these ancient Chinese texts have a “new container” in the modern age.


Recently, a research team from Nanjing Agricultural University has rolled out Xunzi, a large language model (LLM) and XunziChat in association with Gulian, a leading ancient Chinese text publisher.


Wang Dongbo, the leader of the research team, said that the large language model was named after Xunzi because Xunzi was not only a prominent Confucian philosopher during the late Warring States Period (475-221 BC), but also a pioneer in presenting and explaining theories of linguistics in ancient China. 


When asked why he and his partners made the large language model, Wang explained that “traditional Chinese characters, vertical layout (竖版), the absence of pausing and punctuation (句读) are all obstacles that readers have to overcome when they read traditional texts”.


To create Xunzi the LLM, Wang and his partners first needed to do a lot of research. Since 2013, his team has worked tirelessly to digitize Chinese classics like the Siku Quanshu, or the Complete Library in Four Sections. “The hard work involves a large-scale corpus (语料库) of two billion Chinese characters, which has laid a solid foundation for the large language model,” said Wang.


But their efforts seem to have paid off. Now Xunzi the LLM can tag (标记), translate, punctuate, and understand scraps (片段) of ancient Chinese texts. It can even do part-of-speech analysis and retrieve (检索) specific information, such as names, events, and places from a text.


With this LLM, ancient Chinese texts can be accessed by more Chinese people, including students. For instance, if users type “shangu” into the chat box, they will not only discover that it translates to “valley” but also see that it can refer to a person’s courtesy name (字) in certain ancient Chinese texts. Through Xunzi’s retrieval function, users can get more specific cultural information based on courtesy names.


“The model can help us mine for more information hidden in our cultural legacy and find unnoticed models and connections,” said Wang. 


But Wang and his team aren’t simply focused on target users in China. They are aiming at the rest of the world as well. They have shared the LLM on GitHub and other websites, allowing users to download and use it for free. “Our team is committed to the philosophy of making our data and model globally accessible. We hope this will encourage more people to appreciate traditional Chinese culture,” Wang explained.

但王东波和他的团队不仅着眼于中国的目标用户,还将目光投向了世界其他地区。他们在 GitHub 和其他网站上共享了“荀子”,允许用户免费下载和使用。 “我们团队秉持着让我们的数据和模型能在全球范围内被人们使用的理念,希望以此鼓励更多人了解中国传统文化,”王东波解释道。



联系我们  |  诚聘英才  |  演讲比赛  |  关于我们
© i21st.cn   京ICP备13028878号-12