Unravel the past大语言模型“荀子”饱读经书，算力十足

来源：融媒体采编平台
作者：张语迎
日期：2024-01-22

Thousands of years ago, texts appeared on animal bones, bronzes, bamboo slips, and silk brocades (织锦) before they were written on paper. But now these ancient Chinese texts have a “new container” in the modern age.

几千年前，文字先是写在兽骨、青铜器、竹简和织锦上，然后才被人们写在纸上。但如今，这些古老的中文文本在现代有了“新容器”。

Recently, a research team from Nanjing Agricultural University has rolled out Xunzi, a large language model (LLM) and XunziChat in association with Gulian, a leading ancient Chinese text publisher.

近日，南京农业大学的研究团队，与一流的古籍出版公司古联联手，推出大型语言模型“荀子”和“荀子对话模型”。

Wang Dongbo, the leader of the research team, said that the large language model was named after Xunzi because Xunzi was not only a prominent Confucian philosopher during the late Warring States Period (475-221 BC), but also a pioneer in presenting and explaining theories of linguistics in ancient China.

研究团队带头人王东波表示，大型语言模型以荀子的名字命名，是因为荀子不仅是战国（公元前475-221年）晚期著名的儒学思想家，还是提出和解释中国古代语言学理论的先驱者。

When asked why he and his partners made the large language model, Wang explained that “traditional Chinese characters, vertical layout (竖版), the absence of pausing and punctuation (句读) are all obstacles that readers have to overcome when they read traditional texts”.

当被问及他和他的同伴制作这个大型语言模型的原因时，王东波解释道：“繁体字、竖版、缺少停顿和标点符号（句读）都是读者在阅读繁体文本时需要克服的障碍。”

To create Xunzi the LLM, Wang and his partners first needed to do a lot of research. Since 2013, his team has worked tirelessly to digitize Chinese classics like the Siku Quanshu, or the Complete Library in Four Sections. “The hard work involves a large-scale corpus (语料库) of two billion Chinese characters, which has laid a solid foundation for the large language model,” said Wang.

为了创建大型语言模型“荀子”，王东波和他的同伴们需要先做大量的研究。自2013年以来，他的团队始终致力于将《四库全书》等中国经典书籍数字化。“经过辛勤努力，我们建立了20亿汉字的大型语料库，为建立大型语言模型奠定了坚实的基础，”王东波说。

But their efforts seem to have paid off. Now Xunzi the LLM can tag (标记), translate, punctuate, and understand scraps (片段) of ancient Chinese texts. It can even do part-of-speech analysis and retrieve (检索) specific information, such as names, events, and places from a text.

他们的努力得到了回报。现在，大型语言模型“荀子”可以对中国古代文本的片段进行标记、翻译、加标点和阅读理解。它甚至可以进行词性分析并检索特定信息，例如文本中的名称、事件和地点。

With this LLM, ancient Chinese texts can be accessed by more Chinese people, including students. For instance, if users type “shangu” into the chat box, they will not only discover that it translates to “valley” but also see that it can refer to a person’s courtesy name (字) in certain ancient Chinese texts. Through Xunzi’s retrieval function, users can get more specific cultural information based on courtesy names.

通过这个大型语言模型，包括学生在内的更多中国人，可以接触到中国古籍。例如，如果用户在聊天框中输入“shangu”的拼音，其不仅能识别出“山谷”一词，它还会给用户指出与这个词相关的、古籍中一个中国文人的字号等。通过“荀子”的检索功能，用户可以根据字获取更具体的文化信息。

“The model can help us mine for more information hidden in our cultural legacy and find unnoticed models and connections,” said Wang.

“这个模型可以帮助我们挖掘更多隐藏在文化遗产中的信息，找到未被注意到的样本和关联，”王东波说。

But Wang and his team aren’t simply focused on target users in China. They are aiming at the rest of the world as well. They have shared the LLM on GitHub and other websites, allowing users to download and use it for free. “Our team is committed to the philosophy of making our data and model globally accessible. We hope this will encourage more people to appreciate traditional Chinese culture,” Wang explained.

但王东波和他的团队不仅着眼于中国的目标用户，还将目光投向了世界其他地区。他们在 GitHub 和其他网站上共享了“荀子”，允许用户免费下载和使用。 “我们团队秉持着让我们的数据和模型能在全球范围内被人们使用的理念，希望以此鼓励更多人了解中国传统文化，”王东波解释道。

以上文章内容选自《21世纪英文报》高三831期

分享到