华中师范大学杨宗凯团队:用于数学题目生成的教育领域多模态模型

来源:SCIENCE CHINA 2024-12-12 15:33:09 所属栏目:专家观点

生成式人工智能正以前所未有的方式推动教育变革。尽管许多研究者已在探索多模态大模型(LMM)与教学相结合的可能性,但在教育资源生成,尤其是数学题目生成领域的深入研究仍显不足。数学题目作为作重要的知识载体与考核手段,其自动生成技术在教学过程中具有重要的应用价值。然而,传统的“解题”与“出题”分离的研究路径,以及以单一数据结构和统一训练目标为主流的大模型微调范式,限制了数学题目生成在实际教学场景中的应用。

本文大概

读完共需

分钟

研究团队

刘三女牙、冯锦天、杨宗凯、万仟、沈筱譞、孙建文:华中师范大学,教育大数据应用技术国家工程研究中心

罗亚威:浙江大学,软件工程学院

本文提炼

研究意义

生成式人工智能正以前所未有的方式推动教育变革。尽管许多研究者已在探索多模态大模型(LMM)与教学相结合的可能性,但在教育资源生成,尤其是数学题目生成领域的深入研究仍显不足。数学题目作为作重要的知识载体与考核手段,其自动生成技术在教学过程中具有重要的应用价值。然而,传统的“解题”与“出题”分离的研究路径,以及以单一数据结构和统一训练目标为主流的大模型微调范式,限制了数学题目生成在实际教学场景中的应用。

本文工作

本文首先从能力互促与应用逻辑的角度出发,将题干生成与题目解答统一到数学题目生成这一框架中。其次,提出了一种基于“经验之塔”的三阶段微调框架,通过引入类人学习的思想,探索和论证了大模型训练过程类比人类学习经历的有效性。通过剖析职业教师的成长历程,将微调数据按照抽象、观察和实践三个层次的经验内核进行划分,并在此基础上设计了多种精细化的数据构造与注入方法。最后,本工作构建了一个中文多模态数学题目数据集CMM12K,为后续多模态数学题目生成提供测评基准。
 

本文创新点

(1) 从以数据为中心的人工智能视角出发,基于“经验之塔”理论构建了面向数学题目生成的多模态大模型(COMET)。据我们所知,这是在单一多模态大模型上系统性提升解题与出题能力的研究。

(2) 提出了基于“经验之塔”的三阶段微调框架,并提供了抽象、观察和实践经验的数据流生成与知识注入算法。

(3) 构建了一个中文多模态数学题目数据集(CMM12K),涵盖4种题型,共计12,000个样本。这一工作可缓解中文多模态数学题目语料的匮乏现状,并为后续研究提供评测基准。

(4) 多个数据集上的大量实验验证了所提框架和模型的先进性与有效性。

实验结果

在GSM8K、TAL-SCQ5K、CMM12K三个数据集上,执行COMET与9个基线(3个闭源模型和6个开源模型)的性能对比。如下图所示,在三类任务的共计15个评分维度上,COMET在多个维度上达到了和GPT-4o相当的性能。在开源模型中,COMET在15个评估维度中的12个维度上达到了SOTA。值得注意的是,COMET在大多数维度上保持了绝对领先,且参数规模相对较小(7B)。

本文翻译

生成式人工智能对教育的影响是前所未有的 [1]。研究人员一直在探索将大型多模态模型(LMM)与教学过程相结合的可能性。具体而言,罗(Luo)和杨(Yang)[2] 已经探索了大型模型协同领域模型以支持智能教育,培育个性化和自适应的教育体验。然而,现有的研究在生成教育资源方面,尤其是在数学问题生成方面,仍缺乏深入的研究。

传统的数学问题生成研究分为两个独立的子领域,即题干生成 [3](一些研究简单地记为问题生成)和问题求解。然而,我们认为构建高质量的数学问题需要具备既能生成题干又能生成解答的能力,以形成任务闭环。如图 1 所示,一个高质量的数学问题需要由领域专家精心设计,并满足多项要求。(一)完整性。在教学过程中,数学问题同时面向教师、学生和家长。因此,它应包含逻辑清晰的四个部分:设计思路、题干、解题思路和答案。(二)准确性。数学问题应准确反映课程目标,与给定的知识点高度相关,并具备练习和测试的功能。(三)差异化。对于所考察的某些关键知识点,问题应在主题、题型和难度级别上有所区分,以便更好地服务于复杂多样的学习需求。

大型多模态模型为数学问题生成提供了一种新的途径。它不仅能够基于跨模态数据生成连贯且符合逻辑的内容,还能基于情境学习以及指令遵循能力来回应多样化的查询。然而,将大型多模态模型直接应用于生成数学问题仍存在挑战。一方面,通用的大型多模态模型缺乏生成数学问题的专业知识,需要进行迁移训练以注入领域知识。随着训练模式(如预训练和有监督的微调)逐渐固定,迁移训练的研究重点转向了高质量领域数据集的构建 [4]。以往的构建方法受限于机器思维和目标任务形式,导致数据质量密度低以及同质化严重等问题。我们认为大型多模态模型的训练有可能类比人类学习。借鉴人类学习的多层次经验,它能够以细粒度指导各阶段训练数据的设计,使模型获得更丰富的知识。另一方面,以往的研究主要侧重于增强大型多模态模型在题干生成或问题求解方面的单项能力。我们认为题干生成和问题求解所需的专业知识和实践经验存在共性。将这两种能力整合到一个单一模型中能够使它们相互增强,并且对于教育场景更为实用。

为解决上述问题,我们提出了一种基于 “经验之塔” 增强的用于数学问题生成的大型多模态模型(COMET)。首先,将题干生成和问题求解统一到数学问题生成任务中。据我们所知,这是首次在单一大型多模态模型上系统性地增强数学问题生成的工作。其次,受 “经验之塔” 教育理论 [5] 的启发,我们提出了一个三阶段微调框架。“经验之塔” 将人类学习经验划分为三个层次:符号经验、表象经验和直接经验。不同层次的经验相互关联,只有充分整合这三个层次的经验,才能实现高质量的学习。最后,构建了一个中文多模态数学问题数据集(CMM12K),填补了中文多模态语料库领域的空白,为后续研究提供了高质量的基准。

图 1(在线彩色图)展示了(a)数学问题生成示意图以及(b)受 “经验之塔” 指导的三阶段微调框架示意图。

方法。图1展示了三阶段微调框架,更多细节可在附录A中查看。整个微调过程受“经验之塔”指导,注入符号经验、表象经验和直接经验。三阶段微调框架依据所注入经验的类型展开,详细阐述了定义、构建方法以及训练方法。

在第一阶段,我们将符号经验定义为目标领域的背景知识,或是开展目标任务的先决知识。符号经验并不能直接帮助模型解决具体任务,但它通过补充概念性知识提供有力支持。对于数学问题生成,我们将符号经验归纳为四种生成类型:书本知识、图表知识、算术知识和常识。所有数据都以预训练形式与符号经验相关联,并注入大型多模态模型(LMM)以供学习,即不对数据内容进行掩码处理。训练的反向传播从输入的首个标记开始计算损失。

在第二阶段,表象经验被定义为主体在执行目标任务过程中生成的数据,这既包括精通目标任务的人类专家,也涵盖其他大型多模态模型。注入表象经验旨在让大型多模态模型(LMM)向人类学习数学问题生成,并改进其他大型多模态模型所产生的推理失败数据。我们将表象经验归纳为三种生成类型:题干生成经验、问题求解经验和失败经验。与表象经验相关的这些数据由大型多模态模型(LMM)以指令微调的形式进行学习。所有数据都按查询 - 响应对的形式排列,并且对查询部分应用掩码处理。训练的反向传播仅从响应的首个标记开始计算损失。

在第三阶段,直接经验被定义为经过微调的对象在执行目标任务并获得结果反馈时所生成的过程性数据。此类经验旨在用更高阶的领域价值纠正大型多模态模型(LMM)的推理偏好,使其在实践过程中实现具身演化。训练格式为{任务指令,高偏好响应,低偏好响应},我们采用直接偏好优化作为损失函数。

实验与结果。附录B介绍了实验设置。我们在两个公开数据集(GSM8K和TAL - SCQ5K)以及一个自建数据集(CMM12K)上验证大型多模态模型(LMM)的三种能力:可控生成(CG)、类比生成(AG)和细粒度求解(FS)。可控生成(CG)和类比生成(AG)反映了大型多模态模型(LMM)的题干生成能力,细粒度求解(FS)反映了问题求解能力。我们选取六个开源大型多模态模型和三个闭源大型多模态模型作为基线,并在人类和GPT - 4V监督下执行三种评估模式,即评分模式、擂台赛模式和客观指标模式。 附录C展示了详细结果。与相同参数规模的基线相比,所提出的模型在可控生成(CG)、类比生成(AG)和细粒度求解(FS)方面始终保持显著优势。在CMM12K数据集上,“经验之塔”增强的用于数学问题生成的大型多模态模型(COMET)的细粒度求解(FS)准确率比基线高出多达20.67%,在擂台赛模式下类比生成(AG)的平均胜率约为94.33%。与参数规模超过70亿的开源基线相比,“经验之塔”增强的用于数学问题生成的大型多模态模型(COMET)在CMM12K数据集上的细粒度求解(FS)准确率仍显示出显著优势,在GSM8K和TAL - SCQ5K数据集上排名第二。在擂台赛模式下可控生成(CG)的平均胜率约为70%。对于包括GPT - 4o在内的强大闭源模型,“经验之锥”增强的用于数学问题生成的大型多模态模型(COMET)在擂台赛模式下针对可控生成(CG)、类比生成(AG)和细粒度求解(FS)的平均胜率分别为51.0%、55.6%和40.5%。

结论。我们提出了“经验之塔”增强的用于数学问题生成的大型多模态模型(COMET)。为探索将大型多模态模型(LMM)训练类比人类学习的可能性,我们依据“经验之塔”教育理论将教师成长过程划分为三个层次的经验,并指导不同阶段训练数据的构建。设计了一个三阶段微调框架,以增强单个大型多模态模型(LMM)内题干生成和问题求解的能力,以满足教育应用的需求。此外,构建了CMM12K数据集以缓解该领域中文多模态语料的稀缺状况。大量实验已证明了所提模型和框架的先进性及有效性。

致谢:本工作得到了国家科技重大专项(项目编号:2022ZD0117103)、国家自然科学基金(项目编号:62437002、62307015、62293554)、中国博士后科学基金(项目编号:2023M741304、2023T160256)、湖北省自然科学基金(项目编号:2023AFA020、2023AFB295)以及中央高校基本科研业务费(项目编号:CCNU24AI016)的支持。 支持信息 附录A - D。支持信息可在info.scichina.com和link.springer.com在线获取。支持材料按提交时的原样发布,未经排版或编辑。科学准确性和内容方面的责任完全由作者承担。

参考文献

1. 王华(音译)、付涛(音译)、杜宇(音译)等。人工智能时代的科学发现。《自然》,2023年,第620卷:第47 - 60页。

2. 罗阳(音译)、杨阳(音译)。面向智能教育的大型语言模型与特定领域模型协作。《信息与电子工程前沿》,2024年,第25卷:第333 - 341页。

3. 吴强(音译)、张强(音译)、黄鑫(音译)。基于主题表达协同注意力机制和强化学习的数学应用题自动生成。《IEEE ACM 音频、语音与语言处理汇刊》,2022年,第30卷:第1061 - 1072页。

4. 查德成(音译)、赖凯华(音译)、杨峰(音译)等。以数据为中心的人工智能:技术与未来展望。收录于:第29届ACM SIGKDD知识发现与数据挖掘会议论文集,2023年,第5839 - 5840页。

5. 戴尔(音译)。视听资料。《航空事务》,1947年,第2卷:第179页。

本文原文

The impact of generative artificial intelligence on education is unprecedented [1]. Researchers have been exploring possibilities of combining the large multimodal model (LMM) with the teaching process. Specifically, Luo and Yang [2] have explored large model collaborative domain models to support smart education, fostering personalized and adaptive educational experiences. However, existing studies still lack in-depth research on generating educational resources, especially in mathematical problem generation.

Traditional studies of mathematical problem generation are divided into two independent subfields, namely stem generation [3] (some studies simply record as problem generation) and problem-solving. However, we believe that constructing high-quality mathematical problems requires the ability to generate both stems and solutions to form a task-closed loop. As shown in Figure 1, a high-quality mathematical problem needs to be carefully designed by domain experts and meet multiple requirements. (I) Completeness. During the teaching process, mathematical problems are aimed at teachers, students, and parents concurrently. Therefore, it should contain four logically clear parts: the mind of design, the stem, the mind of solution, and the answer. (II) Precision. The mathematical problem should accurately reflect the objectives of the curriculum, be highly related to given knowledge points, and provide the function of exercises and tests. (III) Differentiation. For certain key knowledge points under investigation, the problem should differentiate in theme, problem type, and difficulty level, to better serve complex and diverse learning needs.

LMMs offer a novel approach to mathematical problem generation. It can not only generate coherent and logical content on cross-modal data but also respond to diverse queries based on in-context learning and instruction following capabilities. However, there are still challenges in directly applying LMMs to generate math problems. On the one hand, general LMMs lack the expertise for mathematical problem generation and need to transfer training to inject domain knowledge. As training modes (such as pre-training and supervised fine-tuning) gradually solidify, the research focus of transfer training shifts towards the construction of high-quality domain datasets [4]. Previous construction methods are restricted to machine mind and target task form, resulting in issues of low data quality density and serious homogenization. We believe that the LMM training has the potential to analogy human learning. Drawing on the multi-level experiences of human learning, it can guide the design of training data at each stage with fine granularity, enabling the model to acquire richer knowledge. On the other hand, previous studies mainly focus on enhancing the individual ability of LMMs in stem generation or problem-solving. We believe that the professional knowledge and practical experience required for stem generation and problem-solving share commonalities. Integrating both abilities into a single model can mutually enhance them and is more practical for educational scenarios.

To address the above issues, we propose a “cone of experience” enhanced LMM for mathematical problem generation (COMET). Firstly, stem generation and problem-solving are unified into mathematical problem generation tasks. To the best of our knowledge, this is the first work to systematically enhance mathematical problem generation on a single LMM. Secondly, inspired by the “cone of experience” educational theory [5], we propose a three-stage fine-turning framework. The “cone of experience” divides human learning experience into three levels: symbolic, iconic, and direct experience. The different level experiences are interconnected and only by fully integrating all three levels of experiences can highquality learning be achieved. Finally, a Chinese multimodal mathematical problem dataset (CMM12K) is formulated, filling the gap in the field of Chinese multimodal corpus and providing a high-quality benchmark for subsequent research.

Figure 1 (Color online) Diagrams of (a) mathematical problem generation and (b) the “cone of experience” guided three-stage fine-tuning framework.

 

Methodology. Figure 1 shows the three-stage fine-turning framework, more details can be found in Appendix A. The entire fine-tuning process is guided by the “cone of experience”, injecting symbolic, iconic, and direct experience. The three-stage fine-tuning framework is expanded according to the type of injected experience, elaborating on the definitions, construction methods, and training methods.

In stage-1, we define symbolic experience as the background knowledge of the target domain, or the prerequisite knowledge for carrying out the target task. Symbolic experience does not directly help the model solve specific tasks, but it provides strong support by supplementing conceptual knowledge. For mathematical problem generation, we summarize symbolic experience into four types for production: book knowledge, graph knowledge, arithmetic knowledge, and general knowledge. All the data are associated with symbolic experience as pre-training form and are infused into the LMM for learning, i.e., no masking of data content is undertaken. The backpropagation of training computes loss from the first token of the input.

In stage-2, iconic experience is defined as the data generated by the subject in the process of performing the target task, which includes not only human experts proficient in the target task but also other LMMs. Injecting the iconic experience aims to allow LMM to learn mathematical problem generation from humans and improve upon the failed reasoning data produced by other LMMs. We summarize iconic experience into three types of production: the experience of stem generation, problem-solving, and failure. These data pertaining to the iconic experience are learned by the LMM in the form of instruction-tuning. All data are arranged in a query-response pair, and a masking process is applied to the query part. The backward propagation of training only starts calculating loss from the first token of the response.

In stage-3, direct experience is defined as the generated procedural data when the fine-tuned object carries out the target task with results feedback. Such experience aims to correct the inference preference of the LMM with higherorder domain values, allowing it to embodied evolve during the practice. The training format is {task instruction, high preference response, low preference response} and we apply direct preference optimization as the loss function.

Experiments and results. Appendix B presents the experimental setup. We verify three capabilities of LMMs on two public datasets (GSM8K and TAL-SCQ5K) and one self-built dataset (CMM12K): controllable generation (CG), analogy generation (AG), and fine-grained solving (FS). Both CG and AG reflect the ability of LMM to stem generation, FS reflects the problem-solving ability. We select six open-source LMMs and three closed-source LMMs as baselines, and perform three evaluation modes under human and GPT-4V supervision, namely scoring mode, arena mode, and objective indicators.

Appendix C presents the detailed results. Compared to baselines of the same parameter size, the proposed model consistently maintains significant advantages in CG, AG,and FS. On CMM12K, the FS accuracy of COMET leads the baseline by up to 20.67%, and the average winning rate of AG under the arena mode is about 94.33%. Compared to the open-source baselines with a parameter scale exceeding 7B, COMET still shows a significant advantage in FS accuracy on CMM12K, ranking second on GSM8K and TAL-SCQ5K. The average winning rate for CG under arena mode is about 70%. For powerful closed-source models including GPT-4o, COMET has an average winning rate of 51.0%, 55.6%, and 40.5% in the arena mode for CG, AG, and FS.

Conclusion. We propose COMET, a “cone of experience” enhanced LMM for mathematical problem generation. To explore the possibility of analogy LMM training to human learning, we define the teacher growth process into three level experiences based on the “cone of experience” educa-tional theory and guide the construction of training data at different stages. A three-stage fine-tuning framework is designed to enhance the capabilities of stem generation and problem-solving within a single LMM to meet the requirements of educational applications. Moreover, a CMM12K is built to alleviate the scarcity of Chinese multimodal corpora in this field. Extensive experiments have demonstrated the advancement and effectiveness of the proposed model and framework.

Acknowledgements This work was supported by National Science and Technology Major Project (Grant No. 2022ZD0117103), National Natural Science Foundation of China (Grant Nos. 62437002, 62307015, 62293554), China Postdoctoral Science Foundation (Grant Nos. 2023M741304, 2023T160256), Hubei Provincial Natural Science Foundation of China (Grant Nos. 2023AFA020, 2023AFB295), and Fundamental Research Funds for the Central Universities (Grant No. CCNU24AI016).

Supporting information Appendixes A–D. The support-ing information is available online at info.scichina.com and link. springer.com. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.

References

1 Wang H, Fu T, Du Y, et al. Scientific discovery in the age of artificial intelligence. Nature, 2023, 620: 47–60

2 Luo Y, Yang Y. Large language model and domain-specific model collaboration for smart education. Front Inform Technol Electron Eng, 2024, 25: 333–341

3 Wu Q, Zhang Q, Huang X. Automatic math word problem generation with topic-expression co-attention mechanism and reinforcement learning. IEEE ACM Trans Audio Speech Lang Process, 2022, 30: 1061–1072

4 Zha D C, Lai K H, Yang F, et al. Data-centric AI: techniques and future perspectives. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023. 5839–5840

5 Dale E. Audio-visual materials. Air Aff, 1947, 2: 179

标签:人工智能

版权申明:本网站内容均为本站原创文章或网友转载,涉及版权问题请联系管理员删除,转载亦请申明来源.

上一篇:教育数字基座赋能数字化转型:内涵、框架及典型场景
下一篇:最后一页

热门文章