HuggingFace 每日AI论文速递

407 Episodes

Reverse

2025.10.10 | 早期经验的Agent Learning；图文交错反思链跃升至24.9%

2025-10-1010:00

本期的 14 篇论文如下：[00:16] 🌱 Agent Learning via Early Experience（基于早期经验的主体学习）[00:50] 🧠 MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization（MM-HELIX：以整体平台与自适应混合策略优化激发多模态长链反思推理）[01:42] 🧪 From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning（从“是什么”到“为什么”：面向循证化学反应条件推理的多智能体系统）[02:19] 🎬 UniVideo: Unified Understanding, Generation, and Editing for Videos（UniVideo：统一理解、生成与编辑视频的多模态框架）[03:01] 🧠 When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs（当思想邂逅事实：面向长上下文语言模型的可复用推理）[03:43] 🧠 Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning（元认知增强推理模型：自对齐强化学习）[04:25] 🧠 MemMamba: Rethinking Memory Patterns in State Space Model（MemMamba：重新思考状态空间模型中的记忆模式）[05:17] 🛡 The Alignment Waltz: Jointly Training Agents to Collaborate for Safety（对齐圆舞曲：联合训练智能体协同守护安全）[05:53] 🎯 Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense（混合强化：奖励稀疏时，密集信号更胜一筹）[06:40] 🧪 NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents（NewtonBench：评测大模型智能体在通用科学定律发现中的基准）[07:17] 🪚 DeepPrune: Parallel Scaling without Inter-trace Redundancy（DeepPrune：并行扩展中消除跨路径冗余的高效推理框架）[07:54] 🚀 Training-Free Group Relative Policy Optimization（免训练群组相对策略优化）[08:24] 🪄 ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation（ARTDECO：面向高效高保真即时三维重建的结构化场景表征）[08:55] 🤥 LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions（大模型在欺骗性样本与偏见人机交互中意外学会欺骗：不诚实行为的新兴错位）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.10.09 | Ming-UniVision统一视觉词表；KV-Cache直连让大模型秒聊

2025-10-0911:46

本期的 15 篇论文如下：[00:21] 🔄 Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer（Ming-UniVision：用统一连续视觉词表打通图像理解与生成）[00:59] 🧠 Cache-to-Cache: Direct Semantic Communication Between Large Language Models（缓存到缓存：大模型间的直接语义通信）[01:32] 🌀 Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding（Lumina-DiMOO：面向多模态生成与理解的离散扩散大模型）[02:07] 🧠 SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models（SHANKS：口语模型边听边想的同步推理框架）[03:06] 🤖 RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training（RLinf-VLA：面向VLA模型强化学习训练的统一高效框架）[04:02] 🎬 MATRIX: Mask Track Alignment for Interaction-aware Video Generation（MATRIX：面向交互感知视频生成的掩码轨迹对齐）[04:51] 🎯 Vibe Checker: Aligning Code Evaluation with Human Preference（Vibe Checker：让代码评估对齐人类偏好）[05:44] 🤖 Multi-Agent Tool-Integrated Policy Optimization（多智能体工具集成策略优化）[06:24] 🧠 CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling（风暴前夜：解锁优化建模原生推理潜能的轻量化矫正框架）[06:59] ✂ OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot（OBS-Diff：一次性精准剪枝扩散模型）[07:52] 🧠 Artificial Hippocampus Networks for Efficient Long-Context Modeling（面向高效长上下文建模的人工海马网络）[08:30] 🔍 Revisiting Long-context Modeling from Context Denoising Perspective（基于上下文降噪视角的长文本建模再审视）[09:11] 🧠 Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought（推动多语言推理模型：语言混合思维链新范式）[09:51] 💥 Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention（低精度Transformer训练为何失败：Flash Attention失效机理剖析）[10:37] ⚡ Native Hybrid Attention for Efficient Sequence Modeling（原生混合注意力高效序列建模）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.10.08 | TaTToo用外挂代码干翻大模型；4B小模型32步逼近闭源巨头

2025-10-0811:16

本期的 15 篇论文如下：[00:24] 📊 TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning（TaTToo：面向表格推理测试时扩展的“工具落地思维”过程奖励模型）[00:57] 🔍 Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs（Fathom-DeepResearch：解锁小模型长程信息检索与综合的钥匙）[01:39] 🚀 Fast-dLLM v2: Efficient Block-Diffusion LLM（Fast-dLLM v2：高效的块扩散大语言模型）[02:30] 🧑 CoDA: Coding LM via Diffusion Adaptation（CoDA：基于扩散适配的轻量级代码生成模型）[03:01] 🧩 Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning（规模化代码辅助思维链与指令以增强模型推理）[03:52] ⚖ ASPO: Asymmetric Importance Sampling Policy Optimization（ASPO：非对称重要性采样策略优化）[04:34] 🔗 Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context（混合机制：语言模型如何在上下文中检索绑定实体）[05:15] 🧠 AInstein: Assessing the Feasibility of AI-Generated Approaches to Research Problems（AInstein：评估AI生成科研方案可行性的研究框架）[05:51] 🪂 Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?（拒绝断崖：安全对齐在推理中为何崩塌）[06:35] 🌍 HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video（HoloScene：单视频生成可交互3D仿真世界）[07:22] ⚡ TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation（TensorBLEU：面向逐句训练评估的向量化GPU加速BLEU分数实现）[08:09] 🎯 Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization（边缘自适应DPO：利用奖励模型实现偏好优化的粒度控制）[09:00] 🩺 Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation（基于多模态大语言模型的离散扩散模型实现统一医学多模态生成）[09:46] 🧠 MixReasoning: Switching Modes to Think（混合推理：动态切换思考模式）[10:20] ⚡ LightCache: Memory-Efficient, Training-Free Acceleration for Video Generation（LightCache：面向视频生成的内存高效、无需训练的加速方法）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.10.07 | 论文秒变演讲；Video-LMM后训练突破

2025-10-0711:02

本期的 15 篇论文如下：[00:21] 🎬 Paper2Video: Automatic Video Generation from Scientific Papers（论文自动生成学术演讲视频）[00:55] 🎬 Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models（Video-LMM后训练：深入剖析大型多模态模型的视频推理）[01:38] 🎬 VChain: Chain-of-Visual-Thought for Reasoning in Video Generation（VChain：面向视频生成推理的视觉思维链）[02:14] 👻 Imperceptible Jailbreaking against Large Language Models（针对大语言模型的隐形越狱攻击）[02:56] 🌳 MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information（MITS：基于点互信息的树搜索增强大模型推理）[03:30] 🧬 Hybrid Architectures for Language Models: Systematic Analysis and Design Insights（语言模型混合架构：系统剖析与设计洞见）[04:07] 📊 Factuality Matters: When Image Generation and Editing Meet Structured Visuals（事实至关重要：当图像生成与编辑遇上结构化视觉）[04:59] 🔄 Reactive Transformer (RxT) -- Stateful Real-Time Processing for Event-Driven Reactive Language Models（反应式Transformer：事件驱动的实时有状态对话模型）[05:55] ⚖ Judging with Confidence: Calibrating Autoraters to Preference Distributions（置信评判：将自动评分器校准到偏好分布）[06:44] 🎯 Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training（Reinforce-Ada：面向Reinforce风格LLM训练的自适应采样框架）[07:27] 📏 Optimal Scaling Needs Optimal Norm（最优扩放需要最优范数）[07:51] 🔬 Code4MeV2: a Research-oriented Code-completion Platform（Code4MeV2：面向研究的代码补全平台）[08:31] 🪞 Self-Reflective Generation at Test Time（测试时自反思生成）[09:15] 🔄 SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs（SwiReasoning：在显式与潜空间之间切换思维，实现帕累托更优的推理大模型）[10:00] 👀 Watch and Learn: Learning to Use Computers from Online Videos（观看与学习：从在线视频中学习使用计算机）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.10.06 | 15B小模型追平DeepSeek-R1；渐进蒸馏128 token省八成算力

2025-10-0611:02

本期的 15 篇论文如下：[00:28] 🧠 Apriel-1.5-15b-Thinker（Apriel-1.5-15B-Thinker：以小博大实现前沿多模态推理的15B开源模型）[01:04] 🚀 Efficient Multi-modal Large Language Models via Progressive Consistency Distillation（基于渐进一致性蒸馏的高效多模态大模型）[01:42] 🧩 Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition（组合式策略！利用测试时段分布级组合提升基于扩散或流的机器人策略性能）[02:19] 🪞 Self-Improvement in Multimodal Large Language Models: A Survey（多模态大语言模型自我提升综述）[02:59] 🧬 Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents（你的智能体可能误入歧途：自演化大模型智能体中的涌现风险）[03:38] 📊 CoDA: Agentic Systems for Collaborative Data Visualization（CoDA：面向协同数据可视化的智能体系统）[04:21] 🧐 SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?（SurveyBench：大模型（智能体）写学术综述能有多靠谱？）[05:06] 🔧 REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration（REPAIR：渐进式自适应干预与再融合的鲁棒编辑框架）[05:53] 🔍 OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features（OrtSAE：正交稀疏自编码器揭示原子级特征）[06:38] 🔍 FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents（FocusAgent：轻量级检索器为网页智能体精简冗长上下文的简易高效方案）[07:14] 🎯 Improving GUI Grounding with Explicit Position-to-Coordinate Mapping（基于显式位置-坐标映射的GUI定位改进方法）[08:05] 📏 LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning（LSPO：面向大模型推理的基于长度感知的动态采样策略优化）[08:45] 🤖 WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents（WAInjectBench：面向网页智能体的提示注入攻防基准评测）[09:19] 🍱 Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs（无需配对偏好图像即可免费对齐文本到图像扩散模型）[09:54] 🎯 LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models（LEAML：面向多模态大模型的标签高效分布外视觉任务适配）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

【周末特辑】10月第1周最火AI论文 | Transformer长出大脑的壳；LongLive把长视频做成直播

2025-10-0512:14

本期的 5 篇论文如下：[00:43] TOP1(🔥323) | 🐣 The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain（幼龙破壳： Transformer 与大脑模型之间缺失的环节）[02:38] TOP2(🔥167) | 🎬 LongLive: Real-time Interactive Long Video Generation（LongLive：实时交互式长视频生成框架）[05:04] TOP3(🔥150) | 🔥 MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use（MCPMark：面向真实且全面的MCP应用场景的压力测试基准）[07:24] TOP4(🔥124) | 🧠 EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning（EPO：面向LLM智能体强化学习的熵正则策略优化）[09:18] TOP5(🔥122) | 🎮 Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play（Vision-Zero：基于策略化博弈自对弈的可扩展视觉语言模型自我提升）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.10.03 | LongCodeZip删得快准；迈向分钟级高质量视频生成

2025-10-0311:19

本期的 15 篇论文如下：[00:22] 🗜 LongCodeZip: Compress Long Context for Code Language Models（LongCodeZip：面向代码大模型的长上下文压缩方法）[00:56] 🎬 Self-Forcing++: Towards Minute-Scale High-Quality Video Generation（自增强++：迈向分钟级高质量视频生成）[01:38] 🧠 ExGRPO: Learning to Reason from Experience（基于经验的群体相对策略优化：让大模型学会从经验中推理）[02:32] 🥷 StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions（隐身投毒：基于密度引导幻觉的鲁棒3D高斯溅射攻击）[03:32] 🎛 Interactive Training: Feedback-Driven Neural Network Optimization（交互式训练：反馈驱动的神经网络优化）[04:24] 📈 StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?（StockBench：大模型智能体能否在真实股市中稳定盈利？）[05:07] 🔍 VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning（VOGUE：用视觉不确定性引导探索，提升多模态推理）[05:44] 🪓 The Rogue Scalpel: Activation Steering Compromises LLM Safety（失控的手术刀：激活向量操控竟瓦解大模型安全锁）[06:21] 🔍 CLUE: Non-parametric Verification from Experience via Hidden-State Clustering（CLUE：基于隐状态聚类的非参数经验验证）[07:09] 🔍 ModernVBERT: Towards Smaller Visual Document Retrievers（ModernVBERT：打造更轻量的视觉文档检索器）[07:54] 🗺 RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning（RewardMap：通过多阶段强化学习解决细粒度视觉推理中的稀疏奖励问题）[08:37] 🚀 F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data（F2LLM技术报告：仅用600万开源数据即可达到SOTA嵌入性能）[09:13] 🧠 RLP: Reinforcement as a Pretraining Objective（RLP：将强化学习作为预训练目标）[09:45] 🖱 DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing（DragFlow：借助区域监督释放DiT先验，实现拖拽式编辑）[10:19] 🚀 The Unreasonable Effectiveness of Scaling Agents for Computer Use（扩展计算机使用代理的规模带来的不合理有效性）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.10.02 | MCTS破局RLVR瓶颈；GEM开源智能体训练场

2025-10-0210:32

本期的 15 篇论文如下：[00:19] 🧠 DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search（DeepSearch：以蒙特卡洛树搜索破解强化学习可验证奖励瓶颈）[01:20] 🤖 GEM: A Gym for Agentic LLMs（GEM：面向智能体大模型的开放训练场）[01:57] 🧠 VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators（VLA-RFT：基于世界模拟器与验证奖励的视觉-语言-动作强化微调）[02:36] 🎒 Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation（背包强化学习：通过优化预算分配解锁大模型探索潜能）[03:06] 🎬 Code2Video: A Code-centric Paradigm for Educational Video Generation（Code2Video：面向教育视频生成的代码中心范式）[03:41] ⚙ PIPer: On-Device Environment Setup via Online Reinforcement Learning（PIPer：基于在线强化学习的设备端环境自动配置）[04:11] 🗜 ACON: Optimizing Context Compression for Long-horizon LLM Agents（ACON：面向长程LLM智能体的上下文压缩优化）[04:52] 🔍 Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls（为何Transformer学不会乘法？逆向工程揭示长程依赖陷阱）[05:22] ⚖ BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses（BiasFreeBench：面向大语言模型去偏响应评测的统一基准）[06:01] ⚡ Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution（Flash-Searcher：基于DAG并行执行的极速高效网络智能体）[06:42] 🚀 BroRL: Scaling Reinforcement Learning via Broadened Exploration（BroRL：通过拓宽探索规模来扩展强化学习）[07:25] 📊 Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum（超越对数似然：面向模型能力连续谱的监督微调概率目标）[08:02] 🎯 On Predictability of Reinforcement Learning Dynamics for Large Language Models（论大型语言模型强化学习动力学的可预测性）[08:31] 🖥 GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness（GUI-KV：面向具备时空感知的高效GUI智能体的KV缓存方案）[09:17] 🧠 Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned（训练视觉-语言过程奖励模型以实现多模态推理测试时扩展：关键洞见与经验总结）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

【月末特辑】9月最火AI论文 | 群体RL共享降本；SAPO让旧机也能训大模型

2025-10-0223:10

本期的 10 篇论文如下：[00:29] TOP1(🔥640) | 🤝 Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing（共享即关爱：基于集体RL经验共享的高效大模型后训练）[02:49] TOP2(🔥341) | 🔒 A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code（A.S.E：一个用于评估AI生成代码安全的仓库级基准）[04:59] TOP3(🔥218) | 🤖 VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model（VLA-Adapter：面向小型视觉-语言-动作模型的有效范式）[07:07] TOP4(🔥212) | 🤖 The Landscape of Agentic Reinforcement Learning for LLMs: A Survey（面向大语言模型的智能体强化学习全景：一项综述）[09:17] TOP5(🔥207) | 🤔 Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth（废话学：用深度解读无意义内容挑战大型语言模型）[11:19] TOP6(🔥183) | 🤔 Why Language Models Hallucinate（语言模型为何产生幻觉）[13:06] TOP7(🔥174) | 🧠 A Survey of Reinforcement Learning for Large Reasoning Models（大型推理模型的强化学习综述）[15:32] TOP8(🔥160) | 🎬 LongLive: Real-time Interactive Long Video Generation（LongLive：实时交互式长视频生成框架）[18:13] TOP9(🔥145) | 💡 Reverse-Engineered Reasoning for Open-Ended Generation（面向开放式生成的逆向工程推理）[20:27] TOP10(🔥140) | 🤖 A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers（科学大型语言模型综述：从数据基础到智能体前沿）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.10.01 | 自对弈零标注训练；MCP代理深度评测

2025-10-0111:21

本期的 15 篇论文如下：[00:20] 🎮 Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play（Vision-Zero：基于策略化博弈自对弈的可扩展视觉语言模型自我提升）[00:59] 🔥 MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use（MCPMark：面向真实且全面的MCP应用场景的压力测试基准）[01:36] 🐣 The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain（幼龙破壳： Transformer 与大脑模型之间缺失的环节）[02:10] 🤥 TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning（TruthRL：通过强化学习激励大模型说真话）[02:55] 🌊 OceanGym: A Benchmark Environment for Underwater Embodied Agents（OceanGym：面向水下具身智能体的综合基准环境）[03:41] ⚡ DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder（DC-VideoGen：基于深度压缩视频自编码器的高效视频生成）[04:14] 🔍 Who's Your Judge? On the Detectability of LLM-Generated Judgments（谁是你的评审？大模型生成评审意见的检测性研究）[04:59] ✂ Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning（赢得剪枝豪赌：统一样本-令牌剪枝的高效监督微调新方法）[05:45] 👁 Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training（未见先识：从语言预训练解密大模型视觉先验）[06:24] 🧠 Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training（思维火花！后训练阶段推理模型中涌现的专用注意力头）[07:09] 🧪 VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications（VitaBench：面向真实场景多功能交互任务的LLM智能体评测基准）[07:42] ⚡ dParallel: Learnable Parallel Decoding for dLLMs（dParallel：面向扩散大语言模型的可学习并行解码）[08:28] 🎯 IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance（IMG：通过隐式多模态引导校准扩散模型）[09:15] 🎬 MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation（MotionRAG：基于运动检索增强的图像到视频生成）[10:12] 🐬 Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention（基于离散唇部语义与多尺度全局-局部注意力的高效视听语音分离）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.09.30 | SLA稀疏注意力砍算力；StableToken抗噪不训模

2025-09-3011:45

本期的 15 篇论文如下：[00:22] ⚡ SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention（SLA：通过可微调稀疏线性注意力突破扩散Transformer的稀疏性极限）[01:05] 🗣 StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs（StableToken：一种面向韧性SpeechLLM的噪声鲁棒语义语音分词器）[01:54] 🎮 Multiplayer Nash Preference Optimization（多玩家纳什偏好优化）[02:57] 🔗 RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark（RealUnify：统一模型真的因“统一”而更强吗？综合基准揭晓答案）[03:44] 🎨 OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing（OpenGPT-4o-Image：面向高级图像生成与编辑的大规模综合数据集）[04:28] 🧠 Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR（超越探索-利用权衡：面向RLVR中LLM推理的隐状态方法）[05:05] 🧩 Visual Jigsaw Post-Training Improves MLLMs（视觉拼图后训练提升多模态大模型）[05:37] 🎬 SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer（SANA-Video：基于分块线性注意力Transformer的高效视频扩散生成模型）[06:15] 🔬 Democratizing AI scientists using ToolUniverse（用ToolUniverse普及AI科学家）[06:59] 🧠 When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance（推理何时真正奏效？对推理贡献度的受控研究）[07:31] 📊 GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts（GSM8K-V：视觉语言模型能否解决视觉语境下的小学数学应用题？）[08:04] 🖼 EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling（EditScore：借助高保真奖励建模解锁图像编辑在线强化学习）[08:54] 🚀 SparseD: Sparse Attention for Diffusion Language Models（SparseD：面向扩散语言模型的稀疏注意力机制）[09:40] 🎛 EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering（EasySteer：高性能可扩展LLM推理控制统一框架）[10:32] 🧠 Towards Personalized Deep Research: Benchmarks and Evaluations（迈向个性化深度研究：基准与评估）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.09.29 | 实时长视频边聊边播；分位数基线稳控推理熵

2025-09-2910:55

本期的 15 篇论文如下：[00:20] 🎬 LongLive: Real-time Interactive Long Video Generation（LongLive：实时交互式长视频生成框架）[00:56] 🎯 Quantile Advantage Estimation for Entropy-Safe Reasoning（用于熵安全推理的分位数优势估计）[01:34] 📄 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing（MinerU2.5：面向高效高分辨率文档解析的解耦视觉-语言模型）[02:11] 🧠 EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning（EPO：面向LLM智能体强化学习的熵正则策略优化）[03:08] 🧠 Variational Reasoning for Language Models（语言模型的变分推理框架）[03:37] 💬 Language Models Can Learn from Verbal Feedback Without Scalar Rewards（无需标量奖励，语言模型也能从语言反馈中学习）[04:32] 🔍 ReviewScore: Misinformed Peer Review Detection with Large Language Models（ReviewScore：用大模型揪出“跑偏”的同行评审）[05:12] 🎯 CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning（CapRL：用强化学习激发稠密图像描述潜能）[05:49] 🪄 MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning（MesaTask：面向任务驱动的桌面场景生成与3D空间推理）[06:32] 🎯 No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping（零方差提示不浪费：基于熵引导优势塑造的LLM强化学习新范式）[07:14] 🗣 VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing（VoiceAssistant-Eval：横跨听、说、看的AI助手基准测评）[07:58] 🧭 UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios（UltraHorizon：在长周期场景中评估智能体能力的基准）[08:29] 🖼 LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer（LucidFlux：无需文字描述的大规模扩散Transformer通用图像修复）[09:16] 🌐 WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning（WebGen-Agent：借助多级反馈与步骤级强化学习提升交互式网页生成）[09:49] 🔄 SPARK: Synergistic Policy And Reward Co-Evolving Framework（SPARK：策略与奖励协同演化的强化学习框架）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

【周末特辑】9月第5周最火AI论文 | Qwen3-Omni开源称王; 锁定视觉训解码，Baseer刷新阿文OCR；

2025-09-2712:37

本期的 5 篇论文如下：[00:38] TOP1(🔥116) | 📜 Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR（Baseer：面向阿拉伯文档OCR的视觉-语言模型）[02:43] TOP2(🔥113) | 🌐 Qwen3-Omni Technical Report（Qwen3-Omni技术报告：首个无性能损耗的全模态大模型）[05:23] TOP3(🔥112) | 🗺 RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation（RPG：用于统一可扩展代码库生成的仓库规划图）[07:45] TOP4(🔥104) | 📈 VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models（VCRL：面向大语言模型的方差驱动课程强化学习）[10:05] TOP5(🔥89) | 🚀 LIMI: Less is More for Agency（LIMI：少即是多，打造AI智能体）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.09.26 | SciReasoner八项全能；MMR1模糊区炼出开源多模态

2025-09-2611:17

本期的 15 篇论文如下：[00:20] 🔬 SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines（SciReasoner：跨学科夯实科学推理基石）[01:00] 🧠 MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources（MMR1：基于方差感知采样与开放资源的多模态推理增强）[01:41] 📈 VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models（VCRL：面向大语言模型的方差驱动课程强化学习）[02:26] 🌳 Tree Search for LLM Agent Reinforcement Learning（基于树搜索的大语言模型智能体强化学习）[03:06] 🖼 Seedream 4.0: Toward Next-generation Multimodal Image Generation（Seedream 4.0：面向下一代多模态图像生成）[03:40] 🎯 Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets（Hunyuan3D-Omni：统一可控3D资产生成框架）[04:29] 🤖 AutoIntent: AutoML for Text Classification（AutoIntent：面向文本分类任务的自动化机器学习框架）[05:10] ⚖ TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them（TrustJudge：LLM-as-a-Judge的评分不一致性及缓解之道）[05:43] 🎢 CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning（CE-GPPO：通过梯度保留裁剪策略优化控制强化学习中的熵）[06:30] 🖼 Does FLUX Already Know How to Perform Physically Plausible Image Composition?（FLUX已掌握物理可信图像合成？）[07:31] ✂ CHARM: Control-point-based 3D Anime Hairstyle Auto-Regressive Modeling（CHARM：基于控制点的3D动漫发型自回归建模）[08:26] 🧠 Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution（Recon-Act：基于网络侦察、工具生成与任务执行的自我演化多智能体浏览器操作系统）[09:12] 🎮 V-GameGym: Visual Game Generation for Code Large Language Models（V-GameGym：面向代码大模型的视觉游戏生成基准）[09:49] 🗣 Interactive Recommendation Agent with Active User Commands（支持主动用户指令的交互式推荐智能体）[10:22] 🔍 BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback（BESPOKE：基于诊断反馈的搜索增强大模型个性化评测基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.09.25 | 视频模型零样本全能；隐式思维链省token提效

2025-09-2507:59

本期的 10 篇论文如下：[00:22] 🎥 Video models are zero-shot learners and reasoners（视频模型是零样本学习者与推理者）[01:09] 🧠 SIM-CoT: Supervised Implicit Chain-of-Thought（SIM-CoT：基于监督式隐式思维链的高效推理）[01:55] 🪶 EmbeddingGemma: Powerful and Lightweight Text Representations（EmbeddingGemma：强大而轻量的文本表征模型）[02:29] 🗣 Advancing Speech Understanding in Speech-Aware Language Models with GRPO（基于GRPO提升语音感知大模型开放域理解能力）[03:06] 🌍 LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines（LLMs4All：面向各学科研究与应用的通用大模型综述）[03:52] 🎬 EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning（EditVerse：用上下文学习统一图像与视频编辑生成）[04:29] 🌀 Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation（Lavida-O：弹性大掩码扩散模型统一多模态理解与生成）[05:19] 🎬 PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation（PhysCtrl：基于生成式物理的可控且物理真实的视频生成框架）[05:58] 📄 Logics-Parsing Technical Report（Logics-Parsing 技术报告：基于强化学习的大模型端到端文档解析）[06:44] 🤖 On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub（关于自主编码的实证研究：GitHub上由AI代理发起的拉取请求分析）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.09.24 | 阿语OCR刷新指标；无标注RL涨分

2025-09-2411:36

本期的 15 篇论文如下：[00:24] 📜 Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR（Baseer：面向阿拉伯文档OCR的视觉-语言模型）[00:58] 🚀 Reinforcement Learning on Pre-Training Data（基于预训练数据的强化学习）[01:37] 👁 Do You Need Proprioceptive States in Visuomotor Policies?（无需本体感觉状态的视觉-运动策略是否可行？）[02:36] 🚀 MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe（MiniCPM-V 4.5：通过架构、数据与训练配方烹饪高效多模态大模型）[03:24] 🎯 MAPO: Mixed Advantage Policy Optimization（混合优势策略优化：解决GRPO中优势分配难题）[04:06] 🚀 Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation（Hyper-Bagel：统一加速多模态理解与生成的一体化框架）[04:44] 🎯 VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction（VolSplat：基于体素对齐预测的前馈3D高斯抛雪球重建新范式）[05:31] 🌌 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation（Lyra：基于视频扩散模型自蒸馏的生成式3D场景重建）[06:08] 🧩 What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT（有效推理的密码：重新审视思维链长度、回顾与结构）[06:41] 🗣 Large Language Models Discriminate Against Speakers of German Dialects（大型语言模型对德语方言使用者的歧视）[07:32] 📊 OpenGVL - Benchmarking Visual Temporal Progress for Data Curation（OpenGVL——面向数据整理的视觉时序进展评测基准）[08:19] 🪄 HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis（HyRF：混合辐射场实现内存高效且高质量的新视角合成）[09:07] 🛠 CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching（条件感知重参数化对齐源域与目标域的流匹配）[09:41] 🛰 Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications（零样本多光谱学习：让通用多模态Gemini 2.5模型在遥感任务中重焕新生）[10:28] 🌍 VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction（VIR-Bench：通过旅行视频行程重建评测多模态大模型的地理-时空理解力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.09.23 | 少78条示范让AI飙73.5%；免掩膜视频插主体超Pika

2025-09-2311:18

本期的 15 篇论文如下：[00:21] 🚀 LIMI: Less is More for Agency（LIMI：少即是多，打造AI智能体）[00:55] 🎬 OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models（无需掩膜的视频任意主体插入：基于扩散Transformer模型）[01:28] 🧩 OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System（OnePiece：面向工业级级联排序系统的上下文工程与推理融合框架）[02:19] 🌐 Qwen3-Omni Technical Report（Qwen3-Omni技术报告：首个无性能损耗的全模态大模型）[02:55] 🎬 TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs（TempSamp-R1：面向视频时序定位任务的高效离策略强化微调框架）[03:28] 📐 GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning（GeoPQA：弥合多模态大模型几何推理中的视觉感知鸿沟）[04:15] 🎯 DiffusionNFT: Online Diffusion Reinforcement with Forward Process（DiffusionNFT：基于前向过程在线扩散强化学习）[05:05] 🤖 ByteWrist: A Parallel Robotic Wrist Enabling Flexible and Anthropomorphic Motion for Confined Spaces（ByteWrist：面向狭窄空间的可穿戴并行机器人腕关节）[05:42] 💬 EpiCache: Episodic KV Cache Management for Long Conversational Question Answering（EpiCache：面向长对话问答的情景式KV缓存管理）[06:24] 🧠 SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?（SWE-Bench Pro：AI智能体能攻克长周期软件工程难题吗？）[07:01] 🧠 FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions（FlagEval发现报告：大推理模型在可自动验证文本与视觉问题上的初步测评）[08:05] 🎬 VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video Diffusion Models（VideoFrom3D：基于互补图像与视频扩散模型的3D场景视频生成）[08:53] 🧪 ARE: Scaling Up Agent Environments and Evaluations（ARE：扩展智能体环境与评测规模）[09:28] 🧩 QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models（QWHA：面向大模型量化部署的沃尔什-哈达玛参数高效微调方法）[10:17] 🔍 Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels（从token与参数双视角解析监督微调对模型知识的影响）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.09.22 | 有向图驱动代码生成；双通道视觉统一模型

2025-09-2209:35

本期的 13 篇论文如下：[00:25] 🗺 RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation（RPG：用于统一可扩展代码库生成的仓库规划图）[01:00] 🌉 MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer（MANZANO：基于混合视觉词元器的简洁可扩展统一多模态模型）[01:42] 🧩 Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification（潜区分网络：生成建模、表示学习与分类的统一原理）[02:25] 🎯 BaseReward: A Strong Baseline for Multimodal Reward Model（BaseReward：多模态奖励模型的强力基线）[02:56] 🏠 SPATIALGEN: Layout-guided 3D Indoor Scene Generation（SpatialGen：布局引导的3D室内场景生成）[03:46] 🧠 BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent（BTL-UI：面向GUI智能体的“眨眼-思考-连接”脑启发推理模型）[04:30] 🎭 Lynx: Towards High-Fidelity Personalized Video Generation（Lynx：面向高保真个性化视频生成）[05:20] 🤖 A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning（用于机器人真实强化学习的视觉-语言-动作-评价模型）[05:54] 📹 RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes（动态场景下仅基于RGB视频监督的相机参数优化）[06:21] 🗣 Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems（你听见的是我想表达的吗？量化指令感知差距的表达型文本转语音系统研究）[07:07] 🎬 Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents（Video2Roleplay：面向视频引导角色扮演智能体的多模态数据集与框架）[07:50] 🗣 WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers（WhisTLE：面向预训练语音识别Transformer的纯文本深度监督域适应方法）[08:30] 🗣 Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue（主动询问以澄清：通过多轮对话消解指令歧义）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

【周末特辑】9月第4周最火AI论文 | OmniWorld打造4D数据工厂；WebWeaver让AI边搜边写

2025-09-2013:21

本期的 5 篇论文如下：[00:43] TOP1(🔥95) | 🌍 OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling（OmniWorld：面向4D世界建模的多领域多模态大规模数据集）[02:51] TOP2(🔥93) | 🔍 WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research（WebWeaver：面向开放型深度研究的动态提纲式网络证据结构化框架）[05:09] TOP3(🔥91) | 🤖 Scaling Agents via Continual Pre-training（基于持续预训练扩展智能体系统规模的研究）[07:33] TOP4(🔥88) | 🖥 ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data（ScaleCUA：基于跨平台数据的开源计算机智能体规模化方案）[10:48] TOP5(🔥79) | 🌊 FlowRL: Matching Reward Distributions for LLM Reasoning（FlowRL：通过流匹配奖励分布提升大语言模型推理能力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.09.19 | 跨平台GUI模型刷榜；FlowRL分布匹配提推理

2025-09-1911:38

本期的 15 篇论文如下：[00:26] 🖥 ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data（ScaleCUA：基于跨平台数据的开源计算机智能体规模化方案）[01:01] 🌊 FlowRL: Matching Reward Distributions for LLM Reasoning（FlowRL：通过流匹配奖励分布提升大语言模型推理能力）[01:57] 🧭 Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration（跨越边界推理：借助测试时深思提升规范对齐）[02:55] 🧬 Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation（无需标签即可让语言模型自我进化：多数选择驱动，新颖性促进变异）[03:34] 🎨 Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation（先理解再生成：面向自回归图像生成的自引导训练）[04:12] 🔍 FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning（FinSearchComp：迈向真实专家级金融搜索与推理评测）[04:56] 🤖 RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation（RynnVLA-001：利用人类示范提升机器人操作能力）[05:39] 🔮 AToken: A Unified Tokenizer for Vision（AToken：面向视觉的统一Tokenizer）[06:10] 🌌 WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance（WorldForge：无需训练即可在视频扩散模型中解锁3D/4D生成的涌现能力）[06:58] 🖼 MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks（MultiEdit：面向多样复杂任务的指令式图像编辑新突破）[07:54] 🎮 RecoWorld: Building Simulated Environments for Agentic Recommender Systems（RecoWorld：为智能推荐系统打造仿真训练沙盒）[08:28] 🎯 Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding（释放多模态大模型零样本时空视频定位潜能）[09:03] 🔍 Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs（留意空格：面向LLM选择题问答的Tokenization再审视）[09:51] 🩺 EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence（EchoVLM：面向通用超声智能的动态混合专家视觉-语言模型）[10:34] 🛰 FSG-Net: Frequency-Spatial Synergistic Gated Network for High-Resolution Remote Sensing Change Detection（FSG-Net：频-空协同门控网络用于高分辨率遥感变化检测）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

#box-pro-ellipsis-176030013813023{-webkit-line-clamp:2;}HuggingFace 每日AI论文速递