Discover
Deep Dive in Research
14 Episodes
Reverse
https://huggingface.co/blog/codelion/internal-coherence-maximizationThe article presents a novel method for improving large language models (LLMs) called Internal Coherence Maximization (ICM) combined with Direct Preference Optimization (DPO), which operates without any human supervision. This unsupervised approach demonstrates superior performance in mathematical reasoning tasks compared to traditional human-supervised methods like Group Relative Policy Optimization (GRPO). Key contributions include a complete implementation of ICM with diverse solution generation and a pipeline to convert ICM results into preference pairs for DPO training. The research also shows successful cross-model capability transfer, where knowledge from a stronger model (Qwen3) improves a weaker one (Gemma3), offering a scalable and cost-effective alternative to current LLM alignment paradigms. The authors emphasize that pretrained models already possess rich understanding, and ICM+DPO offers a way to elicit and refine this internal coherence, leading to better performance without the bottleneck of human annotation.
The article introduces EDINET-Bench, a novel open-source Japanese financial benchmark designed to evaluate Large Language Models (LLMs) on complex financial tasks. This benchmark addresses the scarcity of challenging Japanese financial datasets for LLM evaluation, crucial for tasks like accounting fraud detection, earnings forecasting, and industry prediction. The EDINET-Bench dataset is automatically compiled from ten years of Japanese annual reports available through the Electronic Disclosure for Investors’ NETwork (EDINET). Initial evaluations indicate that even state-of-the-art LLMs perform only marginally better than logistic regression in some complex financial tasks, highlighting the need for domain-specific adaptation and further research. The project makes its dataset, benchmark construction code, and evaluation code publicly available to foster advancements in LLM applications within the financial sector.
The article introduces AutoThink, an innovative approach designed to enhance the inference efficiency and accuracy of reasoning Large Language Models (LLMs). AutoThink addresses the challenge of LLMs generating excessive or insufficient reasoning tokens, which leads to computational inefficiency and suboptimal performance. This system comprises two main components: a query complexity classifier that dynamically allocates the optimal number of reasoning tokens, and a dataset of control vectors derived from "pivotal tokens" to guide the LLM's reasoning path. Experimental results demonstrate that AutoThink significantly reduces output tokens while substantially improving accuracy on complex reasoning tasks, suggesting a more strategic approach to LLM resource allocation rather than simply increasing computation.
The article introduces System Prompt Learning (SPL), an innovative approach enabling Large Language Models (LLMs) to learn and refine problem-solving strategies through practical experience. This method addresses the current disparity where most developers lack the sophisticated system prompts that make advanced AI assistants so capable. SPL represents a "third paradigm" of LLM learning, augmenting traditional pretraining and finetuning by allowing models to classify problems, apply relevant strategies, and continuously improve these strategies over time. The system maintains a dynamic database of human-readable strategies, demonstrating significant performance improvements across various benchmarks and offering benefits like cumulative learning, transparency, and adaptability. Implemented as an open-source plugin in optillm, SPL offers a practical way to integrate this adaptive intelligence into LLM applications.
This article introduces OpenEvolve, an open-source implementation of Google DeepMind's AlphaEvolve, a system that leverages Large Language Models (LLMs) in an evolutionary framework to generate and optimize code. OpenEvolve allows users to evolve entire codebases by iteratively creating modifications using LLMs, evaluating them with automated metrics, and selecting promising solutions through an evolutionary process. The article details OpenEvolve's architecture, highlighting its key components like the Prompt Sampler and LLM Ensemble, and provides examples demonstrating its ability to achieve results comparable to AlphaEvolve in complex problems such as circle packing and function minimization, showcasing the evolution from simpler algorithms to more sophisticated solutions. It also discusses the importance of LLM performance and diversity for successful evolution and provides guidance on how to install and use the software for developing and improving algorithms.
This paper introduces Pivotal Token Search (PTS), a novel method for improving the performance of large language models by focusing on critical decision points in their output sequences. Unlike traditional methods that treat all generated tokens equally, PTS identifies "pivotal tokens" that significantly influence the probability of a successful generation. By using a binary search algorithm to pinpoint these key tokens, PTS generates preference pairs specifically centered on these critical decisions, leading to a more efficient learning signal during training. The release includes an open-source implementation, datasets of pivotal tokens and preference pairs, and fine-tuned models demonstrating the technique's effectiveness. This approach has potential applications in improving reasoning abilities, agent trajectories, and model interpretability.
This episode introduces CameraBench, a large-scale dataset and benchmark designed to improve camera motion understanding in videos. It details a taxonomy of camera motion primitives developed with cinematographers, highlighting how motions can relate to scene content like tracking subjects. The authors describe a rigorous annotation framework and human study demonstrating how domain expertise and training enhance annotation accuracy. Using CameraBench, they evaluate both Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM struggles with semantic primitives while VLMs struggle with precise geometric motions. Finally, they show that fine-tuning a generative VLM on CameraBench significantly improves performance on tasks like motion-augmented captioning and video question answering.
This epidsode introduces Step1X-Edit, an open-source image editing model designed to close the performance gap with proprietary models like GPT-4o. The developers created a large-scale, high-quality dataset and a new benchmark (GEdit-Bench) reflecting real-world editing instructions to train and evaluate the model. Step1X-Edit integrates a Multimedia Large Language Model (MLLM) with a diffusion-based image decoder to perform diverse edits based on natural language instructions. Experimental results indicate that Step1X-Edit outperforms existing open-source models and achieves performance comparable to leading closed-source systems.
Visual reasoning is a core component of human intelligence and a critical capabilityfor advanced multimodal models. Yet current reasoning evaluations of multimodallarge language models (MLLMs) often rely on text descriptions and allow languagebased reasoning shortcuts, failing to measure genuine vision-centric reasoning.To address this, we introduce VisuLogic: a benchmark of 1,000 human-verifiedproblems across six categories (e.g., quantitative shifts, spatial relations, attributecomparisons). These various types of questions can be evaluated to assess the visualreasoning capabilities of MLLMs from multiple perspectives. We evaluate leadingMLLMs on this benchmark and analyze their results to identify common failuremodes. Most models score below 30% accuracy—only slightly above the 25% random baseline and far below the 51.4% achieved by humans—revealing significantgaps in visual reasoning. Furthermore, we provide a supplementary training datasetand a reinforcement-learning baseline to support further progress. Code, data, andbaselines are available at https://visulogic-benchmark.github.io/VisuLogic.
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@k metric with large values of k to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does not, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of k (\eg, k=1), base models can achieve a comparable or even higher pass@k score compared to their RL counterparts at large k values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io
Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.
This episode explores a hopeful vision of the future with powerful AI, focusing on how AI could revolutionize five key areas: biology and health, neuroscience and mind, economic development and poverty, peace and governance, and work and meaning. Join us as we examine the potential of AI to solve humanity’s biggest challenges and unlock a future of abundance and well-being for everyone.
This text explores the nature of time from a computational perspective. It argues that time is not a fundamental coordinate but rather a consequence of the universe's computational processes. The author proposes that time is "the progressive doing of computation by the universe," and that our perception of time arises from our own computational limitations as observers. The text further suggests that the universe's computational irreducibility, the idea that there is no shortcut to understanding a system's evolution, contributes to the robustness of time as a unidirectional flow. The author also examines the concepts of multiple threads of time, the ruliad (the totality of all possible computational processes), and the role of computational boundedness in shaping our perception of time and physical laws.
This research paper describes the development and capabilities of "Movie Gen," a new suite of generative AI models that produce high-quality, realistic videos and audio. The paper highlights key advancements in text-to-video and video-to-audio synthesis, video editing, and video personalization. The authors detail their models' architecture, training procedures, and evaluation metrics, demonstrating superior performance compared to existing commercial and open-source solutions. This research aims to advance the field of media generation and enable new creative possibilities.




