AI Radar Research

Daily research digest for developers — Monday, May 18 2026

arXiv

SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

This paper explores the integration of skills into large language model-based agent systems, focusing on how skills are injected into the agent reasoning loop as contextual guidance.

Why it matters: Understanding how to effectively integrate skills into AI agents can enhance their ability to perform complex tasks autonomously.
arXiv

CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation

The paper presents CAX-Agent, a framework designed to improve the reliability of large language models in finite-element simulation by providing structured execution control and fault recovery.

Why it matters: Improving reliability in AI systems is crucial for their deployment in critical applications like engineering simulations.
arXiv

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

This benchmark evaluates the ability of large language models to generate efficient code for performance-critical systems tasks, emphasizing the need for optimization beyond functional correctness.

Why it matters: Benchmarks like PerfCodeBench are essential for assessing and improving the performance capabilities of AI coding tools.
arXiv

Runtime-Structured Task Decomposition for Agentic Coding Systems

The paper discusses how agentic coding systems can benefit from runtime-structured task decomposition, which aids in debugging, root cause analysis, and code review.

Why it matters: Task decomposition can significantly enhance the efficiency and effectiveness of AI coding agents.
arXiv

Effective Harness Engineering for Algorithm Discovery with Coding Agents

This research highlights the importance of harness engineering in combining large language models with evolutionary search for automated algorithm discovery.

Why it matters: Harness engineering is key to unlocking the full potential of AI in discovering new algorithms.
arXiv

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

Hydra introduces a method for improving code generation by using checkpoint-and-rollback to handle static errors, aiming to reduce latency and improve correctness.

Why it matters: Efficient error handling is crucial for the practical deployment of AI in coding tasks.
arXiv

Assistance to Autonomy: A Systematic Literature Review of Agentic AI across the Software Development Life Cycle

This review synthesizes the current state of agentic AI in software development, highlighting mature adoption areas, dominant architectural patterns, and existing limitations.

Why it matters: Understanding the landscape of agentic AI can guide future research and development in AI-assisted software engineering.
arXiv

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

PBT-Bench provides a benchmark for evaluating AI agents on property-based testing, focusing on deriving semantic invariants rather than just reproducing known bugs.

Why it matters: Property-based testing benchmarks are crucial for developing AI systems that can understand and verify software properties.
arXiv

SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

SDOF introduces a framework for multi-agent orchestration that enforces stage constraints, addressing the alignment tax in task routing through graph-based pipelines.

Why it matters: Enforcing constraints in multi-agent systems can improve alignment and efficiency in task execution.
arXiv

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

ICRL explores how reinforcement learning can help large language models internalize self-critique, aiming to improve their ability to correct mistakes autonomously.

Why it matters: Self-critique mechanisms can enhance the reliability and autonomy of AI systems in coding tasks.
✉ Subscribe to daily research digest