AI Radar Research

Daily research digest for developers — Friday, March 27 2026

arXiv

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

ARC-AGI-3 introduces an interactive benchmark for studying agentic intelligence through abstract, turn-based environments where agents must explore, infer goals, and plan actions.

Why it matters: This benchmark provides a new platform for testing and improving autonomous coding agents' decision-making capabilities.
arXiv

AutoSAM: an Agentic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation

AutoSAM presents a framework that automates the generation of input files for the System Analysis Module (SAM) using multi-modal retrieval-augmented generation.

Why it matters: This framework reduces the manual effort required in safety analysis of reactor systems, showcasing the potential of AI in automating complex engineering tasks.
arXiv

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach

This paper explores the formal verification of agent protocols, focusing on Schema-Guided Dialogue and other frameworks for zero-shot API interaction.

Why it matters: Formal semantics are crucial for ensuring the reliability and safety of AI agents interacting with external tools.
arXiv

Experiential Reflective Learning for Self-Improving LLM Agents

This research introduces a framework for LLM agents to improve through experiential reflective learning, enabling adaptation to specialized environments.

Why it matters: The ability for LLM agents to learn from experience is crucial for their effectiveness in dynamic and specialized coding tasks.
arXiv

TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis

TRAJEVAL introduces a method for decomposing code agent trajectories to provide detailed diagnostics beyond binary success metrics.

Why it matters: This approach allows developers to better understand and improve the performance of autonomous coding agents.
arXiv

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

SlopCodeBench evaluates coding agents on iterative tasks, highlighting how performance can degrade over long horizons.

Why it matters: Understanding degradation in coding agents helps improve their reliability and effectiveness in real-world applications.
arXiv

Sketch2Simulation: Automating Flowsheet Generation via Multi Agent Large Language Models

Sketch2Simulation automates the conversion of process sketches into executable simulation models using multi-agent LLMs.

Why it matters: This automation reduces the manual effort and expertise required in process systems engineering, enhancing efficiency.
arXiv

Learning From Developers: Towards Reliable Patch Validation at Scale for Linux

This paper studies Linux patch reviews over a decade to improve the reliability of patch validation processes.

Why it matters: Reliable patch validation is crucial for maintaining the integrity and performance of large-scale open-source projects.
Microsoft Research AI

AsgardBench: A benchmark for visually grounded interactive planning

AsgardBench provides a benchmark for evaluating visually grounded interactive planning in embodied AI systems.

Why it matters: This benchmark helps advance the development of AI systems capable of complex interactive planning tasks.
OpenAI Blog

Powering product discovery in ChatGPT

ChatGPT introduces the Agentic Commerce Protocol for richer, visually immersive shopping experiences, enhancing product discovery and merchant integration.

Why it matters: This development showcases the potential of AI to transform e-commerce through enhanced interactive capabilities.
✉ Subscribe to daily research digest