AI Radar Research

Daily research digest for developers — Friday, May 15 2026

arXiv

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

This paper explores the use of verifier-guided action selection to improve the decision-making of embodied agents using multimodal large language models.

Why it matters: Improving decision-making in embodied agents can enhance the reliability and efficiency of AI systems in real-world applications.
arXiv

CHAL: Council of Hierarchical Agentic Language

The paper introduces a multi-agent debate framework to improve reasoning in large language models by addressing structural limitations in current methodologies.

Why it matters: Enhancing reasoning in LLMs can lead to more accurate and reliable AI coding tools.
arXiv

DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models

DisaBench introduces a framework to evaluate disability-related harms in language models, co-created with people with disabilities and experts.

Why it matters: Ensuring AI systems are safe and inclusive is crucial for their widespread adoption and trust.
arXiv

EvolveMem: Self-Evolving Memory Architecture via AutoResearch for LLM Agents

This paper presents EvolveMem, a self-evolving memory architecture for LLM agents that adapts retrieval infrastructure over time.

Why it matters: Adaptive memory architectures can enhance the performance and longevity of AI coding tools.
arXiv

Neural Code Translation of Legacy Code: APL to C#

This study investigates the translation of APL into C# using large language models, addressing challenges in automatic programming language translation.

Why it matters: Facilitating code translation can help modernize legacy systems and improve software maintenance.
arXiv

CA2: Code-Aware Agent for Automated Game Testing

CA2 introduces a code-aware agent for automated game testing, aiming to improve coverage and efficiency in identifying edge cases.

Why it matters: Improved game testing can lead to more reliable and robust software products.
arXiv

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

CRANE proposes a method for injecting constrained reasoning into code agents, enhancing their ability to handle complex tool-use protocols.

Why it matters: Enhancing reasoning in code agents can improve their effectiveness in complex coding tasks.
arXiv

LLM-Based Robustness Testing of Microservice Applications: An Empirical Study

This empirical study explores the use of LLMs for robustness testing of microservice applications, focusing on generating diverse inputs to expose failures.

Why it matters: Robustness testing is crucial for ensuring the reliability of microservice-based systems.
arXiv

Bidirectional Empowerment of Metamorphic Testing and Large Language Models: A Systematic Survey

This survey examines the interplay between metamorphic testing and large language models, highlighting challenges and opportunities in software quality assurance.

Why it matters: Understanding the interaction between LLMs and testing can lead to improved software quality assurance practices.
OpenAI Blog

Sea's View on the Future of Agentic Software Development with Codex

Sea Limited's CPO discusses the deployment of Codex across engineering teams to accelerate AI-native software development in Asia.

Why it matters: Insights into real-world applications of Codex can guide developers in leveraging AI for software development.
✉ Subscribe to daily research digest