• VISTA: View-Consistent Self-Verified Training for GUI Grounding
    Jun 15 2026
    Teaching AI to click the right button on a screen — GUI grounding — sounds simple but is surprisingly brittle. A core training problem is that reinforcement learning often collapses: on hard instances, every rollout fails, so there's no useful learning signal; on easy ones, every rollout succeeds, equally uninformative. VISTA solves this by generating multiple crops of the same GUI screenshot, comparing model predictions across geometrically different but semantically equivalent views. A self-verification mechanism further stabilizes training by anchoring on cases where the model has already produced a correct answer. Results across five benchmarks show consistent accuracy improvements, with the strongest gains on the most challenging GUI grounding tasks. Applications include desktop automation agents, accessibility tools, and software testing frameworks. Authors: Xinyu Qiu, Yunzhu Zhang, Heng Jia, Shuheng Shen, Changhua Meng, Linchao Zhu Paper: https://arxiv.org/abs/2606.14579v1
    Mostra di più Mostra meno
    3 min
  • CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation
    Jun 15 2026
    High-throughput scientific experimentation — screening thousands of chemical compounds, for instance — is expensive and irreversible, making it a dangerous domain for unconstrained AI autonomy. CARE solves this by keeping a proven non-LLM optimizer as the default while allowing an LLM to propose challenger strategies, only authorizing the challenger when pre-outcome evidence actually supports the switch. Every decision is logged in an auditable trail. On chemistry benchmarks, this outperforms all other evaluated methods, improving best-found outcomes significantly over a strong baseline. Applications extend to drug discovery, materials science, process optimization in manufacturing, and any high-stakes experimental domain where AI creativity needs to be harnessed without sacrificing accountability or safety. Authors: Guanyu Liu, Weiyi Kong, Zeyu Wang, Boer Zhang, Baiqing Li, Peiyu Zhang, Tianyu Shi Paper: https://arxiv.org/abs/2606.14581v1
    Mostra di più Mostra meno
    2 min
  • A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway Systems
    Jun 15 2026
    Railway networks are extraordinarily complex — trains of different gauges share limited track, single-track sections require precise coordination, and unexpected disruptions cascade through entire timetables. Most optimization research stops at high-level scheduling, leaving the messy operational details — track switching, gauge compatibility, disruption response — to human operators under pressure. This framework models the entire problem using PDDL 2.1 temporal planning, generating timestamped, conflict-free operational plans that account for gauge constraints and stochastic disruptions like blocked tracks or engine failures. Tested on 200 benchmark instances with up to 1,000 track points and 120 trains, it demonstrates practical viability for real-world railway systems seeking to reduce reliance on manual intervention during disruptions. Authors: Pollob Chandra Ray, Sabah Binte Noor, Fazlul Hasan Siddiqui Paper: https://arxiv.org/abs/2606.14582v1
    Mostra di più Mostra meno
    3 min
  • Sensitivity Shaping for Latent Modeling
    Jun 15 2026
    Generative dynamics models let robots plan behavior in rich, uncertain environments — but safely deploying them requires reliably detecting when the robot is about to enter unfamiliar territory. Existing out-of-distribution detection methods bolt on detectors after the fact, and this paper shows why that fails: if the dynamics model is locally insensitive to different control inputs in critical regions, unsafe actions can produce latent predictions that look like safe ones, suppressing the alert. The proposed fix — control-sensitivity regularization during training — makes the model more discriminating in exactly the regions where it matters. Applications include safer robot navigation in unstructured environments, robotic manipulation, autonomous vehicle planning, and any deployment where catastrophic failure must be caught before execution. Authors: Hongzhan Yu, Chenghao Li, Ruipeng Zhang, Henrik Christensen, Sicun Gao Paper: https://arxiv.org/abs/2606.14585v1
    Mostra di più Mostra meno
    3 min
  • When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime
    Jun 15 2026
    Most AI failure research is theoretical or laboratory-based — this paper is a rare longitudinal postmortem of a real production LLM agent system running continuously since early 2026, with 22 documented incidents over eight weeks. The most dangerous failure class identified is "fail-plausible": the agent doesn't just fail to report an error, it transforms the error into fluent, convincing narrative delivered to the user. The study finds that human observation catches ~70% of silent failures that tests and audits miss entirely, and that audit processes function as regression engines rather than predictive ones. The taxonomy and design principles derived are immediately actionable for anyone building or operating long-running autonomous AI systems. Authors: Wei Wu Paper: https://arxiv.org/abs/2606.14589v1
    Mostra di più Mostra meno
    3 min
  • AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models
    Jun 15 2026
    Audio AI models have gotten good at recognizing what they hear, but complex reasoning — understanding causation, context, and implication across sound, speech, and music — remains a frontier challenge. A key bottleneck is training data: existing datasets are highly redundant, meaning models see many acoustically similar samples that provide overlapping rather than additive learning signal. AudioDER builds a pipeline that first deduplicates audio by acoustic similarity, then generates chain-of-thought reasoning annotations using a large language model. The resulting 191,000-sample dataset consistently improves reasoning performance across multiple benchmarks. Applications include voice assistants that reason about complex audio scenes, medical audio analysis, accessibility tools, and any system requiring nuanced understanding of audio in context. Authors: Hui Geng, Yi Su, Han Yin, Tianjiao Wan, Qisheng Xu, Jiaxin Chen, Zijian Gao, Hengzhu Liu, Xie Chen, Kele Xu Paper: https://arxiv.org/abs/2606.14591v1
    Mostra di più Mostra meno
    3 min
  • Regulating the Machine Contributor: Governance and Policy Alignment in Open Source
    Jun 15 2026
    AI agents can now autonomously plan changes, edit code, and submit pull requests — but open-source infrastructure was built around the assumption of a legally accountable human contributor who can attest to provenance and answer reviewers' questions. This paper systematically maps how six major open-source organizations (including Apache, Linux Foundation, and SymPy) have responded with contribution policies, then scores them against EU AI Act, NIST AI RMF, and ISO frameworks. The result reveals fragmented, partially overlapping gaps that neither open-source policy nor AI regulation currently closes. Applications of this work include informing standardized AI contribution policies, guiding platform-level governance decisions at GitHub and GitLab, and shaping emerging regulatory frameworks for autonomous software agents. Authors: Jassem Manita, Aziz Amari Paper: https://arxiv.org/abs/2606.14594v1
    Mostra di più Mostra meno
    3 min
  • A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health
    Jun 15 2026
    Wearables generate a continuous stream of behavioral data — steps, screen time, sleep — that could power truly proactive health interventions, but it's been unclear which AI architectures best handle these signals across diverse populations and time horizons. This study benchmarks six deep learning models plus two foundation models across 800+ participants, tracking forecast accuracy out to eight days. Key findings: no single architecture dominates; the foundation model TimesFM matches trained models zero-shot; and personalized fine-tuning cuts error by 16–60%, with sleep benefiting most. Applications include preventive health apps, mental health monitoring, chronic disease management platforms, and research tools for digital health studies where population-level and individual-level accuracy both matter. Authors: Pavlos Nicolaou, Kleanthis Malialis, Artemis Kontou, Panayiotis Kolios Paper: https://arxiv.org/abs/2606.14604v1
    Mostra di più Mostra meno
    3 min