Eye on AI Weekly Research Watch | Podcasts su Audible

Episodi

VISTA: View-Consistent Self-Verified Training for GUI Grounding

Jun 15 2026

Teaching AI to click the right button on a screen — GUI grounding — sounds simple but is surprisingly brittle. A core training problem is that reinforcement learning often collapses: on hard instances, every rollout fails, so there's no useful learning signal; on easy ones, every rollout succeeds, equally uninformative. VISTA solves this by generating multiple crops of the same GUI screenshot, comparing model predictions across geometrically different but semantically equivalent views. A self-verification mechanism further stabilizes training by anchoring on cases where the model has already produced a correct answer. Results across five benchmarks show consistent accuracy improvements, with the strongest gains on the most challenging GUI grounding tasks. Applications include desktop automation agents, accessibility tools, and software testing frameworks. Authors: Xinyu Qiu, Yunzhu Zhang, Heng Jia, Shuheng Shen, Changhua Meng, Linchao Zhu Paper: https://arxiv.org/abs/2606.14579v1
Mostra di più Mostra meno

3 min

Impossibile aggiungere al carrello

Puoi avere soltanto 50 titoli nel carrello per il checkout.

Riprova più tardi

Riprova più tardi

Rimozione dalla Lista desideri non riuscita.

Riprova più tardi

Non è stato possibile aggiungere il titolo alla Libreria

Per favore riprova

Non è stato possibile seguire il Podcast

Per favore riprova

Esecuzione del comando Non seguire più non riuscita

Ascolta gratuitamente
CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

Jun 15 2026

High-throughput scientific experimentation — screening thousands of chemical compounds, for instance — is expensive and irreversible, making it a dangerous domain for unconstrained AI autonomy. CARE solves this by keeping a proven non-LLM optimizer as the default while allowing an LLM to propose challenger strategies, only authorizing the challenger when pre-outcome evidence actually supports the switch. Every decision is logged in an auditable trail. On chemistry benchmarks, this outperforms all other evaluated methods, improving best-found outcomes significantly over a strong baseline. Applications extend to drug discovery, materials science, process optimization in manufacturing, and any high-stakes experimental domain where AI creativity needs to be harnessed without sacrificing accountability or safety. Authors: Guanyu Liu, Weiyi Kong, Zeyu Wang, Boer Zhang, Baiqing Li, Peiyu Zhang, Tianyu Shi Paper: https://arxiv.org/abs/2606.14581v1
Mostra di più Mostra meno

2 min

Impossibile aggiungere al carrello

Puoi avere soltanto 50 titoli nel carrello per il checkout.

Riprova più tardi

Riprova più tardi

Rimozione dalla Lista desideri non riuscita.

Riprova più tardi

Non è stato possibile aggiungere il titolo alla Libreria

Per favore riprova

Non è stato possibile seguire il Podcast

Per favore riprova

Esecuzione del comando Non seguire più non riuscita

Ascolta gratuitamente
A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway Systems

Jun 15 2026

Railway networks are extraordinarily complex — trains of different gauges share limited track, single-track sections require precise coordination, and unexpected disruptions cascade through entire timetables. Most optimization research stops at high-level scheduling, leaving the messy operational details — track switching, gauge compatibility, disruption response — to human operators under pressure. This framework models the entire problem using PDDL 2.1 temporal planning, generating timestamped, conflict-free operational plans that account for gauge constraints and stochastic disruptions like blocked tracks or engine failures. Tested on 200 benchmark instances with up to 1,000 track points and 120 trains, it demonstrates practical viability for real-world railway systems seeking to reduce reliance on manual intervention during disruptions. Authors: Pollob Chandra Ray, Sabah Binte Noor, Fazlul Hasan Siddiqui Paper: https://arxiv.org/abs/2606.14582v1
Mostra di più Mostra meno

3 min

Impossibile aggiungere al carrello

Puoi avere soltanto 50 titoli nel carrello per il checkout.

Riprova più tardi

Riprova più tardi

Rimozione dalla Lista desideri non riuscita.

Riprova più tardi

Non è stato possibile aggiungere il titolo alla Libreria

Per favore riprova

Non è stato possibile seguire il Podcast

Per favore riprova

Esecuzione del comando Non seguire più non riuscita

Ascolta gratuitamente
Sensitivity Shaping for Latent Modeling

Jun 15 2026

Generative dynamics models let robots plan behavior in rich, uncertain environments — but safely deploying them requires reliably detecting when the robot is about to enter unfamiliar territory. Existing out-of-distribution detection methods bolt on detectors after the fact, and this paper shows why that fails: if the dynamics model is locally insensitive to different control inputs in critical regions, unsafe actions can produce latent predictions that look like safe ones, suppressing the alert. The proposed fix — control-sensitivity regularization during training — makes the model more discriminating in exactly the regions where it matters. Applications include safer robot navigation in unstructured environments, robotic manipulation, autonomous vehicle planning, and any deployment where catastrophic failure must be caught before execution. Authors: Hongzhan Yu, Chenghao Li, Ruipeng Zhang, Henrik Christensen, Sicun Gao Paper: https://arxiv.org/abs/2606.14585v1
Mostra di più Mostra meno

3 min

Impossibile aggiungere al carrello

Puoi avere soltanto 50 titoli nel carrello per il checkout.

Riprova più tardi

Riprova più tardi

Rimozione dalla Lista desideri non riuscita.

Riprova più tardi

Non è stato possibile aggiungere il titolo alla Libreria

Per favore riprova

Non è stato possibile seguire il Podcast

Per favore riprova

Esecuzione del comando Non seguire più non riuscita

Ascolta gratuitamente
When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime

Jun 15 2026

Most AI failure research is theoretical or laboratory-based — this paper is a rare longitudinal postmortem of a real production LLM agent system running continuously since early 2026, with 22 documented incidents over eight weeks. The most dangerous failure class identified is "fail-plausible": the agent doesn't just fail to report an error, it transforms the error into fluent, convincing narrative delivered to the user. The study finds that human observation catches ~70% of silent failures that tests and audits miss entirely, and that audit processes function as regression engines rather than predictive ones. The taxonomy and design principles derived are immediately actionable for anyone building or operating long-running autonomous AI systems. Authors: Wei Wu Paper: https://arxiv.org/abs/2606.14589v1
Mostra di più Mostra meno

3 min

Impossibile aggiungere al carrello

Puoi avere soltanto 50 titoli nel carrello per il checkout.

Riprova più tardi

Riprova più tardi

Rimozione dalla Lista desideri non riuscita.

Riprova più tardi

Non è stato possibile aggiungere il titolo alla Libreria

Per favore riprova

Non è stato possibile seguire il Podcast

Per favore riprova

Esecuzione del comando Non seguire più non riuscita

Ascolta gratuitamente
AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

Jun 15 2026

Audio AI models have gotten good at recognizing what they hear, but complex reasoning — understanding causation, context, and implication across sound, speech, and music — remains a frontier challenge. A key bottleneck is training data: existing datasets are highly redundant, meaning models see many acoustically similar samples that provide overlapping rather than additive learning signal. AudioDER builds a pipeline that first deduplicates audio by acoustic similarity, then generates chain-of-thought reasoning annotations using a large language model. The resulting 191,000-sample dataset consistently improves reasoning performance across multiple benchmarks. Applications include voice assistants that reason about complex audio scenes, medical audio analysis, accessibility tools, and any system requiring nuanced understanding of audio in context. Authors: Hui Geng, Yi Su, Han Yin, Tianjiao Wan, Qisheng Xu, Jiaxin Chen, Zijian Gao, Hengzhu Liu, Xie Chen, Kele Xu Paper: https://arxiv.org/abs/2606.14591v1
Mostra di più Mostra meno

3 min

Impossibile aggiungere al carrello

Puoi avere soltanto 50 titoli nel carrello per il checkout.

Riprova più tardi

Riprova più tardi

Rimozione dalla Lista desideri non riuscita.

Riprova più tardi

Non è stato possibile aggiungere il titolo alla Libreria

Per favore riprova

Non è stato possibile seguire il Podcast

Per favore riprova

Esecuzione del comando Non seguire più non riuscita

Ascolta gratuitamente
Regulating the Machine Contributor: Governance and Policy Alignment in Open Source

Jun 15 2026

AI agents can now autonomously plan changes, edit code, and submit pull requests — but open-source infrastructure was built around the assumption of a legally accountable human contributor who can attest to provenance and answer reviewers' questions. This paper systematically maps how six major open-source organizations (including Apache, Linux Foundation, and SymPy) have responded with contribution policies, then scores them against EU AI Act, NIST AI RMF, and ISO frameworks. The result reveals fragmented, partially overlapping gaps that neither open-source policy nor AI regulation currently closes. Applications of this work include informing standardized AI contribution policies, guiding platform-level governance decisions at GitHub and GitLab, and shaping emerging regulatory frameworks for autonomous software agents. Authors: Jassem Manita, Aziz Amari Paper: https://arxiv.org/abs/2606.14594v1
Mostra di più Mostra meno

3 min

Impossibile aggiungere al carrello

Puoi avere soltanto 50 titoli nel carrello per il checkout.

Riprova più tardi

Riprova più tardi

Rimozione dalla Lista desideri non riuscita.

Riprova più tardi

Non è stato possibile aggiungere il titolo alla Libreria

Per favore riprova

Non è stato possibile seguire il Podcast

Per favore riprova

Esecuzione del comando Non seguire più non riuscita

Ascolta gratuitamente
A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

Jun 15 2026

Wearables generate a continuous stream of behavioral data — steps, screen time, sleep — that could power truly proactive health interventions, but it's been unclear which AI architectures best handle these signals across diverse populations and time horizons. This study benchmarks six deep learning models plus two foundation models across 800+ participants, tracking forecast accuracy out to eight days. Key findings: no single architecture dominates; the foundation model TimesFM matches trained models zero-shot; and personalized fine-tuning cuts error by 16–60%, with sleep benefiting most. Applications include preventive health apps, mental health monitoring, chronic disease management platforms, and research tools for digital health studies where population-level and individual-level accuracy both matter. Authors: Pavlos Nicolaou, Kleanthis Malialis, Artemis Kontou, Panayiotis Kolios Paper: https://arxiv.org/abs/2606.14604v1
Mostra di più Mostra meno

3 min

Impossibile aggiungere al carrello

Puoi avere soltanto 50 titoli nel carrello per il checkout.

Riprova più tardi

Riprova più tardi

Rimozione dalla Lista desideri non riuscita.

Riprova più tardi

Non è stato possibile aggiungere il titolo alla Libreria

Per favore riprova

Non è stato possibile seguire il Podcast

Per favore riprova

Esecuzione del comando Non seguire più non riuscita

Ascolta gratuitamente

Episodi

VISTA: View-Consistent Self-Verified Training for GUI Grounding

Impossibile aggiungere al carrello

Rimozione dalla Lista desideri non riuscita.

Non è stato possibile aggiungere il titolo alla Libreria

Non è stato possibile seguire il Podcast

Esecuzione del comando Non seguire più non riuscita

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

Impossibile aggiungere al carrello

Rimozione dalla Lista desideri non riuscita.

Non è stato possibile aggiungere il titolo alla Libreria

Non è stato possibile seguire il Podcast

Esecuzione del comando Non seguire più non riuscita

A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway Systems

Impossibile aggiungere al carrello

Rimozione dalla Lista desideri non riuscita.

Non è stato possibile aggiungere il titolo alla Libreria

Non è stato possibile seguire il Podcast

Esecuzione del comando Non seguire più non riuscita

Sensitivity Shaping for Latent Modeling

Impossibile aggiungere al carrello

Rimozione dalla Lista desideri non riuscita.

Non è stato possibile aggiungere il titolo alla Libreria

Non è stato possibile seguire il Podcast

Esecuzione del comando Non seguire più non riuscita

When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime

Impossibile aggiungere al carrello

Rimozione dalla Lista desideri non riuscita.

Non è stato possibile aggiungere il titolo alla Libreria

Non è stato possibile seguire il Podcast

Esecuzione del comando Non seguire più non riuscita

AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

Impossibile aggiungere al carrello

Rimozione dalla Lista desideri non riuscita.

Non è stato possibile aggiungere il titolo alla Libreria

Non è stato possibile seguire il Podcast

Esecuzione del comando Non seguire più non riuscita

Regulating the Machine Contributor: Governance and Policy Alignment in Open Source

Impossibile aggiungere al carrello

Rimozione dalla Lista desideri non riuscita.

Non è stato possibile aggiungere il titolo alla Libreria

Non è stato possibile seguire il Podcast

Esecuzione del comando Non seguire più non riuscita

A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

Impossibile aggiungere al carrello

Rimozione dalla Lista desideri non riuscita.

Non è stato possibile aggiungere il titolo alla Libreria

Non è stato possibile seguire il Podcast

Esecuzione del comando Non seguire più non riuscita