arXiv cs.CV

https://arxiv.org/list/cs.CV/recent

Latest AI signals from arXiv cs.CV

DeepSignal tracks AI updates from arXiv cs.CV, filtering research and product signals into plain-English summaries, signal scores and source-linked article pages.

Current topics: Research, AI Image, Inference, Robotics, LLM · Companies: Meta

High-signal updates

RadarTwin: Scene-Specific mmWave Radar Simulation and Learning for Mobile Indoor Perception66 signal
RADIANT-PET: Reasoning-Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning66 signal
Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs57 signal

arXiv cs.CV·Santosh Jaiswal

1d ago

Original

Zero-Label Driving Scenario Complexity Detection via Joint Embedding Predictive Architecture

AI Summary

The study introduces a Joint Embedding Predictive Architecture (JEPA) that autonomously detects driving scenario complexity without labels, achieving significant differentiation in complexity scores for various scenarios. The model demonstrated an Average Precision of 0.512 in anomaly detection, outperforming a baseline of 0.436, highlighting its potential in identifying critical driving situations.

Why Featured

The introduction of the Joint Embedding Predictive Architecture (JEPA) for zero-label driving scenario complexity detection is significant as it enhances the ability to autonomously assess critical driving situations, which is vital for improving safety in autonomous vehicles. This advancement can inform product development strategies and investment decisions in the growing field of AI-driven transportation technologies.

#Robotics #AI Assistant

0

arXiv cs.CV·Shanfeng Zhang, Bo Gou, Yue Cao, Lei Zhang, Zhang Yi, Tao He

1d ago

Original

DCSNet: Multiscale Feature Aggregation for Small Medical Object Segmentation with Detection-guided Hierarchical Cropping

AI Summary

DCSNet introduces a novel approach for small medical object segmentation, utilizing Detection-guided Hierarchical Cropping and Multiscale Feature Aggregation to enhance boundary precision. Extensive experiments show DCSNet significantly outperforms existing methods across three medical datasets, addressing class imbalance and edge degradation effectively.

Why Featured

DCSNet's novel approach for small medical object segmentation enhances boundary precision, addressing critical issues like class imbalance and edge degradation. This development is significant for builders and PMs in the healthcare AI space, as it could lead to more accurate diagnostic tools, while investors may see potential for improved market competitiveness in medical imaging technologies.

#AI Image #AI Assistant

0

arXiv cs.CV·Yunhun Nam, Jongheon Jeong

1d ago

Original

Vision-driven Preference Synthesis for Mitigating Hallucinations in

AI Summary

The ViPSy framework enhances Vision-Language Models (VLMs) by constructing preference pairs that are both policy-aligned and visually grounded, reducing hallucination rates by 35.7% on AMBER and 24.5% on Object HalBench. This approach improves visual grounding benchmarks and semantic segmentation, showcasing its effectiveness in mitigating hallucinations.

Why Featured

The ViPSy framework significantly reduces hallucination rates in Vision-Language Models (VLMs) by 35.7%, enhancing their reliability for real-world applications. This advancement is crucial for builders and PMs focusing on deploying VLMs in products, as improved accuracy can lead to better user experiences and increased trust from investors.

#LLM #AI Assistant #Policy

0

arXiv cs.CV·Emily Bejerano, Federico Tondolo, Devang Gupta, Aaron Mano Cherian, Taeyoo Kim, Ayaan Qayyum, Xiaofan Yu, Xiaofan Jiang

1d ago

FeaturedOriginal

RadarTwin: Scene-Specific mmWave Radar Simulation and Learning for Mobile Indoor Perception

AI Summary

RadarTwin is a novel framework that generates scene-specific mmWave radar training data using 3D reconstructions and , improving object recognition accuracy to 95.3% with minimal real data. This approach addresses the data scarcity issue in radar perception, enabling effective training before real data collection.

Why Featured

RadarTwin's ability to generate scene-specific mmWave radar training data significantly lowers the barrier to entry for companies developing indoor perception systems, allowing them to achieve high object recognition accuracy with minimal real-world data collection. This innovation can accelerate product development timelines and reduce costs, making it a compelling opportunity for builders, PMs, and investors in the AI and robotics sectors.

#AI Coding #Inference #Open Source

0

arXiv cs.CV·Jianlong Xiong, ChuanBo Xie, Le Yu, Quansong He, Tao He

1d ago

Original

Enhancing Layer Interaction Using Key-Correlated Layer Attention

AI Summary

Key-Correlated Layer Attention (KCLA) improves inter-layer interactions in neural networks by achieving linear computational complexity while maintaining dynamic information updates. This novel approach enhances long-range cross-layer connections and has shown strong performance in tasks like image recognition and medical image segmentation.

Why Featured

The development of Key-Correlated Layer Attention (KCLA) allows for efficient inter-layer interactions in neural networks with linear computational complexity, which can significantly enhance performance in applications like image recognition and medical segmentation. Builders and PMs should consider integrating KCLA to improve model efficiency and effectiveness, while investors may find opportunities in startups leveraging this technology.

#LLM #AI Image

0

arXiv cs.CV·L. A. Mu\~noz

1d ago

Original

GPU-Accelerated Inverse Structural Anastylosis from Block Collapse Dynamics

AI Summary

The Jenga Inverse Predictor (JIP-2) is a GPU-accelerated deep learning framework that reconstructs collapsed architectural structures using a physics engine and dual-stream ResNet-18 model. It predicts block removal probabilities and generates a 3D video of the reconstruction process, enhancing conservation efforts at sites like Uxmal, Yucatan.

Why Featured

The development of the Jenga Inverse Predictor (JIP-2) enables builders and project managers to assess and restore collapsed structures with greater accuracy and efficiency, potentially reducing costs and time in conservation projects. For investors, this technology represents a novel application of AI in heritage conservation, opening opportunities in both construction and preservation markets.

#Robotics #GPU #AI Video #AI Image

0

arXiv cs.CV·Shanwen Wang, Xin Sun, Sirui Wang, Xiao Xiang Zhu

1d ago

Original

RSGPNet: Geometric Prompting for Remote Sensing Open-Vocabulary Semantic Segmentation

AI Summary

RSGPNet introduces a training-free geometric prompting framework for open-vocabulary semantic segmentation in remote sensing, significantly enhancing segmentation accuracy through a novel combination of text-guided coarse masks, geometric re-prompting, and consistency verification. Extensive experiments show RSGPNet outperforms existing methods in both quantitative and qualitative metrics.

Why Featured

The introduction of RSGPNet, a training-free geometric prompting framework for open-vocabulary semantic segmentation, enhances segmentation accuracy in remote sensing applications. This development signals a shift towards more efficient AI models that can adapt to diverse datasets without extensive retraining, making it attractive for builders and PMs focused on scalable solutions and investors seeking innovative technologies in AI.

#Open Source #AI Image

0

arXiv cs.CV·Di Hu, Xia Yuan, Chunxia Zhao

1d ago

Original

GeoISF: Instance Semantic Forest Inspired Large-Scale Cross-View Geo-Localization via Ground LiDAR-to-Satellite Image

AI Summary

GeoISF introduces a novel large-scale LiDAR-to-image geo-localization pipeline that significantly enhances cross-view localization accuracy, achieving 13.22 times better performance than existing methods on the KITTI dataset. By utilizing an instance semantic forest for improved semantic representation, it effectively bridges the modality gap between point clouds and satellite images. The code will be released as an open-source resource for the research community.

Why Featured

The introduction of GeoISF, which enhances cross-view geo-localization accuracy by 13.22 times using a novel LiDAR-to-image pipeline, signals a significant advancement in geospatial technologies. This development is crucial for builders and PMs in sectors like autonomous vehicles and urban planning, as it can improve location-based services and decision-making processes.

#Open Source #AI Image #AI Search

0

arXiv cs.CV·Chenyang Zhang, Changwang Liu, Jinqi Zhu, Jiayi Chang, Yuxuan Wang, Shuqing He, Jia Guo

1d ago

Original

Semantic-Aware Generative Image Transmission for Resource-Constrained Visual IoT Systems

AI Summary

The paper presents a semantic-aware generative image transmission framework for resource-constrained visual IoT systems, achieving a bitrate of 0.074 bpp with 29.9 dB PSNR, significantly improving efficiency over existing methods. By utilizing a VQ encoder and MaskGIT for token recovery, it effectively balances quality and bandwidth, outperforming traditional approaches by preserving task-relevant objects better than random masking.

Why Featured

The development of a semantic-aware generative image transmission framework for resource-constrained IoT systems is significant as it enhances image quality while reducing bandwidth requirements. This advancement allows builders and PMs to deploy more efficient visual IoT applications, potentially lowering costs and improving user experience, while investors can see opportunities in optimizing IoT infrastructure.

#Inference #Robotics #AI Image

0

arXiv cs.CV·Md Irtiza Hossain, Humaira Ayesha, Junaid Ahmed Sifat

1d ago

Original

CLEAR-MoE: Shared-Basis Expert Extraction from Frozen Vision Transformers via Calibration-Driven Layer Selection

AI Summary

CLEAR-MoE introduces a four-phase pipeline to convert frozen Vision Transformers into sparse Mixture-of-Experts models, achieving 99.9% accuracy retention on Imagenette with DeiT-Small. The method utilizes shared low-rank SVD bases and lightweight routers, demonstrating minimal performance variation across different configurations. However, it incurs a 1.3-1.7x speed overhead compared to dense implementations due to routing complexities.

Why Featured

The development of CLEAR-MoE, which enables the conversion of frozen Vision Transformers into sparse Mixture-of-Experts models while retaining high accuracy, is significant for builders and PMs as it offers a way to optimize model efficiency without sacrificing performance. For investors, this innovation highlights the potential for advancements in AI model deployment, balancing speed and accuracy in real-world applications.

#LLM #Robotics #AI Image

0

arXiv cs.CV·Saeid Arabzadeh, Farshad Almasganj, Mohammad Mahdi Ahmadi

1d ago

Original

Memory-Augmented LSTM Autoencoder for Unsupervised Activity Recognition with IMU Sensor Fusion

AI Summary

The proposed memory-augmented LSTM autoencoder framework achieves 96.6% and 98.4% accuracy on DaLiAc and PAMAP2 datasets, respectively, outperforming both supervised and unsupervised methods in unsupervised human activity recognition using IMU sensor fusion. This approach effectively captures spatiotemporal dependencies despite challenges like noisy data and overlapping activities.

Why Featured

The development of a memory-augmented LSTM autoencoder that achieves over 96% accuracy in unsupervised human activity recognition using IMU sensor fusion is significant for builders and PMs as it enhances the potential for real-time, accurate activity tracking in various applications, from health monitoring to smart environments. For investors, this advancement signals a growing market for AI-driven solutions that can effectively handle complex, noisy data in dynamic settings.

#Robotics #AI Image

0

arXiv cs.CV·Faisal Altawijri, Ismail Mathkour

1d ago

Original

SoccerNet 2026 Player-Centric Ball Action Spotting: Per-Player Attention with Agreement-Based Ensembling

AI Summary

The SoccerNet 2026 submission introduces a two-stage pipeline for player-centric ball action spotting, achieving a Macro-F1 score of 58.94, up from a baseline of 48.6. Key innovations include a Track-Aware Action Detector (TAAD) enhanced with a temporal transformer and a Denoising Sequence Transduction (DST) transformer employing a novel per-player attention mechanism. The ensemble approach effectively reduces false positives while maintaining recall.

Why Featured

The introduction of the Track-Aware Action Detector (TAAD) and Denoising Sequence Transduction (DST) transformer in SoccerNet 2026 significantly improves player-centric ball action spotting accuracy, as evidenced by a Macro-F1 score increase to 58.94. This advancement highlights the potential for enhanced analytics and real-time insights in sports tech, which can attract investment and drive product development in AI-driven sports applications.

#Inference #AI Video #AI Image

0

arXiv cs.CV·Wistan Marchadour, Pedro Soto Vega, Franck Vermet, Mathieu Hatt

1d ago

Original

Few-class Fidelity: Evaluating Explanations of Real-conditions CNN classifiers with Optimized Perturbations

AI Summary

This paper introduces a Fidelity-based XAI metric variation tailored for low-class real-world CNN applications, generating uncertainty-provoking perturbations for accurate evaluation. It demonstrates the framework's effectiveness by comparing it with human-centric metrics in medical and natural imaging, revealing the complex interplay between domain, data curation, and XAI solutions.

Why Featured

The introduction of a Fidelity-based XAI metric for low-class CNN applications allows builders and PMs to better evaluate model explanations in real-world scenarios, particularly in critical fields like healthcare. This development can lead to improved trust and transparency in AI systems, which is crucial for investors looking to support responsible AI technologies.

#AI Image #Policy

0

arXiv cs.CV·Jiasheng Wang, Tanun Jitwatcharakomol, Piyawadee Jongpradubgiat, Simeng Zhu

1d ago

FeaturedOriginal

RADIANT-PET: Reasoning-Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning

AI Summary

RADIANT-PET integrates a voxel-level segmentation model with a large language model for enhanced PET/CT lesion classification, significantly reducing false positives. The framework outperforms traditional methods, especially when radiology reports are included, demonstrating improved lesion detection and clinical alignment.

Why Featured

The development of RADIANT-PET, which combines voxel-level segmentation with large language models for PET/CT lesion classification, is significant as it reduces false positives and enhances clinical alignment. Builders and PMs can leverage this technology to improve diagnostic accuracy in healthcare applications, while investors may see potential for growth in AI-driven medical imaging solutions.

#LLM #AI Coding #Inference #AI Assistant

0

arXiv cs.CV·Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert

1d ago

Original

Transition-Aware best-of-N sampling for Longitudinal Chest X-ray Reports

AI Summary

The study introduces a training-free, transition-aware best-of-N sampling method for chest X-ray report generation, outperforming random selection, especially in the Impression section. Utilizing four directional set distances, it enhances the accuracy of report generation by leveraging longitudinal patient data across multiple visits.

Why Featured

The introduction of a training-free, transition-aware best-of-N sampling method for chest X-ray report generation enhances accuracy by utilizing longitudinal patient data. This development signals a shift towards more efficient and reliable AI solutions in healthcare, which can attract investment and inform product strategies for builders and PMs focused on medical AI applications.

#Inference #AI Image

0

arXiv cs.CV·Hao Li, Chen Chu, Filip Biljecki, Cyrus Shahabi, Wenwen Li

1d ago

Original

Automated Quality Assessment of Geospatial Vector Data: A GeoAI Approach using Spatial Representation Learning

AI Summary

Topo4Vec is an automated GeoAI framework for scalable quality assessment of geospatial vector data, achieving 0.99 accuracy in detecting overlapping building footprints and 0.60 for street network errors. It utilizes Spatial Representation Learning to isolate topological errors, addressing challenges in diverse urban morphologies and large data volumes. The framework demonstrates effectiveness across Los Angeles, Munich, and Singapore.

Why Featured

The development of Topo4Vec, an automated GeoAI framework for quality assessment of geospatial vector data, is significant for builders, PMs, and investors as it enhances accuracy in urban planning by efficiently detecting topological errors. This can lead to reduced project costs and improved decision-making in complex urban environments, ultimately fostering better infrastructure development.

#Robotics #AI Image #AI Search

0

arXiv cs.CV·Shaoxuan Li, Xiangyu Dong, Xiaoguang Ma, Junfeng Chen, Haoran Zhao, Yaoming Zhou

1d ago

Original

CLOSER-VLN: Closed-Loop Self-Verified Retrieval-Augmented Reasoning for Aerial Vision-Language Navigation

AI Summary

The CLOSER-VLN framework introduces a closed-loop self-verified retrieval-augmented reasoning method for aerial vision-language navigation, achieving 32.01% success rate (SR) and 21.28% success path length (SPL) on the CityNav benchmark. This approach addresses critical errors in action execution by incorporating reliability verification and targeted retrieval, enhancing navigation performance in unseen environments without task-specific training.

Why Featured

The introduction of the CLOSER-VLN framework, which achieves a 32.01% success rate in aerial vision-language navigation, signifies a major advancement in autonomous navigation systems. For builders and PMs, this development highlights the potential for improved reliability in navigation technologies, while investors should note its implications for applications in robotics and drone technology in complex environments.

#Robotics #AI Image

0

arXiv cs.CV·Marija Pizurica, Eric Zimmermann, Neil Tenenholtz, James Hall, Olivier Gevaert, Ava P. Amini, Lorin Crawford, Kristen A. Severson

1d ago

Original

JASPR: Joint Spatial Representation learning of histology and spatial genomics for improved virtual genomic screening and clinical prognostication

AI Summary

JASPR is a self-supervised deep learning framework that integrates hematoxylin and eosin (HE) images with spatial transcriptomics (ST) data, enhancing predictions of 9,248 genes in breast cancer. By learning joint representations and incorporating spatial context, JASPR significantly improves prognostic outcomes compared to traditional methods.

Why Featured

The development of JASPR, a self-supervised deep learning framework that integrates HE images with spatial transcriptomics, enhances breast cancer prognostication by improving gene prediction accuracy. This innovation signals potential advancements in personalized medicine and could attract investment in AI-driven healthcare solutions, making it relevant for builders and PMs in the biotech sector.

#AI Coding #Inference #AI Image

0

arXiv cs.CV·Can Demircan, Marcel Binz, Alireza Modirshanechi, Eric Schulz

1d ago

Original

Meta-learning as a principle for human-like visual representations

AI Summary

This study proposes that human-like visual representations in neural networks can be enhanced through meta-learning, allowing models to adapt to new tasks with minimal data. By training a sequence model on diverse tasks, the authors found that meta-learned representations outperform pretrained encoders in predicting human similarity judgments and learning semantic rules, highlighting the importance of flexibility in visual processing.

Why Featured

The development of meta-learning to enhance human-like visual representations in neural networks is significant for builders and PMs as it enables models to adapt quickly to new tasks with limited data, improving efficiency in AI applications. For investors, this innovation suggests a potential for more versatile AI solutions that can better meet diverse user needs, increasing market competitiveness.

#LLM #AI Image #AI Assistant

0

arXiv cs.CV·Ce Chen, Congrui Wang, Yonglin Li, Zhenchen Wan, Mingyang Geng, Junhao Xiao, Zhengpeng Xing, Yaqing Hu, Yao Wu, Zhaoyang Qu, Long Lan, Xinwang Liu, Yingqi Peng, Shijia Li, Zufeng Zhang, Chen Ma, Jingjing Zhou, Xingyu Wang, Qilin Lu, Bin Jiang, Qilin Sun, Shanzhi Gu, Yaoguang Jin, Tongliang Liu, Kede Ma, Yifan Peng

1d ago

Original

JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators

AI Summary

JuZhou 1.0 is an ultra-lightweight text-to-image model, trained entirely on Chinese AI accelerators, achieving a GenEval score of 0.69 with only 0.387B parameters. It enables efficient on-device execution for mobile applications, outperforming larger models like SDXL and IF-XL while maintaining low latency and cost.

Why Featured

The development of JuZhou 1.0, the first edge-native text-to-image model trained on Chinese AI accelerators, signifies a shift towards more efficient solutions. This allows builders and PMs to leverage advanced image generation capabilities in mobile applications with reduced latency and cost, making it a compelling option for investors focused on scalable AI technologies.

#GPU #Open Source #AI Image

0

arXiv cs.CV·Jia-Wei Liao, Li-Xuan Peng, Mei-Heng Yueh, Min Sun, Cheng-Fu Chou, Jun-Cheng Chen

1d ago

Original

DiffRGD: An Inference-Time Diffusion Guidance Through Riemannian Gradient Descent

AI Summary

DiffRGD introduces a distribution-aware guidance framework for diffusion models, preserving latent Gaussian structures during inference. It formulates sampling as a constrained optimization problem on a spherical manifold, outperforming previous methods in image restoration and conditional generation tasks. The method is plug-and-play, enhancing pre-trained models without retraining.

Why Featured

The introduction of DiffRGD enhances diffusion models by enabling better image restoration and conditional generation without the need for retraining, which is crucial for builders and PMs looking to integrate advanced AI capabilities efficiently. For investors, this development signals a potential for improved product offerings and competitive advantages in the AI-driven market.

#Inference #AI Image

0

arXiv cs.CV·Teerath Kumar, Raja Vavekanand, Muhammad Turab

1d ago

Original

MedDiffuseMix: Preserving Diagnostic Evidence with Saliency-Aware Diffusion Medical Image Data Augmentatio

AI Summary

MedDiffuseMix introduces a saliency-aware diffusion mixing framework for medical image augmentation, enhancing classification accuracy across four benchmarks. It outperforms standard methods, improving F1-scores and ROC AUC metrics by preserving diagnostically salient regions while minimizing semantic distortion.

Why Featured

The introduction of MedDiffuseMix, a saliency-aware diffusion framework for medical image augmentation, significantly enhances classification accuracy in medical diagnostics by preserving critical diagnostic features. This development is crucial for builders and PMs in healthcare AI, as it can lead to more reliable diagnostic tools, while investors may see potential for improved market competitiveness and better patient outcomes.

#AI Coding #AI Image

0

arXiv cs.CV·Maher Boughdiri, Mounira Msahli, Albert Bifet

1d ago

Original

AEGIS: A Semantic GAN and Evidential Learning Frameworkfor Robust Adversarial Detection in Vision Sensors

AI Summary

AEGIS introduces a robust adversarial detection framework utilizing a SemantiGAN module and Evidential Deep Learning, achieving an AUROC of 92.1% and outperforming traditional detectors on the Tiny ImageNet dataset. The framework effectively filters adversarial inputs and provides calibrated uncertainty estimates, enhancing image classification in vision sensor networks.

Why Featured

The development of AEGIS, a robust adversarial detection framework utilizing SemantiGAN and Evidential Deep Learning, is significant as it enhances the reliability of image classification in vision sensor networks, achieving a high AUROC of 92.1%. This advancement is crucial for builders and PMs focused on deploying secure AI applications, while investors should note its potential to improve product safety and trustworthiness in AI-driven systems.

#Inference #Robotics #AI Image

0

arXiv cs.CV·Xiao Song, Haonan Qin, Zhaoxu Zhang, Jiong Zhang, Yuqi Fang, Caifeng Shan

1d ago

Original

Detecting Clinical Hallucinations in LVLMs via Counterfactual Visual Grounding Uncertainty

AI Summary

A new framework for detecting hallucinations in large (LVLMs) enhances clinical image understanding by using visual evidence grounding. This method employs a counterfactual entity perturbation technique to improve detection accuracy, achieving better performance than recent baselines across various medical imaging modalities. The approach offers interpretable localization evidence and strong cross-model transferability.

Why Featured

The development of a framework for detecting hallucinations in LVLMs through counterfactual visual grounding is significant for builders and PMs in healthcare AI, as it enhances the reliability of clinical image analysis. For investors, this advancement indicates a growing market potential for AI tools that provide interpretable and accurate medical insights, reducing risks in clinical decision-making.

#LLM #Robotics #AI Image

0

arXiv cs.CV·Krzysztof Pysz, Artur Bartczak, Jaros{\l}aw Kwiecie\'n, Piotr Krajewski, Witold Dyrka

2d ago

Original

Distribution-based deep multiple instance learning for tumor proportion scoring in NSCLC

AI Summary

The proposed distribution-based deep multiple instance learning (MIL) framework enhances tumor proportion scoring (TPS) in non-small cell lung cancer (NSCLC) by employing a two-model approach: an embedding-extraction network and a MIL model for predicting zero-inflated beta parameters. This method significantly surpasses traditional linear and ridge regression models in accuracy and explainability, addressing challenges in annotating histopathological images.

Why Featured

The development of a distribution-based deep multiple instance learning framework for tumor proportion scoring in NSCLC represents a significant advancement in medical imaging analysis. For builders and PMs, this technology could streamline the integration of AI in diagnostic tools, while investors may see potential for improved patient outcomes and cost efficiencies in healthcare applications.

#AI Image #AI Assistant

0

arXiv cs.CV·Hana Kim, Minje Kim, Tae-Kyun Kim

2d ago

Original

CoIn: Comprehensive 2D-3D Inpainting with Gaussian Splatting Guidance

AI Summary

CoIn introduces a multi-stage framework for 2D-3D inpainting, utilizing Gaussian Splatting for enhanced scene reconstruction. It achieves state-of-the-art performance in both object removal and insertion tasks, leveraging a diffusion model and adaptive feature attention for consistency across views.

Why Featured

The introduction of CoIn, a multi-stage framework for 2D-3D inpainting using Gaussian Splatting, significantly enhances scene reconstruction capabilities. This development is crucial for builders and PMs in industries like gaming and virtual reality, as it allows for more realistic object integration and manipulation, potentially driving innovation and investment opportunities in immersive technologies.

#AI Video #AI Image

0

arXiv cs.CV·Liu Yu, Can Chen, Ping Kuang, Zhikun Feng, Fan Zhou, Gillian Dobbie

2d ago

Original

Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding

AI Summary

The paper introduces Fox, a novel inference-time framework that addresses hallucination in Large (LVLMs) by diagnosing structural misalignment and severing risky shortcuts. Fox outperforms the previous state-of-the-art method, SID, by 29.1% while maintaining linguistic richness, showcasing its effectiveness in enhancing model reliability.

Why Featured

The introduction of the Fox framework for Large Vision-Language Models (LVLMs) significantly reduces hallucinations by improving structural alignment, outperforming the previous state-of-the-art by 29.1%. This advancement is crucial for builders and PMs as it enhances model reliability, leading to more trustworthy applications, while investors can recognize its potential for commercial viability in AI-driven products.

#LLM #Inference

0

arXiv cs.CV·Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Daniel L. K. Yamins

2d ago

Original

Perceptual 3D Simulation With Physical World Modeling

AI Summary

P3Sim is a novel physical world modeling system that predicts future scene states from partial observations and incomplete 3D transformations. It integrates a learned world model, geometric conditioning, and persistent memory to enhance generalization across various 3D tasks, including novel view synthesis and dynamic scene prediction. This approach aims to improve 3D scene understanding in robotics and computer vision.

Why Featured

The development of P3Sim, a novel physical world modeling system, enhances 3D scene understanding by predicting future states from incomplete data. This advancement is crucial for builders and PMs in robotics and computer vision, as it enables more robust applications in navigation and interaction with dynamic environments, thereby attracting investor interest in scalable AI solutions.

#Robotics #AI Image

0

arXiv cs.CV·Yiwen Yan, Wanning He, Yu-Wing Tai

2d ago

Original

Beyond MoCap: Scaling Motion Tokenizers with Synthetic Human Motion for Generative Modeling

AI Summary

This study introduces a framework that enhances motion generation by integrating large-scale synthetic human motion with a redesigned VQ-VAE tokenizer, significantly improving the diversity and compositionality of learned motion vocabularies. The approach demonstrates consistent performance gains in tasks like text-to-motion and motion continuation, indicating that expanding the motion representation space is crucial for better generalization in human motion synthesis.

Why Featured

The introduction of a redesigned VQ-VAE tokenizer for motion generation, combined with large-scale synthetic human motion, significantly enhances the diversity of motion vocabularies. This development is crucial for builders and PMs in the gaming and animation sectors, as it opens up new possibilities for more realistic and varied character animations, which can lead to improved user engagement and satisfaction.

#Inference #AI Video #AI Image

0

arXiv cs.CV·Duarte Le\~ao, Diogo Pereira Ara\'ujo, Catarina Barata, Carlos Santiago

2d ago

Original

Beyond Points: Spherical Distributional Part Prototypes for Interpretable Classification

AI Summary

The vMFProto framework introduces a mixture of von Mises-Fisher components for classifying images, enhancing interpretability by addressing intra-class variability. It achieves state-of-the-art explanation quality on benchmarks like CUB-200-2011 and Stanford Dogs while maintaining competitive accuracy through a two-stage training process.

Why Featured

The introduction of the vMFProto framework for image classification enhances interpretability while maintaining competitive accuracy, which is crucial for builders and PMs focusing on AI solutions that require transparency in decision-making. Investors should note this development as it signals a growing demand for interpretable AI models that can be trusted in critical applications.

#Inference #Open Source #AI Image

0

arXiv cs.CV

Latest AI signals from arXiv cs.CV

Zero-Label Driving Scenario Complexity Detection via Joint Embedding Predictive Architecture

DCSNet: Multiscale Feature Aggregation for Small Medical Object Segmentation with Detection-guided Hierarchical Cropping

Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs

RadarTwin: Scene-Specific mmWave Radar Simulation and Learning for Mobile Indoor Perception

Enhancing Layer Interaction Using Key-Correlated Layer Attention

GPU-Accelerated Inverse Structural Anastylosis from Block Collapse Dynamics

RSGPNet: Geometric Prompting for Remote Sensing Open-Vocabulary Semantic Segmentation

GeoISF: Instance Semantic Forest Inspired Large-Scale Cross-View Geo-Localization via Ground LiDAR-to-Satellite Image

Semantic-Aware Generative Image Transmission for Resource-Constrained Visual IoT Systems

CLEAR-MoE: Shared-Basis Expert Extraction from Frozen Vision Transformers via Calibration-Driven Layer Selection

Memory-Augmented LSTM Autoencoder for Unsupervised Activity Recognition with IMU Sensor Fusion

SoccerNet 2026 Player-Centric Ball Action Spotting: Per-Player Attention with Agreement-Based Ensembling

Few-class Fidelity: Evaluating Explanations of Real-conditions CNN classifiers with Optimized Perturbations

RADIANT-PET: Reasoning-Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning

Transition-Aware best-of-N sampling for Longitudinal Chest X-ray Reports

Automated Quality Assessment of Geospatial Vector Data: A GeoAI Approach using Spatial Representation Learning

CLOSER-VLN: Closed-Loop Self-Verified Retrieval-Augmented Reasoning for Aerial Vision-Language Navigation

JASPR: Joint Spatial Representation learning of histology and spatial genomics for improved virtual genomic screening and clinical prognostication

Meta-learning as a principle for human-like visual representations

JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators

DiffRGD: An Inference-Time Diffusion Guidance Through Riemannian Gradient Descent

MedDiffuseMix: Preserving Diagnostic Evidence with Saliency-Aware Diffusion Medical Image Data Augmentatio

AEGIS: A Semantic GAN and Evidential Learning Frameworkfor Robust Adversarial Detection in Vision Sensors

Detecting Clinical Hallucinations in LVLMs via Counterfactual Visual Grounding Uncertainty

Distribution-based deep multiple instance learning for tumor proportion scoring in NSCLC

CoIn: Comprehensive 2D-3D Inpainting with Gaussian Splatting Guidance

Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding

Perceptual 3D Simulation With Physical World Modeling

Beyond MoCap: Scaling Motion Tokenizers with Synthetic Human Motion for Generative Modeling

Beyond Points: Spherical Distributional Part Prototypes for Interpretable Classification

Vision-driven Preference Synthesis for Mitigating Hallucinations in