https://arxiv.org/list/cs.CV/recent
SToRe3D enhances ViT-based 3D object detection by improving inference speed through relevance-aligned sparsity.
SToRe3D's relevance-aligned sparsity boosts ViT-based 3D object detection efficiency, signaling developers and PMs to optimize performance while attracting investor interest in scalable AI solutions.
This paper presents a robust real-time catheter tip tracking system for autonomous navigation in fluoroscopy.
This advancement in real-time catheter tracking enhances precision in medical procedures, signaling opportunities for developers in healthcare AI and attracting PMs and investors focused on innovative medical technologies.
DUET is a dual-paradigm framework enhancing spatial transcriptomics prediction using single-cell inductive priors.
DUET's innovative framework for spatial transcriptomics prediction signals a significant advancement in data analysis techniques, offering developers and PMs new tools for precision medicine and attracting investor interest in biotech innovations.
A lightweight U-Net architecture achieves high-resolution face reconstruction using YOLO-World landmark heatmaps for supervision.
This advancement in lightweight U-Net for face super-resolution signals a shift towards more efficient AI models, crucial for developers and PMs focusing on real-time applications and investors looking for scalable solutions.
This paper shows off-the-shelf embeddings are sufficient for few-shot learning without extensive fine-tuning.
This research indicates that developers can leverage existing embeddings for efficient few-shot learning, reducing the need for extensive fine-tuning, which is crucial for faster deployment and cost-effectiveness.
A new model unifies pix and word tokens for improved generative language and visual understanding.
This model's integration of visual and textual tokens enhances multi-modal applications, signaling potential for developers to create richer AI experiences and for investors to capitalize on emerging technologies.
DeFakerOne is a unified model for fake image detection and localization, outperforming existing benchmarks.
The DeFakerOne model enhances image authenticity verification, crucial for developers and PMs in content moderation, while offering investors insights into advancements in AI-driven trust and security technologies.
IFGNet enhances hyperspectral and LiDAR data fusion using Kolmogorov-Arnold Networks for improved accuracy.
IFGNet's advancement in hyperspectral and LiDAR data fusion using Kolmogorov-Arnold Networks offers developers and PMs a new tool for enhancing data accuracy, crucial for AI-driven applications.
CineMesh4D enables personalized 4D whole-heart reconstruction from sparse cine MRI using a novel pipeline.
CineMesh4D's ability to reconstruct personalized 4D heart models from sparse MRI data signals advancements in medical imaging AI, which can enhance diagnostic tools and patient-specific treatments for developers and investors.
cGANs enable effective computational staining and destaining of pathology images with preprocessing adaptation.
This advancement in generative deep learning enhances image processing in pathology, offering developers and PMs new tools for medical imaging, while investors can leverage improved diagnostic capabilities in healthcare technology.
ProtoMedAgent enhances clinical interpretability by integrating multimodal reporting with privacy-aware workflows.
ProtoMedAgent's integration of multimodal reporting with privacy-aware workflows signals a significant advancement in clinical interpretability, crucial for developers and PMs in healthcare AI and investors seeking innovative solutions.
The paper presents a novel method for 3D crowd reconstruction using contrastive multi-modal hypergraph reasoning.
This novel method enhances 3D crowd reconstruction, offering developers and PMs new tools for immersive applications and investors insights into advanced AI-driven solutions in computer vision.
PVRF is a unified framework for effective adverse weather removal in images using advanced perception and flow techniques.
PVRF's advanced framework for adverse weather removal can enhance image processing applications, offering developers and PMs a competitive edge while attracting investors interested in innovative visual technology.
This work enhances image restoration using dynamic resolution diffusion models to improve efficiency and fidelity.
This advancement in dynamic resolution diffusion models signals improved efficiency and fidelity in image restoration, crucial for developers and PMs focused on enhancing visual quality in applications.
CurveBench is a benchmark for evaluating topological reasoning from images of nested Jordan curves.
CurveBench offers developers and researchers a standardized method to assess topological reasoning in AI, enabling improved algorithms for image analysis and enhancing applications in computer vision.
PanoPlane enables high-fidelity indoor view synthesis using 360° panoramic completion without training.
PanoPlane's ability to synthesize high-fidelity indoor views without training signals a breakthrough for developers and PMs in creating immersive applications, while investors see potential for innovative solutions in 3D visualization.
CreFlow introduces a corrective reflow framework for enhancing video generation in reinforcement learning.
CreFlow's corrective reflow framework enhances video generation in reinforcement learning, signaling improved efficiency and quality in AI-driven content creation for developers, PMs, and investors.
A hardware-aware framework evolves layer-specific functions for efficient Vision Transformer deployment.
This development signals a shift towards optimizing AI models for specific hardware, enhancing efficiency and performance, which is crucial for developers and investors focused on scalable AI solutions.
A two-tier edge-cloud architecture enhances diabetic retinopathy screening in rural areas by reducing cloud dependency.
This architecture reduces latency and dependency on cloud resources, enabling developers and PMs to innovate in rural healthcare solutions while attracting investors interested in scalable tech for underserved markets.
TeDiO enhances temporal coherence in video diffusion models without training, improving motion stability and visual quality.
TeDiO's training-free approach to enhance video diffusion models signals a significant advancement in motion stability, offering developers and PMs a new tool for improving visual quality in video applications.
PhyMotion introduces a structured reward for evaluating realistic human motion in video generation.
PhyMotion's structured reward enhances realism in human video generation, signaling developers and PMs to adopt advanced evaluation methods for improved AI models, while investors may see potential for innovative applications in media.
The study addresses concept omission in MM-DiTs by introducing Omission Signal Intervention to enhance image generation.
This research introduces a method to improve multimodal diffusion transformers, signaling developers and PMs to enhance image generation capabilities, which can attract investor interest in advanced AI applications.
A landmark-guided approach enhances MRI brain segmentation accuracy by mimicking manual protocols.
This advancement in MRI segmentation can significantly improve the accuracy of brain imaging, providing developers and PMs with better tools and investors with promising applications in healthcare technology.
Massive activations in Diffusion Transformers critically shape image semantics and enable effective prompt interpolation.
This research highlights the importance of massive activations in Diffusion Transformers, guiding developers and PMs in optimizing image generation and prompting strategies, while investors can identify potential advancements in AI-driven visual technologies.
CoReDiT enhances Diffusion Transformers by optimizing token pruning for efficiency and quality.
CoReDiT's optimization of token pruning in Diffusion Transformers signals improved efficiency and quality, crucial for developers and PMs focusing on resource management and performance in AI applications.
The study reveals that prefill is crucial for GUI grounding in VLMs, proposing a new method to enhance candidate selection.
This research highlights the importance of prefill in visual language models, signaling developers and PMs to refine GUI grounding techniques for improved user interface interactions.
3D geometric primitives enhance spatial reasoning in vision-language models through innovative benchmarks and techniques.
The integration of 3D primitives in vision-language models signals a significant advancement in spatial reasoning, offering developers and PMs new benchmarks for enhancing AI capabilities and attracting investor interest in innovative applications.
M3Net is a hierarchical 3D network for improved pulmonary nodule classification using multi-scale contextual information.
M3Net enhances pulmonary nodule classification accuracy, signaling a significant advancement in AI-driven medical diagnostics that developers and investors should leverage for healthcare applications.
MMCL-Bench is a benchmark for multimodal context learning from visual evidence and rules.
MMCL-Bench provides a new benchmark for developers and PMs to enhance AI's understanding of multimodal contexts, crucial for building more intuitive applications, while investors can identify opportunities in advanced AI capabilities.
MambaPanoptic introduces a Mamba-based framework for efficient panoptic segmentation with improved feature representation.
MambaPanoptic's efficient panoptic segmentation framework enhances feature representation, signaling a significant advancement for developers and PMs in computer vision applications, attracting investor interest in cutting-edge AI technologies.
DistractMIA introduces a black-box method for membership inference in vision-language models using semantic distraction.
DistractMIA highlights a new vulnerability in vision-language models, signaling developers and PMs to enhance privacy measures and prompting investors to consider security implications in AI investments.
LAMP enhances diffusion posterior sampling with lagged temporal corrections for improved image restoration.
LAMP's advancements in diffusion posterior sampling signal improved image restoration techniques, offering developers and PMs innovative tools and investors potential for enhanced product capabilities and market competitiveness.
SSDA enhances time series forecasting by bridging spectral and structural gaps in large vision models.
SSDA's approach to bridging spectral and structural gaps in vision models can significantly improve time series forecasting accuracy, which is crucial for developers and PMs in predictive analytics.
CROP reformulates aesthetic image cropping as a multimodal reasoning task to align with expert preferences.
CROP's multimodal approach to image cropping enhances developers' tools, PMs' product strategies, and investors' insights into AI-driven creative applications, signaling a shift towards expert-aligned design in visual content.
The A2A framework enhances ultrasound image denoising at test time using self-contrastive learning.
This framework improves ultrasound image quality during testing, signaling a potential advancement in real-time medical imaging applications for developers and investors in healthcare technology.
M2Retinexformer enhances low-light images by integrating depth, luminance, and semantic features in a refined pipeline.
M2Retinexformer's innovative approach to low-light image enhancement signals a new opportunity for developers and PMs to improve user experience in applications relying on visual data.
Inline Critic enhances image editing by refining model predictions during the forward pass.
Inline Critic's ability to refine model predictions in real-time improves image editing efficiency, signaling a shift towards more interactive AI tools that developers, PMs, and investors should leverage.
VideoSEAL addresses evidence misalignment in long video understanding by decoupling planning from answer authority.
VideoSEAL's approach to decoupling planning from answer authority enhances long video understanding, providing developers and PMs with a robust framework for building more reliable AI systems.
FRAME enhances image manipulation detection through adaptive multi-path evidence fusion.
FRAME's advanced detection methods empower developers and PMs to build more reliable image verification tools, while investors can spot opportunities in the growing demand for digital content authenticity solutions.
The Clear2Fog pipeline enhances object detection in foggy conditions using synthetic data for improved model training.
This study demonstrates how synthetic data can significantly improve object detection models in challenging conditions, providing developers and PMs with insights for enhancing AI robustness and attracting investors interested in innovative solutions.
DIVER introduces a dual-stage distillation framework enhancing semantic recovery for improved dataset distillation.
DIVER's dual-stage distillation framework enhances semantic recovery, signaling to developers and PMs the potential for more efficient data usage and improved model performance, attracting investor interest in innovative AI solutions.
A thirty-token prompt significantly reduces sponsored recommendations in twelve LLMs.
This finding reveals how user prompts can effectively influence LLM behavior, informing developers and PMs on optimizing AI interactions and guiding investors on potential shifts in AI monetization strategies.
MorphOPC enhances mask optimization using multi-scale hierarchical morphological learning for improved pattern fidelity.
MorphOPC's advanced mask optimization techniques can significantly enhance pattern fidelity, presenting developers and PMs with new opportunities for precision in semiconductor manufacturing and attracting investor interest in cutting-edge technologies.
TrackCraft3R repurposes video diffusion transformers for efficient dense 3D tracking from monocular video.
TrackCraft3R's innovation in using video diffusion transformers for dense 3D tracking enhances real-time applications, signaling a shift in how developers can approach computer vision tasks.
Current geospatial foundation models lack standardization, hindering effective comparison and innovation.
The lack of standardization in geospatial foundation models signals potential barriers to innovation and collaboration for developers, PMs, and investors in the geospatial technology sector.
The paper critiques current video anomaly detection methods for neglecting scene-specific normality modeling.
This research highlights the need for scene-specific modeling in video anomaly detection, signaling developers and PMs to refine algorithms and investors to consider innovative solutions in AI surveillance technologies.
WildPose is a unified framework for robust pose estimation in dynamic and static environments.
WildPose enhances pose estimation accuracy in diverse environments, offering developers and PMs a reliable tool for applications in robotics and AR, while investors may see potential in its commercial viability.
CRAFT enhances medical image synthesis by aligning generated images with clinical criteria using a novel scoring system.
CRAFT's novel scoring system for medical image synthesis aligns generated images with clinical criteria, offering developers and PMs a pathway to improve diagnostic tools and investors insights into healthcare AI advancements.
The Visual Aesthetic Benchmark reveals gaps in MLLM aesthetic judgments compared to human experts.
This benchmark highlights the limitations of MLLMs in aesthetic evaluation, signaling developers to refine models, PMs to adjust product expectations, and investors to reassess market readiness for AI-driven design tools.
Scale-Gest is a scalable framework for adaptive on-device gesture detection optimizing energy and performance.
Scale-Gest offers a scalable solution for on-device gesture detection, crucial for developers and PMs focusing on energy efficiency and performance optimization in mobile applications.
Hi-GaTA is a novel adapter for generating surgical video reports using hierarchical temporal aggregation.
Hi-GaTA's innovative approach to surgical video report generation signals a significant advancement in AI's application in healthcare, presenting new opportunities for developers, PMs, and investors in medical technology.
This framework enhances dynamic human-object interaction by blending pretrained motion controllers for improved performance.
This AI framework signals a significant advancement in human-object interaction, offering developers and PMs new tools for immersive applications, while investors can capitalize on emerging market opportunities in robotics and gaming.
This study presents a markerless method for quantifying gait deviations in children with CP using single-view videos.
This AI news highlights a breakthrough in gait analysis technology that can enhance clinical assessments and treatment strategies for children with cerebral palsy, signaling opportunities for developers and investors in health tech innovation.
This paper presents a framework for estimating island area and coastline using monocular vision.
This AI framework enables developers and PMs to efficiently estimate island metrics, potentially enhancing environmental monitoring and tourism applications, while investors may see opportunities in geospatial analytics innovations.
LatentHDR decouples exposure from diffusion, enabling efficient HDR generation with high quality.
LatentHDR's innovative approach to HDR generation signals a breakthrough for developers and PMs in creating high-quality imaging tools, attracting investor interest in advanced AI technologies.
3D-Belief introduces a generative 3D world model for embodied belief inference in partially observable environments.
3D-Belief's generative 3D world model enhances AI's ability to infer beliefs in complex environments, signaling a breakthrough for developers, PMs, and investors in creating more intelligent systems.
HamBR utilizes Hamiltonian dynamics for active decision boundary restoration in noisy label learning.
HamBR's innovative approach to noisy label learning can enhance model accuracy, making it crucial for developers, PMs, and investors focused on improving AI performance and reliability.
DenseTRF enhances surgical scene prediction by adapting texture-aware representations without supervision.
DenseTRF's unsupervised adaptation of texture-aware representations can significantly improve surgical scene prediction, offering developers and PMs a competitive edge and attracting investors interested in healthcare AI advancements.
CheXTemporal is a dataset for temporal reasoning in chest radiography with paired X-rays and annotations.
CheXTemporal's dataset enables developers and PMs to enhance AI models for medical imaging, while investors can identify opportunities in healthcare AI advancements.
This study presents a generative AI method for visualizing highway construction hazards using synthetic images.
This AI innovation enables developers and PMs to enhance safety protocols and investors to identify new market opportunities in construction technology through advanced hazard visualization.
PD-4DGS enables progressive compression and streaming of 4D Gaussian Splatting for dynamic scenes.
PD-4DGS enhances dynamic scene streaming efficiency, signaling a breakthrough in bandwidth-adaptive technologies that developers, PMs, and investors can leverage for improved user experiences and reduced costs.
PG-3DGS integrates physics simulation with 3D Gaussian Splatting for realistic and functional 3D structures.
PG-3DGS signals a breakthrough in realistic 3D modeling, crucial for developers and PMs aiming for high-quality simulations, while investors can capitalize on emerging technologies in the gaming and simulation sectors.
ABRA is a new benchmark for radiology agents, enabling navigation and task execution in medical imaging environments.
ABRA provides a standardized benchmark for evaluating radiology AI agents, signaling opportunities for developers, PMs, and investors to enhance medical imaging solutions and drive innovation in healthcare technology.
Lite3R is a model-agnostic framework enhancing efficiency in transformer-based 3D reconstruction.
Lite3R's model-agnostic approach offers developers and PMs a scalable solution for efficient 3D reconstruction, signaling potential cost savings and innovation opportunities for investors in the AI space.
JACoP enhances multi-agent trajectory prediction by ensuring scene-level compliance and reducing collisions.
JACoP's ability to improve multi-agent trajectory prediction with scene-level compliance signals a significant advancement for developers, PMs, and investors in autonomous systems and robotics.
GraphScan enhances Vision SSMs by using graph-based dynamic scanning for improved feature representation.
GraphScan's innovative approach to feature representation in Vision SSMs signals potential advancements in AI performance, crucial for developers, PMs, and investors focused on cutting-edge technology applications.
The PVPS classifier reveals how political and social identities shape evaluations of visual political content.
This AI news highlights how the PVPS classifier can influence content targeting and user engagement strategies for developers, PMs, and investors in politically charged environments.
USEMA introduces a hybrid UNet architecture combining CNNs with scalable Mamba-like attention for efficient medical image segmentation.
USEMA's innovative architecture enhances medical image segmentation efficiency, signaling a significant advancement for developers, PMs, and investors in healthcare AI applications.
The first global agricultural field boundary map at 10m resolution is released for 2024 and 2025.
This high-resolution agricultural field boundary map enables developers and PMs to create precision farming tools, while investors can identify opportunities in agri-tech innovations driven by data analytics.
Vision2Code is a benchmark for evaluating multi-domain image-to-code generation without paired reference code.
Vision2Code provides a standardized framework for assessing image-to-code generation, enabling developers, PMs, and investors to gauge advancements and potential in AI-driven software development tools.
The study introduces PriUS, a framework for interpretable uncertainty in medical image segmentation.
The PriUS framework enhances medical image segmentation by providing interpretable uncertainty, which is crucial for developers, PMs, and investors aiming to improve healthcare AI solutions and ensure reliability in clinical applications.
PresentAgent-2 generates multimodal presentation videos from user queries using an agentic framework.
PresentAgent-2's ability to create multimodal presentations from queries signals a shift towards more efficient content generation tools, benefiting developers, PMs, and investors in enhancing user engagement and productivity.
HiDream-O1-Image is a unified generative model using a pixel-level Diffusion Transformer for multimodal tasks.
HiDream-O1-Image's pixel-level Diffusion Transformer enhances multimodal capabilities, signaling a shift in generative AI that developers, PMs, and investors should leverage for innovative applications and competitive advantage.
This work presents a method for creating background-invariant representations in VLMs using synthetic data.
This research offers developers and PMs a novel approach to improve VLM robustness, signaling potential for investors in cutting-edge AI applications and enhanced user experiences.
VidSplat introduces a training-free framework for 3D scene reconstruction using video diffusion priors.
VidSplat's training-free 3D scene reconstruction framework offers developers, PMs, and investors a significant signal for enhancing video technology and reducing development costs.
Stable-Video-3D generates 8s 1080p text-to-video with physically plausible motion via a learned dynamics prior.
Physics consistency was the visible weakness in AI video; closing that gap brings consumer use cases within reach.