Couverture de TechcraftingAI Computer Vision

TechcraftingAI Computer Vision

TechcraftingAI Computer Vision

De : Brad Edwards
Écouter gratuitement

À propos de ce contenu audio

TechcraftingAI Computer Vision brings you summaries of the latest arXiv research daily. Research is read by your virtual host, Sage. The podcast is produced by Brad Edwards, an AI Engineer from Vancouver, BC, and a graduate student of computer science studying AI at the University of York. Thank you to arXiv for use of its open access interoperability.Brad Edwards
Les membres Amazon Prime bénéficient automatiquement de 2 livres audio offerts chez Audible.

Vous êtes membre Amazon Prime ?

Bénéficiez automatiquement de 2 livres audio offerts.
Bonne écoute !
    Épisodes
    • Ep. 247 - Part 3 - June 13, 2024
      Jun 15 2024

      ArXiv Computer Vision research for Thursday, June 13, 2024.


      00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data

      01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth

      03:08: GGHead: Fast and Generalizable 3D Gaussian Heads

      04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

      06:34: Towards Vision-Language Geo-Foundation Model: A Survey

      08:11: SimGen: Simulator-conditioned Driving Scene Generation

      09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

      11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior

      12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

      13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image

      15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis

      16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

      17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

      19:39: Real-Time Deepfake Detection in the Real-World

      21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

      23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant

      24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

      26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

      28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

      29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

      31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

      33:16: Towards Evaluating the Robustness of Visual State Space Models

      34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

      36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras

      37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach

      40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

      41:40: Explore the Limits of Omni-modal Pretraining at Scale

      42:46: Interpreting the Weight Space of Customized Diffusion Models

      43:58: Depth Anything V2

      45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

      46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

      48:11: Rethinking Score Distillation as a Bridge Between Image Distributions

      49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

      Afficher plus Afficher moins
      52 min
    • Ep. 247 - Part 2 - June 13, 2024
      Jun 15 2024

      ArXiv Computer Vision research for Thursday, June 13, 2024.


      00:21: INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

      02:11: Large-Scale Evaluation of Open-Set Image Classification Techniques

      03:43: PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation

      05:00: MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era

      06:41: Auto-Vocabulary Segmentation for LiDAR Points

      07:30: AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring

      08:43: EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

      10:23: Fine-Grained Domain Generalization with Feature Structuralization

      12:03: SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution

      14:13: ReMI: A Dataset for Reasoning with Multiple Images

      15:41: A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

      17:26: Thoracic Surgery Video Analysis for Surgical Phase Recognition

      18:58: Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval

      20:40: Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

      22:26: CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification

      24:22: Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024

      25:21: Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns

      26:30: WildlifeReID-10k: Wildlife re-identification dataset with 10k individual animals

      27:44: MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction

      29:28: Comparison Visual Instruction Tuning

      30:51: MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

      32:14: Deep Transformer Network for Monocular Pose Estimation of Ship-Based UAV

      33:10: Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

      34:33: Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

      36:04: StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning

      37:30: Parameter-Efficient Active Learning for Foundational models

      38:31: Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

      40:22: Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

      42:38: Towards AI Lesion Tracking in PET/CT Imaging: A Siamese-based CNN Pipeline applied on PSMA PET/CT Scans

      44:36: Memory-Efficient Sparse Pyramid Attention Networks for Whole Slide Image Analysis

      46:19: Instance-level quantitative saliency in multiple sclerosis lesion segmentation

      48:37: CMC-Bench: Towards a New Paradigm of Visual Signal Compression

      50:05: Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

      52:05: CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

      Afficher plus Afficher moins
      53 min
    • Ep. 247 - Part 1 - June 13, 2024
      Jun 15 2024

      ArXiv Computer Vision research for Thursday, June 13, 2024.


      00:21: FouRA: Fourier Low Rank Adaptation

      01:41: Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

      03:18: Few-Shot Anomaly Detection via Category-Agnostic Registration Learning

      04:57: Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

      06:46: ToSA: Token Selective Attention for Efficient Vision Transformers

      08:00: Computer vision-based model for detecting turning lane features on Florida's public roadways

      09:08: Improving Adversarial Robustness via Feature Pattern Consistency Constraint

      10:52: Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network

      12:10: NeRF Director: Revisiting View Selection in Neural Volume Rendering

      13:36: Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency

      15:03: Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

      16:40: COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing

      18:16: Fusion of regional and sparse attention in Vision Transformers

      19:26: Zoom and Shift are All You Need

      20:17: EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

      21:49: The Penalized Inverse Probability Measure for Conformal Classification

      23:24: OpenMaterial: A Comprehensive Dataset of Complex Materials for 3D Reconstruction

      24:47: Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation

      26:30: Computer Vision Approaches for Automated Bee Counting Application

      27:17: Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

      28:16: A Label-Free and Non-Monotonic Metric for Evaluating Denoising in Event Cameras

      29:43: Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

      31:25: Neural NeRF Compression

      32:29: Preserving Identity with Variational Score for General-purpose 3D Editing

      33:50: AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings

      34:51: Adaptive Temporal Motion Guided Graph Convolution Network for Micro-expression Recognition

      36:10: Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

      37:34: AMSA-UNet: An Asymmetric Multiple Scales U-net Based on Self-attention for Deblurring

      38:49: Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark

      40:45: A PCA based Keypoint Tracking Approach to Automated Facial Expressions Encoding

      42:02: Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious?

      43:28: FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

      45:11: How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models

      47:08: Suitability of KANs for Computer Vision: A preliminary investigation

      Afficher plus Afficher moins
      48 min
    Aucun commentaire pour le moment