visual representation of audio

Account Details
X-Large Mix
Mixing Addons
Audio Mastering
Mastering Addons
How It Works
Upload Audio
Request Revision
Rate Our Services

Audio Waveforms Explained | Insights For Audio Editors 2024

by Mixing Monster | Last updated Dec 21, 2024

Home > Blog > Editing > Editing Insights

Audio Waveforms Explained | Insights For Audio Editors

Affiliate Disclaimer: We may earn a commission if you purchase through our links

Audio waveforms are the backbone of sound editing and engineering. They visually represent sound by displaying changes in amplitude and frequency, which is crucial for understanding how sound behaves in different environments. This intricate representation allows audio editors to precisely manipulate sound, tailoring it to specific needs and contexts.

Audio waveforms are fundamental in audio editing and engineering. Their visual representation of sound – depicting amplitude and frequency variations – is a critical tool for audio professionals. Understanding and manipulating audio waveforms is essential for creating high-quality audio content.

The exploration of audio waveforms reveals various possibilities for sound manipulation. As we delve deeper, we uncover the concept and science behind these fascinating sound signatures.

Table Of Contents

1. The Fundamentals Of Audio Waveforms
2. Physics Behind Audio Waveforms
3. Types Of Audio Waveforms
4. Acoustics And Audio Waveforms
5. Waveforms In Analog And Digital Audio
6. Editing Techniques For Audio Waveforms
7. Audio Waveforms In Mixing And Mastering
8. Audio Waveforms: Key Takeaways And Future Directions

Find The Perfect Digital Audio Workstation For Your Studio | Amazon DAWs

Visualizing Sound: An Introduction To Waveform Graphics

Waveform graphics provide a visual representation of precisely these sound characteristics. They illustrate the amplitude and frequency of sound waves over time, allowing audio engineers to analyze and manipulate sound in a more tangible form.

Delving deeper into the science of sound, this section examines the physical principles governing audio waveforms. Understanding these concepts is vital for any audio professional seeking to master the craft of sound editing and engineering .

Sound Wave Propagation

Sound waves propagate through various mediums (like air, water, and solids) as vibrations . These vibrations create longitudinal waves characterized by compressions and rarefactions, allowing the sound to travel over distances.

The speed of sound varies depending on the medium, affecting how the waveform is perceived.

The Relationship Between Frequency And Pitch

Frequency , a fundamental aspect of sound waves, directly affects the pitch of our sound. Higher frequencies result in higher pitches, while lower frequencies lead to lower pitches. This relationship is pivotal in music and audio editing , as it determines the tonal quality of the sound.

Harmonics And Overtones Explained

Harmonics and overtones are essential elements in the timbre or color of sound. A fundamental frequency produces harmonics , which are multiples of this original frequency.

Overtones , which can be harmonic or inharmonic, add richness and complexity to the sound, playing a crucial role in distinguishing different sounds and instruments.

Comparison Of Waveform Types

Understanding these basic waveform types and their sonic qualities is crucial for anyone engaged in sound design and audio editing . Each waveform brings a unique flavor to the sound palette, allowing various creative possibilities in music production and audio engineering .

In this section, we explore the interaction between audio waveforms and their environments, focusing on acoustics, which significantly impact how sound is perceived and manipulated in audio engineering .

Room Acoustics And Waveform Behavior

Room acoustics play a vital role in how sound waves behave. A room’s size, shape, and materials can significantly affect sound waves’ absorption, reflection, and diffusion, altering waveform representations.

Understanding these acoustic properties is essential for optimizing recording and playback environments.

The Impact Of Materials On Sound Waves

Different materials have varying effects on sound waves.

Hard surfaces like concrete and glass reflect sound, creating echoes and reverberations , while softer materials like foam and carpet absorb sound, reducing echoes.

These material characteristics must be considered when setting up a recording studio or optimizing a listening space.

Understanding Echoes And Reverberations

Echoes and reverberations result from the interaction of sound waves with their environment.

Echoes are reflections of sound arriving at the listener’s ear after a delay, while reverberations are the persistence of sound in a space after the original sound has stopped.

Both can be creatively used or managed in audio editing and room acoustics design.

Shop Curated Selection For Advanced Musicians | Amazon Musical Instruments Pro Store

Transients In Audio Waveforms Explained

Transients are short, high-energy bursts of sound that occur at the beginning of a waveform, like the strike of a drum or the pluck of a guitar string.

In digital audio, capturing these transients accurately is crucial for a realistic sound representation.

Learn more about audio transients here:

What Are Audio Transients | How To Treat Transients

What Are Audio Transients? | How To Treat Transients

The Role Of Digital Audio Workstations (DAWs)

DAWs are integral to modern audio editing , providing tools to manipulate digital waveforms.

They allow for precise editing, effects application, and mixing , making them indispensable in audio production .

This section is dedicated to the practical aspects of manipulating audio waveforms in editing . It covers a range of basic to advanced techniques suitable for novice and experienced audio editors.

Basic Audio Waveform Manipulation Techniques

Fundamental techniques include trimming, fading, and adjusting gain. Trimming alters the length of an audio clip, fading helps in the smooth transition of sound, and gain adjustmen t controls the waveform’s amplitude. These are the starting points for any audio editing project.

Advanced Editing: Compression And Equalization

Compression reduces the dynamic range of an audio signal, making loud and quiet sounds quieter, which can help achieve a balanced mix.

Equalization (EQ) allows the editor to boost or cut specific frequencies, shaping the tonal balance of the audio.

Restoring And Enhancing Audio Quality

Audio restoration involves removing unwanted noises like hisses, clicks, or hums and can be particularly challenging.

Enhancing audio quality might involve:

Techniques like stereo widening .
Adding reverb .
Using harmonic exciters to add richness and depth to the sound.
Common Mistakes In Audio Waveform Editing

It’s essential to be aware of common pitfalls in audio editing , such as over- compression , excessive EQ , or ignoring phase issues. Understanding these mistakes helps in avoiding them and achieving a more professional result.

Quick Tips For Effective Audio Waveform Editing

Start With Clean Audio Ensure your recordings are as clean as possible before editing.
Monitor Levels Carefully Keep an eye on your levels to avoid clipping.
Use EQ Sparingly Equalize gently; avoid overdoing it.
Compress With Caution Use compression to balance levels, but don’t overcompress.
Listen On Different Systems Check your edits on various audio systems for consistency.
Save Originals Always keep a copy of the original files before making edits.
Take Breaks Regular breaks help maintain a fresh perspective. Do it!
Trust Your Ears Your ears are your best tool; if it sounds right, it usually is.

This section focuses on managing audio waveforms in the critical mixing and mastering stages, emphasizing how waveform analysis can enhance these processes.

Analyzing Tracks Using Waveforms

In mixing , visual analysis of waveforms can be as critical as listening. It helps identify issues like clipping, phase problems, and timing discrepancies. Waveform analysis is essential to ensuring tracks are well-aligned and dynamically balanced.

Audio Tracks In A Digital Audio Workstation

Balancing Tracks Using Waveforms

Balancing tracks involves adjusting levels, panning , and dynamics to create a harmonious mix. Waveforms provide a visual guide to these elements, helping to achieve a balanced and cohesive sound.

For example, ensuring that waveforms of different instruments complement rather than compete.

Utilizing Waveforms For Stereo Imaging

Stereo imaging involves the placement of sounds within the stereo field to create a sense of width and depth. Waveforms can indicate stereo spread and help make informed decisions about panning and stereo enhancement techniques.

As we conclude our deep dive into audio waveforms, it’s clear that these fundamental sound elements are more than just technicalities—they are the art and science of audio engineering personified.

The Essence Of Audio Waveforms

Reflecting on our journey through the intricacies of audio waveforms, we recognize their pivotal role in shaping the sounds that define our experiences.

From the pure tones of sine waves to the complex harmonies of sawtooth waves, understanding waveforms is akin to understanding the language of sound itself.

Innovations On The Horizon

The field of audio engineering is continually evolving, with new technologies and methodologies emerging regularly.

Innovations like AI-assisted editing, advanced digital synthesis, and immersive audio formats are set to revolutionize how we interact with waveforms, offering even more precision and creative freedom.

A Parting Note: The Art Of Listening

Above all, the most crucial skill in our toolkit remains our ability to listen —not just to the sound itself, but to the story it tells through its waveform.

As we ride the ever-changing audio wave, remember that at the heart of every great sound is an even more excellent listener.

Save On Musical Instruments & Studio Gear | Amazon Deals Of The Day

Featured Articles

45+ Best Practices For A Perfect Home Studio Setup In 2024

You Might Also Be Interested In

Nyquist Frequency Explained | Proper Signal Sampling In 2024

Share This Article!
Thank You For Reading!

Thank you for reading our post! We hope you found the insights both informative and valuable. We welcome your thoughts and concerns on the topic, so please feel free to contact us anytime. We’d love to hear from you!

Mixing Monster is a professional music studio with over 25 years of experience in music production and audio engineering. We help people worldwide achieve the perfect tone for their music.

Best Studio Gear | Top Picks For Every Budget

Pin It on Pinterest

Help | Advanced Search

Computer Science > Multimedia

Title: from vision to audio and beyond: a unified model for audio-visual representation and generation.

Abstract: Video encompasses both visual and auditory data, creating a perceptually rich experience where these two modalities complement each other. As such, videos are a valuable type of media for the investigation of the interplay between audio and visual elements. Previous studies of audio-visual modalities primarily focused on either audio-visual representation learning or generative modeling of a modality conditioned on the other, creating a disconnect between these two branches. A unified framework that learns representation and generates modalities has not been developed yet. In this work, we introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation. The key approach of VAB is that rather than working with raw video frames and audio data, VAB performs representation learning and generative modeling within latent spaces. In particular, VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively. It then performs the pre-training task of visual-conditioned masked audio token prediction. This training strategy enables the model to engage in contextual learning and simultaneous video-to-audio generation. After the pre-training phase, VAB employs the iterative-decoding approach to rapidly generate audio tokens conditioned on visual features. Since VAB is a unified model, its backbone can be fine-tuned for various audio-visual downstream tasks. Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features, leading to competitive results in audio-visual retrieval and classification.

Submission history

Access paper:.

HTML (experimental)
Other Formats

References & Citations

Google Scholar
Semantic Scholar

BibTeX formatted citation

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Understanding Audio Waveforms: A Comprehensive Guide

Audio waveforms are the visual representations of sound waves. Simply put, they show us how sounds change over time. Understanding audio waveforms is crucial in many areas, such as: B. in music production, sound engineering and data analysis. In this comprehensive guide, we'll explore the intricacies of audio waveforms, their influence on technology and society, the key concepts and definitions you need to know, and how to visualize data using waveform images. Additionally, we will demystify the jargon and provide you with a comprehensive guide to technical terminology. Whether you are a beginner or an expert in the field, this guide will deepen your understanding of audio waveforms and their role in the modern world.

Challenge your understanding

Before we delve into the details, let's first test your understanding of audio waveforms. Take a moment and analyze the waveform below:

Do you notice any patterns or irregularities in the waveform? What conclusions can you draw about the sound it represents? Take time to analyze and take notes; we will discuss this in more detail later.

As you examine the waveform, you may notice various patterns such as peaks and troughs that indicate changes in amplitude. These amplitude changes correspond to the loudness or intensity of the sound at different times. You may also notice distinct cycles or repetitions that indicate a periodic sound wave. The waveform can also provide information about the frequency of the sound, as the distance between each cycle indicates the time it takes for a complete oscillation.

Additionally, irregularities in the waveform, such as sudden peaks or dips, can indicate abrupt changes in sound, such as: B. a sudden increase or decrease in volume. These irregularities can also indicate the presence of overtones that contribute to the timbre or tonal quality of the sound.

Put your knowledge to the test

Now that you've learned the basics of audio waveforms, put your knowledge to the test with a few questions:

What is the main purpose of audio waveforms?
What can audio waveforms tell you about a sound?
What are some practical applications for understanding audio waveforms?

Take a moment to consider your answers before continuing. We will provide detailed explanations shortly.

1. The main purpose of audio waveforms is to visually represent the changes in amplitude and frequency of a sound wave over time. By visually representing audio waveforms, we can better analyze and understand the properties of a sound.

2. Audio waveforms can reveal different aspects of a sound, such as: B. its volume, pitch, duration and timbre. By studying the waveform, we can determine the intensity of the sound at different times, identify the frequency of the sound wave, and even distinguish between different types of sounds based on their unique waveform patterns.

3. Understanding audio waveforms has numerous practical applications. In music production, sound engineers use waveforms to edit and mix tracks to ensure that the volume is balanced and the desired effects are achieved. In speech analysis, waveforms help researchers examine vocal patterns and detect speech disorders. In the field of acoustics, audio waveform analysis can also aid in noise reduction, architectural design, and the development of audio compression algorithms.

The influence of technology

Technology has revolutionized the way we create, analyze and interact with audio waveforms. In this section, we will explore the role of technology in modern society and its influence on audio waveforms.

Exploring the role of technology in modern society

In recent years, technological advances have radically changed the field of audio waveforms. With the advent of digital audio workstations (DAWs) and dedicated software, artists and producers can now manipulate, edit and enhance audio waveforms with unprecedented precision.

Additionally, the widespread adoption of portable recording devices and streaming platforms has democratized access to audio production, allowing anyone with a Smartphone You can create your own waveforms and share them with others.

Technology has also played a crucial role in the area of data analysis. By converting audio signals into visual waveforms, analysts can extract valuable insights and patterns from large data sets. This has proven particularly useful in areas such as speech recognition, music recommendation algorithms and acoustics research.

Important concepts and definitions

Before diving deeper into the world of audio waveforms, there are a few important concepts and definitions you should understand. Familiarizing yourself with these terms will help you better understand the intricacies of audio waveforms.

Understand basic technical terms

Below are some important technical terms that are commonly used when discussing audio waveforms:

These are just a few of the most important terms you'll come across when exploring audio waveforms. As you familiarize yourself with these concepts, you will be better equipped to understand the intricacies discussed in the following sections.

Visualizing data with waveform images

Visualizing audio data can provide valuable insights and facilitate analysis. In this section, we will explore how waveform images improve data analysis and the different tools available for creating waveform images.

How waveform images improve data analysis

Waveform images provide a comprehensive visual representation of audio data. By looking at waveforms, analysts can identify patterns, anomalies, and various sound characteristics. This visual representation simplifies the understanding and interpretation of complex audio data.

Waveform images also help with audio editing and manipulation. By zooming in and out of specific sections of the waveform, producers and engineers can precisely edit audio, remove noise, or enhance specific elements. This level of control results in higher quality audio recordings and productions.

In addition, waveform images are widely used to illustrate and explain audio-related content in various teaching materials, tutorials and online courses. They provide a clear and concise visual aid that allows students and learners to better understand the concepts being taught.

Demystifying technical jargon

In the world of audio waveforms, understanding the associated jargon is crucial. In this section, we provide a comprehensive guide to technical terminology so you can easily navigate discussions and resources.

A comprehensive guide to technical terminology

Here are some additional technical terms to add to your audio waveform lexicon:

Waveform display
Transient processes
Spectral analysis

Familiarizing yourself with these terms and their meanings will help you better understand and discuss your audio waveforms.

In this comprehensive guide, we've explored the world of audio waveforms, their impact on technology and society, key concepts and definitions, how to use waveform images in data analysis, and demystifying the jargon often used in discussions about audio waveforms. By developing a thorough understanding of audio waveforms, you can unlock new possibilities in music production, audio engineering, data analysis, and more. With this knowledge, you are now able to begin your journey into the world of audio waveforms with confidence and expertise.

How helpful was this post?

Click on the stars to rate!

Average rating 1.3 / 5. Number of reviews: 4

No reviews yet! Be the first to rate this post.

We are sorry that this post was not helpful to you!

Let's improve this post!

How can we improve this post?

Mastering Audio Visualization: A Beginner's Guide to Waveforms

Looking for our Text to Speech Reader ?

Featured In

Table of contents, understanding the basics: visualization and sound waves, how does an audio waveform generator work, social media, music production, youtube videos, sound engineering, visual appeal, professional touch, informative, speechify ai video editor, echowave.io, choosing and using your audio waveform tool, customizing your video, exporting your audio waveform video, how can you generate an audio waveform, what are the steps to incorporate a waveform into audio, how do you craft a soundwave qr code, why would you adjust the amplitude of an audio waveform, can you explain the contrast between a sound wave and an audio wave, what are the basic types of waveforms, what are the advantages of using an audio waveform generator, how is a waveform different from a spectrum, what functionality does a waveform editor provide, how can you visualize an audio spectrum in after effects, can you add audio waveforms to videos on iphone, how does a music visualizer work, can video editing software generate a waveform visualization.

Audio waveforms are graphical representations of audio signals that display changes in a signal's amplitude over a period of time. They play a vital role...

Audio waveforms are graphical representations of audio signals that display changes in a signal's amplitude over a period of time. They play a vital role in understanding and manipulating audio content, from podcast creation to music videos.

What is an Audio Waveform?

An audio waveform serves as a snapshot of a sound's overall dynamic range and helps us comprehend the intricacies of sound waves. To understand this better, let's delve into the basics of visualization and sound waves.

Visualization involves converting sound waves into a graphical format that can be understood easily. Sound waves are vibrations that travel through the air or another medium and can be heard when they reach a person's or animal's ear. They have characteristics like amplitude (loudness) and frequency (pitch), which when graphically represented, form an audio waveform. This waveform serves as a visual guide to the audio file, making it easier to edit, navigate, and understand the audio content.

An audio waveform generator creates visual representations of sound waves from audio files. It analyzes the audio file, breaks it down into various segments, and presents each segment's amplitude and frequency as a visual graph. These generators are especially helpful in video editing and music production, where precise editing and synchronization are necessary.

Usage of Audio Waveform

Audio waveforms are a universal tool employed across various sectors, demonstrating their wide-ranging functionality. In essence, they act as visual aids that simplify the audio processing workflow.

The podcasting realm is increasingly adopting audio waveforms. As visual cues, they assist in the audio editing process, streamlining operations such as cutting and trimming and optimizing audio levels. This enhancement leads to an improved listening experience for the audience.

With the surge in online video content, adding audio waveform animations to video posts on social media platforms has become a popular trend. These animated waveforms boost engagement, visually appealing to viewers and promoting interaction with the content.

For music producers and sound engineers, audio waveforms are invaluable. They provide a clear snapshot of a track's structure, helping pinpoint specific sections for editing or mastering, thereby facilitating the production of harmonious tunes.

YouTube creators harness the power of audio waveforms to enrich their videos. They improve the video editing workflow and can even serve as a backdrop for an online video, thereby enhancing the overall aesthetics and drawing more viewers to the YouTube channel.

In the field of sound engineering, audio waveforms allow professionals to analyze audio details intricately. They enable the modification and optimization of audio for performance, playing a crucial role in the delivery of high-quality sound.

Why Should You Use Audio Waveform for Your Video?

Incorporating audio waveforms in your videos is not merely a trendy aesthetic choice, but it carries several practical benefits that can enhance your content's value:

The dynamic movement of animated waveforms can intrigue viewers, drawing their attention and fostering interaction with your content. This added dimension can make your video stand out among the static content on the web.

Audio waveforms add a unique visual element to your content. They can transform a standard audio track into an audio waveform animation, heightening the video's overall aesthetic appeal and attracting more views.

Including waveform animations can instill your content with a professional feel, thereby enhancing your brand image and credibility.

Waveforms reveal essential details about the audio content, such as rhythm, dynamics, and amplitude, providing insights that could potentially inform your creative process.

5 Online Audio Waveform Generators to Try

Are you interested in creating stunning and engaging audio waveforms? Here are five online audio waveform generators that you can use to animate your audio:

Kapwing offers a versatile online editor where you can add audio waveforms to any video, free of charge. You can explore different styles, sizes, and colors to generate an audio waveform that complements your content.

WaveVisual presents an opportunity to generate unique sound wave art in seconds. All you need to do is upload your audio or choose a song from their library or Spotify. WaveVisual's sound wave generator will craft beautiful sound waves tailored to your taste.

The Speechify AI Video Editor can take your videos to another level. It incorporates audiograms that represent speech and sounds using waveforms, inviting viewers to listen in. This approach makes your videos more captivating and dynamic, helping you garner more views and engagement on social media platforms like TikTok, Instagram, and Facebook.

Veed.io provides a straightforward and free tool for generating animated sound waves. With a wide array of templates and customization options, you can shape the waveform to match your specific preferences and needs.

Echowave.io emphasizes on the power of audio-waveform animation in driving social media engagement. This platform enables you to add dynamic animations to your videos effortlessly, enhancing the connection between your content and audience.

A Comprehensive Guide for Beginners: Adding Audio Waveforms to Your Videos

Delving into the world of audio waveforms may seem daunting for beginners. Still, with the right tools, guidance, and a little practice, you'll be able to seamlessly integrate audio waveform visuals into your video content. Here's a straightforward step-by-step tutorial to help you through the process, including a specific example using Speechify, an online audio waveform generator.

Start by selecting your audio waveform tool. There are numerous options available online, like Kapwing, veed.io , or Speechify. Evaluate the pricing and features of each tool and choose the one that best aligns with your needs.

Once you've chosen your tool, you need to upload your audio file. Most of these tools support different audio formats, such as WAV and MP3. After uploading, the software will automatically analyze the audio and generate the waveform. Depending on the tool you chose, you should be able to customize the waveform's style, color, and size to match your video's aesthetics. For instance, Speechify allows you to select a waveform style and further customize it.

Adding personal touches to your video can greatly enhance its appeal. Most waveform tools enable you to customize your video by adding text, subtitles, images, or even GIFs. Carefully chosen fonts and a relevant background image can add depth to your video, creating a perfect blend of audio and visuals.

With Speechify, this customization is just a click away. After generating and customizing your waveform, you can add captions and subtitles or insert images and GIFs directly into your video.

The final step in creating your audio waveform video is to preview it, ensuring the waveform aligns correctly with the audio. If everything looks good, save your changes and export the final video. In Speechify, this process is as simple as clicking 'Export'.

Remember, as with learning any new skill, practice is key. So, don't worry if you don't get it right the first time!

The use of audio waveforms can significantly improve the quality and engagement of your audio and video content. Tools like Speechify make the process of generating and integrating waveforms simpler and more accessible, allowing creators to produce visually captivating content. From podcasters to YouTubers, the value of using audio waveforms cannot be understated. Experiment with audio waveforms in your next project and see the difference it makes.

Generating an audio waveform can be achieved by using an audio visualizer or video editing software that supports waveform visualization. These tools interpret your audio file and produce a visual depiction of the sound waves.

Incorporating a waveform into audio can be achieved through a music visualizer or sound wave animation tools. These tools create waveforms grounded on your audio's frequency and amplitude, which can subsequently be overlaid on your audio or video content.

To create a Soundwave QR code, you first need to produce a sound wave from your audio file using a sound wave generator. Once your sound wave is ready, a QR code generator can transform the waveform image into a QR code.

Altering the amplitude of an audio waveform influences the perceived volume of the sound. This adjustment is often made to ensure the audio isn't excessively loud or soft and to preserve a consistent volume level throughout the audio file.

Sound waves are physical waves that journey through a medium, such as air, and are detectable by the human ear. On the other hand, audio waves or audio signals are electronic representations of these sound waves. These can be stored, manipulated, and replayed via electronic devices.

The three fundamental types of waveforms are sine waves, square waves, and triangle waves, each possessing its distinctive shape and properties.

Employing an audio waveform generator can amplify the quality of your audio and video content. It can furnish a visual guide for the audio, assist in aligning audio with visuals, facilitate audio editing, and introduce an attractive visual element to your content.

A waveform represents a sound wave graphically over time, showcasing changes in the sound wave's amplitude. In contrast, a spectrum demonstrates how the energy of the sound wave is distributed across frequencies.

A waveform editor is a software application permitting you to view and manipulate audio waveforms. It provides a variety of editing functions, such as cutting, copying, pasting, trimming, and adjusting volume levels. A waveform editor can also apply effects and filters to enhance or alter the sound.

After Effects provides a powerful tool called the Audio Spectrum Effect. This tool allows you to create dynamic sound wave animations that can be synced with your audio, thereby visualizing the audio spectrum.

Yes, several apps available on iPhone let you add audio waveforms directly to your videos. This enables you to give a professional touch to your audio projects even while on the go.

A music visualizer generates animated imagery in synchronization with a music track. It uses the frequency and amplitude of the music to drive the animation, creating a visually immersive experience for viewers.

Yes, many video editing software can generate waveform visualizations. These visualizations can help with editing tasks by providing a visual reference for the audio content, making it easier to synchronize audio and video content.

How to read the Wings of Fire books in order

Discover the top 10 innovative ways to transform your digital projects with the Speechify Text to Speech API.

Cliff Weitzman

Cliff Weitzman is a dyslexia advocate and the CEO and founder of Speechify, the #1 text-to-speech app in the world, totaling over 100,000 5-star reviews and ranking first place in the App Store for the News & Magazines category. In 2017, Weitzman was named to the Forbes 30 under 30 list for his work making the internet more accessible to people with learning disabilities. Cliff Weitzman has been featured in EdSurge, Inc., PC Mag, Entrepreneur, Mashable, among other leading outlets.

Learnings from the frontlines of music creation.

Proudly brought to you by LANDR

Create, master and release your music in one place.

What is a Spectrogram? The Producer’s Guide to Visual Audio

Have you ever wished you could see audio for a better understanding of your sound?

That’s exactly what’s possible with a spectrogram. It’s a technology that transforms complex audio information into a readable visual form.

But beyond just viewing your sonic information, spectral analysis lets you manipulate sound like never before.

That said, getting started with spectral processing can be challenging if you’ve never used it before.

In this guide, I’ll explain everything you need to know about spectrograms and how you can use them in music production .

From enhancing your audio editing to experimenting with spectral effects, these tools unlock a new way to work with audio.

Let’s get started.

Hear your song mastered

Listen to how your mix would sound fully mastered. LANDR’s AI mastering engine listens to your music to create precise and personalized masters— without any presets.

What is a spectrogram?

A spectrogram is a visual representation of the spectrum of frequencies in a signal as they vary over time.

They’re displayed as a two-dimensional graph with time on the x-axis and frequency on the y-axis.

The color or intensity of the filled in areas represents the sound’s amplitude. Like an infrared camera, areas with greater amplitude appear more opaque or in a brighter hue.

This method of displaying the sound gives you unique information about its frequency content and temporal evolution.

tools like spectral filtering and morphing open up creative sound design possibilities not possible with other methods.

Despite all that, spectrograms aren’t widely used by emerging producers.

Spectral processing is relatively new and not immediately intuitive for those used to traditional waveform views.

Luckily, there are several options available to producers today for generating spectrograms and working with audio in the frequency domain.

Read - <a href="https://blog.landr.com/spectrum-analyzer/" target="_blank" rel="noopener">Spectrum Analyzer: How to Visualize Your Signal in Mixing</a>

Read - Spectrum Analyzer: How to Visualize Your Signal in Mixing

How can you view a spectrogram?

To view a spectrogram, you’ll need software that can convert traditional audio recordings into visual representations of the frequencies.

This is done using a technique called the Fourier Transform . It breaks down the signal into its individual frequency components, allowing you to see the amplitude of each frequency at specific points in time.

You’ll need an entire audio file to perform traditional spectral analysis, so it’s important to note that this technique cannot be used in real time.

While that might sound complicated, today’s spectrogram software makes it easy.

Tools in your DAW or specialized third-party plugins take care of the technical details so you can focus on using the tools to improve your results.

Deep editing tips

Guides to editing for any situation in music production.

Pitch Correction: How to Edit Vocal Tuning for Perfect Takes

Piano Roll: 9 Ways to Use the MIDI Piano in Your DAW

Best Audio Editing Software 2024

Vocal Editing: How to Edit Perfect Vocals in Your DAW

How to Edit Audio: 10 Helpful Tips for Better Results

MIDI Editing: 6 Essential Tips for Better MIDI Tracks

How can spectrograms help you make better music.

Speaking of which, you’re probably wondering how this advanced technology can actually improve your finished product.

It's hard to know exactly when you've crossed the finish line.

A visual representation of the frequency information in your audio tracks is surprisingly powerful. Combine that with processing that can only take place in the frequency domain and you have some serious audio super powers.

Here are five real world uses for spectrogram tools:

1. Clean up recordings

Spectrograms can help you identify and remove unwanted noises or artifacts, such as hums, clicks, or pops, from your recordings.

By visually locating these sounds in the frequency spectrum, you can use tools like spectral filters or noise reduction to isolate and eliminate them without affecting the rest of the audio.

This is a game changer for any producer that has to deal with noise and artifacts from working in a DIY studio.

2. Unique sound design

Spectral processing allows you to create never before heard sounds by manipulating the frequency content of your audio.

You can stretch, shift, or morph frequencies, combine elements from different sounds, or apply spectral effects like reverb or delay to specific frequency ranges.

This can lead to new and innovative textures that other methods can’t match.

3. Mixing in the frequency domain

Spectrograms can help you make more informed decisions when it comes to your mix. Amplitude hotspots can you clues about problematic areas and concentrations of sonic energy.

You can identify frequency clashes between instruments or vocals, spot elements that need EQ adjustments, or visualize the stereo field to get a more balanced and cohesive sound.

4. Restoration and archiving

Spectral processing has become essential for audio restoration and archiving projects.

Beatmaker Lo-fi

Spectrograms can be used to identify destructive processes like transcoding or even rebuild lost information to improve fidelity.

If you’re a producer that works with lo-fi recordings or vintage samples , this approach can help you integrate found sound and damaged audio back into your work.

5. Enhancing vocals

Vocals are another area where spectrogram editing excels.

Cutting edge tools can remove sibilance, plosives, or breath noises and even enhance specific frequency ranges, making the vocals sit better in the mix.

How to Write, Perform and Produce Studio Vocals

The best audio software for working with spectrograms

With the basics out of the way, here are the best apps for spectral processing in music production:

1. iZotope RX

iZotope RX is an audio restoration and spectral editing suite that offers powerful features for working with spectrograms.

RX comes with many different audio tools, but its standout capability is the Spectral Repair module. This lets you easily identify and remove unwanted noises and artifacts by eye.

RX’s other spectrogram tools include a spectral de-esser and advanced de-click and de-reverb modules.

Read - <a href="https://blog.landr.com/best-de-esser-plugin-vocals/" target="_blank" rel="noopener">What is a De-Esser? The 8 Best De-Esser VSTs for Pro Vocals</a>

Read - What is a De-Esser? The 8 Best De-Esser VSTs for Pro Vocals

2. Adobe Audition

Adobe’s Audition DAW comes with spectral editing capabilities built in.

Its dedicated Spectral Frequency Display allows you to view and edit the spectral content of any track in your session. On top of that, its built-in spectral effects and noise reduction provide plenty of flexibility for any task.

Audition integrates seamlessly with other Creative Cloud applications, making it an attractive option for users who rely on other Adobe products.

3. Steinberg SpectraLayers Pro

Steinberg’s SpectraLayers Pro is a dedicated spectral editing software with highly advanced technology.

Your pre-mastering guide.

It allows you to work with individual layers of your audio, making it possible to isolate and manipulate specific elements for mixing or sound design .

SpectraLayers Pro goes deep with high tech features spectral editing tools like spectral casting and molding.

This option is especially exciting for producers looking to dive deep into spectral manipulation for creative sound design or detailed audio editing tasks.

4. Audacity

Audacity is a free, open-source DAW that offers basic spectral editing capabilities.

It features a spectrogram view that lets you view and analyze the frequency content. It offers basic spectral editing operations like noise reduction and filtering to clean up issues in your files.

While not as feature-rich or specialized as the other options on this list, Audacity is a budget-friendly choice. Consider it if you just want to explore spectrograms and spectral editing without paying too much.

Spectrogram world

Spectrograms are your visual window into the frequencies that make up your music.

They can up your sound design skills, make your tracks cleaner and fix issues you thought were beyond repair.

Whether you’re removing unwanted noises, enhancing specific elements in a mix, or exploring the realm of spectral sound design, spectrograms are a great tool to add to your toolbox.

Now that you know the basics, go ahead and have fun diving into the world of spectral processing.

Michael Hahn is an engineer and producer at Autoland and member of the swirling indie rock trio Slight.

Gear guides, tips, tutorials, inspiration and more—delivered weekly.

Keep up with the LANDR Blog.

The 10 Best Free Soft Synth VST Plugins

50+ Free Beats: Download Free Instrumentals – All Genres!

How to Make Electronic Music: Starter Guide for Producers

Visualizing Sound: A Beginner’s Guide to Using TouchDesigner with Live

Tags:

Visuals – View posts with tag: Visuals

An audio wheel by Bileam Tschepe showing a visual representation of the song, "Says" by Nils Frahm

Throughout the history of electronic music, audiovisual art has been integral to its aesthetic identity, both in live performance settings and more recently online. As a result, ever more music makers are incorporating visual media into their work, leading to a new generation of artists who continue to push the boundaries of music’s visual representation.

While there are many different ways to create audiovisual art, we’ve been hearing from a growing number of people who’ve been turning to the node-based programming language TouchDesigner to augment their music with real-time visuals and other media-based works. Developed by the Toronto-based company Derivative, TouchDesigner was created primarily for artists who want to work with image, light and sound. The application is typically used to craft live performances, installations, music visuals and other multi-sensory experiences — although use cases extend far beyond creative arts. Whether it's through projection mapping, virtual reality, or physical computing, TouchDesigner helps people bring their ideas to life in numerous ways.

In this beginner's guide, we'll introduce you to TouchDesigner and explore some of the ways in which you can use the platform to create various kinds of audiovisual art. Along the way, we'll be joined by Derivative’s co-founder Greg Hermanovic and Berlin-based A/V artists Bileam Tschepe and Ioana Bîlea. We’re grateful to them for talking with us and sharing their expertise and insights into the history of TouchDesigner, the available resources for learning the platform, and its creative potential when used with Live.

Visuals for Joachim Spieth's Terrain album, by Stanislav Glazov

Greg Hermanovic

From early on in his working life, Greg Hermanovic was always drawn to the idea of creating interactive tools for artists. Prior to his career in graphic software, he worked in aerospace, where he developed real-time simulators for pilot training and for the US Space Shuttle robot arm. Hermanovic later co-founded Side Effects Software, the makers of Houdini , and received two Academy Awards for his work in procedural visual effects. In 2000, he branched off to start Derivative, where he currently leads the team that makes TouchDesigner.

“A lot of the roots of TouchDesigner are in modular and analog synthesizers,” Hermanovic says. “The first synths I got to put my hands on were the EMS Synthi , the Arp 2600 , and a bit of Moog. And at the same time, I was really interested in experimental films like “Synchromy” by Norman McClaren . At the time McClaren was creating soundtracks from images drawn directly onto film. So the soundtrack was the image and the image was the soundtrack. This really excited me because I realized, OK, here’s a marriage of the two.”

Elekktronaut

Bileam Tschepe - also known as Elekktronaut - uses TouchDesigner to create interactive, organic, audiovisual art, as well as installations and live performances. He also hosts TouchDesigner workshops and meetups at Music Hackspace and has a YouTube channel featuring easy-to-follow TouchDesigner tutorials. Tschepe describes how he sees images, forms, and colors when he listens to music. “It's very abstract,” he says. “I can’t really explain it. It can be really hard to realize what I see. But, I think TouchDesigner is the perfect tool to do it.”

Visuals to Marlon Hoffstadt’s “Mantra Of A New Life,” by Bileam Tschepe

Spherical Aberration

For Ioana Bîlea, aka Spherical Aberration, getting into audiovisual arts came out of frustration. “I was coordinating a music festival in Romania,” she remembers. “But I always wanted to make music. It wasn't until I moved to Berlin, however, that I found a community who were really supporting female artists. I found the VJ community was generally more open though, so I eventually started VJing with Resolume. But that got boring for me, especially for bigger events where I would have to be there all night. I was looking for a tool that was more entertaining and fun. One night a friend showed me TouchDesigner and I immediately realized that's what I was looking for.”

An audiovisual collaboration between Spherical Aberration and Camilla Pisani

Artists working with TouchDesigner

Throughout his career in visual effects, Hermanovic has always been inspired and influenced by music culture. In the mid-'90s, he regularly moonlighted as an audiovisual artist for some of Toronto’s early rave parties. These experiences would ultimately contribute to shaping and inspiring many of the features we see in TouchDesigner today. “I was doing visuals voluntarily for this rave company called Chemistry,” he recalls. “I remember trying to project pictures onto the wall in real-time while doing live editing, live coding, and all that kind of stuff. At one event, I was typing in this super long expression of code, trying to get something to wobble on the screen with some decay and stuff like that. And I thought this was insane. I was trying to type parenthesis, going nuts, during a performance, with people stumbling, and dropping their drinks on my keyboard. And so that was the moment I said, OK, we need to put a new system of Operators into our software which we called CHOPS . So CHOPS were born through a lot of pain and inconvenience”.

In the late ‘90s, a fortuitous encounter between Hermanovic and Richie Hawtin paved the way for their first audiovisual collaboration at the Mutek festival in Montreal. At the time Hawtin was using an early version of Ableton Live with a custom MIDI controller, while at Derivative, Hermanovic and his team were working on early builds of TouchDesigner. “Richie and I pushed the technologies together and made a couple of songs,” remembers Hermanovic. “Some worked well, others didn’t. Later, we did the Plastikman 2010 tour with a lot more time, better organization, and technology behind us.”

Richie Hawtin presents Plastikman live, Timewarp 2010

Since the days of Hermanovic and Hawtin’s pioneering collaborations the number of people using TouchDesigner to combine audio, video, graphics, and data has grown exponentially worldwide. Some notable names using the program include Christopher Bauder , Ben Heim , CLAUDE , Markus Heckmann , and Carsten Nicolai aka Alva Noto. These artists have used TouchDesigner to create a wide range of projects. Many of them also use TouchDesigner as a platform for prototyping and developing new ideas.

“Christopher Bauder’s studio used TouchDesigner for a collaboration with Robert Henke called DEEP WEB ,” remembers Bîlea. “They had a giant audiovisual installation at Kraftwerk in Berlin. And there are people like 404zero who use TouchDesigner to generate controlling signals for a modular synth setup. And there’s NONOTAK Studio , who use it to control their light installations with sound.”

Below is a YouTube playlist featuring the work of various artists using TouchDesigner. This playlist showcases a range of creative possibilities which may spark inspiration for those interested in using the platform in their own work.

Learning TouchDesigner

TouchDesigner, like any programming language, has a learning curve, but those with experience in synthesis, Live, or Max/MSP may find some familiar concepts. There are many online resources available to help individuals at any level of experience learn TouchDesigner. “You’ve got to start somewhere and get your feet wet,” Hermanovic advises. “So try something easy first. Then you’ll get instant results which will motivate you to keep building.”

Music Hackspace , in partnership with Derivative, offers a free online course for those just getting started with TouchDesigner. In addition, they hold monthly meetups featuring presentations and discussions on TouchDesigner projects. “The course is designed for beginners or people who’ve never programmed before,” says Tschepe. “Everyone has to learn the same basics at first. Then you can branch out and focus on different parts that interest you. At our meetups, there's always a chance for a Q&A followed by an open discussion. And usually, someone from Derivative joins.”

“One of the things I really like about Music Hackspace meetups and other events we’re at is the opportunity to watch people use our software over their shoulders,” says Hermanovic. “Because quite often what our users really need isn’t apparent from what they say. Also, our users often miss the new tricks and shortcuts we add to TouchDesigner. So for us, the meetups provide good lessons because we can learn how to make new features more apparent to people”.

TouchDesigner also has a thriving community of users who share inspiration and know-how via Facebook groups , the Derivative forum and Discord servers . The community is where users who have skills can connect with those who either need to learn or want to find others to collaborate with.

“It was rough at first because there were no tutorials and the community was super small,” says Bîlea. “But we had a Facebook group and a GitHub page. The people on Derivative’s forum were so nice. You could ask a question, and someone would always jump in and try to help you. Aurelian Ionus aka Paketa 12 has been one of the main contributors to the community. We’ve all been inspired by him. He gives good advice and a lot of it for free. I don't think I would have gotten to where I am now without his coaching.”

“When I started there weren't that many tutorials and no beginner courses,” remembers Tschepe. “Without Mathew Ragan , I wouldn't have known where to begin. There are a few of his university classes on YouTube . These videos were definitely helpful for me when starting out.”

TouchDesigner has a free non-commercial license for users who are getting started. While the non-commercial license limits the resolution, all basic operations and functions remain accessible for those wanting to use it for their personal learning and non-paying projects.

“We really wanted to get TouchDesigner in the hands of as many people as possible so they could see if they like it or not,” Hermanovic mentions. “And we wanted to give access in countries where money is harder to come by, so there’s affordability. People can then make some art that is lower resolution but not watermarked. Some projects you can even do end to end, non-commercially without any limitations really.”

Useful links on getting started with TouchDesigner

Download TouchDesigner

Derivative’s official Learn TouchDesigner help page

TouchDesigner Beginner Course, by Bilieam Tschepe

Mathew Ragan Teaching Resources

A list of all TouchDesigner tutorials out there

Joy Division in TouchDesigner, with Paketa 12

Understanding Operators

Operators in TouchDesigner can be thought of as the building blocks or ingredients that make up your project. They are essentially small nodes that perform specific tasks, such as generating images, processing audio, or controlling data. You can use these Operators by themselves or combine them in different ways to create more complex behaviors and effects. To help explain the concept further, Hermanovic likens TouchDesigner’s family of Operators to the Periodic Table of elements. “Imagine every element does something different, there's no commonality,” he says. “And the elements combine to form molecules. And the molecules are these powerful things that form life. That's how I wanted TouchDesigner to work at the basic level. All of the Operators are bite-sized enough that each one has a good job, and can do lots of interesting things without being overloaded with features.”

TouchDesigner has a collection of sixteen Operators that are considered to be the most fundamental and essential for building projects. These are referred to as the “ Sweet 16 ”. “They are the most popular and useful Operators which everyone should get to know,” Hermanovic suggests.

TouchDesigner also features a Palette Browser with pre-constructed networks of Operators, called Components. Beginners can quickly get started by dragging and dropping these Components into their projects. Additional Components can be downloaded from AlphaMoonbase's OLIB library page and ChopChopChop.org .

Operator Families, with Bileam Tschepe

For those not wanting to dive too deeply into node-based programming, Tschepe is currently developing Algorhythm , a real-time interface for TouchDesigner that allows users to easily play and manipulate generative visuals and media using audio and other inputs. Algorhythm allows users to fade between visuals and add post-effects without needing to program large networks of Operators. It is currently available through Tschepe's Patreon .

Making audio-reactive visuals with TouchDesigner and Ableton Live

There are several ways that TouchDesigner can be integrated with Live to create audio-reactive visuals and other kinds of interactive experiences. One approach is to use TouchDesigner to control audio and video clips in Live. This can be done by sending OSC (Open Sound Control) messages from TouchDesigner to Live. Another method is to use TouchDesigner to visualize audio data in real-time, such as waveforms or spectrums, using the audio input from Live. This can be done using TouchDesigner's audio analysis tools which convert sound frequencies into numerical data. The data can then be connected to any number of TouchDesigner’s Operators opening up infinite possibilities for real-time music visualization.

TouchDesigner's Ableton Link CHOP can also sync Live Sets using timing information on a Link-enabled network. You can take it further with the TDAbleton tool which offers tight, two-way integration and full access to almost everything going on in a Live Set. The recently released TDAbleton 2.0 update also includes access to Clip information in Live 11 arrangements and allows users to trigger Macro Variations and access all 16 Rack Marcos.

Alternatively, OSC signals can be sent between Live and TouchDesigner using the LiveGrabber Max for Live plugins.

“I’m happy there are great audio tools already out there through Ableton, VST and stuff like that,” says Hermanovic. “So we don’t really have to go there. But just being on the periphery of audio and being able to talk two ways with it is really good. I’m excited about the development of TDAbleton. The feedback I’m getting is that its link is pretty tight, flexible and open”.

Useful links on audio-reactive visuals and TouchDesigner’s integration with Live

Ableton & TouchDesigner: How to Build an Audio & Visual Live Set, with CLAUDE

Audio-Reactive Visuals in TouchDesigner, with Torin Blankensmith

Make Anything Audio Reactive, with Bileam Tschepe

TDAbleton Tutorials, with Bileam Tschepe

Beat Detection Tutorial, with Bileam Tschepe

Audio Reactive Particle Cloud Tutorial, with Bileam Tschepe

Audio Reactive articles on Derivative.ca

‘Bouncing liquid’, created using Live and TouchDesigner

Choosing the right computer

When choosing a computer for use with TouchDesigner, it is advisable to opt for the most powerful configuration that your budget allows. This will provide the best performance and enable you to tackle more complex projects. Factors to consider when determining the spec of your computer include:

Processor: TouchDesigner can be a CPU-intensive application, so it's important to have a powerful processor. Look for a processor with a high number of cores and a high clock speed.

Graphics card: TouchDesigner is also graphics-intensive, so a good graphics card is important. Look for a card with a high amount of dedicated video memory and support for GPU acceleration. It’s also worth noting, certain TouchDesigner features will only work with Nvidia GPUs.

Memory: TouchDesigner benefits from a lot of memory, so look for a computer with at least 16GB of RAM.

Storage: TouchDesigner files can be quite large, so it's important to have a fast and large hard drive or SSD to store them on.

Portability: If you need to take your computer with you on the go, you'll want to consider its size and weight.

“Technically for the basic stuff, you don't really need a good computer,” says Tschepe. “But, if you’re doing some high-end 3D work you’re going to have issues. You’ll need a really powerful computer. Or if you want to render an 8K live projection, obviously you have to get a really good graphics card.”

“I bought a gaming laptop,” says Bîlea. “When I bought it three years ago it was one of the most powerful of its kind. Now it’s already falling behind. But it’s still working well for what I need. The Nvidia graphics cards upgrade pretty fast though. You could change your laptop every year if you wanted.”

“At one time I had to ask Silicon Graphics to borrow one of their workstations,” Hermanovic reminisces. “It took four people to move, they were so heavy. They cost $250,000. There were these poor guys pulling it up stairways into the space. Fortunately, the affordability of hardware has come down a lot. We’re piggybacking off the gaming industry because it’s driven the price of graphic chips right down.”

Finding inspiration

There are many ways to find inspiration for audiovisual projects in TouchDesigner. One way is to attend events such as music festivals, and conferences to see what other people are doing with audiovisual technology. You can also look at the portfolios of other artists and designers to see what they have created. And of course, looking to other art forms and occasionally stepping away from the computer and into the world can provide endless ideas and inspiration.

“Personally I think I pick ideas more from non-TouchDesigner-related things,” says Hermanovic. “There’s so much amazing art in many other fields spanning hundreds of years. I’m influenced by painting and photography from way back. I really liked going to NYC in the ‘70s because of the explosion in the Soho area of art galleries and large-format art and installations. That was a hotbed of creativity back then.”

“My visuals are inspired by nature and organic textures,” says Bîlea. “Everything is evolving and slow-moving. I don't usually work with sharp edges or bright colors. I think my music has the same source of inspiration, so it's very ambient.”

“I also like to recreate things I see in nature,” says Tschepe. “I can be inspired by anything really. Anywhere I see natural patterns. It could even be the inside of a coffee mug.”

Forms in Wood, by Bileam Tschepe

We hope that this beginner’s guide has given you a taste of the creative possibilities of using TouchDesigner together with Live and some ideas on where to start. Whether you're an experienced A/V artist looking to expand your skills, or a newcomer to the field looking to explore the intersections of sound, music, and visuals, TouchDesigner offers a wealth of opportunities for exploration and expression. There are no strict rules for using these tools, and the only limit is your imagination. We look forward to hearing – and seeing – what you come up with.

Keep up with Elekktronaut and Spherical Aberration . Follow Derivative on YouTube

Text and interviews by Joseph Joyce and Phelan Kane

Artwork by Bileam Tschepe (Elekktronaut)

View posts in category: Artists

Paradox: breakbeat mastery, extending live: how three different artists approach visuals for live performance, view posts in category: loop, a/v interchange: performing music with visuals.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Notifications You must be signed in to change notification settings

A curated list of audio-visual learning methods and datasets.

GeWu-Lab/awesome-audiovisual-learning

Folders and files, repository files navigation.

This is a curated list of audio-visual learning methods and datasets, based on our survey: <Learning in Audio-visual Context: A Review, Analysis, and New Perspective>. This list will continue to be updated, please feel free to nominate good related works with Pull Requests!

[Website of Our Survey] , [arXiv]

Speech recognition, speaker recognition, action recognition, emotion recognition, speech enhancement and separation, object sound separation, face super-resolution and reconstruction, natural sound, spatial sound generation, talking face, image manipulation, depth estimation, audio-visual transfer learning, cross-modal retrieval, audio-visual representation learning, sound localization in videos, audio-visual saliency detection, audio-visual navigation, localization, question answering, audio-visual boosting, audio-visual recognition.

[Applied Intelligence-2015] Audio-visual Speech Recognition Using Deep Learning Authors: Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, Tetsuya Ogata Institution: Waseda University; Kyoto University; Honda Research Institute Japan Co., Ltd.

[CVPR-2016] Temporal Multimodal Learning in Audiovisual Speech Recognition Authors: Di Hu, Xuelong Li, Xiaoqiang Lu Institution: Northwestern Polytechnical University; Chinese Academy of Sciences

[AVSP-2017] End-To-End Audiovisual Fusion With LSTMs Authors: Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic Institution: Imperial College London; University of Twente

[IEEE TPAMI-2018] Deep Audio-visual Speech Recognition Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman Institution: University of Oxford; Google Inc.

[2019] Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection Authors: Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu Sun Institution: Peking University

[IEEE TNNLS-2022] Multimodal Sparse Transformer Network for Audio-visual Speech Recognition Authors: Qiya Song, Bin Sun, Shutao Li Institution: Hunan University

[Interspeech-2022] Robust Self-Supervised Audio-visual Speech Recognition Authors: Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed Institution: Toyota Technological Institute at Chicago; Meta AI

[2022] Bayesian Neural Network Language Modeling for Speech Recognition Authors: Boyang Xue, Shoukang Hu, Junhao Xu, Mengzhe Geng, Xunying Liu, Helen Meng Institution: the Chinese University of Hong Kong

[Interspeech-2022] Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition Authors: Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro Institution: Korea Advanced Institute of Science and Technology; Genesis Lab Inc.

[MLSP-2022] Rethinking Audio-visual Synchronization for Active Speaker Detection Authors: Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, Changshui Zhang Institution: Tsinghua University; Beijing National Research Center for Information Science and Technology; University of Rochester

[NeurIPS-2022] A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer Authors: Wei-Ning Hsu, Bowen Shi Institution: Toyota Technological Institute at Chicago

[ITOEC-2022] FSMS: An Enhanced Polynomial Sampling Fusion Method for Audio-Visual Speech Recognition Authors: Chenghan Li; Yuxin Zhang; Huaichang Du Institution: Communication University of China

[IJCNN-2022] Continuous Phoneme Recognition based on Audio-Visual Modality Fusion Authors: Julius Richter; Jeanine Liebold; Timo Gerkamnn Institution: Universität Hamburg

[ICIP-2022] Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition Authors: Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai Institution: University of Science and Technology of China; Chinese Academy of Sciences; iFLYTEK Co., Ltd.

[ICASSP-2023] Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation Authors: Jing-Xuan Zhang, Genshun Wan, Zhen-Hua Ling, Jia Pan, Jianqing Gao, Cong Liu Institution: University of Science and Technology of China; iFLYTEK Co. Ltd.

[CVPR-2022] Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations Authors: Dan Oneaţă, Horia Cucu Institution: University POLITEHNICA of Bucharest

[AAAI-2022] Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading Authors: Minsu Kim, Jeong Hun Yeo, Yong Man Ro Institution: Korea Advanced Institute of Science and Technology

[AAAI-2023] Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning Authors: Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng Institution: Nanyang Technological University; ZJU-Hangzhou Global Scientific and Technological Innovation Center; Zhejiang University

[WACV-2023] Audio-Visual Efficient Conformer for Robust Speech Recognition Authors: Maxime Burchi, Radu Timofte Institution: University of Würzburg

[2023] Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition Authors: Minsu Kim, Hyung-Il Kim, Yong Man Ro Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute

[2023] Multimodal Speech Recognition for Language-Guided Embodied Agents Authors: Allen Chang, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn, Tejas Srinivasan, Jesse Thomason Institution: University of Southern California

[2023] MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation Authors: Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino, Changhan Wang Institution: Meta AI

[ICASSP-2023] The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge Authors: Pengcheng Guo, He Wang, Bingshen Mu, Ao Zhang, Peikun Chen Institution: Northwestern Polytechnical University

[CVPR-2023] Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring Authors: Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro Institution: Korea Advanced Institute of Science and Technology

[ICASSP-2023] Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels Authors: Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic Institution: Imperial College London; Meta AI

[CVPR-2023] AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid Institution: Google Research

[CVPR-2023] SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision Authors: Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen Institution: University of Surrey; Meta AI

[ICASSP-2023] Multi-Temporal Lip-Audio Memory for Visual Speech Recognition Authors: Jeong Hun Yeo, Minsu Kim, Yong Man Ro Institution: Korea Advanced Institute of Science and Technology

[ICASSP-2023] On the Role of LIP Articulation in Visual Speech Perception Authors: Zakaria Aldeneh, Masha Fedzechkina, Skyler Seto, Katherine Metcalf, Miguel Sarabia, Nicholas Apostoloff, Barry-John Theobald Institution: Apple Inc.

[ICASSP-2023] Practice of the Conformer Enhanced Audio-Visual Hubert on Mandarin and English Authors: Xiaoming Ren, Chao Li, Shenjian Wang, Biao Li Institution: Beijing OPPO Telecommunications Corp., ltd.

[ICASSP-2023] Robust Audio-Visual ASR with Unified Cross-Modal Attention Authors: Jiahong Li, Chenda Li, Yifei Wu, Yanmin Qian Institution: Shanghai Jiao Tong University

[IJCAI-2023] Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng Institution: Nanyang Technological University; University of Aberdeen; University of Science and Technology of China

[Interspeech-2023] Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization Authors: Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath Institution: The University of Texas at Austin; Carnegie Mellon University

[Interspeech-2023] Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning Authors: Sara Kashiwagi, Keitaro Tanaka, Qi Feng, Shigeo Morishima Institution: Waseda University; Waseda Research Institute for Science and Engineering

[ACL-2023] AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation Authors: Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao Institution: Zhejiang University; ByteDance

[ACL-2023] Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng Institution: Nanyang Technological University; University of Aberdeen; University of Science and Technology of China

[ACL-2023] MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition Authors: Yuchen Hu, Chen Chen, Ruizhe Li, Heqing Zou, Eng Siong Chng Institution: Nanyang Technological University; University of Aberdeen

[IJCNN-2023] Exploiting Deep Learning for Sentence-Level Lipreading Authors: Isabella Wu, Xin Wang Institution: Choate Rosemary Hall; Stony Brook University

[IJCNN-2023] GLSI Texture Descriptor Based on Complex Networks for Music Genre Classification Authors: Andrés Eduardo, Coca Salazar Institution: Federal University of Technology - Paraná

[ICME-2023] Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder Authors: Yusheng Dai, Hang Chen, Jun Du, Xiaofei Ding, Ning Ding, Feijun Jiang, Chin-Hui Lee Institution: University of Science and Technology of China; Alibaba Group; Georgia Institute of Technology

[ICME-2023] Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition Authors: Jinxin Wang, Zhongwen Guo, Chao Yang, Xiaomei Li, Ziyuan Cui Institution: Ocean University of China; University of Technology Sydney

[AAAI-2024] Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation Authors: Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai Institution: NERC-SLIP, University of Science and Technology of China (USTC), Hefei, China; Tencent AI LAB; Nanyang Technological University, Singapore

[CVPR-2024] A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition Authors: Yusheng Dai, Hang Chen, Jun Du, Ruoyu Wang, Shihao Chen, Jiefeng Ma, Haotian Wang, Chin-Hui Lee Institution: University of Science and Technology of China, Hefei, China; Georgia Institute of Technology, Atlanta, America

[IJCNN-2024] Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy Authors: Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng Institution: The Chinese University of Hong Kong

[InterSpeech-2024] LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha Institution: University of Maryland, College Park, USA

[InterSpeech-2024] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation Authors: Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass Institution: MIT, USA; IBMResearch AI, USA; MIT-IBM Watson AI Lab, USA; University of Bonn, Germany

[MTA-2016] Audio-visual Speaker Diarization Using Fisher Linear Semi-discriminant Analysis Authors: Nikolaos Sarafianos, Theodoros Giannakopoulos, Sergios Petridis Institution: National Center for Scientific Research “Demokritos”

[ICASSP-2018] Audio-visual Person Recognition in Multimedia Data From the Iarpa Janus Program Authors: Gregory Sell, Kevin Duh, David Snyder, Dave Etter, Daniel Garcia-Romero Institution: The Johns Hopkins University

[ICASSP-2019] Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion Authors: Suwon Shon, Tae-Hyun Oh, James Glass Institution: MIT Computer Science and Artificial Intelligence Laboratory, Cambridge

[Interspeech-2019] Who Said That?: Audio-visual Speaker Diarisation Of Real-World Meetings Authors: Joon Son Chung, Bong-Jin Lee, Icksang Han Institution: Naver Corporation

[ICASSP-2020] Self-Supervised Learning for Audio-visual Speaker Diarization Authors: Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang Institution: University of Central Florida; Tencent AI Lab; Beijing University of Posts and Telecommunications

[ICASSP-2021] A Multi-View Approach to Audio-visual Speaker Verification Authors: Leda Sari, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan Singhal, Yatharth Saraf Institution: University of Illinois at Urbana-Champaign, Facebook AI Research

[IEEE/ACM TASLP-2021] Audio-visual Deep Neural Network for Robust Person Verification Authors: Yanmin Qian, Zhengyang Chen, Shuai Wang Institution: Shanghai Jiao Tong University

[ICDIP 2022] End-To-End Audiovisual Feature Fusion for Active Speaker Detection Authors: Fiseha B. Tesema, Zheyuan Lin, Shiqiang Zhu, Wei Song, Jason Gu, Hong Wu Institution: Interdisciplinary Innovation Research Institute, Zhejiang Lab; Dalhousie University; University of Electronic Science and Technology of China; Zhejiang University

[EUVIP-2022] Active Speaker Recognition using Cross Attention Audio-Video Fusion Authors: Bogdan Mocanu, Tapu Ruxandra Institution: University "Politehnica" of Bucharest; Télécom SudParis

[2022] Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection Authors: Rahul Sharma, Shrikanth Narayanan Institution: University of Southern California

[SLT-2023] Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection Authors: Xuanjun Chen, Haibin Wu, Helen Meng, Hung-yi Lee, Jyh-Shing Roger Jang Institution: National Taiwan University; The Chinese University of Hong Kong

[ICAI-2023] Speaker Recognition in Realistic Scenario Using Multimodal Data Authors: Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon Yousaf Institution: University of Engineering and Technology Taxila; Swarm Robotics Lab NCRA; Deutsches Elektronen-Synchrotron DESY

[CVPR-2023] A Light Weight Model for Active Speaker Detection Authors: Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen Institution: Sichuan University; The Chinese University of Hong Kong

[ICASSP-2023] The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition Authors: Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu Institution: University of Science and Technology of China; Georgia Institute of Technology; Carnegie Mellon University; Kore University of Enna; iFlytek; Northwestern Polytechnical University; Delft University of Technology

[ICASSP-2023] ImagineNet: Target Speaker Extraction with Intermittent Visual Cue Through Embedding Inpainting Authors: Zexu Pan, Wupeng Wang, Marvin Borsdorf, Haizhou Li Institution: National University of Singapore; University of Bremen; The Chinese University of Hong Kong

[ICASSP-2023] Speaker Recognition with Two-Step Multi-Modal Deep Cleansing Authors: Ruijie Tao, Kong Aik Lee, Zhan Shi, Haizhou Li Institution: National University of Singapore; A*STAR; The Chinese University of Hong Kong; University of Bremen; Shenzhen Research Institute of Big Data

[ICASSP-2023] Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction Authors: Timothée Dhaussy, Bassam Jabaian, Fabrice Lefèvre, Radu Horaud Institution: Avignon University; Université Grenoble Alpes

[ICASSP-2023] Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification Authors: Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang Institution: Tianjin University; A⋆STAR;Singapore Institute of Technology; National Institute of Informatics

[ICASSP-2023] Multi-Speaker End-to-End Multi-Modal Speaker Diarization System for the MISP 2022 Challenge Authors: Tao Liu, Zhengyang Chen, Yanmin Qian, Kai Yu Institution: Shanghai Jiao Tong University

[ICASSP-2023] Av-Sepformer: Cross-Attention Sepformer for Audio-Visual Target Speaker Extraction Authors: Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng Institution: Tsinghua University; Xiaomi Inc.; The Chinese University of Hong Kong

[ICASSP-2023] The WHU-Alibaba Audio-Visual Speaker Diarization System for the MISP 2022 Challenge Authors: Ming Cheng, Haoxu Wang, Ziteng Wang, Qiang Fu, Ming Li Institution: Wuhan University; Duke Kunshan University; Alibaba Group

[ICASSP-2023] Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning Authors: Hui Chen, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang Institution: Tianjin University; A*STAR

[Interspeech-2023] Target Active Speaker Detection with Audio-visual Cues Authors: Yidi Jiang, Ruijie Tao, Zexu Pan, Haizhou Li Institution: National University of Singapore; The Chinese University of Hong Kong

[Interspeech-2023] CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition Authors: Lantian Li, Xiaolou Li, Haoyu Jiang, Chen Chen, Ruihai Hou, Dong Wang Institution: Tsinghua University; Beijing University of Posts and Telecommunications

[Interspeech-2023] Rethinking the visual cues in audio-visual speaker extraction Authors: Junjie Li, Meng Ge, Zexu pan, Rui Cao, Longbiao Wang, Jianwu Dang, Shiliang Zhang Institution: Tianjin University; National University of Singapore; Shenzhen Research Institute of Big Data

[ICAI-2023] Speaker Recognition in Realistic Scenario Using Multimodal Data Authors: Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon Yousaf Institution: University of Engineering and Technology Taxila; National Centre of Robotics and Automation; Deutsches Elektronen-Synchrotron

[ACL-2023] OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment Authors: Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, Zhou Zhao Institution: Zhejiang University; Huawei Cloud

[ICASSP-2023] AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction Authors: Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng Institution: Tsinghua University; Xiaomi Inc.; The Chinese University of Hong Kong

[Interspeech-2023] PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network Authors: Qinghua Liu, Meng Ge, Zhizheng Wu, Haizhou Li Institution: Shenzhen Research Institute of Big Data; The Chinese University of Hong Kong; National University of Singapore

[IEEE/ACM TASLP-2023] A Dynamic Convolution Framework for Session-Independent Speaker Embedding Learning Authors: Bin Gu, Jie Zhang, Wu Guo Institution: University of Science and Technology of China

[IEEE/ACM TASLP-2024] Self-Supervised Learning With Cluster-Aware-DINO for High-Performance Robust Speaker Verification Authors: Bing Han, Zhengyang Chen, Yanmin Qian Institution: Shanghai Jiao Tong University

[ICASSP-2024] Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling Authors: Bruno Korbar, Jaesung Huh, Andrew Zisserman Institution: Visual Geometry Group, Department of Engineering Science, University of Oxford, UK

[FG-2024] Dynamic Cross Attention for Audio-Visual Person Verification Authors: R. Gnana Praveen, Jahangir Alam Institution: Computer Research Institute of Montreal (CRIM), Montreal, Canada

[IJCNN-2016] Exploring Multimodal Video Representation For Action Recognition Authors: Cheng Wang; Haojin Yang; Christoph Meinel Institution: University of Potsdam

[CVPR-2018] The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary Authors: Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Krishna, Shyamal Buch, Cuong Duc Dao Institution: King Abdullah University of Science and Technology; Stanford University; Universidad del Norte; Universiteit van Amsterdam

[ICCV-2019] EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition Authors: Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen Institution: University of Bristol; University of Oxford

[ICCV-2019] SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition Authors: Bruno Korbar, Du Tran, Lorenzo Torresani Institution: Facebook AI Research

[ICCV-2019] Uncertainty-Aware Audiovisual Activity Recognition Using Deep Bayesian Variational Inference Authors: Mahesh Subedar, Ranganath Krishnan, Paulo Lopez Meyer, Omesh Tickoo, Jonathan Huang Institution: Intel Labs

[CVPR-2020] Listen to Look: Action Recognition by Previewing Audio Authors: Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani Institution: The University of Texas at Austin; Facebook AI Research

[2020] Audiovisual SlowFast Networks for Video Recognition Authors: Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer Institution: University of California; Facebook AI Research

[ICCV-2021] AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition Authors: Rameswar Panda, Chun-Fu(Richard) Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris Institution: MIT-IBM Watson AI Lab; Boston University; Massachusetts Institute of Technology

[2021] Cross-Domain First Person Audio-Visual Action Recognition through Relative Norm Alignment Authors: Mirco Planamente, Chiara Plizzari, Emanuele Alberti, Barbara Caputo Institution: Politecnico di Torino; Istituto Italiano di Tecnologia

[WACV-2022] Domain Generalization Through Audio-Visual Relative Norm Alignment in First Person Action Recognition Authors: Mirco Planamente, Chiara Plizzari, Emanuele Alberti, Barbara Caputo Institution: Politecnico di Torino; Istituto Italiano di Tecnologia; CINI Consortium

[CVPR-2022] Audio-Adaptive Activity Recognition Across Video Domains Authors: Yunhua Zhang, Hazel Doughty, Ling Shao, Cees G. M. Snoek Institution: University of Amsterdam; Inception Institute of Artificial Intelligence

[WACV-2022] MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition Authors: Jiawei Chen, Chiu Man Ho Institution: OPPO US Research Center

[CVPR-2022] Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos Authors: Saghir Alfasly, Jian Lu, Chen Xu, Yuru Zou Institution: Shenzhen University; Guangdong Key Laboratory of Intelligent Information Processing; Pazhou Lab

[2022] Noise-Tolerant Learning for Audio-Visual Action Recognition Authors: Haochen Han, Qinghua Zheng, Minnan Luo, Kaiyao Miao, Feng Tian, Yan Chen Institution: Xi’an Jiaotong University, the Shanxi Provincial Key Laboratory of Institute of Multimedia Knowledge Fusion and Engineering; the Ministry of Education Key Laboratory for Intelligent Networks and Network Security

[ICLR-2023] Exploring Temporally Dynamic Data Augmentation for Video Recognition Authors: Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, Sangyoun Lee Institution: NAVER Clova; Korea Advanced Institute of Science and Technology; NAVER AI Lab; Yonsei University

[ICASSP-2023] Epic-Sounds: A Large-scale Dataset of Actions That Sound Authors: Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman Institution: University of Oxford; University of Bristol

[ICASSP-2023] AV-TAD: Audio-Visual Temporal Action Detection With Transformer Authors: Yangcheng Li, Zefang Yu, Suncheng Xiang, Ting Liu, Yuzhuo Fu Institution: Shanghai Jiao Tong University

[ICCV-2023] Audio-Visual Glance Network for Efficient Video Recognition Authors: Muhammad Adi Nugroho, Sangmin Woo, Sumin Lee, Changick Kim Institution: Korea Advanced Institute of Science and Technology

[IEEE TMM-2023] Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition Authors: Maregu Assefa, Wei Jiang, Jinyu Zhan, Kumie Gedamu, Getinet Yilma, Melese Ayalew, Deepak Adhikari Institution: University of Electronic Science and Technology of China; Sichuan Artificial Intelligence Research Institute; Adama Science and Technology University

[CVPR-2024] TIM: A Time Interval Machine for Audio-Visual Action Recognition Authors: Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, Dima Damen Institution: University of Bristol; VGG, University of Oxford; Czech Technical University in Prague

[EMNLP-2017] Tensor Fusion Network for Multimodal Sentiment Analysis Authors: Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, Louis-Philippe Morency Institution: Carnegie Mellon University; Nanyang Technological University

[AAAI-2018] Multi-attention Recurrent Network for Human Communication Comprehension Authors: Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, Louis-Philippe Morency Institution: Carnegie Mellon University; Nanyang Technological University

[AAAI-2018] Memory Fusion Network for Multi-view Sequential Learning Authors: Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, Louis-Philippe Morency Institution: Carnegie Mellon University; Instituto Polite cnico Nacional; Nanyang Technological University

[NAACL-2018] Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos Authors: Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, Roger Zimmermann Institution: National University of Singapore

[EMNLP-2018] Contextual Inter-modal Attention for Multi-modal Sentiment Analysis Authors: Deepanway Ghosal, Md Shad Akhtar, Dushyant Chauhan, Soujanya Poria, Asif Ekbal, Pushpak Bhattacharyya Institution: Indian Institute of Technology Patna; Nanyang Technological University

[ACL-2019] Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Authors: Yitao Cai, Huiyu Cai, Xiaojun Wan Institution: Peking University

[ACL-2020] Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis Authors: Dushyant Singh Chauhan, Dhanush S R, Asif Ekbal and Pushpak Bhattacharyya Institution: Indian Institute of Technology Patna

[ACL-2020] A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis Authors: Jean-Benoit Delbrouck, Noe Tits, Mathilde Brousmiche, Stephane Dupont Institution: University of Mons

[ACL-2020] Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation Authors: Aman Shenoy, Ashish Sardana Institution: Birla Institute of Technology and Science, Pilani; NVIDIA Graphics

[CVPR-2021] Progressive Modality Reinforcement for Human Multimodal Emotion Recognition From Unaligned Multimodal Sequences Authors: Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, Guosheng Lin Institution: Southwest Jiaotong University; Southwestern University of Finance and Economics; Tencent; University of Electronic Science and Technology of China; Nanyang Technological University

[IEEE TAFFC-2021] Multi-modal Sarcasm Detection and Humor Classification in Code-mixed Conversations Authors: Manjot Bedi, Shivani Kumar, Md Shad Akhtar, Tanmoy Chakraborty Institution: Indraprastha Institute of Information Technology, Delhi

[IEEE SLT-2021] Detecting expressions with multimodal transformers Authors: Srinivas Parthasarathy, Shiva Sundaram Institution: Amazon

[CVPR-2022] M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation Authors: Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Naoyuki Onoe Institution: Sony Research India

[CCC-2022] A Multimodal Emotion Perception Model based on Context-Aware Decision-Level Fusion Authors: Yishan Chen; Zhiyang Jia; Kaoru Hirota; Yaping Dai Institution: Beijing Institute of Technology; State Key Laboratory of Intelligent Control and Decision of Complex Systems

[IJCNN-2022] Sense-aware BERT and Multi-task Fine-tuning for Multimodal Sentiment Analysis Authors: Lingyong Fang, Gongshen Liu, Ru Zhang Institution: Shanghai Jiao Tong University; Beijing University Posts and Telecommunications

[IEEE/ACM TASLP-2022] EmoInt-Trans: A Multimodal Transformer for Identifying Emotions and Intents in Social Conversations Authors: Gopendra Vikram Singh, Mauajama Firdaus, Asif Ekbal, Pushpak Bhattacharyya Institution: Indian Institute of Technology

[ICPR-2022] Self-attention fusion for audiovisual emotion recognition with incomplete data Authors: Kateryna Chumachenko, Alexandros Iosifidis, Moncef Gabbouj Institution: Tampere University; Aarhus University

[IEEE TAFFC-2023] Audio-Visual Emotion Recognition With Preference Learning Based on Intended and Multi-Modal Perceived Labels Authors: Yuanyuan Lei, Houwei Cao Institution: Texas A&M University; New York Institute of Technology

[IEEE T-BIOM-2023] Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention Authors: R Gnana Praveen, Patrick Cardinal, Eric Granger Institution: Ecole de technologie supérieure

[ICASSP-2023] Adapted Multimodal Bert with Layer-Wise Fusion for Sentiment Analysis Authors: Odysseas S. Chlapanis, Georgios Paraskevopoulos, Alexandros Potamianos Institution: National Technical University of Athens; Institute for Language and Speech Processing

[ICASSP-2023] Recursive Joint Attention for Audio-Visual Fusion in Regression Based Emotion Recognition Authors: R Gnana Praveen, Eric Granger, Patrick Cardinal Institution: École de Technologie supérieure

[IEEE/ACM TASLP-2023] Exploring Semantic Relations for Social Media Sentiment Analysis Authors: Jiandian Zeng, Jiantao Zhou, Caishi Huang Institution: Beijing Normal University; China University of Macau; University of Macau

[CVPR-2023] Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network Authors: Zhicheng Zhang, Lijuan Wang, Jufeng Yang Institution: Nankai University

[ACM MM-2023] Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023 Authors: Haotian Wang, Yuxuan Xi, Hang Chen, Jun Du, Yan Song, Qing Wang, Hengshun Zhou, Chenxi Wang, Jiefeng Ma, Pengfei Hu, Ya Jiang, Shi Cheng, Jie Zhang, Yuzhe Weng Institution: University of Science and Technology of China; Northwestern Polytechnical University

[arxiv-2024] HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition Authors: Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao Institution: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Institute of Automation, Chinese Academy of Sciences, Beijing, China; Department of Automation, Tsinghua University, Beijing, China; Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China

[IJCAI-2024] HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis Authors: Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, Liang Hu Institution: Tongji University; Beijing Institute of Technology; University of Oxford; DeepBlue Academy of Sciences

[ICPR-2024] Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition Authors: Tong Shi, Xuri Ge, Joemon M. Jose, Nicolas Pugeault, Paul Henderson Institution: School of Computing Science, University of Glasgow

[InterSpeech-2024] AVR: Synergizing Foundation Models for Audio-Visual Humor Detection Authors: Sarthak Sharma, Orchid Chetia Phukan, Drishti Singh, Arun Balaji Buduru, Rajesh Sharma Institution: IIIT-Delhi, India; University of Tartu, Estonia

Uni-modal Enhancement

[Interspeech-2018] Visual Speech Enhancement Authors: Aviv Gabbay, Asaph Shamir, Shmuel Peleg Institution: The Hebrew University of Jerusalem

[Interspeech-2018] The Conversation: Deep Audio-Visual Speech Enhancement Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman Institution: University of Oxford

[IEEE TETCI-2018] Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks Authors: Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, Hsin-Min Wang Institution: Research Center for Information Technology Innovation; National Taiwan University; National Yang-Ming University; Mackay Medical College; Academia Sinica

[ICASSP-2018] Seeing Through Noise: Visually Driven Speaker Separation And Enhancement Authors: Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg Institution: The Hebrew University of Jerusalem

[GlobalSIP-2019] Visually Assisted Time-Domain Speech Enhancement Authors: Elham Ideli, Bruce Sharpe, Ivan V. Baji?, Rodney G. Vaughan Institution: Simon Fraser University; SingSoftNext

[ICASSP-2019] On Training Targets and Objective Functions for Deep-learning-based Audio-visual Speech Enhancement Authors: Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen Institution: Aalborg University; Oticon A/S

[InterSpeech-2019] Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues Authors: Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Atsunori Ogawa, Tomohiro Nakatani Institution: Nippon Telegraph & Telephone Corporation

[Interspeech-2019] My Lips Are Concealed: Audio-Visual Speech Enhancement Through Obstructions Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman Institution: University of Oxford; Naver Corporation

[2020] Facefilter: Audio-Visual Speech Separation Using Still Images Authors: Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang Institution: Yonsei University; Naver Corporation

[ICASSP-2020] Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders Authors: Mostafa Sadeghi, Xavier Alameda-Pineda Institution: Inria Grenoble Rhone-Alpes

[CVPR-2021] Looking Into Your Speech: Learning Cross-Modal Affinity for Audio-Visual Speech Separation Authors: Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, Kwanghoon Sohn Institution: Yonsei University; Naver Corporation; Korea Aerospace University

[ISCAS-2021] Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras Authors: Ander Arriandiaga, Giovanni Morrone, Luca Pasa, Leonardo Badino, Chiara Bartolozzi Institution: Istituto Italiano di Tecnologia; University of Modena and Reggio Emilia

[ICASSP-2022] The Impact of Removing Head Movements on Audio-Visual Speech Enhancement Authors: Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob Donley, Anurag Kumar Institution: Inria Grenoble; Université Grenoble Alpes; Inria Nancy Grand-Est; Reality Labs Research

[2022] Dual-path Attention is All You Need for Audio-Visual Speech Extraction Authors: Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson Institution: University of Illinois at Urbana-Champaign

[ICASSP-2022] Audio-visual multi-channel speech separation, dereverberation and recognition Authors: Guinan Li, Jianwei Yu, Jiajun Deng, Xunying Liu, Helen Meng Institution: The Chinese University of Hong Kong; Tencent AI lab

[2022] Audio-visual speech separation based on joint feature representation with cross-modal attention Authors: Junwen Xiong, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha, Yanning Zhang Institution: Northwestern Polytechnical University; Nanchang University

[CVPR-2022] Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis Authors: Karren Yang, Dejan Marković, Steven Krenn, Vasu Agrawal, Alexander Richard Institution: Massachusetts Institute of Technology; Meta Reality Labs Research

[IEEE MMSP-2022] As We Speak: Real-Time Visually Guided Speaker Separation and Localization Authors: Piotr Czarnecki, Jakub Tkaczuk Institution: Warsaw University of Technology

[IEEE HEALTHCOM-2022] A Novel Frame Structure for Cloud-Based Audio-Visual Speech Enhancement in Multimodal Hearing-aids Authors: Abhijeet Bishnu, Ankit Gupta, Mandar Gogate, Kia Dashtipour, Ahsan Adeel, Amir Hussain, Mathini Sellathurai, Tharmalingam Ratnarajah Institution: University of Edinburgh; Heriot-Watt Watt University; Edinburgh Napier University; University of Wolverhampton

[CVPR-2022] Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation Authors: Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman Institution: University of Oxford

[WACV-2023] BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds Authors: Youshan Zhang, Jialu Li Institution: Yeshiva University; Cornell University

[SLT-2023] AVSE Challenge: Audio-Visual Speech Enhancement Challenge Authors: Andrea Lorena Aldana Blanco, Cassia Valentini-Botinhao, Ondrej Klejch, Mandar Gogate, Kia Dashtipour, Amir Hussain, Peter Bell Institution: University of Edinburgh

[ICLR-2023] Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation Authors: Haoyue Cheng, Zhaoyang Liu, Wayne Wu, Limin Wang Institution: Nanjing University; SenseTime

[WACV-2023] Unsupervised Audio-Visual Lecture Segmentation Authors: Darshan Singh S, Anchit Gupta, C. V. Jawahar, Makarand Tapaswi Institution: International Institute of Information Technology, Hyderabad

[ISCSLP-2022] Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement Authors: Chenxi Wang, Hang Chen, Jun Du, Baocai Yin, Jia Pan Institution: University of Science and Technology of China; iFlytek

[ICASSP-2023] Real-Time Audio-Visual End-to-End Speech Enhancement Authors: Zirun Zhu, Hemin Yang, Min Tang, Ziyi Yang, Sefik Emre Eskimez, Huaming Wang Institution: Microsoft

[ICASSP-2023] Efficient Intelligibility Evaluation Using Keyword Spotting: A Study on Audio-Visual Speech Enhancement Authors: Cassia Valentini-Botinhao, Andrea Lorena Aldana Blanco, Ondrej Klejch, Peter Bell Institution: University of Edinburgh

[ICASSP-2023] Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement Authors: Chenyue Zhang, Hang Chen, Jun Du, Baocai Yin, Jia Pan, Chinhui Lee Institution: University of Science and Technology of China; iFlytek Co., Ltd.; Georgia Institute of Technology

[ICASSP-2023] Real-Time Audio-Visual End-To-End Speech Enhancement Authors: Zirun Zhu, Hemin Yang, Min Tang, Ziyi Yang, Sefik Emre Eskimez, Huaming Wang Institution: Microsoft

[ICASSP-2023] Audio-Visual Speech Enhancement with a Deep Kalman Filter Generative Model Authors: Ali Golmakani, Mostafa Sadeghi, Romain Serizel Institution: Université de Lorraine

[ICASSP-2023] A Multi-Scale Feature Aggregation Based Lightweight Network for Audio-Visual Speech Enhancement Authors: Haitao Xu, Liangfa Wei, Jie Zhang, Jianming Yang, Yannan Wang, Tian Gao, Xin Fang, Lirong Dai Institution: University of Science and Technology of China; Ethereal Audio Lab; Tsinghua Shenzhen International Graduate School

[ICASSP-2023] Egocentric Audio-Visual Noise Suppression Authors: Roshan Sharma, Weipeng He, Ju Lin, Egor Lakomkin, Yang Liu, Kaustubh Kalgaonkar Institution: Carnegie Mellon University; Meta

[ICASSP-2023] Dual-Path Cross-Modal Attention for Better Audio-Visual Speech Extraction Authors: Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson Institution: University of Illinois at Urbana-Champaign

[ICASSP-2023] On the Role of Visual Context in Enriching Music Representations Authors: Kleanthis Avramidis, Shanti Stewart, Shrikanth Narayanan Institution: University of Southern California

[ICASSP-2023] LA-VOCE: LOW-SNR Audio-Visual Speech Enhancement Using Neural Vocoders Authors: Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna Ithapu, Maja Pantic Institution: Imperial College London; Meta

[ICASSP-2023] Learning Audio-Visual Dereverberation Authors: Changan Chen, Wei Sun, David Harwath, Kristen Grauman Institution: The University of Texas at Austin; Meta AI

[Interspeech-2023] Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling Institution: University of Science and Technology of China

[Interspeech-2023] Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model Authors: Héctor Martel, Julius Richter, Kai Li, Xiaolin Hu, Timo Gerkmann Institution: Tsinghua University; Universität Hamburg; Chinese Institute for Brain Research

[ITG-2023] Audio-Visual Speech Enhancement with Score-Based Generative Models Authors: Julius Richter, Simone Frintrop, Timo Gerkmann Institution: Universität Hamburg

[Interspeech-2023] Speech inpainting: Context-based speech synthesis guided by video Authors: Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen Institution: Universitat Pompeu Fabra; Aalborg University; Oticon A/S

[EUSIPCO-2023] Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction Authors: Tomoya Yoshinaga, Keitaro Tanaka, Shigeo Morishima Institution: Waseda University; Waseda Research Institute for Science and Engineering

[IEEE/ACM TASLP-2023] Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition Authors: Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu Institution: The Chinese University of Hong Kong

[ICCV-2023] AdVerb: Visually Guided Audio Dereverberation Authors: Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha Institution: University of Maryland; University of Montreal

[TEEE/ACM TASLP-2023] Multi-Cue Guided Semi-Supervised Learning Toward Target Speaker Separation in Real Environments Authors: Jiaming Xu, Jian Cui, Yunzhe Hao, Bo Xu Institution: Xiaomi Corporation; University of Chinese Academy of Sciences

[ICASSP-2024] Consistent and Relevant: Rethink the Query Embedding in General Sound Separation Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng Institution: Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Tencent AI Lab, Audio and Speech Signal Processing Oteam, China; The Chinese University of Hong Kong, Hong Kong SAR, China;

[ICASSP-2024] SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech Authors: Adam Sabra, Cyprian Wronka, Michelle Mao, Samer Hijazi Institution: Cisco Systems, Inc

[IJCAI-2024] Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction Authors: Zhaoxi Mu, Xinyu Yang Institution: Xi’an Jiaotong University

[InterSpeech-2024] FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching Authors: Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung Institution: Korea Advanced Institute of Science and Technology, South Korea

[InterSpeech-2024] RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement Authors: Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic Institution: Meta AI, UK; Imperial College London, UK

[ACM MM-2024] RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues Authors: Tianrui Pan, Jie Liu, Bohan Wang, Jie Tang, Gangshan Wu Institution: Nanjing University, State Key Laboratory for Novel, Software Technology, Nanjing, China;

[INTERSPEECH-2024] LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement Authors: Arnav Jain, Jasmer Singh Sanjotra, Harshvardhan Choudhary, Krish Agrawal, Rupal Shah, Rohan Jha, M. Sajid, Amir Hussain, M. Tanveer Institution: Indian Institute of Technology Indore, Simrol, Indore, 453552, India; School of Computing, Edinburgh Napier University, EH11 4BN, Edinburgh, United Kingdom;

[ECCV-2018] Learning to Separate Object Sounds by Watching Unlabeled Video Authors: Ruohan Gao, Rogerio Feris, Kristen Grauman Institution: The University of Texas at Austin; IBM Research; Facebook AI Research

[ECCV-2018] The Sound of Pixels Authors: Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; Columbia University

[ICASSP-2019] Self-supervised Audio-visual Co-segmentation Authors: Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[ICCV-2019] The Sound of Motions Authors: Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[ICCV-2019] Recursive Visual Sound Separation Using Minus-Plus Net Authors: Xudong Xu, Bo Dai, Dahua Lin Institution: The Chinese University of Hong Kong

[ICCV-2019] Co-Separating Sounds of Visual Objects Authors: Ruohan Gao, Kristen Grauman Institution: The University of Texas at Austin; Facebook AI Research

[ACCV-2020] Visually Guided Sound Source Separation using Cascaded Opponent Filter Network Authors: Lingyu Zhu, Esa Rahtu Institution: Tampere University

[CVPR-2020] Music Gesture for Visual Sound Separation Authors: Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[ICCV-2021] Visual Scene Graphs for Audio Source Separation Authors: Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, Anoop Cherian Institution: University of Illinois at Urbana-Champaign; Mitsubishi Electric Research Laboratories

[CVPR-2021] Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation Authors: Yapeng Tian, Di Hu, Chenliang Xu Institution: University of Rochester; Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods

[ECCV-2022] AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation Authors: Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey Institution: Google Research; University of Illinois Urbana-Champaign

[ICIP-2022] Visual Sound Source Separation with Partial Supervision Learning Authors: Huasen Wang, Lingling Gao, Qianchao Tan, Luping Ji Institution: University of Electronic Science and Technology of China

[NeurIPS-2022] Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation Authors: Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian Institution: University of Illinois; Mitsubishi Electric Research Labs

[ICLR-2023] CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos Authors: Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick Institution: Sony Group Corporation; University of California San Diego

[CVPR-2023] Language-Guided Audio-Visual Source Separation via Trimodal Consistency Authors: Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Institution: Oriol Nieto, Bryan Russell, Kate Saenko Boston University; Adobe Research; MIT-IBM Watson AI Lab, IBM Research

[CVPR-2023] iQuery: Instruments As Queries for Audio-Visual Sound Separation Authors: Jiaben Chen, Renrui Zhang, Dongze Lian, Jiaqi Yang, Ziyao Zeng, Jianbo Shi Institution: University of California San Diego; Shanghai AI Laboratory; The Chinese University of Hong Kong; National University of Singapore; ShanghaiTech University; University of Pennsylvania

[ICCV-2023] Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation Authors: Yiyang Su, Ali Vosoughi, Shijian Deng, Yapeng Tian, Chenliang Xu Institution: Michigan State University; University of Rochester; University of Texas at Dallas

[WACV-2024] LAVSS: Location-Guided Audio-Visual Spatial Audio Separation Authors: Yuxin Ye, Wenming Yang, Yapeng Tian Institution: Tsinghua University; The University of Texas at Dallas

[NeurIPS-2024] Continual Audio-Visual Sound Separation Authors: Weiguo Pian, Yiyang Nan, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian Institution: The University of Texas at Dallas; Brown University; Carnegie Mellon University;

[CVPR-2020] Learning to Have an Ear for Face Super-Resolution Authors: Givi Meishvili, Simon Jenni, Paolo Favaro Institution: University of Bern

[IEEE TCSVT-2021] Appearance Matters, So Does Audio: Revealing the Hidden Face via Cross-Modality Transfer Authors: Chenqi Kong, Baoliang Chen, Wenhan Yang, Haoliang Li, Peilin Chen, Shiqi Wang Institution: City University of Hong Kong; Nanyang Technological University

[ICASSP-2022] Deep Video Inpainting Guided by Audio-Visual Self-Supervision Authors: Kyuyeon Kim; Junsik Jung; Woo Jae Kim; Sung-Eui Yoon Institution: Korea Advanced Institute of Science and Technology

[CVPR-2022] Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices? Authors: Cho-Ying Wu, Chin-Cheng Hsu, Ulrich Neumann Institution: University of Southern California

[WACV-2023] Audio-Visual Face Reenactment Authors: Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar Institution: International Institute of Information Technology, Hyderabad; University of Bath

[ICASSP-2023] Hearing and Seeing Abnormality: Self-Supervised Audio-Visual Mutual Learning for Deepfake Detection Authors: Changsung Sung, Juncheng Chen, Chusong Chen Institution: National Taiwan University; Academia Sinica

[CVPR-2023] AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction Authors: Aggelina Chatziagapi, Dimitris Samaras Institution: Stony Brook University

[CVPR-2023] Parametric Implicit Face Representation for Audio-Driven Facial Reenactment Authors: Ricong Huang, Peiwen Lai, Yipeng Qin, Guanbin Li Institution: Sun Yat-sen University; Cardiff University

[CVPR-2023] CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior Authors: Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, Tien-Tsin Wong Institution: The Chinese University of Hong Kong; Tencent AI Lab

[ICASSP-2024] Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection Authors: Davide Berghi, Peipei Wu, Jinzheng Zhao, Wenwu Wang, Philip J. B. Jackson Institution: Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, U.K.

[CVPR-2024] AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection Authors: Trevine Oorloff, Surya Koppisetti, Nicolò Bonettini, Divyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, Gaurav Bharaj Institution: University of Maryland - College Park; Reality Defender Inc.

[BMVC-2024] Content and Style Aware Audio-Driven Facial Animation Authors: Qingju Liu, Hyeongwoo Kim, Gaurav Bharaj Institution: Flawless AI, UK; Imperial College London, UK;

[BMVC-2024] Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies Authors: Marcella Astrid, Enjie Ghorbel, Djamila Aouada Institution: Computer Vision, Imaging & Machine, Intelligence Research Group (CVI2), Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, Luxembourg; Cristal Laboratory, National School of Computer Sciences, Manouba University, Tunisia;

[SIGGRAPH-2024] PersonaTalk: Bring Attention to Your Persona in Visual Dubbing Authors: Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu Institution: Bytedance,China;

Cross-modal Perception

Cross-modal generation, mono sound generation.

[ICASSP-2017] Vid2speech: Speech Reconstruction From Silent Video Authors: Ariel Ephrat, Shmuel Peleg Institution: The Hebrew University of Jerusalem

[ICCV-2017] Improved Speech Reconstruction From Silent Video Authors: Ariel Ephrat, Tavi Halperin, Shmuel Peleg Institution: The Hebrew University of Jerusalem

[ICASSP-2018] Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video Authors: Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani Institution: Columbia University

[ACM MM-2018] Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed Authors: Yaman Kumar, Mayank Aggarwa, Pratham Nawal, Shin’ichi Satoh, Rajiv Ratn Shah, Roger Zimmermann Institution: Netaji Subhas Institute of Technology; National Institute of Informatics; Indraprastha Institute of Information; National University of Singapore

[2019] Video-Driven Speech Reconstruction using Generative Adversarial Networks Authors: Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantic Institution: Imperial College London; Samsung AI Centre

[Interspeech-2019] Hush-Hush Speak: Speech Reconstruction Using Silent Videos Authors: Shashwat Uttam, Yaman Kumar Singla, Dhruva Sahrawat, Mansi Agarwal Institution: Netaji Subhas Institute of Technology; Adobe Research; National University of Singapore; Delhi Technological University

[ICASSP-2021] Learning Audio-Visual Correlations From Variational Cross-Modal Generation Authors: Ye Zhu, Yu Wu, Hugo Latapie, Yi Yang, Yan Yan Institution: Illinois Institute of Technology; University of Technology Sydney; Cisco

[IEEE TCYB-2022] End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks Authors: Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Bj?rn W. Schuller, Maja Pantic Institution: Imperial College London; University of Augsburg; Meta AI

[ICPR-2022] Learning Speaker-specific Lip-to-Speech Generation Authors: Munender Varshney, Ravindra Yadav, Vinay P. Namboodiri, Rajesh M Hegde Institution: Indian institute of Technology; University of Bath

[ICASSP-2023] Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech Authors: Jiyoung Lee, Joon Son Chung, Soo-Whan Chung Institution: NAVER AI Lab; Korea Advanced Institute of Science and Technology; NAVER Cloud

[CVPR-2023] ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration Authors: Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi Institution: Meta AI; Meta Reality Labs Research; Toyota Technological Institute at Chicago; The Hebrew University of Jerusalem

[ICCV-2023] DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding Authors: Jeongsoo Choi, Joanna Hong, Yong Man Ro Institution: Korea Advanced Institute of Science and Technology

[CVPR-2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners Authors: Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen Institution: HKUST; ARCLab,Tencent PCG

[IEEE TMM-2015] Real-Time Piano Music Transcription Based on Computer Vision Authors: Mohammad Akbari, Howard Cheng Institution: Simon Fraser University; University of Lethbridge

[ACM MM-2017] Deep Cross-Modal Audio-Visual Generation Authors: Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, Chenliang Xu Institution: University of Rochester

[NeurIPS-2020] Audeo: Audio Generation for a Silent Performance Video Authors: Kun Su, Xiulong Liu, Eli Shlizerman Institution: University of Washington

[ECCV-2020] Foley Music: Learning to Generate Music from Videos Authors: Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba Institution: Cambridge

[ICASSP-2020] Sight to Sound: An End-to-End Approach for Visual Piano Transcription Authors: A. Sophia Koepke, Olivia Wiles , Yael Moses, Andrew Zisserman Institution: University of Oxford; The Interdisciplinary Center

[2020] Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements Authors: Kun Su, Xiulong Liu, Eli Shlizerman Institution: University of Washington

[ICASSP-2021] Collaborative Learning to Generate Audio-Video Jointly Authors: Vinod K Kurmi, Vipul Bajaj, Badri N Patro, K S Venkatesh, Vinay P Namboodiri, Preethi Jyothi Institution: Indian Institute of Technology Kanpur; University of Bath; Indian Institute of Technology Bombay

[ACM-2021] Video Background Music Generation with Controllable Music Transformer Authors: Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, Shuicheng Yan Institution: Beihang University; Charterhouse School, Godalming, Surrey; Sea AI Lab

[2022] Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation Authors: Runbang Zhang, Yixiao Zhang, Kai Shao, Ying Shan, Gus Xia Institution: New York University, Shanghai; Queen Mary University of London; Tencent Inc.; Mohamed bin Zayed University of Artificial Intelligence

[CVPR-2023] Conditional Generation of Audio from Video via Foley Analogies Authors: Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens Institution: University of Michigan; Yale University; Adobe Research

[ICML-2023] Long-Term Rhythmic Video Soundtracker Authors: Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, Yu Qiao Institution: Shanghai Artificial Intelligence Laboratory

[CVPR-2016] Visually Indicated Sounds Authors: Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, William T. Freeman Institution: Massachusetts Institute of Technology; U.C. Berkeley; Google Research

[CVPR-2018] Visual to Sound: Generating Natural Sound for Videos in the Wild Authors: Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg Institution: University of North Carolina at Chapel Hill; Adobe Research

[IEEE TIP-2020] Generating Visually Aligned Sound From Videos Authors: Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan, Institution: South China University of Technology; China Pazhou Laboratory; MIT-IBM Watson AI Lab

[BMVC-2021] Taming Visually Guided Sound Generation Authors: Vladimir Iashin, Esa Rahtu Institution: Tampere University

[IEEE TCSVT-2022] Towards an End-to-End Visual-to-Raw-Audio Generation With GAN Authors: Shiguang Liu; Sijia Li; Haonan Cheng Institution: Tianjin University

[ICASSP-2023] I Hear Your True Colors: Image Guided Audio Generation Authors: Roy Sheffer, Yossi Adi Institution: The Hebrew University of Jerusalem

[CVPR-2023] Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos Authors: Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan Institution: University of Washington; MIT-IBM Watson AI Lab; MIT; UMass Amherst

[ACM TOG-2018] Scene-aware audio for 360° videos Authors: Dingzeyu Li, Timothy R.Langlois, Changxi Zheng Institution: Columbia University; Adobe Research

[NeurIPS-2018] Self-Supervised Generation of Spatial Audio for 360° Video Authors: Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, Oliver Wang Institution: University of California San Diego; Adobe Research

[CVPR-2019] 2.5D Visual Sound Authors: Ruohan Gao, Kristen Grauman Institution: The University of Texas at Austin; Facebook AI Research

[ICIP-2019] Self-Supervised Audio Spatialization with Correspondence Classifier Authors: Yu-Ding Lu, Hsin-Ying Lee, Hung-Yu Tseng, Ming-Hsuan Yang Institution: University of California at Merced

[ECCV-2020] Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation Authors: Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang, Ziwei Liu Institution: The Chinese University of Hong Kong

[CVPR-2021] Visually Informed Binaural Audio Generation without Binaural Audios Authors: Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, Dahua Lin Institution: The Chinese University of Hong Kong; Nanyang Technological University

[AAAI-2021] Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation Authors: Yan-Bo Lin, Yu-Chiang Frank Wang Institution: National Taiwan University; ASUS Intelligent Cloud Services

[TOG-2021] Binaural Audio Generation via Multi-task Learning Authors: Sijia Li, Shiguang Liu, Dinesh Manocha Institution: Tianjin University; University of Maryland at College Park

[WACV-2022] Beyond Mono to Binaural: Generating Binaural Audio From Mono Audio With Depth and Cross Modal Attention Authors: Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma Institution: Indian Institute of Technology Kanpur; CDAC Noida; TensorTour Inc.

[CVPR-2023] Novel-View Acoustic Synthesis Authors: Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi Institution: University of Texas at Austin; Meta AI

Video Generation

[ACM TOG-2017] Synthesizing Obama: learning lip sync from audio Authors: Supasorn Suwajanakorn, Steven Maxwell Seitz, Ira Kemelmacher-Shlizerman Institution: University of Washington

[ECCV-2018] Lip Movements Generation at a Glance Authors: Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, Chenliang Xu Institution: University of Rochester

[IJCV-2019] You Said That?: Synthesising Talking Faces from Audio Authors: Amir Jamaludin, Joon Son Chung, Andrew Zisserman Institution: University of Oxford

[ICCV-2019] Few-Shot Adversarial Learning of Realistic Neural Talking Head Models Authors: Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky Institution: Samsung AI Center; Skolkovo Institute of Science and Technology

[IJCV-2020] Realistic Speech-Driven Facial Animation with GANs Authors: Konstantinos Vougioukas, Stavros Petridis, Maja Pantic Institution: Imperial College London; Samsung AI Research Centre Cambridge

[IJCV-2020] GANimation: One-Shot Anatomically Consistent Facial Animation Authors: Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, Francesc Moreno-Noguer Institution: Institut de Robòtica i Informàtica Industrial; The Ohio State University

[ACM TOG-2020] Makelttalk: Speaker-Aware Talking-Head Animation Authors: Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, Dingzeyu Li Institution: University of Massachusetts Amherst; Huya Inc.; Adobe Research

[CVPR-2020] FReeNet: Multi-Identity Face Reenactment Authors: Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, Changjie Fan Institution: Zhejiang University; Fuxi AI Lab

[ECCV-2020] Neural Voice Puppetry: Audio-driven Facial Reenactment Authors: Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, Matthias Nie?ner Institution: Technical University of Munich; Saarland Informatics Campus

[CVPR-2020] Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images Authors: Hang Zhou, Jihao Liu, Ziwei Liu, Yu Liu, Xiaogang Wang Institution: The Chinese University of Hong Kong; SenseTime Research

[ECCV-2020] MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation Authors: Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, Chen Change Loy Institution: SenseTime Research; Carnegie Mellon University; Center for Research on Intelligent Perception and Computing, CASIA; University of Chinese Academy of Sciences; Shenzhen Institutes of Advanced Technology, Chinese Academy of Science; Nanyang Technological University

[AAAI-2021] Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation Authors: Lincheng Li, Suzhen Wang, Zhimeng Zhang, Yu Ding, Yixing Zheng, Xin Yu, Changjie Fan Institution: Netease Fuxi AI Lab; University of Technology Sydney

[CVPR-2021] Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation Authors: Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, Ziwei Liu Institution: The Chinese University of Hong Kong; SenseTime Research; Tokyo Institute of Technology; Nanyang Technological University

[CVPR-2021] Audio-Driven Emotional Video Portraits Authors: Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, Feng Xu Institution: Nanjing University; The Chinese University of Hong Kong; The University of Sydney; SenseTime Research; Nanyang Technological University; Tsinghua University

[AAAI-2022] One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning Authors: Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu Institution: Netease Fuxi AI Lab; University of Technology Sydney

[TVCG-2022] Generating talking face with controllable eye movements by disentangled blinking feature Authors: Shiguang Liu, Jiaqi Hao Institution: Tianjin University

[AAAI-2022] SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory Authors: Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro Institution: Korea Advanced Institute of Science and Technology

[CVPR-2022] FaceFormer: Speech-Driven 3D Facial Animation with Transformers Authors: Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura Institution: The University of Hong Kong; The Hong Kong University of Science and Technology; Adobe Research; Texas A&M University

[CVPR-2023] Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert Authors: Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T. Tan, Haizhou Li Institution: National University of Singapore; University of Science and Technology Beijing; University of Electronic Science and Technology of China; The Chinese University of Hong Kong

[ICASSP-2023] Free-View Expressive Talking Head Video Editing Authors: Yuantian Huang, Satoshi Iizuka, Kazuhiro Fukui Institution: University of Tsukuba

[ICASSP-2023] Audio-Driven Facial Landmark Generation in Violin Performance using 3DCNN Network with Self Attention Model Authors: Tingwei Lin, Chaolin Liu, Li Su Institution: Taiwan International Graduate Program; Academia Sinica; National Chengchi University

[ICASSP-2023] Naturalistic Head Motion Generation from Speech Authors: Trisha Mittal, Zakaria Aldeneh, Masha Fedzechkina, Anurag Ranjan, Barry-John Theobald Institution: University of Maryland; Apple Inc.

[ICASSP-2023] Audio-Visual Inpainting: Reconstructing Missing Visual Information with Sound Authors: Valentina Sanguineti, Sanket Thakur, Pietro Morerio, Alessio Del Bue, Vittorio Murino Institution: Istituto Italiano di Tecnologia; University of Genova

[CVPR-2023] Identity-Preserving Talking Face Generation with Landmark and Appearance Priors Authors: Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, Guanbin Li Institution: Sun Yat-sen University; Xidian University; The University of Hong Kong

[CVPR-2023] SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation Authors: Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, Fei Wang Institution: Xi’an Jiaotong University; National Key Laboratory of Human-Machine Hybrid Augmented Intelligence; Tencent AI Lab; Ant Group

[ACM MM-2023] Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline Authors: Zhigang Chang, Weitai Hu, Qing Yang, Shibao Zheng Institution: Du Xiaoman Financial; Shanghai Jiao Tong University

[CVPR-2023] LipFormer: High-fidelity and Generalizable Talking Face Generation with A Pre-learned Facial Codebook Authors: Jiayu Wang, Kang Zhao, Shiwei Zhang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou Institution: Alibaba Group

[CVPR-2024] Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation Authors: Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Seymanur Aktı, Hazım Kemal Ekenel, Alexander Waibel Institution: KarlsruheInstitute of Technology; Istanbul Technical University; Carnegie Mellon University

[InterSpeech-2024] Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert Authors: Han EunGi, Oh Hyun-Bin, Kim Sung-Bin, Corentin Nivelet Etcheberry, Suekyeong Nam, Janghoon Joo, Tae-Hyun Oh Institution: Grad. School of Artificial Intelligence and Dept. of Electrical Engineering, POSTECH, Korea; ENSC, Bordeaux INP, France; KRAFTON, Korea; Inst. for Convergence Research and Education in Advanced Technology, Yonsei University, Korea

[ACM MM-2024] ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer Authors: Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu Institution: BNRist, DCST, Tsinghua University; Baidu Inc.; Zhongguancun Laboratory; S-Lab, Nanyang Technological University

[ECCV-2024] KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding Authors: Zhihao Xu, Shengjie Gong, Jiapeng Tang, Lingyu Liang, Yining Huang, Haojie Li, Shuangping Huang Institution: South China University of Technology; Technical University of Munich; Pazhou Laboratory;

[IVA-2018] Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network Authors: Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, Kazuhiko Sumi Institution: Hokkai Gakuen University Sapporo; Aoyama Gakuin University; Yokohama National University

[IVA-2019] Analyzing Input and Output Representations for Speech-Driven Gesture Generation Authors: Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, Hedvig Kjellstr?m Institution: KTH Royal Institute of Technology in Stockholm; Hokkai Gakuen University; Aoyama Gakuin University;

[CVPR-2019] Learning Individual Styles of Conversational Gesture Authors: Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik Institution: University of California, Berkeley; Zebra Medical Vision; Massachusetts Institute of Technology

[ICMI-2019] To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations , Authors: Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, Yaser Sheikh Institution: Carnegie Mellon University; Facebook Reality Labs

[EUROGRAPHICS-2020] Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows Authors: Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow Institution: KTH Royal Institute of Technology

[ICMI-2020] Gesticulator: A Framework For Semantically-Aware Speech-Driven Gesture Generation Authors: Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, Hedvig Kjellstr?m Institution: KTH Royal Institute of Technology

[2020] Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach Authors: Chaitanya Ahuja, Dong Won Lee, Yukiko I. Nakano, Louis-Philippe Morency Institution: Carnegie Mellon University; Seikei University

[ACM TOG-2020] Speech Gesture Generation From The Trimodal Context Of Text, Audio, And Speaker Identity Authors: Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, Geehyuk Lee Institution: Korea Advanced Institute of Science and Technology; University of Science and Technology; Electronics and Telecommunications Research Institute

[CVPR-2022] SEEG: Semantic Energized Co-Speech Gesture Generation Authors: Yuanzhi Liang, Qianyu Feng, Linchao Zhu, Li Hu, Pan Pan, Yi Yang Institution: Alibaba; University of Technology Sydney; Zhejiang University

[IEEE TNNLS-2022] VAG: A Uniform Model for Cross-Modal Visual-Audio Mutual Generation Authors: Wangli Hao; He Guan; Zhaoxiang Zhang Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences

[CVPR-2023] Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation Authors: Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, Lequan Yu Institution: The University of Hong Kong; The Chinese University of Hong Kong; Nanyang Technological University

[IJCAI-2024] Bridge to Non-Barrier Communication: Gloss-Prompted Fine-grained Cued Speech Gesture Generation with Diffusion Model Authors: Wentao Lei, Li Liu, Jun Wang Institution: The Hong Kong University of Science and Technology (Guangzhou); Tencent AI Lab; The Hong Kong University of Science and Technology

[ACM MM-2018] Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis Authors: Taoran Tang, Jia Jia, Hanyang Mao Institution: Tsinghua University

[CVPR-2018] Audio to Body Dynamics Authors: Eli Shlizerman, Lucio Dery, Hayden Schoen, Ira Kemelmacher-Shlizerman Institution: Facebook Inc.; Stanford University; University of Washington

[NeurIPS-2019] Dancing to Music Authors: Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, Jan Kautz Institution: University of California; NVIDIA

[ICLR-2021] Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning Authors: Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang, Daxin Jiang Institution: Fudan University; Microsoft STCA; Meituan; Rinna AI

[ICCV-2021] AI Choreographer: Music Conditioned 3D Dance Generation With AIST++ Authors: Ruilong Li, Shan Yang, David A. Ross, Angjoo Kanazawa Institution: University of Southern California; Google Research; University of California, Berkeley

[ICASSP-2022] Genre-Conditioned Long-Term 3D Dance Generation Driven by Music Authors: Yuhang Huang, Junjie Zhang, Shuyan Liu, Qian Bao, Dan Zeng, Zhineng Chen, Wu Liu Institution: Shanghai University; University of Chinese Academy of Sciences; JD AI Research; Fudan University

[CVPR-2022] Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory Authors: Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Chang Loy, Ziwei Liu Institution: Nanyang Technological University; Sun Yat-Sen University; University of California, Los Angeles; SenseTime Research

[CVPR-2023] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation Authors: Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo Institution: Renmin University of China; Peking University; Microsoft Research

[IEEE TMM-2023] Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization Authors: Jiashuo Yu, Junfu Pu, Ying Cheng, Rui Feng, Ying Shan Institution: Shanghai Key Lab of Intelligent Information Processing; Fudan University; Tencent

[VCJ-2024] QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation Authors: Zhizhen Zhou, Yejing Huo, Guoheng Huang, An Zeng, Xuhang Chen, Lian Huang, Zinuo Li Institution: Guangdong University of Technology, Guangdong, China; Huizhou University, Guangdong, China; Guangdong Mechanical and Electrical College, Guangdong, China; University of Western Australia, WA, Australia

[2021] Sound-guided semantic image manipulation Seung Hyun Lee, Wonseok Roh, Wonmin Byeon, Sang Ho Yoon, Chan Young Kim, Jinkyu Kim, Sangpil Kim Korea University; Korea Advanced Institute of Science and Technology; NVIDIA Corp.

[2022] Learning visual styles from audio-visual associations Tingle Li, Yichen Liu, Andrew Owens, Hang Zhao Tsinghua University; University of Michigan; Shanghai Qi Zhi Institute

[CVPR-2023] Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment Authors: Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh Institution: Pohang University of Science and Technology; Korea Advanced Institute of Science and Technology; University of Michigan; Yonsei University

[ACM MM-2024] An Inverse Partial Optimal Transport Framework for Music-guided Movie Trailer Generation Authors: Yutong Wang, Sidan Zhu, Hongteng Xu, Dixin Luo Institution: Beijing Institute of Technology, Beijing, China; Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, China;Renmin University of China, Beijing, China

[ICRA-2020] BatVision: Learning to See 3D Spatial Layout with Two Ears Authors: Jesper Haahr Christensen; Sascha Hornauer; Stella X. Yu Institution: Technical University of Denmark; University of California

[ECCV-2020] VISUALECHOES: Spatial Image Representation Learning Through Echolocation Authors: Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman Institution: The University of Texas at Austin; Facebook Reality Lab; Facebook AI Research

[CVPR-2021] Beyond Image to Depth: Improving Depth Prediction Using Echoes Authors: Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma Institution: Indian Institute of Technology Kanpur; Centre for Development of Advanced Computing Noida; TensorTour Inc.

[ICASSP-2022] Co-Attention-Guided Bilinear Model for Echo-Based Depth Estimation Authors: Go Irie, Takashi Shibata, Akisato Kimura Institution: Nippon Telegraph & Telephone Corporation

[NeurIPS-2022] Learning Neural Acoustic Fields Authors: Andrew Luo, Yilun Du, Michael Tarr, Josh Tenenbaum, Antonio Torralba, Chuang Gan Institution: Carnegie Mellon University; Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[NeurIPS-2022] Few-Shot Audio-Visual Learning of Environment Acoustics Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman Institution: The University of Texas at Austin; Facebook AI Research

[NeurIPS-2016] SoundNet: Learning Sound Representations from Unlabeled Video Authors: Yusuf Aytar, Carl Vondrick, Antonio Torralba Institution: Massachusetts Institute of Technology

[ICCV-2019] Self-Supervised Moving Vehicle Tracking With Stereo Sound Authors: Chuang Gan, Hang Zhao, Peihao Chen, David Cox, Antonio Torralba Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; IBM Research AI

[CVPR-2021] There Is More Than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking With Sound by Distilling Multimodal Knowledge Authors: Francisco Rivera Valverde, Juana Valeria Hurtado, Abhinav Valada Institution: University of Freiburg

[AAAI-2021] Enhanced Audio Tagging via Multi- to Single-Modal Teacher-Student Mutual Learning Authors: Yifang Yin, Harsh Shrivastava, Ying Zhang, Zhenguang Liu, Rajiv Ratn Shah, Roger Zimmermann Institution: National University of Singapore; National University of Singapore Northwestern Polytechnical University; Zhejiang Gongshang University; Indraprastha Institute of Information Technology, Delhi

[Interspeech-2021] Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification Authors: Leying Zhang, Zhengyang Chen, Yanmin Qian Institution: Shanghai Jiao Tong University

[ICCV-2021] Multimodal Knowledge Expansion Authors: Zihui Xue, Sucheng Ren, Zhengqi Gao, Hang Zhao Institution: Shanghai Qi Zhi Institute; UT Austin; South China University of Technology; Massachusetts Institute of Technology; Tsinghua University

[CVPR-2021] Distilling Audio-visual Knowledge by Compositional Contrastive Learning Authors: Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata Institution: University of Tubingen; MPI for Informatics; Tencent; Max Planck Institute for Intelligent Systems

[2022] Estimating Visual Information From Audio Through Manifold Learning Authors: Fabrizio Pedersoli, Dryden Wiebe, Amin Banitalebi, Yong Zhang, George Tzanetakis, Kwang Moo Yi Institution: University of British Columbia; Huawei Technologies Canada Co., Ltd; University of Victoria

[DCASE-2021] Audio-Visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy Authors: Chengxin Chen, Meng Wang, Pengyuan Zhang Institution: Institute of Acoustics, CAS; University of Chinese Academy of Sciences

[Interspeech-2021] Audiovisual transfer learning for audio tagging and sound event detection Authors: Wim Boes, Hugo Van hamme Institution: ESAT, KU Leuven

[2023] Revisiting Pre-training in Audio-Visual Learning Authors: Ruoxuan Feng, Wenke Xia, Di Hu Institution: Hunan University; Renmin University of China

[IJCNN-2023] A Generative Approach to Audio-Visual Generalized Zero-Shot Learning: Combining Contrastive and Discriminative Techniques Authors: Qichen Zheng, Jie Hong, Moshiur Farazi Institution: Australian National University; CSIRO Data61

[ICCV-2023] Audio-Visual Class-Incremental Learning Authors: Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian Institution: The University of Texas at Dallas; Carnegie Mellon University

[ICCV-2023] Hyperbolic Audio-visual Zero-shot Learning Authors: Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, Lars Petersson Institution: Australian National University; CSIRO Data61

[CVPR-2023] Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models Authors: Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, Deva Ramanan Institution: Carnegie Mellon University

[ICCV-2023] Class-Incremental Grouping Network for Continual Audio-Visual Learning Authors: Shentong Mo, Weiguo Pian, Yapeng Tian Institution: Carnegie Mellon University; University of Texas at Dallas

[ICCV-2023] Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation Authors: Heeseung Yun, Joonil Na, Gunhee Kim Institution: Seoul National University

[2017] Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint Authors: Sungeun Hong, Woobin Im, Hyun S. Yang

[ICCV-2017] Image2song: Song Retrieval via Bridging Image Content and Lyric Words Authors: Xuelong Li, Di Hu, Xiaoqiang Lu Institution: Chinese Academy of Sciences; Northwestern Polytechnical University

[CVPR-2018] Seeing voices and hearing faces: Cross-modal biometric matching Authors: Arsha Nagrani, Samuel Albanie, Andrew Zisserman Institution: University of Oxford

[ECCV-2018] Cross-modal Embeddings for Video and Audio Retrieval Authors: Didac Suris, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giro-i-Nieto Institution: Universitat Politecnica de Catalunya; Barcelona Supercomputing Center

[ISM-2018] Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA Authors: Donghuo Zeng, Yi Yu, Keizo Oyama Institution: National Institute of Informatics

[TOMCCAP-2020] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval Authors: Donghuo Zeng, Yi Yu, Keizo Oyama Institution: National Institute of Informatics

[IEEE TGRS-2020] Deep Cross-Modal Image–Voice Retrieval in Remote Sensing Authors: Yaxiong Chen, Xiaoqiang Lu, Shuai Wang Institution: China University of Chinese Academy of Sciences; Chinese Academy of Sciences

[2021] Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval Authors: Donghuo Zeng, Jianming Wu, Gen Hattori, Yi Yu, Rong Xu Institution: KDDI Research, Inc.; National Institute of Informatics, SOKENDAI

[ICCV-2021] Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion Authors: Qinghao Ye, Xiyue Shen, Yuan Gao, Zirui Wang, Qi Bi, Ping Li, Guang Yang Institution: Hangzhou Dianzi University; University of California; East China Normal University; University of Oxford; Wuhan University; Imperial College London

[ICMR-2024] Anchor-aware Deep Metric Learning for Audio-visual Retrieval Authors: Donghuo Zeng, Yanan Wang, Kazushi Ikeda, Yi Yu Institution: KDDI Research, Inc.; Hiroshima University

[IJCAI-2022] Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast Authors: Boqing Zhu, Kele Xu, Changjian Wang, Zheng Qin, Tao Sun, Huaimin Wang, Yuxing Peng Institution: National University of Defense Technology

[IEEE ISM-2022] Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval Authors: Donghuo Zeng, Yanan Wang, Jianming Wu, Kazushi Ikeda Institution: KDDI Research, Inc.

[IEEE SMC-2022] Graph Network based Approaches for Multi-modal Movie Recommendation System Authors: Daipayan Chakder, Prabir Mondal, Subham Raj, Sriparna Saha, Angshuman Ghosh, Naoyuki Onoe Institution: Indian Institute of Technology; Sony Research

[CVPR-2022] Visual Acoustic Matching Authors: Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman Institution: University of Texas at Austin; Stanford University; Meta AI

[IEEE TMM-2023] Deep Cross-Modal Retrieval Between Spatial Image and Acoustic Speech Authors: Xinyuan Qian, Wei Xue, Qiquan Zhang, Ruijie Tao, Haizhou Li Institution: University of Science and Technology; Hong Kong University of Science and Technology; University of New South Wales; National University of Singapore; The Chinese University of Hong Kong-Shenzhen

[ICASSP-2024] Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos Authors: Dennis Fedorishin, Lie Lu, Srirangaraj Setlur, Venu Govindaraju Institution: Dolby Laboratories, University at Buffalo;

Audio-visual Collaboration

[ICCV-2017] Look, Listen and Learn Authors: Relja Arandjelovic, Andrew Zisserman Institution: Google Inc.; University of Oxford

[NeurIPS-2018] Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization Authors: Bruno Korbar, Du Tran, Lorenzo Torresani Institution: Dartmouth College; Facebook Research

[NeurIPS-2020] Learning Representations from Audio-Visual Spatial Alignment Authors: Pedro Morgado, Yi Li, Nuno Nvasconcelos Institution: University of California

[NeurIPS-2020] Self-Supervised Learning by Cross-Modal Audio-Video Clustering Authors: Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran Institution: King Abdullah University of Science and Technology; Facebook AI Research

[NeurIPS-2020] Labelling Unlabelled Videos From Scratch With Multi-Modal Self-Supervision Authors: Yuki Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi Institution: University of Oxford; Facebook AI Research

[CVPR-2021] Audio-Visual Instance Discrimination with Cross-Modal Agreement Authors: Pedro Morgado, Nuno Vasconcelos, Ishan Misra Institution: University of California San Diego; Facebook AI Research

[CVPR-2021] Robust Audio-Visual Instance Discrimination Authors: Pedro Morgado, Ishan Misra, Nuno Vasconcelos Institution: University of California San Diego; Facebook AI Research

[2021] Unsupervised Sound Localization via Iterative Contrastive Learning Authors: Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang Institution: National Yang Ming Chiao Tung University; University of California; Snap Inc.; Google Research

[ICCV-2021] Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos Authors: Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang Institution: Columbia University; Massachusetts Institute of Technology; University of Central Florida; Goethe University Frankfurt; IBM Research AI; MIT-IBM Watson AI Lab; The University of Texas at Austin; NYU-Courant CS & CDS

[2021] OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation Authors: Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang Institution: Chinese Academy of Sciences

[NeurIPS-2021] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Authors: Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong Institution: Columbia University; Google Inc.; Cornell University

[2021] Audio-visual Representation Learning for Anomaly Events Detection in Crowds Authors: Junyu Gao, Maoguo Gong, Xuelong Li Institution: Xidian University; Northwestern Polytechnical University

[ICASSP-2022] Audioclip: Extending Clip to Image, Text and Audio Authors: Andrey Guzhov, Federico Raue, Jorn Hees, Andreas Dengel Institution: Germany TU Kaiserslautern; Deutsches Forschungszentrum für Künstliche Intelligenz GmbH

[CVPR-2022] MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound Authors: Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi Institution: University of Washington; Allen Institute for Artificial Intelligence; University of Edinburgh

[2022] Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning Authors: Shuaicheng Li, Feng Zhang, Kunlin Yang, Lingbo Liu, Shinan Liu, Jun Hou, Shuai Yi Institution: Sensetime Research; The Hong Kong Polytechnic University

[NeurIPS-2022] Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi Institution: Dartmouth College; Northwestern University

[IEEE TMM-2022] Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations Authors: Sijie Mai, Ying Zeng, Haifeng Hu Institution: Sun Yat-sen University; National Natural Science Foundation of China

[CVPR-2022] Audiovisual Generalised Zero-shot Learning with Cross-modal Attention and Language Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata Institution: University of Tübingen; Robert Bosch GmbH; Max Planck Institute

[CVPRW-2022] Multi-task Learning for Human Affect Prediction with Auditory–Visual Synchronized Representation Authors: Euiseok Jeong;, Geesung Oh, Sejoon Lim Institution: Kookmin University

[CVPR-2023] Vision Transformers are Parameter-Efficient Audio-Visual Learners Authors: Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius Institution: The University of North Carolina at Chapel Hill

[CVPR-2022] Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata Institution: University of Tubingen; Robert Bosch GmbH; Max Planck Institute

[ECCV-2022] Temporal and cross-modal attention for audio-visual zero-shot learning Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata Institution: University of Tuebingen; Max Planck Institute

[NeurIPS-2022] u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality Authors: Wei-Ning Hsu, Bowen Shi Institution: Meta AI

[NeurIPS-2022] Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization Authors: Junru Wu, Yi Liang, Feng Han, Hassan Akbari, Zhangyang Wang, Cong Yu Institution: Texas A&M University; Google Research; University of Texas at Austin; Celonis Inc.

[AAAI-2023] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity Authors: Pritam Sarkar, Ali Etemad Institution: Queen’s University; Vector Institute

[ICLR-2023] Contrastive Audio-Visual Masked Autoencoder Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James R. Glass Institution: Massachusetts Institute of Technology; The University of Texas at Austin; MIT-IBM Watson AI Lab; Goethe University Frankfurt

[ICLR-2023] Jointly Learning Visual and Auditory Speech Representations from Raw Data Authors: Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic Institution: Imperial College London; Meta AI

[WACV-2023] Audio Representation Learning by Distilling Video as Privileged Information Authors: Amirhossein Hajavi, Ali Etemad Institution: Queen's University, Canada

[2023] AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations Authors: Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli Institution: University of California; Meta AI

[AAAI-2023] Audio-Visual Contrastive Learning with Temporal Self-Supervision Authors: Simon Jenni, Alexander Black, John Collomosse Institution: Adobe Research; University of Surrey

[CVPR-2023] ImageBind One Embedding Space to Bind Them All Authors: Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra Institution: Meta AI

[NeurIPS-2023] Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks Authors: Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao Institution: Zhejiang University; Shanghai Artificial Intelligence Laboratory; Huawei Noah’s Ark Lab

[WACV-2024] OmniVec: Learning robust representations with cross modal sharing Authors: Siddharth Srivastava, Gaurav Sharma Institution: TensorTour Inc.

[InterSpeech-2024] Zero-Shot Fake Video Detection by Audio-Visual Consistency Authors: Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, Dong Wang Institution: School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China; Center for Speech and Language Technologies, BNRist, Tsinghua University, China

[ICML-2024] From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation Authors: Kun Su, Xiulong Liu, Eli Shlizerman Institution: Department of ECE, University of Washington, Seattle, United States; Department of Applied Math, University of Washington, Seattle, United States

Audio-visual Localization

[ECCV-2018] Objects that Sound Authors: Relja Arandjelovic, Andrew Zisserman Institution: Google Inc.; University of Oxford

[ECCV-2018] Audio-Visual Scene Analysis with Self-Supervised Multisensory Features Authors: Andrew Owens, Alexei A. Efros Institution: University of California, Berkeley

[ECCV-2018] The Sound of Pixels Authors: Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; Columbia University

[CVPR-2019] Deep Multimodal Clustering for Unsupervised Audiovisual Learning Authors: Di Hu, Feiping Nie, Xuelong Li Institution: Northwestern Polytechnical University

[CVPR-2021] Localizing Visual Sounds the Hard Way Authors: Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman Institution: University of Oxford

[IEEE TPAMI-2021] Class-aware Sounding Objects Localization via Audiovisual Correspondence Authors: Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen Institution: Renmin University of China; Shanghai Jiao Tong University

[IEEE TPAMI-2021] Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications Authors: Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon Institution: Korea Advanced Institute of Science and Technology; Pohang University of Science and Technology; University of California

[CVPR-2022] Mix and Localize: Localizing Sound Sources in Mixtures Authors: Xixi Hu, Ziyang Chen, Andrew Owens Institution: University of Michigan; The University of Texas at Austin

[ECCV-2022] Audio-Visual Segmentation Authors: Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong Institution: Hefei University of Technology; SenseTime Research; Australian National University; Beihang University; NVIDIA; The University of Hong Kong; 7Shanghai Artificial Intelligence Laboratory

[2022] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization Authors: Hao Jiang, Calvin Murdock, Vamsi Krishna Ithapu Institution: Meta Reality Labs

[ACM MM-2022] Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation Authors: Jinxiang Liu, Chen Ju, Weidi Xie, Ya Zhang Institution: Shanghai Jiao Tong University; Shanghai AI Laboratory

[CVPR-2022] Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes Authors: Zengjie Song, Yuxi Wang, Junsong Fan, Tieniu Tan, Zhaoxiang Zhang Institution: Chinese Academy of Science; University of Chinese Academy of Sciences

[CVPR-2022] Self-supervised object detection from audio-visual correspondence Authors: Triantafyllos Afouras; Yuki M. Asano; Francois Fagan; Andrea Vedaldi; Florian Metze Institution: University of Oxford; University of Amsterdam; Meta AI

[EUSIPCO-2022] Visually Assisted Self-supervised Audio Speaker Localization and Tracking Authors: Jinzheng Zhao, Peipei Wu, Shidrokh Goudarzi, Xubo Liu, Jianyuan Sun, Yong Xu, Wenwu Wang Institution: University of Surrey; Tencent AI Lab, Bellevue

[ICASSP-2023] MarginNCE: Robust Sound Localization with a Negative Margin Authors: Sooyoung Park, Arda Senocak, Joon Son Chung Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute, South Korea

[IEEE TMM-2022] Cross modal video representations for weakly supervised active speaker localization Authors: Rahul Sharma, Krishna Somandepalli, Shrikanth Narayanan Institution: University of Southern California; Google Inc.

[NeurIPS-2022] A Closer Look at Weakly-Supervised Audio-Visual Source Localization Authors: Shentong Mo, Pedro Morgado Institution: Carnegie Mellon University; University of Wisconsin-Madison

[AAAI-2022] Visual Sound Localization in the Wild by Cross-Modal Interference Erasing Authors: Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou Institution: The Chinese University of Hong Kong; Zhejiang University; Shanghai Jiao Tong University; Renmin University of China; Nanyang Technological University

[ECCV-2022] Sound Localization by Self-Supervised Time Delay Estimation Authors: Ziyang Chen, David F. Fouhey, Andrew Owens Institution: University of Michigan

[IEEE/ACM TASLP-2023] Audio-Visual Cross-Attention Network for Robotic Speaker Tracking Authors: Xinyuan Qian, Zhengdong Wang, Jiadong Wang, Guohui Guan, Haizhou Li Institution: University of Science and Technology Beijing; Chinese University of Hong Kong; Shenzhen Research Institute of Big dataNational University of Singapore; Univeristy of California at Berkeley; University of Bremen

[WACV-2023] Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization Authors: Dennis Fedorishin, Deen Dayal Mohan, Bhavin Jawade, Srirangaraj Setlur, Venu Govindaraju Institution: University at Buffalo

[WACV-2023] Exploiting Visual Context Semantics for Sound Source Localization Authors: Xinchi Zhou, Dongzhan Zhou, Di Hu, Hang Zhou, Wanli Ouyang Institution: The University of Sydney; Renmin University of China; Baidu Inc.

[2023] Audio-Visual Segmentation with Semantics Authors: Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong Institution: Hefei University of Technology; SenseTime Research; University of Oxford; Australian National University; Beihang University; NVIDIA; The University of Hong Kong; Shanghai Artificial Intelligence Laboratory

[CVPR-2023] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning Authors: Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, Nick Barnes Institution: Australian National University; Beihang University; The University of Oxford; Shanghai AI Lab; OPPO Research Institute

[CVPR-2023] Egocentric Audio-Visual Object Localization Authors: Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu Institution: University of Rochester; Meta Reality Labs Research

[CVPR-2023] Audio-Visual Grouping Network for Sound Localization from Mixtures Authors: Shentong Mo, Yapeng Tian Institution: Carnegie Mellon University; University of Texas at Dallas

[ICASSP-2023] Flowgrad: Using Motion for Visual Sound Source Localization Authors: Rajsuryan Singh, Pablo Zinemanas, Xavier Serra, Juan Pablo Bello, Magdalena Fuentes Institution: Universitat Pompeu Fabra; New York University

[ACM MM-2023] Audio-visual segmentation, sound localization, semantic-aware sounding objects localization Authors: Chen Liu, Peike Li, Xingqun Qi, Hu Zhang, Lincheng Li, Dadong Wang, Xin Yu Institution: University of Technology Sydney; The University of Queensland; Futureverse; The Hong Kong University of Science and Technology; CSIRO DATA61; Netease Fuxi AI Lab

[ACM MM-2023] Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization Authors: Tianyu Liu, Peng Zhang, Wei Huang, Yufei Zha, Tao You, Yanning Zhang Institution: Northwestern Polytechnical University; Nanchang University

[ACM MM-2023] Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization Authors: Sung Jin Um, Dongjin Kim, Jung Uk Kim Institution: Kyung Hee University

[IJCAI-2023] Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation Authors: Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Beihang University; Alibaba Group

[ICCV-2023] Sound Source Localization is All about Cross-Modal Alignment Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung Institution: Korea Advanced Institute of Science and Technology; Harvard University; Pohang University of Science and Technology; Yonsei University

[ICCV-2023] Multimodal Variational Auto-encoder based Audio-Visual Segmentation Authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, Yuchao Dai Institution: Northwestern Polytechnical University; Shaanxi Key Laboratory of Information Acquisition and Processing; Australian National University; Shanghai AI Laboratory

[WACV-2024] Can CLIP Help Sound Source Localization? Authors: Sooyoung Park, Arda Senocak, Joon Son Chung Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute

[NeurIPS-2023] Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization Authors: Yuxin Guo, Shijie Ma, Hu Su, Zhiqing Wang, Yuhao Zhao, Wei Zou，Siyang Sun, Yun Zheng Institution: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation of Chinese Academy of Sciences, Beijing, China; DAMOAcademy, Alibaba Group

[ICASSP-2024] Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization Authors: Yuxin Guo, Shijie Ma, Yuhao Zhao, Hu Su, Wei Zou Institution: School of Artificial Intelligence, University of Chinese Academy of Sciences; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS); Institute of Automation of Chinese Academy of Sciences

[CVPR-2024] Audio-Visual Segmentation via Unlabeled Frame Exploitation Authors: Jinxiang Liu, Yikun Liu, Fei Zhang, Chen Ju, Ya Zhang, Yanfeng Wang Institution: Cooperative Medianet Innovation Center, Shanghai Jiao Tong University; Shanghai AI Laboratory

[ACM MM-2024] CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization Authors: Xiang He, Xiangxi Liu, Yang Li, Dongcheng Zhao, Guobin Shen, Qingqun Kong, Xin Yang, Yi Zeng Institution: Brain-inspired Cognitive Intelligence Lab,Institute of Automation, Chinese Academy of Sciences, Beijing, China; Center for Long-term Artificial Intelligence, Beijing, China; Key Laboratory of Brain Cognition and Brain-inspired, Intelligence Technology, CAS, Shanghai, China; Institute of Automation, Chinese Academy of Sciences Beijing, China;

[ACM MM-2024] Open-Vocabulary Audio-Visual Semantic Segmentation Authors: Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying Institution: National Key Laboratory of General, Artificial Intelligence, School of Intelligence Science and Technology, Peking University, Beijing, China; Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA; Berkeley AI Research, University of California, Berkeley, Berkeley, CA, USA; College of Information and Electrical Engineering, China Agricultural University, Beijing, China;

[ECCV-2024] Spherical World-Locking for Audio-Visual Localization in Egocentric Videos Authors: Heeseung Yun, Ruohan Gao, Ishwarya Ananthabhotla, Anurag Kumar, Jacob Donley, Chao Li, Gunhee Kim, Vamsi Krishna Ithapu, Calvin Murdock Institution: Seoul National University; Reality Labs Research at Meta

[2019] DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction Authors: Hamed R. Tavakoli, Ali Borji, Esa Rahtu, Juho Kannala Institution: Aalto University; Tampere University

[CVPR-2020] STAViS: Spatio-Temporal AudioVisual Saliency Network Authors: Antigoni Tsiami, Petros Koutras, Petros Maragos Institution: National Technical University of Athens

[IEEE TIP-2020] A Multimodal Saliency Model for Videos With High Audio-visual Correspondence Authors: Xiongkuo Min, Guangtao Zhai, Jiantao Zhou, Xiao-Ping Zhang, Xiaokang Yang, Xinping Guan Institution: Shanghai Jiao Tong University; University of Macau; Ryerson University

[IROS-2021] ViNet: Pushing the limits of Visual Modality for Audio-Visuav Saliency Prediction Authors: Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi Institution: International Institute for Information Technology; University of Canberra

[CVPR-2021] From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach Authors: Guotao Wang, Chenglizhao Chen, Deng-Ping Fan, Aimin Hao, Hong Qin Institution: Beihang University; Qingdao University; Chinese Academy of Medical Sciences

[ICME-2021] Lavs: A Lightweight Audio-Visual Saliency Prediction Model Authors: Dandan Zhu; Defang Zhao; Xiongkuo Min; Tian Han; Qiangqiang Zhou; Shaobo Yu; Yongqing Chen; Guangtao Zhai; Xiaokang Yang Institution: Shanghai Jiao Tong University; Tongji University; Stevens Institute of Technology; Jiangxi Normal University; East China Normal University; Hainan University

[2022] A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key! Authors: Chenglizhao Chen, Mengke Song, Wenfeng Song, Li Guo, Muwei Jian Institution: China University of Petroleum; Shandong University of Finance and Economics; Beijing Information Science and Technology University

[TOMCCAP-2022] PAV-SOD: A New Task Towards Panoramic Audiovisual Saliency Detection Authors: Yi Zhang, Fang-Yi Chao, Wassim Hamidouche, Olivier Deforges Institution: University Rennes; Institut National des Sciences Appliquées Rennes; Centre national de la recherche scientifique; Trinity College Dublin

[CVPR-2023] CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective Authors: Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, Guangtao Zhai Institution: Northwestern Polytechnical University; Ningbo Institute of Northwestern Polytechnical University; Nanchang University; Shanghai Jiao Tong University

[CVPR-2023] Self-Supervised Video Forensics by Audio-Visual Anomaly Detection Authors: Chao Feng, Ziyang Chen, Andrew Owens Institution: University of Michigan

[CVPR-2023] CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective Authors: Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, Guangtao Zhai Institution: Northwestern Polytechnical University; Ningbo Institute of Northwestern Polytechnical University; Nanchang University; Shanghai Jiao Tong University

[IJCNN-2023] 3DSEAVNet: 3D-Squeeze-and-Excitation Networks for Audio-Visual Saliency Prediction Authors: Silong Liang, Chunxiao Li, Naying Cui, Minghui Sun, Hao Xue Institution: JiLin University

[IEEE TMM-2023] SVGC-AVA: 360-Degree Video Saliency Prediction with Spherical Vector-Based Graph Convolution and Audio-Visual Attention Authors: Qin Yang, Yuqi Li, Chenglin Li, Hao Wang, Sa Yan, Li Wei, Wenrui Dai, Junni Zou, Hongkai Xiong, Pascal Frossard Institution: Shanghai Jiao Tong University; École Polytechnique Fédérale de Lausanne

[TMM-2023] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio Authors: Dandan Zhu, Kaiwei Zhang, Nana Zhang, Qiangqiang Zhou, Xiongkuo Min, Guangtao Zhai, Xiaokang Yang Institution: Institute of AI Education, Shanghai, East China Normal University;Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University; School of Computer Science and Technology, Donghua University; School of Software, Jiangxi Normal University; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

[CVPR-2024] DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction Authors: Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, Yufei Zha Institution: Northwestern Polytechnical University; Ningbo Institute of Northwestern Polytechnical University; Nanchang University

[ECCV-2020] SoundSpaces: Audio-Visual Navigation in 3D Environments Authors: Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, Kristen Grauman Institution: The University of Texas at Austin; University of Illinois at Urbana-Champaign; Facebook Reality Labs; Facebook AI Research

[ICRA-2020] Look, Listen, and Act: Towards Audio-Visual Embodied Navigation Authors: Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, Joshua B. Tenenbaum Institution: MIT-IBM Watson AI Lab; Tsinghua University; Massachusetts Institute of Technology; Google Inc.

[ICLR-2021] Learning to Set Waypoints for Audio-Visual Navigation Authors: Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman Institution: The University of Texas at Austin; Facebook AI Research

[CVPR-2021] Semantic Audio-Visual Navigation Authors: Changan Chen, Ziad Al-Halah, Kristen Grauman Institution: The University of Texas at Austin; Facebook AI Research

[ICCV-2021] Move2Hear: Active Audio-Visual Source Separation Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman Institution: The University of Texas at Austin; Facebook AI Research

[2022] Sound Adversarial Audio-Visual Navigation Authors: Yinfeng Yu, Wenbing Huang, Fuchun Sun, Changan Chen, Yikai Wang, Xiaohong Liu Institution: Tsinghua University; Xinjiang University; The University of Texas at Austin; JD Explore Academy

[CVPR-2022] Towards Generalisable Audio Representations for Audio-Visual Navigation Authors: Shunqi Mao, Chaoyi Zhang, Heng Wang, Weidong Cai Institution: University of Sydney

[NeurIPS-2022] SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning Authors: Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Institution: The University of Texas at Austin; Reality Labs at Meta; Georgia Tech; Meta AI

[NeurIPS-2022] AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments Authors: Sudipta Paul, Amit K. Roy-Chowdhury, Anoop Cherian Institution: University of California; Mitsubishi Electric Research Labs, Cambridge

[BMVC-2022] Pay Self-Attention to Audio-Visual Navigation Authors: Yinfeng Yu, Lele Cao, Fuchun Sun, Xiaohong Liu, Liejun Wang Institution: Tsinghua University; Motherbrain, EQT; Xinjiang University

[CVPR-2022] Finding Fallen Objects Via Asynchronous Audio-Visual Integration Authors: Chuang Gan, Yi Gu, Siyuan Zhou, Jeremy Schwartz, Seth Alter, James Traer, Dan Gutfreund, Joshua B. Tenenbaum, Josh McDermott, Antonio Torralba Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[CVPR-2022] ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer Authors: Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu Institution: Stanford Univeristy; Carnegie Mellon University

[IEEE RAL-2023] Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds Authors: Abdelrahman Younes, Daniel Honerkamp, Tim Welschehold, Abhinav Valada Institution: University of Freiburg

[2023] Audio Visual Language Maps for Robot Navigation Authors: Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard Institution: University of Freiburg; Google Research; University of Technology Nuremberg

[ICCV-2023] Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation Authors: Jinyu Chen, Wenguan Wang, Si Liu, Hongsheng Li, Yi Yang Institution: Beihang University; Zhejiang University; The Chinese University of Hong Kong

[IROS-2024] Audio-Visual Traffic Light State Detection for Urban Robots Authors: Sagar Gupta, Akansel Cosgun Institution: Deakin University, Australia

Audio-visual Event Localization and Parsing

[ECCV-2018] Audio-visual Event Localization in Unconstrained Videos Authors: Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu Institution: University of Rochester

[ICASSP-2019] Dual-modality Seq2Seq Network for Audio-visual Event Localization Authors: Yan-Bo Lin, Yu-Jhe Li, Yu-Chiang Frank Wang Institution: National Taiwan University

[ICCV-2019] Dual Attention Matching for Audio-Visual Event Localization Authors: Yu Wu, Linchao Zhu, Yan Yan, Yi Yang Institution: Baidu Research; University of Technology Sydney; Texas State University

[AAAI-2020] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization Authors: Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, Yan Yan Institution: Nanjing University of Science and Technology

[ACCV-2020] Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization Authors: Yan-Bo Lin, Yu-Chiang Frank Wang Institution: National Taiwan University; ASUS Intelligent Cloud Services

[WACV-2021] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention Authors: Bin Duan, Hao Tang, Wei Wang, Ziliang Zong, Guowei Yang, Yan Yan Institution: Illinois Institute of Technology; University of Trento; Texas State University

[CVPR-2021] Positive Sample Propagation along the Audio-Visual Event Line Authors: Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, Meng Wang Institution: Hefei University of Technology; Intelligent Interconnected Systems Laboratory of Anhui Province; Australian National University

[AIKE-2021] Audio-Visual Event Localization based on Cross-Modal Interacting Guidance Authors: Qiurui Yue; Xiaoyu Wu; Jiayi Gao Institution: Communication University of China

[TMM-2021] Audio-Visual Event Localization by Learning Spatial and Semantic Co-attention Authors: Cheng Xue, Xionghu Zhong, Minjie Cai, Hao Chen, Wenwu Wang Institution: Hunan University; United Kingdom of Great Britain and Northern Ireland

[CVPR-2022] Cross-Modal Background Suppression for Audio-Visual Event Localization Authors: Yan Xia, Zhou Zhao Institution: Zhejiang University

[ICASSP-2022] Bi-Directional Modality Fusion Network For Audio-Visual Event Localization Authors: Shuo Liu; Weize Quan; Yuan Liu; Dong-Ming Yan Institution: Chinese Academy of Sciences; Alibaba Group

[ICSIP-2022] Audio-Visual Event and Sound Source Localization Based on Spatial-Channel Feature Fusion Authors: Xiaolong Zheng, Ying Wei Institution: Shandong University

[IJCNN-2022] Look longer to see better: Audio-visual event localization by exploiting long-term correlation Authors: Longyin Guo, Qijun Zhao, Hongmei Gao Institution: Sichuan University; Tibet University

[EUSIPCO-2022] Audio Visual Graph Attention Networks for Event Detection in Sports Video Authors: Taichi Ishiwatari, Makiko Azuma, Takuya Handa, Masaki Takahashi, Takahiro Mochizuki, Masanori Sano Institution: Science and Technology Research Laboratories, NHK; Tokyo Institute of Technology

[IEEE TPAMI-2022] Contrastive Positive Sample Propagation along the Audio-Visual Event Line Authors: Jinxing Zhou, Dan Guo, Meng Wang Institution: Hefei University of Technology

[IEEE TPAMI-2022] Semantic and Relation Modulation for Audio-Visual Event Localization Authors: Hao Wang, Zheng-Jun Zha, Liang Li, Xuejin Chen, Jiebo Luo Institution: University of Science and Technology of China; Chinese Academy of Sciences; University of Rochester

[WACV-2023] AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization Authors: Tanvir Mahmud, Diana Marculescu Institution: The University of Texas at Austin

[WACV-2023] Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding Authors: Arda Senocak, Junsik Kim, Tae-Hyun Oh, Dingzeyu Li, In So Kweon Institution: Korea Advanced Institute of Science & Technology; Harvard University; Pohang University of Science and Technology; Adobe Research

[ICASSP-2023] A dataset for Audio-Visual Sound Event Detection in Movies Authors: Rajat Hebbar, Digbalay Bose, Krishna Somandepalli, Veena Vijai, Shrikanth Narayanan Institution: University of Southern California

[CVPR-2023] Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline Authors: Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng Institution: Southern University of Science and Technology; University of Birmingham; The University of Hong Kong; Shandong University; Peng Cheng Laboratory

[CVPR-2023] Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies Authors: Bei Gan, Xiujun Shu, Ruizhi Qiao, Haoqian Wu, Keyu Chen, Hanjun Li, Bo Ren Institution: Tencent YouTu Lab

[ICASSP-2023] Collaborative Audio-Visual Event Localization Based on Sequential Decision and Cross-Modal Consistency Authors: Yuqian Kuang, Xiaopeng Fan Institution: Harbin Institute of Technology; PengCheng Lab

[CVPR-2023] Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception Authors: Junyu Gao, Mengyuan Chen, Changsheng Xu Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Peng Cheng Laboratory

[IJCNN-2023] Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization Authors: Jinqiao Dou, Xi Chen, Yuehai Wang Institution: Zhejiang University

[AAAI-2023] Furnishing Sound Event Detection with Language Model Abilities Authors: Hualei Wang, Jianguo Mao, Zhifang Guo, Jiarui Wan, Hong Liu, Xiangdong Wang Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Beijing Jiaotong University

[IEEE TMM-2023] Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization Authors: Yuanyuan Jiang, Jianqin Yin, Yonghao Dang Institution: Beijing University of Posts and Telecommunications

[CVPR-2024] T-VSL: Text-Guided Visual Sound Source Localization in Mixtures Authors: Tanvir Mahmud, Yapeng Tian, Diana Marculescu Institution: University of Texas at Austin

[ICME-2024] Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios Authors: Ya Jiang, Qing Wang, Jun Du, Maocheng Hu, Pengfei Hu, Zeyan Liu, Shi Cheng, Zhaoxu Nian, Yuxuan Dong, Mingqi Cai, Xin Fang, Chin-Hui Lee Institution: University of Science and Technology of China; iFlytek Research; Georgia Institute of Technology

[ECCV-2024] Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes Authors: Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu Institution: Gaoling School of Artificial Intelligence, Renmin University of China, China; Beijing University of Posts and Telecommunications, Beijing, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China; Engineering Research Center of Next-Generation Search and Recommendation

[ECCV-2024] Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation Authors: Juncheng Ma, Peiwen Sun, Yaoting Wang, Di Hu Institution: University of Chinese Academy of Sciences; Beijing University of Posts and Telecommunications; Gaoling School of Artificial Intelligence, Renmin University of China, China; Engineering Research Center of Next-Generation Search and Recommendation

[ACM MM-2024] Unveiling and Mitigating Bias in Audio Visual Segmentation Authors: Peiwen Sun, Honggang Zhang, Di Hu Institution: Beijing University of Posts and Telecommunications, Beijing, China; Renmin University of China, Beijing, China

[ICASSP-2025] A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio Authors: Xavier Juanola, Gloria Haro, Magdalena Fuentes Institution: Universitat Pompeu Fabra, Barcelona, Spain; MARL-IDM, New York University, New York, USA;

[ECCV-2020] Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing Authors: Yapeng Tian, Dingzeyu Li, Chenliang Xu Institution: University of Rochester; Adobe Research

[CVPR-2021] Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing Authors: Yu Wu, Yi Yang Institution: Baidu Research; University of Technology Sydney

[NeurIPS-2021] Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing Authors: Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang Institution: National Yang Ming Chiao Tung University; UNC Chapel Hill; University of California, Merced; Snap Research; Google Research; Yonsei University

[2022] Investigating Modality Bias in Audio Visual Video Parsing Authors: Piyush Singh Pasi, Shubham Nemani, Preethi Jyothi, Ganesh Ramakrishnan Institution: Indian Institute of Technology

[ICASSP-2022] Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding Authors: Penghong Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan Institution: Harbin Institute of Technology; Wireless Technology Lab

[ECCV-2022] Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing Authors: Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, Limin Wang Institution: Nanjing University; SenseTime Research; The Chinese University of Hong Kong; Shanghai AI Laboratory

[NeurIPS-2022] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing Authors: Shentong Mo, Yapeng Tian Institution: Carnegie Mellon University; University of Texas at Dallas

[2023] Improving Audio-Visual Video Parsing with Pseudo Visual Labels Authors: Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang Institution: Hefei University of Technology; Shanghai AI Lab

[ICASSP-2023] CM-CS: Cross-Modal Common-Specific Feature Learning For Audio-Visual Video Parsing Authors: Hongbo Chen, Dongchen Zhu, Guanghui Zhang, Wenjun Shi, Xiaolin Zhang, Jiamao Li Institution: Chinese Academy of Sciences; ShanghaiTech University; University of Chinese Academy of Sciences

[2023] Towards Long Form Audio-visual Video Understanding Authors: Wenxuan Hou, Guangyao Li, Yapeng Tian, Di Hu Institution: Renmin University of China; The University of Texas at Dallas

[CVPR-2023] Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio- Visual Event Perception Authors: Junyu Gao, Mengyuan Chen, Changsheng Xu Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Peng Cheng Laboratory

[ACM MM-2023] TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification Authors: Meng Liu, Ke Liang, Dayu Hu, Hao Yu, Yue Liu, Lingyuan Meng, Wenxuan Tu, Sihang Zhou, Xinwang Liu Institution: National University of Defense Technology

[WACV-2024] Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing Authors: Yating Xu, Conghui Hu, Gim Hee Lee Institution: National University of Singapore

[ECCV-2024] Label-anticipated Event Disentanglement for Audio-Visual Video Parsing Authors: Jinxing Zhou, Dan Guo, Yuxin Mao, Yiran Zhong, Xiaojun Chang, Meng Wang Institution: Hefei University of Technology; Anhui Zhonghuitong Technology Co., Ltd.; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center; Northwestern Polytechnical University; Shanghai AI Laboratory; University of Science and Technology of China; MBZUAI

Audio-visual Question Answering and Dialog

[ICCV-2021] Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos Authors: Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim Institution: Seoul National University; Allen Institute for AI; University of Oxford; Hyundai Motor Company

[CVPR-2022] Learning To Answer Questions in Dynamic Audio-Visual Scenarios Authors: Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu Institution: Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods; University of Rochester

[NeurIPS-2022] Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners Authors: Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin · Shuohang Wang · Ziyi Yang · Chenguang Zhu · Derek Hoiem · Shih-Fu Chang · Mohit Bansal · Heng Ji Institution: University of Illinois at Urbana-Champaign; MSR; The University of North Carolina at Chapel Hill; Columbia University

[ACM MM-2023] Progressive Spatio-temporal Perception for Audio-Visual Question Answering Authors: Guangyao Li, Wenxuan Hou, Di Hu Institution: Renmin Uniiversity of China

[WACV-2024] CAD -- Contextual Multi-modal Alignment for Dynamic AVQA Authors: Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa Institution: University of Surrey; BBC Research and Development

[AAAI-2024] Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering Authors: Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang Institution: School of Computer Science and Information Engineering, Hefei University of Technology; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

[InterSpeech-2024] Towards Multilingual Audio-Visual Question Answering Authors: Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma Institution: IIIT-Delhi, India; Reliance Jio AICoE, Hyderabad, India; University of Tartu, Estonia;

[ECCV-2024] Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality Authors: Kyu Ri Park, Hong Joo Lee, Jung Uk Kim Institution: Kyung Hee University, Yong-in, South Korea; Technical University of Munich, Munich, Germany; Munich Center for Machine Learning (MCML), Munich, Germany

[ACM MM-2024] Boosting Audio Visual Question Answering via Key Semantic-Aware Cues Authors: Guangyao Li, Henghui Du, Di Hu Institution: GSAI, Renmin University of China, Beijing, China;

[CVPR-2019] Audio Visual Scene-Aware Dialog Authors: Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh Institution: Georgia Institute of Technology; Mitsubishi Electric Research Laboratories

[Interspeech-2019] Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog Authors: Hori, C.; Cherian, A.; Marks, T.; Hori, T. Institution: Mitsubishi Electric Research Laboratories, Inc.

[ICASSP-2019] End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features Authors: Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh Institution: Mitsubishi Electric Research Laboratories; Georgia Institute of Technology

[CVPR-2019] A Simple Baseline for Audio-Visual Scene-Aware Dialog Authors: Idan Schwartz, Alexander G. Schwing, Tamir Hazan Institution: Technion; University of Illinois at Urbana-Champaign

[CVPR-2019] Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog Authors: Shachi H Kumar, Eda Okur, Saurav Sahay, Jonathan Huang, Lama Nachman Institution: Anticipatory Computing Lab

[2020] TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog Authors: Wubo Li, Dongwei Jiang, Wei Zou, Xiangang Li Institution: Didi Chuxing

[AAAI-2021] Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers Authors: Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian Institution: Rutgers University; The Chinese University of Hong Kong; University of Illinois at Urbana Champaign; Mitsubishi Electric Research Laboratories

[2021] VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs Authors: Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani Institution: Columbia University; Facebook AI; Georgia Tech; Dartmouth

[ICASSP-2022] Audio-Visual Scene-Aware Dialog and Reasoning Using Audio-Visual Transformers with Joint Student-Teacher Learning Authors: Ankit Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori Institution: Mitsubishi Electric Research Laboratories; Carnegie Mellon University; Rutgers University; The Chinese University of Hong Kong

[WACV-2022] QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog Authors: Muchao Ye;Quanzeng You;Fenglong Ma Institution: University Park; Microsoft Azure Computer Vision

[TACL-2022] Learning English with Peppa Pig Authors: Mitja Nikolaus, Afra Alishahi, Grzegorz Chrupała Institution: Aix-Marseille University; Tilburg University

[2022] End-to-End Multimodal Representation Learning for Video Dialog Authors: Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa Institution: Georgia Institute of Technology

[AAAI-2022] Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations Authors: Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima Institution: Nippon Telegraph and Telephone Corporation

[IEEE/ACM TASLP-2023] DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog Authors: Zhe Chen, Hongcheng Liu, Yu Wang Institution: Shanghai Jiao Tong University; Shanghai Artificial Intelligence Laboratory

[IEEE-ACM T AUDIO SPE-2023] DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog Authors: Zhe Chen, Hongcheng Liu, Yu Wang Institution: Cooperative Medianet Innovation Center, Shanghai Jiao Tong University

Contributors 6

IMAGES

AUDIO VISUAL: Pengertian, Fungsi, Tujuan, Jenis dan Jurnal Makalah
Audio Visualization
Sunshine Menendez
Sound Recordings' visual representation through their corresponding...
Representation of Sound through Visual :: Behance
Visual Representation Audio Wave Stock Illustration 9303319

COMMENTS

Audio Waveforms Explained | Insights For Audio Editors 2024
3 days ago · This intricate representation allows audio editors to precisely manipulate sound, tailoring it to specific needs and contexts. Audio waveforms are fundamental in audio editing and engineering. Their visual representation of sound – depicting amplitude and frequency variations – is a critical tool for audio professionals.
GitHub - krantiparida/awesome-audio-visual: A curated list of ...
UnAV-100 - The dataset consists of more than 10K untrimmed videos with over 30K audio-visual events covering 100 different event categories. There are often multiple audio-visual events that might be very short or long, and occur concurrently in each video as in real-life audio-visual scenes.
Audio Waveform Generator | Visualize Sound in Videos ...
An audio waveform is a visual representation of an audio file that shows its sound waves. It displays the volume (amplitude) and rhythm of your audio in a graphical format, making it easier to follow and more engaging for viewers. How do you create a waveform from audio? Creating a waveform from audio is simple with a free audio waveform maker.
From Vision to Audio and Beyond: A Unified Model for Audio ...
Sep 27, 2024 · Video encompasses both visual and auditory data, creating a perceptually rich experience where these two modalities complement each other. As such, videos are a valuable type of media for the investigation of the interplay between audio and visual elements. Previous studies of audio-visual modalities primarily focused on either audio-visual representation learning or generative modeling of a ...
Understanding Audio Waveforms: A Comprehensive Guide
Sep 8, 2023 · This visual representation simplifies the understanding and interpretation of complex audio data. Waveform images also help with audio editing and manipulation. By zooming in and out of specific sections of the waveform, producers and engineers can precisely edit audio, remove noise, or enhance specific elements.
Mastering Audio Visualization: A Beginner's Guide to Waveforms
Aug 7, 2023 · An audio waveform generator creates visual representations of sound waves from audio files. It analyzes the audio file, breaks it down into various segments, and presents each segment's amplitude and frequency as a visual graph.
What is a Spectrogram? The Producer’s Guide to Visual Audio
Mar 4, 2024 · A visual representation of the frequency information in your audio tracks is surprisingly powerful. Combine that with processing that can only take place in the frequency domain and you have some serious audio super powers. Here are five real world uses for spectrogram tools: 1. Clean up recordings
Visualizing Sound: A Beginner’s Guide to Using ... - Ableton
Feb 7, 2023 · An audio wheel by Bileam Tschepe showing a visual representation of the song, "Says" by Nils Frahm Throughout the history of electronic music, audiovisual art has been integral to its aesthetic identity, both in live performance settings and more recently online.
GitHub - GeWu-Lab/awesome-audiovisual-learning: A curated ...
[ICIP-2022] Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition Authors: Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai Institution: University of Science and Technology of China; Chinese Academy of Sciences; iFLYTEK Co., Ltd.
Sound and Visual Representation Learning with Multiple ...
Audio-visual learning. Audio-visual data offers a variety of resources for knowledge transfer between different modal-ities [4, 7, 1]. Many works [65, 3] leverage the natural synchronization between vision and sound to learn repre-sentation of sounds and images without ground truth labels. This has been successfully used in various tasks such as