Unsupervised Improvement of Audio-Text Cross-Modal Representations
dc.contributor.author | Wang, Zhepei | |
dc.contributor.author | Subakan, Cem | |
dc.contributor.author | Subramani, Krishna | |
dc.contributor.author | Wu, Junkai | |
dc.contributor.author | TIAGO FERNANDES TAVARES | |
dc.contributor.author | FABIO JOSE AYRES | |
dc.contributor.author | Smaragdis, Paris | |
dc.creator | Wang, Zhepei | |
dc.creator | Subakan, Cem | |
dc.creator | Subramani, Krishna | |
dc.creator | Wu, Junkai | |
dc.creator | Smaragdis, Paris | |
dc.date.accessioned | 2025-01-08T00:04:10Z | |
dc.date.available | 2025-01-08T00:04:10Z | |
dc.date.issued | 2023 | |
dc.description.abstract | Recent advances in using language models to obtain cross-modal audio-text representations have overcome the limitations of conventional training approaches that use predefined labels. This has allowed the community to make progress in tasks like zero-shot classification, which would otherwise not be possible. However, learning such representations requires a large amount of human-annotated audio-text pairs. In this paper, we study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio. We explore domain-unspecific and domain-specific curation methods to create audio-text pairs that we use to further improve the model. We also show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks. | en |
dc.format | Digital | |
dc.format.extent | 5 p. | |
dc.identifier.uri | https://repositorio.insper.edu.br/handle/11224/7245 | |
dc.language.iso | Inglês | |
dc.subject | Audio-text representation learning | en |
dc.subject | Data aug-mentation | en |
dc.subject | Contrastive learning | en |
dc.subject | Sound event classification | en |
dc.subject | Acoustic scene classification | en |
dc.title | Unsupervised Improvement of Audio-Text Cross-Modal Representations | |
dc.type | conference paper | |
dspace.entity.type | Publication | |
local.description.event | 3 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics | |
local.identifier.sourceUri | https://arxiv.org/abs/2305.01864 | |
local.publisher.city | New York | |
local.publisher.country | Estados Unidos | |
local.subject.cnpq | CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO | |
local.subject.cnpq | ENGENHARIAS::ENGENHARIA ELETRICA | |
local.type | Trabalho de Evento | |
relation.isAuthorOfPublication | b94cce1d-a49e-40dc-becd-051f9254fab8 | |
relation.isAuthorOfPublication | 37971022-7c69-4e93-9186-4c9431a1f95c | |
relation.isAuthorOfPublication.latestForDiscovery | b94cce1d-a49e-40dc-becd-051f9254fab8 |
Arquivos
Pacote Original
1 - 2 de 2
N/D
- Nome:
- ACESSO_RESTRITO_Trabalho_de_Evento_2023_Unsupervised_improvement_of_audio_text_cross_modal_representations_TC.pdf
- Tamanho:
- 262.54 KB
- Formato:
- Adobe Portable Document Format
N/D
- Nome:
- Primeira_Pagina_Trabalho_de_Evento_2023_Unsupervised_improvement_of_audio_text_cross_modal_representations_TC.pdf
- Tamanho:
- 139.37 KB
- Formato:
- Adobe Portable Document Format
Licença do Pacote
1 - 1 de 1
N/D
- Nome:
- license.txt
- Tamanho:
- 236 B
- Formato:
- Item-specific license agreed upon to submission
- Descrição: