Last modified: 2026-01-24
Abstract
Computational linguistics and multimodality intersect significantly, enhancing language comprehension through the integration of various data types. Recent advances in machine learning have led to the development of models that leverage both linguistic and visual features, improving performance across tasks. This synthesis of modalities not only aids in language processing but also addresses challenges in data representation and alignment (Sun et al., 2023).
Models that rely on multimodal data outperform monomodal models, as evidenced by improvements in the GLUE benchmark (Luo et al., 2021). Research indicates that different modalities can significantly impact performance in tasks such as sentiment analysis and emotion recognition, underscoring the need to better understand their roles (Haouhat et al., 2023). However, effective integration of diverse modalities remains a challenge, particularly in ensuring robust alignment and reasoning capabilities (Haouhat et al., 2023). Current models often focus on the unimodal origins of language, highlighting the need for frameworks that incorporate multimodal communication principles (Grifoni et al., 2021).
The research objective has been to analyze emerging trends in computational linguistics and multimodality in the scientific literature. A systematic content analysis was conducted (Cabanillas et al., 2022) using the Scopus database and the keywords "computational linguistics" and "multi-modality," yielding a total of 127 documents through the PRISMA protocol. The resulting conceptual map represents the interrelations within computational linguistics and multimodality, showing how various subfields interconnect in this discipline. At the center, the terms "computational linguistics" and "multi-modality" stand out as the main axes, illustrating that research in this area focuses on integrating various data modalities, such as language, image, and audio, to enhance information processing and comprehension. Surrounding these central concepts, there are several thematic clusters represented by different colors, each with a specialized focus.
For instance, the blue nodes, including "language model" and "large language models," are closely associated with the development and application of language models, which are essential for natural language processing (NLP) tasks and semantic analysis. These concepts are strongly connected to terms like "natural language processing," "semantics," and "large language models," highlighting the importance of building robust systems to deeply understand human language.
In contrast, the red cluster, which includes terms such as "machine translations," "performance," and "benchmarking," appears focused on machine translation and the performance of these models in practical applications. This suggests that, beyond modeling, there is a strong emphasis on improving the efficiency and accuracy of these technologies, especially in translation tasks where precise interpretation across languages is crucial.
Meanwhile, the yellow nodes, such as "visual languages," "classification of information," and "speech recognition," point to the integration of different data modalities. This multimodal approach focuses on merging visual and auditory information with language processing, highlighting applications ranging from speech recognition to image classification and text comprehension in various contexts. This shows how computational linguistics extends beyond text to develop systems capable of interpreting and combining data from different sources, such as images and sounds, for a more comprehensive representation of information.
This map illustrates the complex interaction between language models and their performance, along with the importance of benchmarking and performance analysis as key elements in measuring progress in the accuracy and efficiency of these technologies. Altogether, the map demonstrates how modern computational linguistics encompasses an interconnected network of concepts, where multimodal learning, language processing, and machine translation converge to create more versatile and accurate AI applications across diverse settings. While multimodal approaches show promise, reliance on dominant modalities may overshadow the contributions of others, suggesting the need for a balanced exploration in future research.
Figure 1. Co-ocurrence map of keywords
References
- Cabanillas-García, J. L., Luengo, R., & Carvalho, J. L (2022). Bibliographic and content review on the use of technologyin people with disabilities during the pandemic. In E. M. Pope, C. Brindão & C. G. Sanders (Eds.), Qualitative Research: Practices and Challenges (Vol. 11) (e535). https://doi.org/10.36367/ntqr.11.2022.e535
- Haouhat, A., Bellaouar, S., Nehar, A., & Cherroun, H. (2023). Modality Influence in Multimodal Machine Learning. arXiv preprint arXiv:2306.06476. https://arxiv.org/abs/2306.06476
- Grifoni, P., D’ulizia, A., & Ferri, F. (2021). When language evolution meets multimodality: Current status and challenges toward multimodal computational models. IEEE Access, 9, 35196-35206. https://ieeexplore.ieee.org/abstract/document/9361559
- Sun, H., Niu, Z., Yu, X., Liu, J., Chen, Y. W., & Lin, L. (2023). Modality-invariant and Specific Prompting for Multimodal Human Perception Understanding. arXiv preprint arXiv:2311.10791. https://arxiv.org/abs/2311.10791
- Luo, X., Cao, C., & Wang, L. (2021, December). Multi-modal Universal Embedding Representations for Language Understanding. In International Conference on Frontiers in Cyber Security (pp. 103-119). Springer Singapore.