Given the growing expansion in the development and use of computational methods in humanities research, it is necessary to propose methodologies that properly explore the questions posed by different disciplines, considering the locality of both data and the process behind its generation. In the present work, we explore the problem of automatically identifying the main topics in collections of Nahua discourses known as huehuetlahtollis. Each document in the collections is introduced through an extended title, and it is a natural question if enhancing the role of title terms during the unsupervised learning process could enrich results. Aiming at explainability, we consider a model based on nonnegative matrix factorizations (NMF). An overview of the historical process behind the composition of the explored corpora suggests that titles reflect the point of view of the collection’s compiler in manners that justify viewing the paratext as a supplementary source on the material. Therefore, we propose a bi-objective NMF scheme that appropriately reflects the a priori knowledge on the corpus, linking and combining the information of titles and content to improve the accuracy in identifying topic groups and relevant terms within a corpus. By comparing three different schemes against the labels assigned by an expert, we show that our model better reflects the nature of data, translating into higher accuracy. Finally, we present some insights on the studied corpora derived from our analysis of identified relevant terms.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)