Integration strategy for impact metrics of institutional academic output through a Data Warehouse

a case study with OpenAlex, OpenAIRE, and COAR

Authors

DOI:

https://doi.org/10.5195/biblios.2025.1348

Keywords:

Institutional academic output, Persistent identifiers, Responsible metrics, Data Warehouse, Data Vault

Abstract

Objective. This article proposes a strategy for integrating data from multiple sources on academic output, facilitating informed decision-making. The approach is adaptable to various organizations, regardless of the number or type of sources involved. Method. An integration system was designed based on open-source tools and a scalable hybrid data model. It combines Data Warehouse techniques (Kimball & Ross) to optimize analysis, and Data Vault 2.0 to manage heterogeneity and ensure traceability, enabling flexible integration. Results. Data from OpenAIRE, OpenAlex, and COAR were integrated into a unified academic publications table, consolidating key metrics such as citations, views, and downloads. The table includes relevant information such as title, DOI, publication type and year, as well as open access status. Conclusions. Data integration enables a more comprehensive view of the impact of institutional scientific output. This approach supports the implementation of responsible metrics.

Author Biographies

Pablo César de Albuquerque, National University of La Plata

Bachelor’s degree in Computer Science from the National University of La Plata (UNLP) and is currently pursuing a Ph.D. in Computer Science at the School of Computer Science at the same university. He conducts his research at PREBI-SEDICI (UNLP) and at the Center for Information Management Services (CESGI) of the Scientific Research Commission of the Province of Buenos Aires (CIC).

Gonzalo Luján Villarreal, Universidad Nacional de La Plata

Ph.D. in Computer Science from the National University of La Plata (UNLP). He currently serves as Director of PREBI-SEDICI UNLP and Director of the Center for Information Management Services (CESGI) at the Scientific Research Commission (CIC) of the Province of Buenos Aires.

References

Add seeds to your DAG. (2025, Abril 3). dbt Developer Hub. Recuperado el Abril 4, 2025, de https://docs.getdbt.com/docs/build/seeds

Aghassibake, N., Castello, O. G., Gujilde, P., & Rabun, S. (2023). Visualizing institutional activity using persistent identifier metadata. Information Services & Use, 43(3-4), 335–342. https://doi.org/10.3233/ISU-230218

Albuquerque, P. C. (2024a). PabloDeAlbu/dbt-scholar [Software]. GitHub. https://github.com/PabloDeAlbu/dbt-scholar

Albuquerque, P. C. (2024b). PabloDeAlbu/kedro-scholar [Cuaderno Jupyter]. GitHub. https://github.com/PabloDeAlbu/kedro-scholar

Albuquerque, P. C., Villarreal, G. L., & De Giusti, M. R. (2021, Junio 22–25). Proposal of a data warehouse for scholarly institutions built on institutional repositories [Objeto de conferencia]. IX Jornadas de Cloud Computing, Big Data & Emerging Topics, La Plata, Buenos Aires, Argentina. http://sedici.unlp.edu.ar/handle/10915/125161

Albuquerque, P. C., Villarreal, G. L., & De Giusti, M. R. (2022, Octubre 3-7). WebID como base para el desarrollo de una marca personal en repositorios institucionales [Objeto de conferencia]. XI Conferencia Internacional de Bibliotecas y Repositorios Digitales (BIREDIAL-ISTEC), Costa Rica. http://sedici.unlp.edu.ar/handle/10915/145739

Albuquerque, P. C., Villarreal, G. L., & De Giusti, M. R. (2023, Octubre 18-20). Modelo dimensional para la medición de la producción académica [Objeto de conferencia]. XII Conferencia Internacional de Bibliotecas y Repositorios Digitales (BIREDIAL-ISTEC), Montevideo, Uruguay. http://sedici.unlp.edu.ar/handle/10915/161906

Apache Superset. (2025). Apache Superset™ is an open-source modern data exploration and visualization platform. Recuperado el Abril 4, 2025, de https://superset.apache.org/

Bollini, A., Knoth, P., Perakakis, P., Rodrigues, E., Shearer, K., Sompel, V. de, & Walk, P. (2017). Next generation repositories: Behaviours and technical recommendations of the COAR Next Generation Repositories Working Group (Version 2) [Original report]. Zenodo. https://doi.org/10.5281/zenodo.8077381

Cabezas-Clavijo, A., & Torres-Salinas, D. (2021). Bibliometric reports for institutions: Best practices in a responsible metrics scenario. Frontiers in Research Metrics and Analytics, 6, Article e696470. https://doi.org/10.3389/frma.2021.696470

Carletti, E., Rucci, E., & Villarreal, G. L. (2024, Octubre 22-24). HERA 2.0: Más funcionalidad para la evaluación de recursos académicos [Objeto de conferencia]. XIII Conferencia Internacional de Bibliotecas y Repositorios Digitales (BIREDIAL-ISTEC), Santiago de Chile, Chile. http://sedici.unlp.edu.ar/handle/10915/177287

Ciuciu-Kiss, J. T., & Garijo, D. (2024, May 27). Assessing the overlap of science knowledge graphs: A quantitative analysis [Conference paper]. International Workshop on Natural Scientific Language Processing and Research Knowledge Graphs, Hersonissos, Crete, Greece. In G. Rehm, S. Dietze, S. Schimmler, & F. Krüger (Eds.), Natural scientific language processing and research knowledge graphs, Lecture Notes in Computer Science (Vol. 14770, pp. 171-185). Springer. https://doi.org/10.1007/978-3-031-65794-8_11

Cuartas, G. V., Tirado, A. U., Restrepo-Quintero, D., Gutiérrez, J. O., Pallares, C., Gómez-Molina, H. F., Suárez-Tamayo, M., & Calle, J. (2019). Hacia un modelo de medición de la ciencia desde el Sur Global: Métricas responsables. Palabra Clave, 8(2), Artículo e068. https://doi.org/10.24215/18539912e068

Data catalog. (2025). Kedro. Recuperado el Julio 22, 2025, de https://docs.kedro.org/en/1.0.0/catalog-data/introduction/

Dhaouadi, A., Bousselmi, K., Gammoudi, M. M., Monnet, S., & Hammoudi, S. (2022). Data warehousing process modeling from classical approaches to new trends: Main features and comparisons. Data, 7(8), Article 113. https://doi.org/10.3390/data7080113

Donthu, N., Kumar, S., Mukherjee, D., Pandey, N., & Lim, W. M. (2021, September). How to conduct a bibliometric analysis: An overview and guidelines. Journal of Business Research, 133, 285–296. https://doi.org/10.1016/j.jbusres.2021.04.070

Filtering search results. (2025). OpenAIRE Graph Documentation. Recuperado el Julio 22, 2025, de https://graph.openaire.eu/docs/10.3.0/apis/graph-api/searching-entities/filtering-search-results/

Harder, R. (2024, June). Using Scopus and OpenAlex APIs to retrieve bibliographic data for evidence synthesis: A procedure based on Bash and SQL. MethodsX, 12, Article 102601. https://doi.org/10.1016/j.mex.2024.102601

Hogan, A., Blomqvist, E., Cochez, M., D’Amato, C., Melo, G. de, Gutierrez, C., Kirrane, S., Gayo, J. E. L., Navigli, R., Neumaier, S., Ngomo, A. C. N., Polleres, A., Rashid, S. M., Rula, A., Schmelzeisen, L., Sequeda, J., Staab, S., & Zimmermann, A. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), Article 71. https://doi.org/10.1145/3447772

Kimball, R., & Ross, M. (2013). The data warehouse lifecycle toolkit (3rd ed.). John Wiley & Sons.

Linstedt, D., & Olschimke, M. (2015). Building a scalable data warehouse with Data Vault 2.0 (1st ed.). Morgan Kaufmann.

Manghi, P., Bardi, A., Atzori, C., Baglioni, M., Manola, N., Schirrwagen, J., & Principe, P. (2019). The OpenAIRE research graph data model (Version 1.3) [Original report]. Zenodo. https://doi.org/10.5281/zenodo.2643199

Öztürk, O., Kocaman, R., & Kanbach, D. K. (2024). How to design bibliometric research: An overview and a framework proposal. Review of Managerial Science, 18, 3333-3361. https://doi.org/10.1007/s11846-024-00738-0

Priem, J., Piwowar, H., & Orr, R. (2022, May 4). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts [Preprint arXiv]. Submitted to the 26th International Conference on Science, Technology and Innovation Indicators (STI 2022), Granada, Spain. arXiv. https://doi.org/10.48550/arXiv.2205.01833

Searching entities. (2025). OpenAIRE Graph Documentation. Recuperado el Julio 22, 2025, de https://graph.openaire.eu/docs/apis/graph-api/searching-entities/

Silva, V. S., Matas, L., Moreira, T., & Segundo, W. C. (2022). An ETL strategy for integrating the LA Referencia platform and VIVO for the Brazilian CRIS. Procedia Computer Science, 211, 111-117. https://doi.org/10.1016/j.procs.2022.10.182

Tomczyńska, A., Ostrowska, S., Protasiewicz, J., & Podwysocki, E. (2023, June 15). Beyond CRIS: A research and higher education information system in Poland [Paper]. EUNIS 2023 Annual Conference, Vigo, Spain. http://hdl.handle.net/11366/2477

Universidad Nacional de La Plata. (2025). OpenAlex. Recuperado el Abril 4, 2025, de https://openalex.org/institutions/i874386039

Use a Jupyter notebook for Kedro project experiments. (2024). Kedro. Recuperado el Abril 4, 2025, de https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html

Works overview: Schema reference for Works entities. (2025). OpenAlex. Recuperado el Abril 4, 2025, de https://docs.openalex.org/api-entities/works/work-object

Published

2026-05-15

How to Cite

Albuquerque, P. C. de, & Villarreal, G. L. (2026). Integration strategy for impact metrics of institutional academic output through a Data Warehouse: a case study with OpenAlex, OpenAIRE, and COAR. Biblios Journal of Librarianship and Information Science, (88), e013. https://doi.org/10.5195/biblios.2025.1348

Issue

Section

Original