Minería de texto en la clasificación de documentos digitales

Marcial Contreras Barrera

doi:10.5195/biblios.2016.309

Text mining in the classification of digital documents

Authors

Marcial Contreras Barrera Universidad Nacional Autónoma de México – UNAM http://orcid.org/0000-0002-2837-5221

DOI:

https://doi.org/10.5195/biblios.2016.309

Keywords:

Text mining, Classification, Automated classifier, Bibliographic material

Abstract

Objective: Develop an automated classifier for the classification of bibliographic material by means of the text mining. Methodology: The text mining is used for the development of the classifier, based on a method of type supervised, conformed by two phases; learning and recognition, in the learning phase, the classifier learns patterns across the analysis of bibliographical records, of the classification Z, belonging to library science, information sciences and information resources, recovered from the database LIBRUNAM, in this phase is obtained the classifier capable of recognizing different subclasses (LC). In the recognition phase the classifier is validated and evaluates across classification tests, for this end bibliographical records of the classification Z are taken randomly, classified by a cataloguer and processed by the automated classifier, in order to obtain the precision of the automated classifier. Results: The application of the text mining achieved the development of the automated classifier, through the method classifying documents supervised type. The precision of the classifier was calculated doing the comparison among the assigned topics manually and automated obtaining 75.70% of precision. Conclusions: The application of text mining facilitated the creation of automated classifier, allowing to obtain useful technology for the classification of bibliographical material with the aim of improving and speed up the process of organizing digital documents.

Author Biography

Marcial Contreras Barrera, Universidad Nacional Autónoma de México – UNAM

Técnico Académico, Subdirección de Informática, Departamento de Producción, Dirección General de Bibliotecas, Universidad Nacional Autónoma de México – UNAM, México.

References

Abbott, D. (10 de Julio de 2013). Introduction to Text Mining. Recuperado el 17 de 6 de 2014, de http://www.vscse.org/summerschool/2013/Abbott.pdf

Abdullah Muhammad, A. (2014). Medical Document Classification Based on MeSH. 2014 47th Hawaii International Conference on System Sciences (págs. 2571 - 2575). Waikoloa, HI: I EEE.

Ananiadou, S., Kell, D. B., & Tsujiii, J.-i. (October de 2006). Text mining and its potential applications in systems biology. (ELSEVIER, Ed.) Trends in Biotechnology, 24(12), 9.

Arkaitz Zubiaga, V. F. (2009). Comparativa de aproximaciones a SVM semisupervisado multiclase para clasificación de páginas Web. Recuperado el 16 de 10 de 2015, de Dialnet: http://dialnet.unirioja.es/servlet/articulo?codigo=2973575

Dey, L., Rastogi, A. C., & Kumar, S. (2006). Generating Concept Ontologies Through Text Mining. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (págs. 23 - 32 ). Hong Kong : IEEE.

Katerina Frantzi, S. A. (August de 2000). Automatic recognition of multi-word terms:. the C-value/NC-value method. (S. Link, Ed.) International Journal on Digital Libraries, 3(2), 115-130.

LAN, Q. (2010). Extraction of News Content for Text Mining Based on Edit Distance. Journal of Computational Information Systems, (págs. 3761-3777).

Lee, S., Baker, J., Song, J., & Wetherbe, J. C. (2010). An Empirical Comparison of Four Text Mining Methods . Proceedings of the 43rd Hawaii International Conference on System Sciences - 2010 (págs. 1-10). Hawaii : IEEE.

Lévano, G. L. (12 de 06 de 2011). Clasificación de colecciones. Recuperado el 12 de 08 de 2013, de http://www.ugel05.edu.pe/

M.Sukanya, S. (2012). Techniques on Text Mining. 2012 IEEE International Conference on Advanced Communication Control and Computing Technologies (ICACCCT), (págs. 269-271). Ramanathapuram .

Maggini, M., Rigutini, L., & Turchi, M. (2004). Pseudo-Supervised Clustering for Text Documents. Web Intelligence, 2004. WI 2004. Proceedings. IEEE/WIC/ACM International Conference on (págs. 363 - 369 ). IEEE .

Mahdi Shafiei, S. W. (2007). Document Representation and Dimension Reduction for Text Clustering. Workshop on Text Data Mining and Management (TDMM) in conjuction with 23rd IEEE conference (págs. 770-778). Turquia: IEEE.

Maowen, W., Caidong, Z., Weiyao, L., & QingQiang, W. (2012 ). Text Topic Mining Based on LDA and Co-occurrence Theory. Computer Science & Education (ICCSE), 2012 7th International Conference on (págs. 525 - 528 ). Melbourne, VIC : IEEE .

Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. En J. K. Michael W. Berry, Text mining : applications and theory. New Jersey: Mic hael W. Berry and Jacob Kogan.

Salton, G. (1989). Automatic text processing : The transformation, analysis, and retrieval of information by computer. E.U.A: Eddison Wesley.

Salton, G., & Mcgill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.

Swanson, D. R. (1991). Complementary structures in disjoint science literatures. In Proceedings of the 14th Annual International ACM/SIGIR Conference, 280-289.

Swanson, D., & Smalhaiser, N. (1994). Assessing a gap in the biomedical literature: magnesium deficiency and neurologic disease. Neuroscience research communications, 15, 1-9.

Verma, V. K., Ranjan, M., & Mishra, P. (2015). Text mining and information professionals: Role, issues and challenges . Emerging Trends and Technologies in Libraries and Information Services (ETTLIS), 2015 4th International Symposium on (págs. 133 - 137 ). Noida : IEEE .

Wang, Z. (2010). Document Classification Algorithm Based on Kernel Logistic Regression. Industrial and Information Systems (IIS), 2010 2nd International Conference on (Volume:1 ) (págs. 76 - 79 ). Dalian : IEEE.

Wei, W., & Barnaghi, P. M. (23 de sep de 2013). University of Surrey. Recuperado el 15 de octubre de 2015, de http://epubs.surrey.ac.uk/533646/

Xiu-Li, P., Feng, Y.-Q., & Jiang, W. (2007). An improved document classificaction approach with maximum entropy and entropy feauture selection. 2007 International Conference on Machine Learning and Cybernetics (págs. 3911-3915). Hong Kong: IEEE.

Zhang, Y., & Gu, H. (2011). Text Mining with Application to Academic Libraries. En Computer Science for Environmental Engineering and EcoInformatics (págs. 200-205). Springer Berlin Heidelberg.

Downloads

PDF (Español (España))

Published

2016-11-21

How to Cite

Contreras Barrera, M. (2016). Text mining in the classification of digital documents. Biblios Journal of Librarianship and Information Science, (64), 33–43. https://doi.org/10.5195/biblios.2016.309

Download Citation

Issue

No. 64 (2016)

Section

Original

License

Authors who publish with this journal agree to the following terms:

The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons Attribution 4.0 International License or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
1. Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
with the understanding that the above condition can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post online a prepublication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work. Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
The Author represents and warrants that:
1. the Work is the Author’s original work;
2. the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
3. the Work is not pending review or under consideration by another publisher;
4. the Work has not previously been published;
5. the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
6. the Work contains no libel, invasion of privacy, or other unlawful matter.
The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.

Revised 7/16/2018. Revision Description: Removed outdated link.