بررسی رویکردهای متن‌کاوی و عملکرد آن در کشف و استخراج موضوع

نویسندگان
دانشگاه اصفهان
چکیده
زمینه و هدف : در این پژوهش چهار روش متن‌کاوی بررسی میشود و بر درک و شناسایی خصوصیات و محدودیت‌های آن‌ها در کشف موضوع تمرکز میکند. این چهار روش عبارت‌اند از 1) تجزیه‌وتحلیل معنایی پنهان(LSA) 2) تحلیل معنایی پنهان احتمالاتی(PLSA)، 3) تخصیص دیریکله پنهان(LDA) و 4) مدل‌سازی موضوعی همبسته(CTM).

روش پژوهش: پژوهش حاضر از نوع کتابخانه‌ای است که در آن، ادبیات حوزه متن‌کاوی و مدل‌سازی موضوعی مرور و تحلیل شده است.

یافته‌ها: تجزیه‌وتحلیل معنایی پنهان می‌تواند برای تشخیص موضوعات خاص و منحصربه‌فرد در مدارکی که تنها به یک موضوع پرداخته‌اند استفاده شود. سه روش دیگر متن‌کاوی، بر موضوعات و گرایش کلی متن متمرکز هستند. تحلیل معنایی پنهان احتمالاتی برای مدارکی که به یک موضوع پرداخته‌اند قابل‌استفاده است اما برخلاف تجزیه‌وتحلیل معنایی پنهان ، این روش در کشف موضوعات و مضامین کلی متن کاربرد دارد. درحالی‌که تخصیص دیریکله پنهان در مورد مدارکی که به چندین موضوع پرداخته‌اند کاربرد بیشتری دارد. روش مدل‌سازی موضوعی همبسته می‌تواند در تشخیص ارتباط بین دسته‌های موضوعی مختلف استفاده شود.

نتیجه‌گیری: رویکردهای متن‌کاوی به خاطر بهره‌گیری از تحلیل معنایی در کشف و استخراج موضوع متون مناسب است
کلیدواژه‌ها

عنوان مقاله English

A review of text mining approaches and their function in discovering and extracting a topic

نویسندگان English

Ali Mansouri
Fatemeh Zarmehr
Hossein Karshenas
Isfahan University
چکیده English

Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery.

Methodology: The study is an analytical review of the literature of text mining and topic modeling.

Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text mining methods focus on topics and general partiality of the text. PLSA is applicable to documents dealing with a topic, unlike the LSA, it is used to discover general themes and contexts. However, LDA is more applicable to documents that address several issues. The CTM, method can be used to identify relationship between different subject categories.

Conclusion: Text mining tactics are suitable for employing analysis in discovering and extracting the text subjects.

کلیدواژه‌ها English

Text mining
Topic Modeling
Semantic Analysis
Topic Discovery
Abosaba Kazemaini, A(2011). Comparison of Comprehensiveness and Prevention of Recovered Information Based on Front and Back Storage Storage Systems in Persian Library Software. Master thesis. Department of Library & Information Science, Faculty of Educational Sciences and Psychology, Isfahan University.
Babu, P, B., Sarangi, A.K., & Madalli, D. P. (2012). "Knowledge Organization Systems for semantic digital Libraries". International Conference Trends in Knowledge and Information Dynamics. Bangolare, Pakistan. Retrieved from: http://eprints.rclis.org/19759/1/KOS semantic Digital Libraries.pdf
Bitterman, Andre; Fischer, Andreas (2018). How to identify hot topics in psychology using topic modeling. Zeitschrift fur psychologie. 226(1), 3-13. [DOI:10.1027/2151-2604/a000318]
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 17-35. [DOI:10.1214/07-AOAS114]
Blei, D; Ng, A; Jordan, M (2003), "Latent dirichlet allocation," Journal, 3, 993-1022.
Blei, David & Lafferty, John (2007). A correlated topic model of science. The annual of applied statistics,1(1), 17-35. [DOI:10.1214/07-AOAS114]
Chien, Jt(2016). Hierarchical theme and topic modeling. IEEE trans neural netw learn syst.27(3): 565-578. Available at: https://www.researchgate.net/publication/274394886 Hierarchical Theme and Topic Modeling [DOI:10.1109/TNNLS.2015.2414658]
Dean, J(2014). Bigdata, datamining & machine learning: Value creation for business leader and practitioners, Retrieved from: https://www.wiley.com/en-ir/Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners-p-9781118618042 [DOI:10.1002/9781118691786]
Drakos, G(2019). NLP Tutorials: topic modeling with SVD and truncated SVD. GDcoder. Retrieved from: https://medium.com/@george.drakos62/nlp-tutorial-topic-modeling-with-singular-value-decomposition-svd-and-truncated-svd-fbpca-and-5fa612277c22.
Efsun ,S., Yadav, K., Chio, H. A (2017). Topic modeling based classification of clinical report. Association for computational linguistics, 67-73. Retrieved from: http://aclweb.org/anthology/P13-3010.
Fang EX, Li M-D, Jordan MI, Liu H (2018) Mining massive amounts of genomic data: a semi parametric topic modeling approach
Fang, D., Yang, H., Gao, B. and Li, X. (2018), "Discovering research topics from library electronic references using latent Dirichlet allocation", Library Hi Tech, 36(3), 400-410. [DOI:10.1108/LHT-06-2017-0132]
Figuerola, C.G., García Marco, F.J. & Pinto, M. Sci-entometrics (2017) 112: 1507. Retrrieved from:
https://doi.org/10.1007/s11192-017-2432-9 [DOI:10.1007/s11192-017-2432-9.]
Gupta ,V.and G. Lehal(2009)"A Survey of Text Mining Techniques and Applications", Journal of Emerging Technologies In Web Intelligence, 1. [DOI:10.4304/jetwi.1.1.60-76]
Hagen, Loni(2018). Content analysis of e-petition with topic modeling: how to train and evaluate LDA models? Information processing & management,54(6), 1292-1307. [DOI:10.1016/j.ipm.2018.05.006]
Heydari, F(2014). Web users clustering and initial fetching of web pages using hidden probabilistic semantic analysis. Master thesis. Isfahan University of Technology.
Hinde J. (2011) Logistic Normal Distribution. In: Lovric M. (eds) International Encyclopedia of Statistical Science. Springer, Berlin, Heidelberg [DOI:10.1007/978-3-642-04898-2_342]
Hofmann T.(2001) "Unsupervised learning by probabilistic latent semantic analysis," Machine learning, 42(1-2), 177-196. [DOI:10.1023/A:1007617005950]
Hwang, S.Y., Wei, C.P., Lee, C.H., & Chen, Y.S. (2017). Coauthor ship network based literature recommendation with topic model. Online Information Review, 41(3), 318-336. [DOI:10.1108/OIR-06-2016-0166]
Khademian, M., Kokabi, M(2018). Liberian Thing's Social Labels Versus Subject Headings in the Library of Congress: Review of Texts. Journal of Library and Information Science, 8 (1,) 313- 335. Retrieved 3/3/98, from : https://infosci.um.ac.ir/index.php/riis/article/view/57823
Kinyanjui, Daniel (2016) Subject cataloguing and the principles on which the choice of subject headings should be based, GRIN Verlag: Munich.
Koller, D., and Friedman, N.(2009), "Probabilistic Graphical Models: Principles and Techniques", The MIT Press.
Kurata, K & et al (2018). Analyzing library and information science full-text articles using a topic modeling approach. 81Annual meeting of the association for information science & technology I nVancouvar of Canada (10-14, November, 2018). Retrived from: https://www.researchgate.net/publication/330812928 Analyzing library and information science full-text articles using a topic modeling approach
lee, S., Song, J & Kim,Y(2010). An Empirical comparison of four text mining methods.Journal of computer information system. 51(1):1-10. Retrieved from: https://www.researchgate.net/publication/286840108 An empirical comparison of four text mining methods
Meen Ch & Yongjun, Zh (July 18th 2018). Scientometrics of Scientometrics: Mapping Historical Footprint and Emerging Technologies in Scientometrics, Scientometrics, Mari Jibu and Yoshiyuki Osabe, IntechOpen, DOI: 10.5772/intechopen.77951. Available from: https://www.intechopen.com/books/scientometrics/scientometrics-of-scientometrics-mapping-historical-footprint-and-emerging-technologies-in-scientome
Mohammadian, B (2014) Identification of scientific theft in Persian documents based on thematic modeling. Master thesis. Department of Computer, Faculty of Engineering, Kharazmi University.
Mortazavi, A., Javaherian, A(2013). Application of single value decomposition to random noise attenuation in synthetic and real seismic data. Oil Research. (80), 123-134. Retrieved from: https://pr.ripi.ir/article 459 85173420168d8944de96e91cba871aa2.pdf
Nadezhda, Y & Aleksey, F(2018). Improving the quality of information retrieval using syntactic analysis of search query. Retrieved from:
https://www.semanticscholar.org/paper/Improving-the-Quality-of-Information-Retrieval-of-Yarushkina-Filippov/d0955103ee4e4cd78a0d24f880a1cda7f3b35d5e
Newman, D., Hagedorn, K., Chemudugunta, C., & Smyth, P. (2007). Subject metadata enrichment using statistical topic models. JCDL. Retrieved from: https://www.researchgate.net/publication/220924369 Subject Metadata Enrichment using Statistical Topic Models [DOI:10.1145/1255175.1255248]
Norouzi, Y.,Khavidaki, S(2014). Social Semantic Digital Library: A Perspective for Digital Libraries in Iran. Rahyaft, 57, 63-74. Retrieved 5/8/98 from http://rahyaft.nrisp.ac.ir/article 13557.html
Rani, M., Dhar, A, K., Vyas, O.P(2017). Semi- Automatic terminology ontology learning based on topic modeling. Engineering Application of Artificial Intelligence, 63, 108-125. Retrived from: https://www.researchgate.net/publication/317195300 Semi-Automatic Terminology Ontology Learning Based on Topic Modeling [DOI:10.1016/j.engappai.2017.05.006]
Rani, M., Dhar, A., Kumar; Vyas, O.P(2017). Semi- Automatic terminology ontology learning based on topic modeling. Engineering Application of Artificial Intelligence, 63, 108-125. [DOI:10.1016/j.engappai.2017.05.006]
Sanandres, E; Madariaga, C; Abello, R(2018). Topic modelling of twitter conversations. Retrieved from: https://www.researchgate.net/publication/326450126 Topic Modeling of Twitter Conversations/citations
Selvi, M & et al (2019). Classification of medical dataset along with topic modeling using LDA. Lecture notes in electrical engineering 511.Springer. [DOI:10.1007/978-981-13-0776-8_1]
Soergel, D )2004. (Indexing language and thesauri: construction and maintenance. Los Angeles, CA: Melville
Sohrabi, B; Raeesi vanani, I; Baranizade Shineh, M (2017). Topic Modeling and classification of cyber-space papers using text mining. Cyberspace studies, 2(1), 103- 125.
Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of latent semantic analysis, 427(7), 424-440.
Steyvers, M; Smyth, P; Rosen-Zvi, M; Griffiths, T, (2004) "Probabilistic author-topic models for in-formation discovery," in Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington. [DOI:10.1145/1014052.1014087]
Strunk Jr, W.(2007), "The elements of style", Fili-quarian Publishing, LLC.
Venkat N. Gudivada, Amogh R. Gudivada(2018). Hand book of ststistic. USA, Elsevier. Retrieved from : https://www.sciencedirect.com/topics/computer-science/vector-space-models
Zamani, M., Dianat, R., Sadeghzadeh, M(2013). Classification of Persian Texts Using Probabilistic Hidden Semantic Analysis Method, 1st National Conference on Application of Intelligent Systems (Soft Computing) in Science and Technology, Quchan, Islamic Azad University of Quchan.
Zhao, R., & K. Mao. 2018. Fuzzy Bag-of-Words Model for Document Representation. IEEE Trans-actions on Fuzzy Systems .لی26 (2): 794-804. doi:10.1109/TFUZZ.2017.2690222. [DOI:10.1109/TFUZZ.2017.2690222]