Pembentukan Dataset Topik Kata Bahasa Indonesia pada Twitter Menggunakan TF-IDF & Cosine Similarity

Kristian Adi Nugraha, Danny Sebastian


Social media is evidently the most popular platform compared to other web applications. Indonesians spend an average of 3 hours and 15 minutes every day to access social media, resulting in a substantial amount of information flow. Even though research on information retrieval with social media data are common, only an inconsiderable amount concentrate using social media content in the Indonesian language. Our research aims to form an Indonesian language topic dataset using social media data from Twitter. The methods used in this research include TF-IDF for data formation and cosine similarity to group the Twitter data. Based on the test we conducted, our system is able to produce fairly accurate result with 64% as its most optimal percentage for the process of every 200 Tweets.

Full Text:



N. C. Laksana, “Ini Jumlah Total Pengguna Media Sosial di Indonesia,” Okezone, 13 Maret 2018. [Online]. Available: [Accessed 26 Juli 2018].

B. Agung, “Pengguna Internet di Indonesia Akses Medsos 3 Jam Per Hari,” CNN Indonesia, 2017 Desember 2017. [Online]. Available: [Accessed 26 Juli 2018].

M. V. Zaanen and P. Kanters, “Automatic Mood Classification Using TF*IDF Based on Lyrics,” in 11th International Society for Music Information Retrieval Conference (ISMIR 2010), 2010.

M. Kompan and M. Bielikova, “Content-based news recommendation,” in International conference on electronic commerce and web technologies, 2010.

A. R. Lahitani, A. E. Permanasari and N. A. Setiawan, “Cosine similarity to determine similarity measure: Study case in online essay assessment,” in 2016 4th International Conference on Cyber and IT Service Management, Bandung, 2016.

A. M. Kaplan and M. Haenlein, “Users of the world, unite! The challenges and opportunities of Social Media,” Business horizons, vol. 53, no. 1, pp. 59-68, 2010.

T. L. Tuten, Advertising 2.0: social media marketing in a web 2.0 world: social media marketing in a web 2.0 world, ABC-Clio, 2008.

A. M. Kaplan and M. Haenlein, “Users of the world, unite! The challenges and opportunities of Social Media,” Business Horizons, vol. 53, pp. 59-68, 2010.

T. O'reilly, What is web 2.0, 2005.

A. J. Kim and K. K. Johnson, “Power of consumers using social media: Examining the influences of brand-related user-generated content on Facebook,” Computer in Human Behavior, vol. 58, pp. 98-108, 2016.

T. Daugherty, M. S. Eastin and L. Bright, “Exploring consumer motivations for creating user-generated content,” Journal of interactive advertising, vol. 2, no. 2, pp. 16-25, 2008.

K. A. Manap and N. Adzharudin, “The role of user generated content (UGC) in social media for tourism sector,” in The 2013 WEI International Academic Conference Proceedings, 2013.

A. Z. Bahtar and M. Muda, “The Impact of User--Generated Content (UGC) on Product Reviews towards Online Purchasing-A Conceptual Framework,” in Procedia Economics and Finance, 2016.

K. Crowston and I. Fagnot, “Stages of motivation for contributing user-generated content: A theory and empirical test,” International Journal of Human-Computer Studies, vol. 109, pp. 89-101, 2018.

J. Chae, D. Thom, H. Bosch, Y. Jang and R. Maciejewski, “Spatiotemporal Social Media Analytics for Abnormal Event Detection and Examination using Seasonal-Trend Decomposition,” in Visual Analytics Science and Technology (VAST), 2012.

M. Mathioudakis and N. Koudas, “Twittermonitor: trend detection over the twitter stream,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010.

W. He, S. Zha and L. Li, “Social media competitive analysis and text mining: A case study in pizza industry,” International Journal of Information Management, vol. 33, no. 3, pp. 464-472, 2013.

S. Inzalkar and J. Sharma, “A survey on text mining-techniques and application,” International Journal of Research In Science & Engineering, vol. 24, pp. 1-14, 2015.

S. Ahmad and R. Varma, “Information extraction from text messages using data mining techniques,” Malaya Journal of Matematik, vol. S, no. 1, pp. 26-29, 2018.

D. Agnihotri, K. Verma and P. Tripathi, “Pattern and cluster mining on text data,” in Fourth International Conference on Communication Systems and Network Technologies, 2014.

S. Vijayarani, J. Ilamathi and Nithya, “Preprocessing techniques for text mining-an overview,” International Journal of Computer Science & Communication Networks, vol. 5, no. 1, pp. 7-16, 2015.

D. Sebastian, “Rancang Bangun Website Klasifikasi Untuk Pencarian Produk Pasar Online Menggunakan Algoritma K-Nearest Neighbor,” Jurnal Teknik Informatika dan Sistem Informasi, vol. 3, no. 3, 2017.

A.-H. Tan, “Text Mining: The state of the art and the challenges,” in Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, 1999.

R. Cooley, B. Mobasher and J. Srivastava, “Web mining: Information and pattern discovery on the world wide web,” in Tools with Artificial Intelligence, 1997. Proceedings., Ninth IEEE International Conference on. IEEE, 1997.

R. Kosala and H. Blockeel, “Web mining research: A survey,” ACM Sigkdd Explorations Newsletter, vol. 2, no. 1, pp. 1-15, 2000.

J. A. Iglesias, A. Tiemblo, A. Ledezma and A. Sanchis, “Web news mining in an evolving framework,” Information Fusion, vol. 28, pp. 90-98, 2016.

A. R. Chrismanto and Y. Lukito, “Klasifikasi Sentimen Komentar Politik dari Facebook Page Menggunakan Naive Bayes,” Jurnal Informatika dan Sistem Informasi, vol. 2, no. 2, pp. 26-34, 2016.

X. Chen, M. Vorvoreanu and K. Madhavan, “Mining Social Media Data for Understanding Student's Learning Experiences,” IEEE Transactions on Learning Technologies, vol. 7, no. 3, pp. 246-259, 2014.

R. Kohavi, “Mining E-Commerce Data: The good, the bad, and the ugly,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001.

M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez and K. Kochut, “A brief survey of text mining: Classification, clustering and extraction techniques,” arXiv preprint arXiv:1707.02919, 2017.

S. A. Salloum, M. Al-Emran, A. A. Monem and K. Shaalan, “Using text mining techniques for extracting information from research articles,” Intelligent Natural Language Processing: Trends and Applications, pp. 373-397, 2018.

V. Gupta and G. S. Lehal, “A Survey of Text Mining Techniques and Applications,” Journal of Emerging Technologies in Web Intelligence, vol. 1, no. 1, pp. 60-76, 2009.

S. Menaka and N. Radha, “Text classification using keyword extraction technique,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3, no. 12, pp. 734-740, 2013.

F. S. Al-Anzi and D. AbuZeina, “Toward an enhanced Arabic text classification using cosine similarity and latent semantic indexing,” Journal of King Saud University – Computer and Information Sciences, vol. 29, no. 2, pp. 189-195, 2017.

W. H. Gomaa and A. A. Fahmy, “A Survey of Text Similarity Approaches,” International Journal of Computer Applications , vol. 68, no. 13, pp. 13-18, 2013.



  • There are currently no refbacks.

Copyright (c) 2018 Jurnal Teknik Informatika dan Sistem Informasi