Do first encounters make or break new users?

Using text features in the first comment to predict new user return on Reddit

Authors

Keywords:

new user attrition, churn, Reddit, text analysis, sentiment analysis, online comments, social media, social feedback

Abstract

Many new users quit a site after only one interaction. Existing studies of user return consider user characteristics and simple feedback like upvotes, while leaving potentially useful text data unstudied. Here, we analyze 700,000 first post/sole comment pairs on Reddit, with the goal of determining whether comments are related to return probabilities. Using two complementary text analysis techniques—text regression (CCS) and Linguistic Inquiry and Word Count (LIWC)—we demonstrate that information from the first comment a new user receives improves predictions of new user return. Our work serves as an example of useful predictive features being extracted from very short text comments, and also illustrates the importance of social feedback on the experiences of new users.

Author Biography

Emma Mary Klugman, Harvard University

Emma Klugman is a doctoral student in Education and Data Science at Harvard University.

References

Ammari, T., Schoenebeck, S., & Romero, D. (2019). Self-declared Throwaway Accounts on Reddit: How Platform Affordances and Shared Norms enable Parenting Disclosure and Support. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), 135:1–135:30. https://doi.org/10.1145/3359237

De Choudhury, M., & De, S. (2014). Mental health discourse on reddit: Self-disclosure, social support, and anonymity. In Eighth international AAAI conference on weblogs and social media. https://www.sushovan.de/research/reddit-icwsm.pdf

Coussement, K. and Bock, K. W. D. (2013). Customer churn prediction in the online gambling industry: The beneficial effect of ensemble learning. Journal of Business Research, 66(9):1629–1636. https://doi.org/10.1016/j.jbusres.2012.12.008

Coussement, K. and den Poel, D. V. (2008). Churn prediction in subscription services: An application of support vector machines while comparing two parameter-selection techniques. Expert Systems With Applications, 34(1):313– 327. https://doi.org/10.1016/j.eswa.2006.09.038

DeLong, E. R., DeLong, D. M., and Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3). https://www.jstor.org/stable/2531595

Dror, G., Pelleg, D., Rokhlenko, O., and Szpektor, I. (2012). Churn prediction in new users of Yahoo! answers. Proceedings of the 21st International Conference on World Wide Web, pages 829–834. https://doi.org/10.1145/2187980.2188207

He, B., Shi, Y., Wan, Q., and Zhao, X. (2014). Prediction of customer attrition of commercial banks based on SVM model. Procedia Computer Science, 31:423–430. https://doi.org/10.1016/j.procs.2014.05.286

Hung, S.-Y., Yen, D. C., and Wang, H.-Y. (2006). Applying data mining to telecom churn management. Expert Systems With Applications, 31(3):515– 524. https://doi.org/10.1016/j.eswa.2005.09.080

Jamal, Z. and Bucklin, R. E. (2006). Improving the diagnosis and prediction of customer churn: A heterogeneous hazard modeling approach. Journal of Interactive Marketing, 20(3):16–29. https://doi.org/10.1002/dir.20064

Jia, J., Miratrix, L., Yu, B., Gawalt, B., El Ghaoui, L., Barnesmoore, L., and Clavier, S. (2014). Concise comparative summaries (CCS) of large text corpora with a human experiment. Annals Of Applied Statistics, 8(1):499–529. https://projecteuclid.org/euclid.aoas/1396966296

Miratrix, L. (2017). textreg: n-Gram Text Regression, aka Concise Comparative Summarization. R package version 0.1.4. https://cran.r-project.org/web/packages/textreg/index.html

Miratrix, L. W. and Ackerman, R. (2016). Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability. Statistical Analysis and Data Mining: The ASA Data Science Journal. https://doi.org/10.1002/sam.11323

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., and Müller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12:77. https://doi.org/10.1186/1471-2105-12-77

Sarkar, C. (2013). The effects of participation and feedback received on the length of time members in online communities remain active. PhD thesis, Michigan State University.

Tausczik, Y. R. and Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1):24–54. https://doi.org/10.1177/0261927X09351676

Wang, T., Wang, K., Erlandsson, F., Wu, S., and Faris, R. (2013). The influence of feedback with different opinions on continued user participation in online newsgroups. ASONAM ’13, pages 388–395. ACM and IEEE. https://doi.org/10.1145/2492517.2492555

Yang, J., Wei, X., Ackerman, M. S., and Adamic, L. A. (2010). Activity lifespan: An analysis of user survival patterns in online. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media. Association for the Advancement of Artificial Intelligence. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1466

Downloads

Published

2023-05-31