International Journal of Medical Informatics 146 (2021) 104350
Contents lists available at ScienceDirect
International Journal of Medical Informatics journal homepage: www.elsevier.com/locate/ijmedinf
Identifying current Juul users among emerging adults through Twitter feeds Tung Tran a, Melinda J. Ickes b, Jakob W. Hester b, Ramakanth Kavuluru a, c, * a
Department of Computer Science University of Kentucky, Lexington, USA Department of Kinesiology and Health Promotion University of Kentucky, Lexington, USA c Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, USA b
A R T I C L E I N F O
A B S T R A C T
Keywords: e-cigarettes Juul Tobacco prevention Machine learning
Introduction: Juul is the most popular electronic cigarette on the market. Amid concerns around uptake of ecigarettes by never smokers, can we detect whether someone uses Juul based on their social media activities? This is the central premise of the effort reported in this paper. Several recent social media-related studies on Juul use tend to focus on the characterization of Juul-related messages on social media. In this study, we assess the potential in using machine learning methods to automatically identify Juul users (past 30-day usage) based on their Twitter data. Methods: We obtained a collection of 588 instances, for training and testing, of Juul use patterns (along with associated Twitter handles) via survey responses of college students. With this data, we built and tested super vised machine learning models based on linear and deep learning algorithms with textual, social network (friends and followers), and other hand-crafted features. Results: The linear model with textual and follower network features performed best with a precision-recall tradeoff such that precision (PPV) is 57 % at 24 % recall (sensitivity). Hence, at least every other college-attending Twitter user flagged by our model is expected to be a Juul user. Additionally, our results indicate that social network features tend to have a large impact (positive) on classification performance. Conclusion: There are enough latent signals from social feeds for supervised modeling of Juul use, even with limited training data, implying that such models are highly beneficial to very focused intervention campaigns. This initial success indicates potential for more involved automated surveillance of Juul use based on social media data, including Juul usage patterns, nicotine dependence, and risk awareness.
1. Introduction E-cigarettes, particularly pod-based devices like Juul, remain popu lar among youth and emerging adults . One brand, Juul, encom passed 64 % of the market share in October 2019, per Nielsen all channel data for tobacco sales . While marketing campaigns claim they are not specifically targeted at minors, Juul use soared among these pop ulations, prompting public health concerns and subsequent ongoing investigations by the U.S. Food and Drug Administration [3–5]. The popularity of Juul use among young people is partly facilitated by its ease of use or concealment, the effect that it gives users, and use by friends [6,7]. With Juul’s rapid expansion to international markets , and growth of related products like disposable e-cigarettes , new generations of youth across the world may be at risk of dependence on nicotine products and various other negative health outcomes such as hindered brain development and long-term lung damage .
Furthermore, during the current times of the COVID-19 pandemic, evi dence shows COVID-19 diagnosis is five times more likely among e-cigarette users . Hence, it is critical to empower public health agencies and other prevention researchers with strategies for reaching youth who are currently using the product, to provide relevant infor mation, as well as cessation messaging and support resources. One such way could be through social media given approximately 90 % of young adults use social media and over 80 % own a smartphone . There is also evidence to suggest e-cigarette manufacturers target young people through advertising on social media, which has been linked to susceptibility of initiation . Several recent studies explore and characterize social media posts related to Juul use and dependence [13–16]; however, to our knowledge, none are designed for the purpose of identifying Juul usage by consumers based on their social media feeds. In this first of its kind study, using self-reported survey answers as
* Corresponding author at: 230E MDS Building, 725 Rose St. Lexington, KY 40508, USA. E-mail address: [email protected]
(R. Kavuluru). https://doi.org/10.1016/j.ijmedinf.2020.104350 Received 1 July 2020; Received in revised form 12 November 2020; Accepted 20 November 2020 Available online 10 December 2020 1386-5056/© 2020 Elsevier B.V. All rights reserved.
T. Tran et al.
International Journal of Medical Informatics 146 (2021) 104350
ground truth, we explore the potential for using publicly available his torical Twitter data, including user-specific Twitter messages (tweets) and friend and follower networks, to determine an individual’s past 30day use of Juul. Our main intuition behind this study is based on past studies that show social media participation and associated influence can be drivers of beliefs and habits of tobacco product usage . Additionally, our prior work also indicates that having friends who are also users of Juul is one of the reasons respondents cited as their moti vation to try the product . This effort leverages ‘friends’ on social media as a variable to assess the ability to identify current use. Addi tionally, Unger et al.  also show that explicitly talking positively about tobacco products correlates with using them in the real world. Our effort combines these ideas and uses both user generated content and follower/friend connections to identify Juul users. Unlike Unger et al.’s effort, we do not limit our focus to positive mentions of specific pre-selected tobacco terms and simply let supervised models elicit signal from tweets. We contend that models that are able to automatically distinguish between users and non-users of Juul in a quick and inexpensive manner will benefit the development and reach of tailored prevention cam paigns on social media to combat pro-use information young people are currently exposed to. Twitter specifically allows such campaigns by advertising prevention messages, clinical trial recruitment calls, or additional surveys. This way, the focus is on specific likely users of the product as opposed to traditional surveys, which target a more general audience where there may not be many users. Our data in this project pertains to University of Kentucky students and this was only possible because we had access to incoming students’ emails through official channels of the organization. This may not be practical, if we want to target other segments of population across the nation. There is ongoing research that determines the age group or life stage of a user on Twitter . Our approach (specifically the model we learn to identify current users) could be used in combination with such methods to reach a wider and more diverse group of potential users across the nation.
2.2. Evaluation procedure We contend that precision-focused metrics are reasonable for assessing model performance in terms of practical utility. This is espe cially true in the context of highly focused interventions where it is more important to ensure resources are prioritized for people who are highly likely to be Juul users. While coverage (sensitivity) is important, there is no interventional expense associated with false negative errors as we do not need to pay for the missed users. Thus, we evaluated the proposed models with precision-focused metrics. One choice of metrics is Fβ with β = 14, which is the harmonic mean between precision and recall with a greater emphasis on precision. Another metric used was average precision at n , a metric commonly used in information retrieval tasks for search engines. As commonly applied, we selected n = 10 and normal ized by n. Intuitively, suppose we ranked all test examples by predicted likelihood of being a Juul user; average precision at 10 ([email protected]
) mea sures the quality of results in the top 10 considering the number of relevant results (i.e., positive for Juul use) and their placement in the ranking. We note that Fβ is more interpretable in terms of precision-recall trade-off, while [email protected]
is useful for fine-grained com parisons. We performed our experiments using a stratified 5-fold cross-validation setup repeated six times. On each rotation, we reserved one fold for testing, one fold for validation, and the remaining (three) folds for training. Overall model performance is assessed based on the average of the aforementioned metrics over the thirty fold-wise evaluations. 2.3. Machine learning models 2.3.1. Traditional linear model The well-known logistic regression algorithm with bag-of-word fea tures served as our linear model. We concatenated all tweets and encode each n -gram (for n = 2, …, 6) as a 0/1 binary feature (ignoring fre quency). We ignored n -grams with a global frequency less than three. These settings were chosen based on preliminary experiments. During training, we optimized the regularization hyperparameter c and threshold t, by performing grid search over the held-out validation fold, such that Fβ is maximized. In a typical binary classification model, t = 0.5 and examples with p(x) > t are labeled positive; however, varying t allows us to obtain a desirable trade-off between precision and recall as determined by the Fβ metric. Prior to testing, we trained on both the training and validation set using the optimized c value and predicted as positive only test examples where p(x) > t.
2. Materials and methods 2.1. Data collection The dataset is derived from a longitudinal study on Juul use among incoming college students (an important emerging adult subgroup) at the University of Kentucky conducted between August 16, 2018 and April 30, 2019. It has been approved by the University of Kentucky IRB (protocol ID: 45840). Students were invited to participate in an online survey and asked various questions regarding familiarity with Juul, patterns of Juul use, dependence, poly-use with other tobacco products, reasons for Juul use, and general demographic information over three time points (T1, T2, and T3) spaced approximately three to five months apart. The survey was adapted from the National Youth Tobacco Survey to target specific tobacco products or brands was earlier used by our team to characterize prevalence and reasons for Juul use among incoming freshmen . During T1, 555 participants optionally provided their Twitter handle (i.e., username or identification). Of these, 214 participated again at T2 and 170 again participated at T3. Hence, we have 939 total discrete survey records that are each associated with a Twitter handle. As tweets are an essential predictive feature, we keep only records associated with at least one tweet in the 90-day window immediately prior to the day the survey response was received. The final dataset consisted of 588 records. We observed answers to the survey question on whether the participant has used Juul in the past thirty days as a discrete binary Yes/No answer in the ground truth. Of the 588 in stances in the dataset, 151 (25.6 %) reported current Juul use and were categorized as Juul users.
2.3.2. Deep learning model Our deep learning model is based on an LSTM-based recurrent neural network  over the list of tweets in chronological order. Each tweet is first encoded as a 100-dimensional vector using a standard convolu tional neural network  of window sizes 2, 3, and 4 with 50 filters each that convolves over randomly initialized word vectors of size 200. Our preliminary experiments showed that these settings were optimal, and inclusion of pretrained embeddings resulted in worse model per formance. Once tweets are processed by the recurrent network, the output state at the last recurrent unit is fully connected to a softmax output layer for identifying Juul use. The model is trained using binary cross-entropy loss and optimized with stochastic gradient descent. We first train on the training set for ten epochs; using the trained model, we optimize t on the validation set. Next, we train the model on both training and validation set for 30 epochs, with simple weight-averaging of checkpoints along the last 10 epochs as a form of regularization. 2.4. Feature extraction 2.4.1. Textual features For both linear and deep models, we used the NLTK Twitter-aware 2
T. Tran et al.
International Journal of Medical Informatics 146 (2021) 104350
tokenizer , designed for tokenizing tweets, that takes into consid eration domain-specific tokens such as smileys (stylized faces con structed using punctuation symbols) and hashtags (user-generated metadata tag). All tweets posted by the individual within a 90-day window immediately prior to survey response were supplied as input to the model. Both linear and deep learning models effectively treated input text as n-gram based features. In light of recent advancements in deep contextual embeddings (e.g., ELMo  and BERT ), and their revolutionary impact on many important NLP tasks, we additionally include experiments wherein textual features are instead replaced with their corresponding pre-trained BERT embeddings. We specifically use the BERT-Large-Cased variant of pre-trained BERT models, made pub licly available by Google Research (https://github.com/google-research /bert), to generate vector embeddings for each tweet in the dataset – internally, these vector embeddings correspond to the [CLS] token produced by the BERT model.
Table 1 Our main results based on average performances from 5-fold cross-validation repeated six times. The ‘ + ’ symbol indicates the addition of a feature to the model with just the text features.
2.4.2. User features We included features specific to the Twitter account including the age of the account, number of tweets, whether the user is verified, number of followers, number of friends, and number of lists the user is assigned. For both linear and deep learning models, these features were encoded as vectors and are concatenated to the n-gram feature vector immediately prior to the softmax layer. 2.4.3. Tweet features We additionally included features specific to a tweet which include the age of the tweet, whether the tweet is a reply, a retweet, or a quote, and the number of times the tweet has been retweeted. As the linear model does not encode tweets individually, but rather as a singular “blob” of text, tweet features were only included in the deep learning model.
Random Uniform Random Stratified
Positive if “Juul” is in tweet text
Linear Model (Only text features) + User features
+ Friends network
+ Followers network
Linear Model (All features)
Linear Model (All features + BERT)
Deep Model (Only text features)
Deep Model (All features)
Deep Model (All features + BERT)
low recall . In contrast, in rows 6 and 7 of the table, we see precision close to the simple string matching-based method but with substantially higher recall. Thus, we conclude that several Juul users may not be tweeting about Juul, but machine-learned models are a viable alterna tive for handling these cases. Between linear and deep models where pretrained BERT embeddings are not utilized, the best linear model (with only text and friends/fol lowers network features) performed at almost ten Fβ and seven [email protected]
points above the best deep neural model. Deep learning models seemed to overfit when trained on relatively small datasets and these results likely indicate that simpler models may be more suitable given the limited amount of training data. However, our results show that these issues can be mitigated with pre-trained BERT embeddings, which is a well-known benefit of transfer learning. Features derived from the friends and followers network had the most substantial impact on per formance such that we observe almost a 10-point improvement in Fβ from simply adding either types of network feature. Follower features were slightly more beneficial than friend features; inclusion of these isolated features resulted in improvements from 34.88 % (text-only features) to 46.25 % (text and follower features) and 43.21 % (text and friend features) on [email protected]
Including more hand-crafted features generally improved recall without adversely affecting precision, and this is consistent for both linear and deep learning models. When evaluating holistically for a balance of both Fβ and [email protected]
, the linear model with only follower-based features performed best at 50.33 % Fβ and 46.25 % [email protected]
A linear model that has all features proposed in the methods section lead to a recall of 32.65 % with a precision of 52.48 %, indicating the features are acting complementarily to improve recall with a com parable loss in precision. Given our precision focus, we believe this full model may not be worth exploring further. However, this model may be of utility for heavily funded intervention scenarios where the primary goal is to achieve better coverage at the expense of many false positives. For interpretability, we additionally included the [email protected]
measure, which represents the precision rate evaluated on the top 10 instances when ranked by likelihood of a tweeter being a Juul user. Based on the main results, we achieved approximately 60 % precision for the top 10 out of 118 examples (average number of examples per test fold). To illustrate a practical use case for the proposed model, we offer the following hypothetical scenario. Suppose there are 2000 individuals in the demographic for an intervention campaign — namely, first year
2.4.4. Friend and follower network Lastly, we included social network-based features based on friends and followers associated with the account. ‘Followers’ are a list of users that are subscribed to one’s tweets, while ‘friends’ are a list of users to which one is subscribed. Intuitively, information about a user may be informed by the list of people that they are following. We performed feature extraction by first encoding these networks as adjacency matrices. The entire ‘friends’ network (as a subnetwork of all twitter users) can be encoded as an 85,718 × 85,718 square adjacency matrix of ones and zeros where a one at the ith row and jth column indicates that the user represented by index j is a “friend” of the user represented by the index i. We filter along the rows dimension so that only the 588 users in our dataset are represented to obtain a matrix of size 588 × 85718. To overcome issues related to sparse representations, we apply singularvalue decomposition (SVD) as a form of dimensionality reduction to reduce the feature dimension to a manageable size (specifically, 100) . Thus, the final feature matrix representing the ‘friends’ network is of size 588 × 100. We perform the same procedure to obtain the feature matrix for the follower network; here the original adjacency matrix was instead 98520 × 98520. Like user features, social network features were included via simple concatenation. 3. Results and discussion Main results are presented in Table 1. Both linear and deep models performed well above random baselines (first two rows) in terms of Fβ . To facilitate comparison with a simple approach, in the third row we show the performance if we classify all tweeters whose tweets contain “Juul” as a substring as Juul users. Although precision is better than that for most other methods, recall is in single digits and lower compared to all other methods. Precision is more important, but this approach will drastically reduce the coverage we may be able to obtain. This is not surprising because rule-based methods well known to suffer from very 3
T. Tran et al.
International Journal of Medical Informatics 146 (2021) 104350
students of a typical American university that regularly use social media. 10 × 2000) will By simple extrapolation, approximately 169 students (118 be targeted in the campaign. With 60 % precision, 101 of those targeted will be actual Juul users (169 × 0.6). As Juul use occurs with a preva lence of 25 % according to our training data, a crude estimate of 500 students using Juul per 2000 would conclude that our model has recall of about 20 % (101/500), which is consistent with our main results.
 B. Herzog, P. Kanada, Nielsen all channel data thru 10/5. Wells Fargo Securities, 2019.  M. Richtel, S. Kaplan, Did Juul Lure Teenagers and Get ‘Customers for Life’?, 2019, 27 August 2018. [Online]. Available: https://www.nytimes.com/2018/08/27/scie nce/juul-vaping-teen-marketing.html. [Accessed 9 September 2019].  M. Richtel, S. Kaplan, Juul Illegally Marketed E-Cigarettes, F.D.A. Says, 2019, 9 September 2019. [Online]. Available: https://www.nytimes.com/2019/09/09/h ealth/vaping-juul-e-cigarettes-fda.html. [Accessed 9 September 2019].  U.S. Food, Drug Administration, Statement From FDA Commissioner Scott Gottlieb, M.D., on New Enforcement Actions and a Youth Tobacco Prevention Planto Stop Youth Use of, and Access to, JUUL and Other e-cigarettes, 23 April 2018. [Online]. Available: https://www.fda.gov/NewsEvents/Newsroom/PressAnnouncement s/ucm605432.htm.  M. Ickes, J.W. Hester, A.T. Wiggins, M.K. Rayens, E.J. Hahn, R. Kavuluru, Prevalence and reasons for Juul use among college students, J. Am. Coll. Health 68 (5) (2020) 455–459.  E.L. Leavens, E.M. Stevens, E.I. Brett, E.T. Hebert, A.C. Villanti, J.L. Pearson, T. L. Wagener, JUUL electronic cigarette use patterns, other tobacco product use, and reasons for use among ever users: results from a convenience sample, Addict. Behav. 95 (2019) 178–183.  D. Madlen, J. Kasperkevic, M. Chapman, Juul Spreads Over the World As Home Market Collapses in Scandal, 2019. November 2019. [Online]. Available: https:// www.thebureauinvestigates.com/stories/2019-11-21/juul-spreads-over-theworld-as-home-market-collapses-in-scandal. [Accessed October 2020].  V.H. Murthy, E-cigarette use among youth and young adults: a major public health concern, JAMA Pediatr. 171 (3) (2017) 209–210.  S.M. Gaiha, J. Cheng, B. Halpern-Felsher, Association between youth smoking, electronic cigarette use, and COVID-19, J. Adolesc. Health 67 (4) (2020) 519–523.  A. Perrin, Social Media Usage: 2005–2015, Pew Research Center, 2015 [Online]. Available: https://www.pewresearch.org/internet/2015/10/08/social-networki ng-usage-2005-2015/. [Accessed 2020].  Z.B. Massey, L.O. Brockenberry, P.T. Harrell, Vaping, smartphones, and social media use among young adults: snapchat is the platform of choice for young adult vapers, Addict. Behav. 112 (2020) 106576.  R. Kavuluru, S. Han, E.J. Hahn, On the popularity of the USB flash drive-shaped electronic cigarette Juul, Tob. Control 28 (1) (2019) 110–112.  J.E. Sidani, J.B. Colditz, E.L. Barrett, A. Shensa, K.-H. Chu, A.E. James, B. A. Primack, I wake up and hit the JUUL: analyzing twitter for JUUL nicotine effects and dependence, Drug Alcohol Depend. 204 (2019) 107500.  J.-P. Allem, L. Dharmapuri, J.B. Unger, T.B. Cruz, Characterizing JUUL-related posts on twitter, Drug Alcohol Depend. 190 (2019) 1–5.  L. Czaplicki, G. Kostygina, Y. Kim, S.N. Perks, G. Szczypka, S.L. Emery, D. Vallone, E.C. Hair, Characterising JUUL-related posts on Instagram, Tob. Control (2019) 1–6.  W. Yoo, J. Yang, E. Cho, How social media influence college students’ smoking attitudes and intentions, Comput. Human Behav. 64 (2016) 173–182.  J.B. Unger, R. Urman, T.B. Cruz, A. Majmundar, J. Barrington-Trimis, M.A. Pentz, R. McConnell, Talking about tobacco on Twitter is associated with tobacco product use, Prev. Med. 114 (2018) 54–56.  J. Zhang, X. Hu, Y. Zhang, H. Liu, Your age is no secret: inferring microbloggers’ ages via content and interaction analysis. International Conference on Web and Social Media, 2016.  N. Craswell, S. Robertson, Average precision at n. Encyclopedia of Database Systems, Springer US, Boston, MA, 2009, pp. 193–194.  F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: continual prediction with LSTM, Neural Comput. 12 (10) (2000) 2451–2471.  Y. Kim, Convolutional neural networks for sentence classification, in: Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014.  NLTK Project, NLTK 3.4.5 Documentation, 2019, 20 August 2019. [Online]. Available: https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize. casual. [Accessed 12 September 2019].  M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, Conference of the North American Chapter of the Association for Computational Linguistics (2018).  J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, Conference of the North American Chapter of the Association for Computational Linguistics (2019).  N. Halko, P.-G. Martinsson, J.A. Tropp, Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions, SIAM Rev. 52 (2) (2011) 217–288.  L. Zhang, R. Ghosh, M. Dekhil, M. Hsu, B. Liu, Combining lexicon-based and learning-based methods for Twitter sentiment analysis, in: HP Laboratories, Technical Report HPL-2011, vol. 89, 2011.
4. Conclusion and future work In this study, we compared linear and deep learning models for automatically identifying whether an individual is positive or negative for current Juul use (past 30-day use) based on their Twitter data, including tweets and social network features. Results indicate that linear models outperform deep learning models, and that inclusion of nontextual features, such as the network of friends and followers, is important for maximizing model performance. It is important to note that, given limited training data, it is not conclusive that linear models will always outperform deep learning models for this task. Future work will focus on curating a larger dataset; with an abundance of training data, we expect a substantial leap in performance for both linear and (especially) deep models. Other avenues of research include modeling of more detailed usage patterns, nicotine dependence, and risk awareness. In terms of surveillance implications, our models can help public health campaigns target cessation and tobacco treatment messages, launch additional electronic surveys (as targeted advertisements), or advertise clinical trial recruitment calls to potential users. Given not all Juul users are on Twitter (or other social media), we emphasize this is only one component of a potentially multi-pronged approach to reach the user space. However, this may be relatively inexpensive and can capture a segment of users that tend to actively use social media . As such, we believe this line of work may benefit future exploration by other mem bers of the scientific community who work at the intersection of public health, informatics, and computer science. Funding source Research reported in this publication was supported by the National Cancer Institute of the U.S. National Institutes of Health under Award Number R21CA218231. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Na tional Institutes of Health. CRediT authorship contribution statement Tung Tran: Methodology, Software, Investigation, Writing - original draft. Melinda J. Ickes: Resources, Data curation, Writing - review & editing. Jakob W. Hester: Resources, Data curation, Writing - review & editing. Ramakanth Kavuluru: Conceptualization, Methodology, Writing - review & editing, Supervision. Declaration of Competing Interest The authors report no declarations of interest. References  T.W. Wang, L.J. Neff, E. Park-Lee, C. Ren, K.A. Cullen, B.A. King, E-cigarette use among middle and high school students—United States, 2020, Morbidity Mortality Weekly Rep. 69 (37) (2020) 1310–1312.