1 Introduction
1.1 Motivation
1.2 Contribution and research questions
-
RQ1: Are users’ self-disclosure behaviour associated with their engagement in cybersecurity discussions? Prior studies in OSNs (in general) and Q&A platforms (in particular) have shown correlations between users’ engagement and self-disclosure practices (e.g. Adaji and Vassileva 2016; Kayes et al. 2015; Vargo and Matsubara 2018). Hence, this RQ aims at zooming into developers’ decisions regarding profile visibility and their participation in discussions about privacy and security. Particularly, it seeks to investigate whether different self-disclosure patterns exist across SO users who involve themselves actively in such discussions, and those who do not.
-
RQ2: Are privacy-related constructs associated with users’ engagement in cybersecurity discussions? As with RQ1, former studies have delved into the relation between psychological constructs (e.g. perceived risks and control) and peoples’ engagement within OSNs (e.g. Staddon et al. 2012; Jozani et al. 2020). The purpose of this RQ is to examine whether such correlations also take place in SO but regarding users’ participation in discussions about privacy and security.
2 Background and related work
2.1 Cybersecurity discussions in SO
2.2 Insights from online social networks
3 Methodology
3.1 Data collection
privacy
,security
, privacy-policy
, code-access-security
, data-security
, network-security
, and gdpr-consentform
as topic tags.6 Additionally, r
and python
were used as language tags given the increasing popularity of these languages within the data science community (Moutidis and Williams 2021). Thereby, we sought to narrow down the scope of the study mainly to data science practitioners as they are prone to handle sensitive data (e.g. medical records, biometric data, demographics). Furthermore, their cybersecurity practices can have a great impact on automated decision-making systems (e.g. biases, discrimination).3.1.1 Discussions dataset (D1)
search/advanced
endpoint and a tag filter
provided by the API itself. By the end of the mining process, a total of 1239 questions/posts were retrieved from SO (Figure 1).3.1.2 Profiles dataset (D2)
users/{ids}
endpoint). The email address of some of them was also mined using the GitHub (GH) URL available in the profiles (email addresses are never included in SO profile pages). This step was necessary to recruit participants afterwards for the online survey. This complementary mining process was executed using the R package gh
8 resulting in 457 unique e-mail addresses corresponding to engaged users. Such information was included in the profiles dataset \(D_2\) along with the rest of the profile information extracted from SO.r
tag, and 777587 for python
tag. Next, we mined the profile information from a representative sample of these two groups with a 99% confidence and a margin of error of 3%. Such information was mined directly from the users/{uids}
endpoint, ensuring that the corresponding SO ids were not already part of the engaged group, and were not repeated across each language. Overall, we obtained 1830 Python users and 1645 R users (3475 in total). These results were merged into the \(D_2\) dataset, using an additional variable to indicate whether this information corresponds to engaged or unengaged users. Like with the engaged profiles, we collected the e-mail addresses of 413 unengaged users via GH (Fig. 1).3.2 Data aggregation
3.2.1 Amount of self-disclosure
-
We gave each link (website, Twitter and GitHub) a value of 1 if it was filled in the user’s profile, and 0 if not.
-
The location variable was calculated as the links (i.e. 1 if it was completed and 0 if not). Since users can obfuscate this field (e.g. by using nicknames or aliases), we conducted a card sorting analysis to estimate the reliability of this coding schema. From this analysis, we concluded that location information could be considered accurate if present.
-
The variable corresponding to the display name was computed as the proportion of used characters over the total available (30 characters). As with location, we completed another card sorting analysis to obtain further reliability insights. Once again, we concluded that the information present in this field could be considered accurate. Both card-sorting analyses can be found in the Appendix A.
-
The profile image was retrieved as an URL address during the data collection process. To determine whether an image corresponds to a custom or a default one we compared its URL against a collection of Gravatar10 URLs (Gravatar pictures are frequently used as default in SO profiles). Using regular expressions, we assigned a 0 value to those profile pictures found in the Gravatar database. Otherwise, they were considered as custom and given a value of 1.
-
The about me field can have up to 3000 characters allowing HTML formatting. The HTML tags were removed through an R script, and the proportion of used characters was calculated to determine the corresponding disclosure value of this field. This approach assumes that, as more characters are included, more personal information is being revealed.
3.2.2 Engagement in cybersecurity discussions
3.3 Survey structure
3.3.1 Population and sampling
Demographic | Ranges | Freq. | Resp. (%) |
---|---|---|---|
Gender | Female | 1 | 1.56 |
Male | 61 | 95.31 | |
Non-Binary | 1 | 1.56 | |
Prefer not to say | 1 | 1.56 | |
Educational level | Graduate Degree (MSc, PhD) | 36 | 56.25 |
High School or Less | 3 | 4.69 | |
Some College | 11 | 17.19 | |
Undergrad Degree (BSc, BA) | 14 | 21.88 | |
Employment status | Currently in School | 1 | 1.56 |
Currently in University | 5 | 7.81 | |
Unemployed, not looking for work | 2 | 3.13 | |
Unemployed, looking for work | 1 | 1.56 | |
Working full-time | 49 | 76.56 | |
Working part-time | 6 | 9.38 | |
Programming experience (R/Python) | <2 years | 2 | 3.13 |
2–5 years | 14 | 21.88 | |
5–10 years | 22 | 34.38 | |
>10 years | 26 | 40.63 | |
Other | 2–5 years | 3 | 4.69 |
Programming | 5–10 years | 12 | 18.75 |
Experience | >10 years | 49 | 76.56 |
3.3.2 Ethical considerations
4 Results
4.1 Privacy and security discussions (SO Q&A data)
Indicator | Frequency | Total (%) |
---|---|---|
Has answers | 825 | 67 |
Has accepted answers | 588 | 47 |
Has score \(>0\) | 719 | 58 |
Has comments | 489 | 39 |
Closed | 94 | 8 |
4.2 Self-disclosure practices (SO profile data)
Diff. Levels | Diff. Means | SE | \(\varvec{p}\) | 95% CI |
---|---|---|---|---|
Unengaged—Proactive | 0.107* | 0.008 | 0.000 | (0.086, 0.127) |
Unengaged—Reactive | \(-\)0.020* | 0.006 | 0.002 | (\(-\)0.034, \(-\)0.006) |
Proactive—Unengaged | \(-\)0.107* | 0.009 | 0.000 | (\(-\)0.127, \(-\)0.086) |
Proactive—Reactive | \(-\)0.127* | 0.008 | 0.000 | (\(-\)0.148, \(-\)0.106) |
Reactive—Unengaged | 0.020* | 0.006 | 0.002 | (0.006, 0.034) |
Reactive—Proactive | 0.127* | 0.009 | 0.000 | (0.106, 0.148) |
Group | B | SE | Sig. | Exp(B) | |
---|---|---|---|---|---|
Proactive | Intercept | \(-0.991\) | 0.062 | 0.000 | |
% self-disclosure | \(-0.023\) | 0.002 | 0.000 | 0.978 | |
Reactive | Intercept | \(-0.313\) | 0.043 | 0.000 | |
% self-disclosure | 0.004 | 0.001 | 0.001 | 1.004 |
4.3 Privacy-related constructs (survey data)
Variable | SS | d.f. | MS | F | \(\varvec{p}\) | \(\varvec{\eta ^2}\) |
---|---|---|---|---|---|---|
Profile data | ||||||
% self-disclosure | 9.340 | 2 | 4.670 | 86.180 | 0.000 | 0.024 |
Survey data | ||||||
GPC | 0.825 | 2 | 0.412 | 0.333 | 0.718 | 0.011 |
PCST | 0.400 | 2 | 0.200 | 0.141 | 0.869 | 0.005 |
PCOT | 3.860 | 2 | 1.930 | 1.127 | 0.331 | 0.036 |
RSK | 0.547 | 2 | 0.274 | 0.384 | 0.682 | 0.012 |
PC | 2.174 | 2 | 1.087 | 0.854 | 0.431 | 0.027 |
SD | 3.082 | 2 | 1.541 | 1.040 | 0.360 | 0.033 |