GitHub and more: sharing data & code

This post was originally published on this site


A recent Nature News article ‘Democratic databases: Science on GitHub‘ discussed GitHub and other programs used for sharing code and data. As a measure for GitHub’s popularity, NatureNews looked at citations of GitHub repositories in research papers from various disciplines (source: Scopus). The article also mentioned BitBucket, Figshare and Zenodo as alternative tools for data and code sharing, but did not analyze their ‘market share’ in the same way.

Our survey on scholarly communication tools asked a question about tools used for archiving and sharing data & code, and included GitHub, FigShare, Zenodo and Bitbucket among the preselected answer options (Figure 1). Thus, our results can provide another measurement of use of these online platforms for sharing data and code.

sharedata

Figure 1 – Survey question on archiving and sharing data & code

Open Science  – in word or deed

Perhaps the most striking result is that of the 14,896 researchers among our 20,663 respondents (counting PhD students, postdocs and faculty), only 4,358 (29,3%) reported using any tools for archiving/sharing data. Saliently, of the 13,872 researchers who answered the question ‘Do you support the goals of Open Science’ (defined in the survey as  ‘openly creating, sharing and assessing research, wherever viable’), 80,0% said ‘yes’. Clearly, for open science, support in theory and adoption in practice are still quite far apart, at least as far as sharing data is concerned.

os-support-researchers

Figure 2 Support for Open Science among researchers  in our survey

Among those researchers that do archive and share data, GitHub is indeed the most often used, but just as many people indicate using ‘others’ (i.e. tools not mentioned as one of the preselected options). Figshare comes in third, followed by Bitbucket, Dryad, Dataverse, Zenodo and Pangaea (Figure 3).

all-researchers-sharing-data

Figure 3 – Survey results: tools used for archiving and sharing data & code

Among ‘others’, the most often mentioned tool was Dropbox (mentioned by 496 researchers), with other tools trailing far behind.  Unfortunately, the survey setup invalidates direct comparison of the number of responses for preset tools and tools mentioned as ‘others’ (see: Data are out. Start analyzing. But beware). Thus, we cannot say whether Dropbox is used more or less than GitHub, for example, only that it is the most often mentioned ‘other’ tool.

Disciplinary differences

As mentioned above, 29,3% of researchers in our survey reported to engage in the activity of archiving and sharing code/data. Are there disciplinary differences in this percentage? We explored this earlier in our post ‘The number games‘. We found that researchers in engineering & technology are the most inclined to archive/share data or code, followed by those in physical and life sciences. Medicine, social sciences and humanities are lagging behind at more or less comparable levels (figure 4). But is is also clear that in all disciplines archiving/sharing data or code is an activity that still only a minority of researchers engage in.

data-code-archiving-respons-researchers

Figure 4 – Share of researchers archiving/sharing data & code

Do researchers from different disciplines use different tools for archiving and sharing code & data? Our data suggest that they do (Table 1, data here). Percentages given are the share of researchers (from a given discipline) that indicate using a certain tool. For this analysis, we looked at the population of researchers (n=4,358) that indicated using at least one tool for archiving/sharing data (see also figure 4). As multiple answers were allowed for disciplines as well as tools used, percentages do not add up to 100%.

While it may be no surprise that researchers from Physical Sciences and Engineering & Technology are the most dominant GitHub users (and also the main users of BitBucket), GitHub use is strong across most disciplines. Figshare and Dryad are predominantly used in Life Sciences, which may partly be explained by the coupling of these repositories to journals in this domain (i.e. PLOS to Figshare and Gigascience, along with many others, to Dryad).

github-and-more-heatmap-table

Table 1: specific tool usage for sharing data & code across disciplines

As a more surprising finding, Dataverse seems to be adopted by some disciplines more than others. This might be due to the fact that there is often institutional  support from librarians and administrative staff for Dataverse (which was developed by Harvard and is in use at many universities). This might increase use by people who have somewhat less affinity with ‘do-it-yourself’ solutions like GitHub or Figshare. An additional reason, especially for Medicine, could be the possibility of private archiving of data in Dataverse, with control over whom to give access. This is often an important consideration when dealing with potentially sensitive and confidential patient data.

Another surprising finding is the overall low use of Zenodo – a CERN-hosted repository that is the recommended archiving and sharing solution for data from EU-projects and -institutions. The fact that Zenodo is a data-sharing platform that is available to anyone (thus not just for EU project data) might not be widely known yet.

A final interesting observation, which might go against the common idea, is that among researchers in Arts&Humanities who archive and share code, use of these specific tools is not lower than in Social Sciences and Medicine. In some cases, it is even higher.

A more detailed breakdown, e.g. across research role (PhD student, postdoc or faculty), year of first publication or country is possible using the publicly available survey data.

Comments are closed.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑