A recent Research Article published by Chiara Gabella (CG), SIB Swiss Institute of Bioinformatics, and colleagues explored how best to fund knowledgebases, which are relied on by many life scientists as highly accurate and reliable sources of scientific information. There are many questions about how to fund these, in her article Chiara uses UniProtKB as a case study, a knowledgebase run by the UniProt Consortium – a collaboration between the European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). Chiara’s article was openly reviewed by Helen Berman (HB), Rutgers, The State University of New Jersey, who also works on a knowledge base – the RCSB Protein Data Bank.
What is a knowledgebase and how is it different from a deposition database or repository? Can you briefly describe the knowledgebases that you both work on and how they serve the scientific community?
CG: As the name suggests, a knowledgebase is a database of knowledge. It is not a simple collection of data, but can be seen as a dynamic digital encyclopaedia, where information is built on many datasets from the literature or other sources, such as other knowledgebases.
Knowledgebases often add layers of value through expert curation: highly-qualified scientists manually select, review and annotate the information. Deposition databases allow scientists to deposit their scientific findings and are essential to ensure long-term data preservation and accessibility. The added value of a knowledgebase lies in the integration and interpretation of the large amount of data and information, making them a primary source of extremely reliable scientific information for researchers, in academics as well as in the private sector.
The knowledgebase studied in the paper is the Universal Protein Resource knowledgebase (UniProtKB), a key resource consulted every month by 500,000 users from the international scientific community and containing more than 100 million protein sequences and associated functional information. At the SIB in Geneva, more than 40 biocurators contribute to the curated section of UniProtKB (UniProtKB/Swiss-Prot), that contains more than 550,000 protein sequences. At SIB we have more than 150 data resources, and many of them are manually curated knowledgebases.
HB: Knowledgebases aggregate and integrate information from a variety of data resources thereby creating a “one stop” shop for any particular scientific domain. Uniprot does this for protein sequences; rather than having to sort through information from 150 sources, it is possible to go to one place and learn about a particular protein sequence. The RCSB Protein Data Bank provides an analogous function for the more than 137,000 macromolecular structures in the PDB. Starting with the data in the PDB Archive managed by the wwPDB, the RCSB PDB website integrates information from 40 different resources to provide knowledge about the many aspects of macromolecular structure. There are about one million unique visitors per year who visit the site; these include scientists, educators, and students from all over the world.
In her article, Chiara says one of the criteria for a funding model for UniProt is that it must “guarantee open access and equal opportunity”. Why do you think this is so important for a knowledgebase?
CG: At SIB, we believe that access to scientific data must not be a privilege. The purpose of our work is to empower advances in life sciences and health. We strive to ensure that knowledge and data are accessible to all in a sustainable way, irrespective of their location or academic status. Scientific research as a whole is trying to go towards this direction: data should be open and FAIR – Findable, Accessible, Interoperable and Reproducible, which allows their preservation after the end of the related research projects and guarantees access to the wider community.
“The ability to easily find known information about a particular subject enables and promotes the scientific process and speeds up the acquisition of new knowledge.” Helen Berman
HB: Open access allows community based review of data and information. The ability to easily find known information about a particular subject enables and promotes the scientific process and speeds up the acquisition of new knowledge. Any errors that may exist in a knowledgebase can be reported and corrected quickly by authors of the original work or by biocurators.
The article outlines twelve different funding models for knowledgebases and then applied these to UniProt. For you, what were the two most stand out models and why?
CG: Knowledgebases, UniProtKB included, are currently mainly funded by two of the models presented in the paper: institutional funding and research grants. Institutional funding allows a relative stability in time. Both models are compliant with the open access principles and allow equity of users. These models cannot guarantee the long-term sustainability of data resources. The Infrastructure Model that we propose, brings together the positive aspects of those two models: data resources still receive funds from funding agencies, but not as cyclic grants in competition with research projects. This model guarantees stable funding and is scalable with the amount of research data that is generated in a certain activity domain.
HB: In an analysis of domain repositories the Infrastructure Model was proposed as the best way to fund these types of data resources. That analysis came to similar conclusions as the current analysis on knowledgebases. In addition to providing open access and equity among researchers and universities, the infrastructure model provides for sustainable funding. This kind of stability, not possible through grant funding, is critical for all types of core data resources on which the research community depend.
In her review, Helen says: “is there scope for significant future reduction in manual biocuration”. How much of a role do you think that automated curation play in the future and what impact will this have?
HB: Although curation cannot be completely automated, it is important to identify what parts of curation can be automated at any one time in history. This requires continuing review of the processes to see what aspects are repetitive and can be encoded. It also requires attention to new methods being developed by computer scientists that can be adapted to curation. These steps will ensure that curation becomes more efficient. This will lead to a higher quality product and will contain the costs.
“Automated curation is now used as a first screening tool for data extraction: we can imagine that in the future some tasks that now belong to manual curation will be automatically completed.” Chiara Gabella
CG: The new methods in machine learning and text mining are gradually improving the efficiency of automated information extraction programs. Those techniques can however not replace the knowledge and added value that are brought by highly experienced manual curators. Automated curation is now used as a first screening tool for data extraction: we can imagine that in the future some tasks that now belong to manual curation will be automatically completed. However, expert curation will be always needed to guarantee the high quality of data through the selection and validation of appropriate data and the extraction of reliable information from published literature.
In Helen’s review, she also highlights that sustainability is an issue across all data resources. What do you think can be done by the various stakeholders to address this?
“Data resources provide the foundational infrastructure that enables scientific research. The funding mechanisms should reflect that.” Helen Berman
HB: There are some important steps that can be taken. The first is to ensure that there are international agreements among the data resources so as to ensure standards, collaboration on software development, and cooperation on data processing on a global scale. There are several models that have been adopted by different data resources for doing this.
The second is more complex. There needs to be better global coordination of the various public and private agencies that fund data resources. Criteria for merit review of these resources must be developed that are distinct from those used to review research proposals. Stable mechanisms for providing the funds need to be put in place; the infrastructure model is a good one. This idea has been discussed for many years but has never been implemented. It would go a long way in stabilizing core resources and ultimately make the funding process more efficient. Data resources provide the foundational infrastructure that enables scientific research. The funding mechanisms should reflect that.
CG: I couldn’t agree more with Helen’s statement. Long term sustainability is a point of concern for data resources in any discipline. There is currently a lot of discussion about this issue and initiatives are taken on an international level. It is important to take into account not only the specificities of repositories but also of curated databases.
Joining forces internationally is the key to long term data and knowledge preservation. The global coordination of public and private funders is an essential step. Its practical implementation will require criteria to evaluate the impact of data resources as well as a stable funding model. And a fundamental step towards deciding on a viable funding model the estimation of the overall cost linked to the data resources that need to be sustained.
The post How best to fund knowledgebases – an author and reviewer in conversation appeared first on F1000 Blogs.