As every librarian knows, there are three main sources of citation data. The three citation indexes (in increasing order of size) are Web of Science, Scopus and Google Scholar.
@aarontay I’ve actually been meaning to ask you- What do you make of publishers’ not seeming to care that most of RG/Acad material breaks copyright?
— Ryan Regier (@ryregier) May 1, 2017
Honestly, I’m not sure why the publishers are not currently acting (though they acted a few years ago with academia.edu), but this is perhaps the greatest sticking point of using ResearchGate as a citation index due to the instability of the situation.
But is there anything better?
Microsoft Academic search (new version)
Microsoft’s alternative to Google Scholar , Microsoft Academic Search is now in it’s 2nd iteration may fit the bill.
Some of you might remember the earlier version, which seemed to be abandoned in 2014 with analysis revealing that Microsoft seem to have stopped indexing new articles.
Fortunately, Microsoft has since clarified that they were actually working on a new version of the service.
They then launched a preview version of this new Microsoft academic service about a year ago.
As you will see later, many bibliometrics experts are very excited by this new service. But why?
Microsoft describes it as thus in a 2015 paper “At the core of MAS is a heterogeneous entity graph comprised of six types of entities that model the scholarly activities: field of study, author, institution, paper, venue, and event. In addition to obtaining these entities from the publisher feeds as in the previous effort, we in this version include data mining results from the Web index and an in-house knowledge base from Bing, a major commercial search engine.”
In short, it seems to offer the best of both worlds, using Google Scholar type crawling technology combined with publisher based metadata feeds to build large indexes with an attention to metadata fields similar to what you get from library type databases. On top of that, unlike the well known difficulty of extracting data from Google Scholar, you can do so easily in Microsoft Academic Search via an API, or downloading Microsoft Academic Graph (MAG).
I’ve been playing around with the service, setting up my own profile, kicking the tires etc.
My profile in the new Microsoft academic search
I’m not ready to do a full review yet, but it does look promising despite the bugs.
Some preliminary things I noticed
Search wise the number of results generated for searches seems closer to what you would see search in library databases. For example a search for terms web scale discovery gets you around 111 results, but you get 2.6 million in Google Scholar! It seems unlikely that Microsoft academic search index is smaller by such a degree (see later) , so it is probably because it does not search and match within full text (for whatever technical reason).
And this was confirmed via Twitter
@aarontay We default to semantic search over full text, so less but more accurate results. If semantic fails we do fall back to full text though— Microsoft Academic (@MSFTAcademic) May 8, 2017
The other major difference between it and Google Scholar is that Google Scholar shows [citation] results or items that are not indexed, while Microsoft academic search does not.
[Citation] results in Google Scholar
All this perhaps explains some of the reasons why you get fewer results compared to Google Scholar.
The size of Microsoft Academic search versus the rest
Microsoft claims an index of 83 million publications record in 2015, and by 2016, this rose to 140 million publication records, 40 million authors, 60,000 journal titles. As estimates for the size of Google Scholar’s index typically fall into the 100+ million range (it’s notoriously hard to get any hard facts on the size of Google Scholar) , Microsoft is now seemingly hitting within the same ballpark and is significantly bigger than Scopus and Web of Science , which is perhaps 60%-70% of it’s size.
But that’s what is claimed, what does the research by Harzing and other researchers show?
Harzing of course is well known as the author of the free “Publish or Perish” tool, the only tool allowed by Google to extract citation data from Google Scholar. She has now added support to version 5.0 for Microsoft Academic Search.
In blog posts such as Microsoft Academic (Search): a Phoenix arisen from the ashes? and peer reviewed article “Microsoft Academic: Is the Phoenix getting wings?(http://www.harzing.com/download/mas2.pdf – Publisher’s version), she shows that the coverage of Microsoft Academic Search more than equals the coverage of the paid indexes of Scopus and Web of Science. While it still falls behind that of Google Scholar the gap is closing.
For example, her blog post that studies coverage of her own works finds that practically all her publications that is also indexed in Scopus (all but 2) and Web of Science (all but 1) is also in Microsoft Academic Search. On top of that, Microsoft Academic Search can find 30/43 more of her works than Scopus and Web of Science respectively. Google Scholar still dominates Microsoft Academic Search though.
In general, Scopus & Web of Science detects slightly more citations than Microsoft Academic Search in the Life Sciences (11% more) , Sciences (7% more) & pretty much ties for Engineering but Microsoft Academic Search beats the other two handily in Humanities (170% of Scopus) and Social Sciences (145% of Scopus). Google Scholar clearly dominates all as usual.
Hazing goes on to explain that Microsoft Academic Search uses machine learning to drop citations that it can’t verify that is a true cite and attempts to correct for this to “estimate “true” citation counts”. This leads to the following comparison.
When looking at this estimated true citation count (MA ECC) , Microsoft Academic Search actually finds more citations than Google Scholar in Life Sciences and just barely loses out in Science & Engineering.
But Google Scholar continues to dominate in Social Sciences and particularly Humanities. This is probably due to the impact of Google books for book related items.
I could go on to describe the results from the thesis, Comparison of Microsoft Academic (Graph) ,with Web of Science, Scopus and Google Scholar or the paper, The coverage of Microsoft Academic: Analyzing the publication output of a university but results are pretty much similar with the latter describing the service “is on the verge of becoming a bibliometric superpower.”
Completeness of metadata fields
But I suspect, the interest in Microsoft Academic Search is not just purely based on the size of index. After all Google Scholar still seems to have the edge in size.
But interest here lies in the fact that the service is now sufficiently big and also the richness of the metadata and the ease of extraction of the data, both areas Google Scholar is extremely poor at. The only official licensed tool by Google, Publish or Perish is often unreliable and cannot be used for large scale extraction for example.
It might be worth reading the thesis, Comparison of Microsoft Academic (Graph) with Web of Science, Scopus and Google Scholar , particularly chapter 4 that compares the openness of accessing data of the 3 sources and the completeness of metadata fields.
Similarly, Citation Analysis with Microsoft Academic goes in depth to assess the suitability of Microsoft Academic Search as a bibliometric tool in terms of the completeness of metadata field and easy accessibility of data for extraction.
In general, the results are positive, it’s far easier to extract & manipulate data than Google Scholar through the API or by downloading the Microsoft Academic graph which has a much richer and structured data available than Google Scholar. Even something like having internal Microsoft Academic Search assigned ids for “papers, references, authors, affiliations, fields of study, journals and venues” is very helpful.
It’s not a perfect tool though, for example examining the attributes available they realize that there is no document type (which makes metrics that normalizes using document type hard to do), nor does it have the very obvious doi attribute (a strange omission). While there is a subject type “field of study”, it’s dynamically generated and far too specific (50,000 field of studies?).
Currently our citation sources consists of either paid services like Scopus and Web of Science, or free to access services by commercial companies – Google Scholar and Microsoft Academic Search. Both are not ideal.
The OpenCitations Project is probably the best solution but as of writing there is no study I know of quantifying the size of this index.
Still, one wonders if it might be the beginning of the end for paid citation indexes. Use of Scopus and Web of Science as discovery tools have greatly declined in recent years and much of its value now lies mostly in generating citation metrics.
As open access continues to march on, more and more content will be freely available. This will free up citations/references as well to be mined (albeit not always in structured format), so citation indexes will have to compete on data quality, feature sets and ease of use.
Players that have strengths in handling and cleaning of large datasets (e.g. Google) will have a big edge here of course. Traditional companies that serve libraries and academia may not be able to match this but do have strengths in terms of better understanding of academics so it’s going to be interesting to watch.