The thing about Google Scholar

September 13, 2013

I’m a fan of Google Scholar, no doubt about it. From a practical point of view you can’t go past it as a tool for finding documents by subject and, much as I prize and value our databases, if you restrict your searching to the traditional commercially-produced high-quality sources you run the risk of missing out on a substantial proportion of the relevant material that is out there. I think most librarians get this and if we are not including it as part of our toolkit and recommending it to students and researchers then we are simply losing credibility because sooner or later they are going to hear about it anyway.

That’s probably a battle that was fought and won years ago, but it is still worth revisiting because an understanding of what GS is and how it works will be important when we move on to look at its place in (tada!) bibliometrics. So here’s the thing. A traditional database like Web of Science or Historical Abstracts contains a number of searchable elements, three of which are generally used when we search by subject – the title, the abstract and the keywords. If it doesn’t find your combination of search terms in those fields it won’t return a “hit”, so if an article contains a paragraph on jellyfish attacks on ad hoc networks on page 9 the database won’t find it unless those terms (“jellyfish attack” and “ad hoc network”) are in the title, abstract or keywords because the database has no knowledge of what is in the article. (Like most great and simple truths this is something of an over-simplification but it will do for present purposes.) The other limitation of a database is that it will only return results from journals that it “covers”, so anything that appears in a journal that is not on its list will not show up in the results – this is not entirely a bad thing, especially given the amount of junk publishing that is out there, but it does mean that no single database can be guaranteed to find “everything”. Sorry.

Google Scholar, on the other hand, is not a database but a search engine and manages to avoid these limitations. What it does for the scholarly world is more or less what big Google does for the everything-else world, it just sends out its spider to crawl across the scholarly web and creates a huge word index of everything it finds. It does then try to organise what it finds into meaningful elements, so when it comes across an article it generally recognises that a certain bit is the names of the authors, another bit is the title, another the name of the journal and so on. This is generally successful but occasionally goes badly wrong, so that a search for Influenza as an author turns up the following

GS1

There is no obvious author for this report so poor old GS chose three headings at random from the sidebar at the left – Bulletins Portal, Avian Influenza and West Nile Virus. Looking at the list of options it could have been worse –

GS2

For those with too much time on their hands, it used to be possible to have great fun doing this sort of thing, but they do seem to have cleaned their act up a bit over the years and you no longer get massive numbers of hits by author-searching the names of nasty diseases. For glass-half-full people, on the other hand, there is a huge upside to GS’s cheap-and-cheerful approach and that relates to its ability to find stuff right in the middle of articles, reports and even books. Let’s look again at jellyfish attacks on ad hoc networks – you can click here to run this search in Google Scholar. Now we get 150 results and it’s pretty clear from the first couple of pages that many or most of them are from bona fide scholarly journals in the IT field. Even if we go to the last page they still look okay, and even if six of the links are to the same passage in the Encyclopedia of Cryptography and Security who is going to worry about that when we see the results from doing the same search in the conventional databases? Here’s what we find –

Scopus 12
Web of Knowledge 6
ACM Digital Library 5
IEEE Xplore 3 (25 if you do a full-text search)

Game over. No brainer. Google Scholar is hands-down winner. Choose your cliché. And wait, there’s more. If you look at the first twenty results of our search, thirteen of the items found come with links to copies of the documents in digital repositories

gs3

What this means is that those thirteen articles are directly available to any Internet user regardless of whether or not they have access to a subscribed (aka paywalled) copy. In terms of Open Access and Information Democracy this is huge, and it is difficult to imagine that documents in repositories would get a fraction of the use that they do without this sort of exposure. Four cheers for Google Scholar.

So what’s the downside? Nothing can be that good, right? Right. In the results our search above, you may have noticed that nothing on the first page of results was more recent than 2011, and the top hits were relatively old for such a hot topic. This is because the sorting algorithm favours highly cited items which tend to be older ones – as I said last week, it takes time to get cited – and you need to limit your results to recent years to be looking at the the latest work. Of the nine articles from 2013 none appears before the third page of results. This is fine if you know that you have to apply the date limit, but many students will not know to do this, not to mention members of the general public. And then get this – our search of Scopus tuned up the following 2012 paper from Communications in Computer and Information Science – “JAM: Mitigating jellyfish attacks in wireless ad hoc networks” but even with both our keyword phrases in the title it didn’t show up in the list of 150 results on GS, at least not today, 13 September 2013. So phooey to Google Scholar too, sometimes you just can’t trust it.

Actually none of this is very new and, love GS or hate it, we have learned to live with it and profit from the experience. I wrote an article expressing qualified support back in 2006 – Examining the claims of Google Scholar as a serious information source – and the other day someone asked me if I still stood by the opinions I expressed then. I do, and looking back at it I’m surprised at how little GS has changed. At the time, though, opinions were quite strongly divided, with many in the library profession arguing against its appearance on library websites alongside “proper” databases, and it seemed to be necessary to speak out in support of its use as a practical information discovery tool. Over time new functionality has been added, obviously in response from users wanting GS to behave more like a database, and one of the first was the facility to create records for EndNote, RefMan and BibTex from GS data. This works pretty well, but only operates on one record at a time and records don’t include abstracts or keywords – it is not a database, so that sort of data isn’t really present – and users have to be warned to check records to ensure that the authors don’t turn out to be Meningococcal Disease and Oral Vaccination. But, as I said, GS is cheap (free actually) and cheerful and a little wild inaccuracy is all part of the fun.

What has changed since 2006 is the world, or more precisely that part of the world impacted by the growth of research evaluation and bibliometric analysis of research outputs . Major evaluation exercises have been undertaken in The United Kingdom, Australia and New Zealand, and while the United States does not have the same centralised funding system, citation counts and the h-index are routinely used in hiring and tenure decisions. The good folks at Google Scholar (hi there!) don’t venture out much to consult or explain, but they obviously take careful note of what is going on and the fact that their citation counts were routinely higher than those found in Web of Science and Scopus was clearly attractive to many scholars. Unless you are playing golf, a larger number is somehow much more satisfying than a smaller one, and GS nearly always has the advantage there; its spider has access to all of the major digital journal collections covered by WoS and Scopus, and as well as this GS covers the mass of worthwhile smaller titles like New Zealand Studies in Applied Linguistics and He Pukenga Korero that are overlooked by the big databases. So, all good. If you want to know who cited your work then look on Google Scholar; give or take the odd botch-up you will find not only all the citing journal articles, but the theses and reports that had mentioned it as well, plus a number of books. The latter can be topped up by a visit to Google Books and then you have a really rich picture of the influence of a publication, much more nuanced and informative than simply knowing that Journal Article A was cited by Journal Articles B through X between the years Y and Z. Knowing that it has been cited in government or NGO reports, for example, could indicate that a piece of research has informed policy and helped bring about change, while its appearance in books could be a sign of the wider dispersal of its ideas. Brilliant.

Okay, not so fast. Remember that I said that Google Scholar was a search engine, created by a spider, rather than a database with a clear structure and conforming to clearly understood principles of inclusion and exclusion. (Most of the time.) Well, this is where things begin to get murky because GS does a thorough job of dredging that part of the web which it designates scholarly and its scoop really does capture everything, oysters and pearls, yes, but also old boots and the supermarket trolley some students tried to use as a boat. I exaggerate of course, but the difficulty lies in defining what we might count as a citation. By more or less inventing the notion of citation counting, the Institute of Scientific Information under Eugene Garfield (now Thomson Reuters Web of Science) put in place a de facto definition that it was citation in a peer-reviewed journal that counted – i.e. that literally contributed to the count – and for many years that was more or less our understanding too, partly because we had no alternative and partly because it made some sort of sense. A citation in a peer-reviewed journal was a real pat on the back, not just a pat on the back from your friend Johnny or from your cousin’s girlfriend but a pat on the back from one of the teachers. Awesome! However, by extending this little sign of approval (and it usually is approval) from the boring old peer-reviewed sources to include working papers of uncertain status, draft documents that are themselves titled “Not For Citation”, reports of uncertain origin, undergraduate student essays, PowerPoints and just plain old Word documents sitting there on the web for reasons no one can remember, then we inevitably change the meaning of the number that is returned. I’m not saying that they don’t have a meaning – it’s always nice to be noticed – but in many cases this meaning may be more like an altmetrics number than what has been traditionally defined as bibliometrics. And yes, I do know that the world moves on, but as it moves on it changes and part of our task as a community of librarians and scholars is to notice these changes and to think clearly about their meaning.

This is the point at which I say that there is a darker side to all of this. There is a darker side to all of this. If it is possible for clever people to game the databases by “citation stacking” then Google Scholar really is an unlocked door with no one at home. I draw your attention to a forthcoming paper in the Journal of the American Society for Information Science and Technology by López-Cózar, Robinson-Garcia and Torres-Salinas. Titled “The Google Scholar Experiment: how to index false papers and manipulate bibliometric indicators” it’s about the ease with which citations can be dumped onto GS in order to inflate an individual h-index. I’m not saying for sure that we are going to be deluged by a flood of academic spam, but where there’s a way there’s almost certainly some scallywag out there with a will, so it won’t be for want of trying.

So, am I saying that you should simply stay away from Google Scholar when trying to evaluate the influence of your research? Hell no! If you don’t keep track of GS citations you will miss out on a lot of pats on the back and also some important evidence of how your work has been received. One thing published scholars can do, and really should do, is to register with Google Scholar Citations and “claim” their publications and make this public. I have done this and while it inflates my citation counts and credits me with an h-index, that’s just fine by me – as long as I don’t go around bragging about it to anyone who might know better. It also does another couple of things which are useful – sending me alerts when any of the works are cited (I can tell you now that that is a rare and happy event) and creating for me a GS portfolio that links any of the works that someone might find to all of the others. In the example below, because the document in question is linked to my GS Citations profile my name is underlined, which means anyone clicking on it will link through to the portfolio with all the other stuff. I actually think that this portfolio approach is much more important than any number, especially as the number is currently subject to significant “grade inflation”.

gs4

So here in a nutshell is what I’m saying. Google Scholar good to great. Google Scholar Citations – really interesting but use with care. But anything purely metric, like the egregiously-named Publish or Perish (a piece of software that turns GS data into fancy-looking metrics) or Google Scholar Metrics is best avoided by anyone with a concern for the meaning and validity of numbers in the evaluation of scholarly influence. And that’s all for now. As usual tweet, retweet, and if anyone wants to spend the weekend doubling my h-index that’s good too*.

Bruce White
eResearch Librarian
eResearch on Library Out Loud

*Just kidding! Seriously, DON’T DO THAT!!!

Leave a Reply

Your email address will not be published. Required fields are marked *

    Search posts