The thing about Google Scholar
September 13, 2013
I’m a fan of Google Scholar, no doubt about it. From a practical point of view you can’t go past it as a tool for finding documents by subject and, much as I prize and value our databases, if you restrict your searching to the traditional commercially-produced high-quality sources you run the risk of missing out on a substantial proportion of the relevant material that is out there. I think most librarians get this and if we are not including it as part of our toolkit and recommending it to students and researchers then we are simply losing credibility because sooner or later they are going to hear about it anyway.
That’s probably a battle that was fought and won years ago, but it is still worth revisiting because an understanding of what GS is and how it works will be important when we move on to look at its place in (tada!) bibliometrics. So here’s the thing. A traditional database like Web of Science or Historical Abstracts contains a number of searchable elements, three of which are generally used when we search by subject – the title, the abstract and the keywords. If it doesn’t find your combination of search terms in those fields it won’t return a “hit”, so if an article contains a paragraph on jellyfish attacks on ad hoc networks on page 9 the database won’t find it unless those terms (“jellyfish attack” and “ad hoc network”) are in the title, abstract or keywords because the database has no knowledge of what is in the article. (Like most great and simple truths this is something of an over-simplification but it will do for present purposes.) The other limitation of a database is that it will only return results from journals that it “covers”, so anything that appears in a journal that is not on its list will not show up in the results – this is not entirely a bad thing, especially given the amount of junk publishing that is out there, but it does mean that no single database can be guaranteed to find “everything”. Sorry.
Google Scholar, on the other hand, is not a database but a search engine and manages to avoid these limitations. What it does for the scholarly world is more or less what big Google does for the everything-else world, it just sends out its spider to crawl across the scholarly web and creates a huge word index of everything it finds. It does then try to organise what it finds into meaningful elements, so when it comes across an article it generally recognises that a certain bit is the names of the authors, another bit is the title, another the name of the journal and so on. This is generally successful but occasionally goes badly wrong, so that a search for Influenza as an author turns up the following
There is no obvious author for this report so poor old GS chose three headings at random from the sidebar at the left – Bulletins Portal, Avian Influenza and West Nile Virus. Looking at the list of options it could have been worse –
For those with too much time on their hands, it used to be possible to have great fun doing this sort of thing, but they do seem to have cleaned their act up a bit over the years and you no longer get massive numbers of hits by author-searching the names of nasty diseases. For glass-half-full people, on the other hand, there is a huge upside to GS’s cheap-and-cheerful approach and that relates to its ability to find stuff right in the middle of articles, reports and even books. Let’s look again at jellyfish attacks on ad hoc networks – you can click here to run this search in Google Scholar. Now we get 150 results and it’s pretty clear from the first couple of pages that many or most of them are from bona fide scholarly journals in the IT field. Even if we go to the last page they still look okay, and even if six of the links are to the same passage in the Encyclopedia of Cryptography and Security who is going to worry about that when we see the results from doing the same search in the conventional databases? Here’s what we find –
Scopus 12
Web of Knowledge 6
ACM Digital Library 5
IEEE Xplore 3 (25 if you do a full-text search)
Game over. No brainer. Google Scholar is hands-down winner. Choose your cliché. And wait, there’s more. If you look at the first twenty results of our search, thirteen of the items found come with links to copies of the documents in digital repositories
What this means is that those thirteen articles are directly available to any Internet user regardless of whether or not they have access to a subscribed (aka paywalled) copy. In terms of Open Access and Information Democracy this is huge, and it is difficult to imagine that documents in repositories would get a fraction of the use that they do without this sort of exposure. Four cheers for Google Scholar.
So what’s the downside? Nothing can be that good, right? Right. In the results our search above, you may have noticed that nothing on the first page of results was more recent than 2011, and the top hits were relatively old for such a hot topic. This is because the sorting algorithm favours highly cited items which tend to be older ones – as I said last week, it takes time to get cited – and you need to limit your results to recent years to be looking at the the latest work. Of the nine articles from 2013 none appears before the third page of results. This is fine if you know that you have to apply the date limit, but many students will not know to do this, not to mention members of the general public. And then get this – our search of Scopus tuned up the following 2012 paper from Communications in Computer and Information Science – “JAM: Mitigating jellyfish attacks in wireless ad hoc networks” but even with both our keyword phrases in the title it didn’t show up in the list of 150 results on GS, at least not today, 13 September 2013. So phooey to Google Scholar too, sometimes you just can’t trust it.
Actually none of this is very new and, love GS or hate it, we have learned to live with it and profit from the experience. I wrote an article expressing qualified support back in 2006 – Examining the claims of Google Scholar as a serious information source – and the other day someone asked me if I still stood by the opinions I expressed then. I do, and looking back at it I’m surprised at how little GS has changed. At the time, though, opinions were quite strongly divided, with many in the library profession arguing against its appearance on library websites alongside “proper” databases, and it seemed to be necessary to speak out in support of its use as a practical information discovery tool. Over time new functionality has been added, obviously in response from users wanting GS to behave more like a database, and one of the first was the facility to create records for EndNote, RefMan and BibTex from GS data. This works pretty well, but only operates on one record at a time and records don’t include abstracts or keywords – it is not a database, so that sort of data isn’t really present – and users have to be warned to check records to ensure that the authors don’t turn out to be Meningococcal Disease and Oral Vaccination. But, as I said, GS is cheap (free actually) and cheerful and a little wild inaccuracy is all part of the fun.
What has changed since 2006 is the world, or more precisely that part of the world impacted by the growth of research evaluation and bibliometric analysis of research outputs . Major evaluation exercises have been undertaken in The United Kingdom, Australia and New Zealand, and while the United States does not have the same centralised funding system, citation counts and the h-index are routinely used in hiring and tenure decisions. The good folks at Google Scholar (hi there!) don’t venture out much to consult or explain, but they obviously take careful note of what is going on and the fact that their citation counts were routinely higher than those found in Web of Science and Scopus was clearly attractive to many scholars. Unless you are playing golf, a larger number is somehow much more satisfying than a smaller one, and GS nearly always has the advantage there; its spider has access to all of the major digital journal collections covered by WoS and Scopus, and as well as this GS covers the mass of worthwhile smaller titles like New Zealand Studies in Applied Linguistics and He Pukenga Korero that are overlooked by the big databases. So, all good. If you want to know who cited your work then look on Google Scholar; give or take the odd botch-up you will find not only all the citing journal articles, but the theses and reports that had mentioned it as well, plus a number of books. The latter can be topped up by a visit to Google Books and then you have a really rich picture of the influence of a publication, much more nuanced and informative than simply knowing that Journal Article A was cited by Journal Articles B through X between the years Y and Z. Knowing that it has been cited in government or NGO reports, for example, could indicate that a piece of research has informed policy and helped bring about change, while its appearance in books could be a sign of the wider dispersal of its ideas. Brilliant.
Okay, not so fast. Remember that I said that Google Scholar was a search engine, created by a spider, rather than a database with a clear structure and conforming to clearly understood principles of inclusion and exclusion. (Most of the time.) Well, this is where things begin to get murky because GS does a thorough job of dredging that part of the web which it designates scholarly and its scoop really does capture everything, oysters and pearls, yes, but also old boots and the supermarket trolley some students tried to use as a boat. I exaggerate of course, but the difficulty lies in defining what we might count as a citation. By more or less inventing the notion of citation counting, the Institute of Scientific Information under Eugene Garfield (now Thomson Reuters Web of Science) put in place a de facto definition that it was citation in a peer-reviewed journal that counted – i.e. that literally contributed to the count – and for many years that was more or less our understanding too, partly because we had no alternative and partly because it made some sort of sense. A citation in a peer-reviewed journal was a real pat on the back, not just a pat on the back from your friend Johnny or from your cousin’s girlfriend but a pat on the back from one of the teachers. Awesome! However, by extending this little sign of approval (and it usually is approval) from the boring old peer-reviewed sources to include working papers of uncertain status, draft documents that are themselves titled “Not For Citation”, reports of uncertain origin, undergraduate student essays, PowerPoints and just plain old Word documents sitting there on the web for reasons no one can remember, then we inevitably change the meaning of the number that is returned. I’m not saying that they don’t have a meaning – it’s always nice to be noticed – but in many cases this meaning may be more like an altmetrics number than what has been traditionally defined as bibliometrics. And yes, I do know that the world moves on, but as it moves on it changes and part of our task as a community of librarians and scholars is to notice these changes and to think clearly about their meaning.
This is the point at which I say that there is a darker side to all of this. There is a darker side to all of this. If it is possible for clever people to game the databases by “citation stacking” then Google Scholar really is an unlocked door with no one at home. I draw your attention to a forthcoming paper in the Journal of the American Society for Information Science and Technology by López-Cózar, Robinson-Garcia and Torres-Salinas. Titled “The Google Scholar Experiment: how to index false papers and manipulate bibliometric indicators” it’s about the ease with which citations can be dumped onto GS in order to inflate an individual h-index. I’m not saying for sure that we are going to be deluged by a flood of academic spam, but where there’s a way there’s almost certainly some scallywag out there with a will, so it won’t be for want of trying.
So, am I saying that you should simply stay away from Google Scholar when trying to evaluate the influence of your research? Hell no! If you don’t keep track of GS citations you will miss out on a lot of pats on the back and also some important evidence of how your work has been received. One thing published scholars can do, and really should do, is to register with Google Scholar Citations and “claim” their publications and make this public. I have done this and while it inflates my citation counts and credits me with an h-index, that’s just fine by me – as long as I don’t go around bragging about it to anyone who might know better. It also does another couple of things which are useful – sending me alerts when any of the works are cited (I can tell you now that that is a rare and happy event) and creating for me a GS portfolio that links any of the works that someone might find to all of the others. In the example below, because the document in question is linked to my GS Citations profile my name is underlined, which means anyone clicking on it will link through to the portfolio with all the other stuff. I actually think that this portfolio approach is much more important than any number, especially as the number is currently subject to significant “grade inflation”.
So here in a nutshell is what I’m saying. Google Scholar good to great. Google Scholar Citations – really interesting but use with care. But anything purely metric, like the egregiously-named Publish or Perish (a piece of software that turns GS data into fancy-looking metrics) or Google Scholar Metrics is best avoided by anyone with a concern for the meaning and validity of numbers in the evaluation of scholarly influence. And that’s all for now. As usual tweet, retweet, and if anyone wants to spend the weekend doubling my h-index that’s good too*.
Bruce White
eResearch Librarian
eResearch on Library Out Loud
*Just kidding! Seriously, DON’T DO THAT!!!
Search posts
Categories
Tags
Recent Comments
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- December 2023
- November 2023
- October 2023
- September 2023
- June 2023
- May 2023
- February 2023
- January 2023
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- March 2022
- January 2022
- November 2021
- August 2021
- July 2021
- May 2021
- April 2021
- March 2021
- December 2020
- November 2020
- September 2020
- August 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- November 2019
- October 2019
- September 2019
- July 2019
- June 2019
- May 2019
- March 2019
- February 2019
- December 2018
- November 2018
- October 2018
- September 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- September 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015
- April 2015
- March 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- September 2009
- November 2008
Leave a Reply