March 2006

We made some citation frequency comparisons between Scopus, Web of Science and Google Scholar. As Scopus counts citations from 1996 we limited the comparisions to articles published from 1996 and current. The result of the figures in the screenshot showed:

Scopus finds 9% more citations than Web of Science when limited to articles from 1996-.

Scopus finds 20% more citations than Google Scholar when limited to articles from 1996-.

Web of Science finds 10% more citations than Google Scholar when limited to articles from 1996-.

Important to know is that Web of Science indexes more than 9,000 journals compared to Scopus 15,000 journals, though Web of Science argues that (according to Bradford's Law) they have the core journals which have the most citations. Google Scholar has no list of journals and other sources they index, but they index both articles from the proprietary web and scholarly archives, master theses, books etc. Google Scholar citation counting is not working properly either as we already pointed out in a previous posting. In this test all cited references from Scopus haven't been retrieved, just the indexed articles.

As we also already mentioned, the article "An Examination of Citation Counts in a New Scholarly Communication Environment" published in D-Lib magazine September 2005 Vol. 11, No. 9. by Kathleen Bauer et al at Yale University Library made some citation counting. But when we just counted all citations for a random 5 set of authors at Umeå university, Bauer et al made comparisons of the average number of times an article is cited. Both our test and the test by Bauer et al didn't check the Google Scholar inconsistencies of citations counting and duplicates.

Some of the findings from the article by Bauer et al follow below. The information derives from the tables in their article.

The search for articles published 2000 in Journal of the American Society for Information Science and Technology (JASIST) showed for example:

Web of Science counts 0.3 more citations than Scopus.

The search for articles published 1985 in Journal of the American Society for Information Science and Technology (JASIST) showed for example:

Web of Science counts 11.9 more citations than Scopus.

Because Scopus just count citations from articles published from 1996 and current the 11.9 difference is not surprising. Though the 0.3 difference for articles published from 2000 is more questionable. This test by Bauer et al has its limitations because it's limited to just one journal (i.e., JASIST).

Conclusion: Different testing methods at least shows that Scopus definitely is important when searching citations for articles published from 1996. Due to inconsistencies in Google Scholar its not suggested as a single usable tool for citation search.

In several of his writings, Peter Jacso has indicated the inconsistencies of Google Scholar. One important flaw is the citation search. Both his web published paper "Google Scolar and The scientist" and the article "As we may search" published in Current Science 2005 (please see References to literature) discuss the problems.

My testings indicates less inconsistencies than before, but still they exist. The article "An Examination of Citation Counts in a New Scholarly Communication Environment" published in D-Lib magazine September 2005 Vol. 11, No. 9. by Kathleen Bauer et al at Yale University Library made some comparisions of the average number of times an article is cited. They checked the citation frequency of each article for a certain year, in this case both 1985 and 2000, in the Journal of the American Society for Information Science and Technology (JASIST). The search for 2000 showed Google Scholar had 4.5 more citations than Web of Science and 3.9 more citations than Scopus. But searching 1985 Web of Science had 8.7 more citations than Google Scholar and Google Scholar had "just" 2.9 more citations than Scopus. The major shortcoming in this article is that they never analyzed the inconsistencies with Google Scholar citation Search. The citation count doesn't always work properly. Here's an example of a record that indicates Cited by 15 (other sources):

When clicking this link Cited by 15 you will find only 14 citations:

Here's another example of an article by P Jacso himself. Cited by 3 sources according to Google Scholar:

When clicking this link Cited by 3 you will find only 2 citations:

This search on semiconductors is an example from Jacso. In this reference it seems like the article is published 2006, but checking the source shows it's published 1990 and 2006 is the starting page of the article:

Jacso has also pointed out the flaws of duplication in his article "As we may search" in Current Science. Google Scholar works hard with the ability to cluster duplicate articles together. If you look at the preceding screenshot after the title you see the link group of 3>>. Clicking that link shows you 3 duplicates. Because Google Scholar indexes not just peer-reviewed journal articles, but preprint archives, conference papers, master thesis, webpublished materials etc you understand they have a hard problem to discover duplicates.

Here's an example. Searching sojka modeling drop size distributions gives as the first hit an article by Bainsky and Sojka with title "Modeling drop size distributions". That article should be Cited by 7 other sources according to Google Scholar.

By clicking Cited by 7 you find 7 hits but two of them are duplicates. View the two titles "Modeling Spray Impingement using Linear Stability Theories for Droplet Shattering". Though the first title has a link to group of 2>>.

Conclusion: Don't trust the Google Scholar citation counting without manually checking it for inconsistencies in terms of counting and clustering duplicates.

It's not hard to find inconsistencies and flaws in Google Scholar. Some of them follow below.
This search on semiconductors is an example from Peter Jacso Google Scholar and The Scientist 2005 (Published on university homesite as extra material).

In this reference it seems like the article is published 2006, but checking the source shows it's published 1990 and 2006 is the starting page of the article:

Searching advanced search limiting the date range to 1995-2006 returns 135,000 hits. But extending the date range to 1985-2006 returns just 131,000 hits. How come, Google Scholar?

Another flaw in Google Scholar is the OR Boolean operator. In this case the result for: dahlqvist OR dahlquist is 16.200 which means there should be 500 documents with both dahlqvist AND dahlquist, otherwise 16.700. But it's not.

This is a quotation from Peter Jasco "As we may search" in Current Science Vol. 89, No. 9. (10 November), pp. 1537-1547.:

"G-S is a free service, and for many who consider it to be a gift for the world it may be anathema to say any but good words of it. It is also to be emphasized that it is a joint gift by some publishers and/or their digital facilitators (the content part), and Google (the software and the service operation part). If ISI or Elsevier could have received such unfettered access to the publishers’ archives for harvesting their sites offering standard-compliant metadata, they could probably sell their services – if not for free – at a fraction of their current price. Building a multi-million record database incurs multi-million dollar investment just to subscribe to the journals, administer their processing, and record their standard bibliographic data, abstract, and descriptors, for about 1 million papers per year in the most recent period".

Of course, Google Scholar also has problems similar to Web of Science and Scopus when indexing author Rantapaa Dahlqvist S. When searching via advanced scholar search in author field, the following advanced search operators are returned: author:Rantapaa Dahlqvist author:s and 41 hits.

One of the 41 hits have both first name initials SB. The result also includes 7 hits on author variant Dahlqvist SR. A search on Dahlqvist SR shows on the contrary 28 hits on that variant. And here we find the article that was published in Lancet and was misspelled in Web of Science as Dahlwvist SR.

In contrast, a search on Rantapaa-Dahlqvist S without ää returns just 35 hits(!).

Trying to search misspelling Rantapaa-Dahlquist S returns no hits while same misspelling with ää Rantapää-Dahlquist S returns 2 hits which are not included in the Rantapaa Dahlqvist S search of 41 hits.

But searching Rantapää S returns 43 hits included misspelled rantapää-dahlquist.

But this is not all. Searching Rantapaa SB gives one more hit not included in the 43 hits or 41 hits mentioned before. Such a mess!

Let’s try some other authors. Searching author:sojka author:p also returns hits of P Jakubus Z Sojka. This means that all first name initials with P and all surnames called Sojka are searched with this syntax.

To get refined matching, use quotation marks like this: author:”p sojka”. But still the problem exists that you have both pa and pe sojka. To exclude Paul E Sojka you could write: author:”p sojka” –author:”pe sojka”, but still you can’t be sure that all p Sojka records include just PA Sojka. And to find records by PA Sojka you can’t add it like an OR-statement. Didn’t work for me. You need to do a separate search: author:”PA Sojka”

And it’s not possible to a make limited search to address. But, as you may already have learnt from previous postings, author address search in Web of Science and Scopus is too inconsistent to suggest as a valuable method for refining your author search.
It is also worth noting that if you don’t restrict an author search to the author field in advanced search or with advanced search operators you will get hits where the author name exists in the fulltext of the articles which Google Scholar indexes parts of.

Conclusion: The same problems with author search as in Web of Science and Scopus exists in Google Scholar. And as I said in a previous posting, a proper publication list from the author is the best way to be sure to find every important article.

It's hard to make an easy and still deeply and thorough evaluation of subject coverage in Scopus, Web of Science and Google Scholar due to a lot of reasons. Especially because the databases in question do not use established thesauri. Though I made a small comparison between these multidisciplinary databases and PubMed.

I chose three MeSH terms (two of them with subheadings included) with three words included. I limited my search to 1996, mainly because Scopus subject coverage before 1996 is selective. The MeSH-terms were:

Hormone replacement therapy
Antifreeze proteins toxicity
Neonatal screening ethics

Result from PubMed searching MeSH database:

Result from Scopus searching field keywords:

Result if broading the search to title, abstracts and keywords.

Result from Web of Science when searching Topic in General search which include title, abstracts and Keywords (author keywords and keywords plus):

As for results in Google Scholar they are more hard to evaluate, because Google Scholar indexes significant parts of the fulltext. It's possible to limit to title search but not abstracts for example. A lot of the material Google Scholar indexes is retrieved from the open web and other material is Journal articles references (and fulltext) from publishers. Google Scholar has not integrated any thesauri for the article references, however. Instead they have 7 subject areas available for limiting in advanced search. As viewed in this screenshot one of the 7 subject areas is Medicine, Pharmacology and Veterinary Medicine. I made a limit to that subject area and timespan 1996-.

2310 hits are definitely more that the others but as you see the second hit is definitely of high relevance but the others have indexed the word ethics in the fulltext where the word ethics is part of a ethics committee and not necessarily relevant.

Screen shot of search on antifreeze proteins toxicity shows 60 hits:

Not all of these hits are relevant and some are hits from books.

Screen shot of search on hormone replacement therapy shows 26.200 hits:

Conclusion: It's not recommendable to use Web of Science, Scopus or Google Scholar when doing exhausitve, specific searches when all possible important records of current science have to be found. This is due to the fact that thesauri and controlled vocabulary are not integrated at all or not properly.

Broadening a subject search in Scopus from searching Keywords to searching Title, abstracts and keywords gives a higher recall but not in all cases relevant records. To broaden a search both Scopus and Google Scholar is recommended but not Web of Science which indexes less material from 1996.

Web of Science has no integration of thesauri in its database. Instead Eugene Garfield and ISI 1990 invented something called Keywords plus which Garfield explains here:

Essays of an Information Scientist: Journalology, KeyWords Plus, and other Essays, Vol:13, p.295, 1990 Current Contents, #32, p.3-7, August 6, 1990 [PDF]

Here's an explanation of how KeyWords plus works:

”KeyWords Plus supplies additional search terms extracted from the titles of articles cited by authors in their bibliographies and footnotes”.

"Records without references won't have KW+ – but more specifically, articles whose references are not linked to source items. In addition, it may be that those with very few linked references won't generate good candidates for KW+ either".

Here's an example of a record retrieved when searching for Hormone replacement therapy. Look at field KeyWords Plus and you find one hit on Hormone replacement therapy in bold.

To check if this KeyWords Plus phrase is extracted from the references list click on Cited references and search for the phrase. In this screen shot you'll see it exists:

Beside of this KeyWords Plus Web of Science also has author keywords.

Conclusion: I haven't evaluated the KeyWords Plus yet, but as it's uncontrolled terms it's impossible to make exhaustive, refined subject searches in Web of Science.

From Elsevier databases Scopus has integrated thesauri like GEOBASE Subject Index (geology, geography, earth and environmental science), EMTREE (life sciences, health), MeSH (life sciences, health), FLX terms and WTA terms (fluid sciences, textile sciences), Regional Index (geology, geography, earth and environmental science), Species Index (biology, life sciences), Ei thesaurus (controlled and uncontrolled terms) (engineering, technology, physical sciences). As you see, the last one includes uncontrolled terms. Scopus also integrate author keywords which are uncontrolled keywords supplied by the author of the article.

When searching the field Keywords in Scopus you won’t get just controlled vocabularies, you will also get uncontrolled vocabulary from Ei and author keywords.

This reference from PubMed:
Nicolau B, Marcenes W, Bartley M, Sheiham A.
Associations between socio-economic circumstances at two stages of life and
adolescents' oral health status.
J Public Health Dent. 2005 Winter;65(1):14-20.

It does not have the MeSH terms (from PubMed) or EMTREE (from Embase) integrated in the Scopus reference. Just author keywords as you see at this screen shot:

Another example of a Scopus record with no EMTREE terms. (No MeSH headings exist yet because it's a PubMed in process record).

And the same reference in Embase with EMTREE terms:

The following reference is from from PubMed:

Anderson C.
Breast cancer. How not to publicize a misconduct finding.
Science. 1994 Mar 25;263(5154):1679.

See the MeSH-terms at the screen shot. The terms with * -sign means it's a major MeSH heading:

Not all major MeSH headings are included in Scopus reference:

Ei thesaursus, sometimes called Compendex thesaurus, is not properly implemented either. On this screen shot you find a record from Scopus with no Ei thesaurs terms attached:

And here's a screen shot from the database Compendex showing the same record with Ei thesaurus terms attached:

When testing and comparing Compendex with Scopus, quite a lot records didn't integrate Ei thesaurus, but when it exists on Scopus records it has integrated both main heading, controlled and uncontrolled terms properly.

Unfortunately, subject search in Scopus has a lot of disadvantages:

  1. It’s impossible to browse the keywords and thesauri integrated in Scopus.
  2. The thesauri are inconsequently integrated, sometimes no MeSH terms, sometimes no Emtree, sometimes not all major MeSH headings.
  3. Uncontrolled terms are mixed with controlled ones and not possible to separate when refining a search.
  4. You can’t choose which thesauri to use.
  5. No mapping of terms as in Embase and PubMed.
  6. No possibility to explode terms.
  7. No integration of MeSH subheadings
    Conclusion: This means Scopus is impossible to use for refined and comprehensive subject search. That means you have to use PubMed to properly use MeSH terms, Embase to properly use Emtree and Compendex to properly use Ei terms. Of course Scopus is not built to substitute the Elsevier databases. That’s why I don’t think Scopus will ever implement the thesauri of the Elsevier databases properly. But why subject search of MeSH terms is not properly implemented when PubMed is a free source is very strange.

When searching for an author in Cited Ref Search, my recommendation is to use the Cited author index which makes it possible to browse authors. When searching Rantapaa in the Cited author index you get 7 author variants as seen in the screenshot:

As we learnt from the previous posting Rantapaa Dahlqvist search test in Scopus we also have to try the author variant Dahlqvist SR. Two additional author variants are found: Dahlqvist SR and Dahlqvist SRR.

In contrast, browsing Author Index via General Search gives 8 author variants when searching Rantapaa.

And browsing Dahlqvist S gives three author variants: Dahlqvist SR, Dahlqvist SBR and Dahlqvist SRA.

Searching all these author variants both in Cited Ref Search or General Search doesn't retrieve the article published by Rantapaa Dahlqvist S in Lancet. Why? Searching the article title "New concept in Echocardoigraphy" gives the explanation. As seen in this screen shot the author is misspelled Dahlwvist SR.

And as we know from the same author test in Scopus, both PubMed and the original electronical article from Science Direct have interpreted the author correctly as Rantapaa Dahlqvist S(olbritt).

These are all author variants found in Web of Science:

From Citations:

Rantapaa S
Rantapaa SB
Rantapaadahlqvi S
Rantapaadahlqvi SR
Rantapaadahlqvist S
Dahlqvist SR
Dahlqvist SRR

From articles:

Rantapaa Dahlquist S
Rantapaa Dahlquist SB
Rantapaa Dahlqvist S
Rantapaa Dahlqvist SSRD
Rantapaa Dahlqvst S
RantapaaDahlqvist S
RantapaaDahlqvist SB
RantapeDahlquist S
Dahlqvist SR
Dahlqvist SRA
Dahlqvist SBR
Dahlwvist SR

Conclusion: Although this is an extreme example I suggest to use author index to check for possible misspellings. Having a proper publication list is recommendable.

In Scopus you have a separate bar for Author search. You could also choose to limit to Author in Basic search or search operators in Advanced search. When searching rantapaa s or Rantapää s with Swedish/Finnish diacritic characters ä, you receive 23 hits including the article indexed with the full first name Solbritt.

But if you choose to search with Rantapaa-Dahlqvist S or Rantapaa Dahlqvist you loose the hit with first name Solbritt because that record is spelled Dahlquist.

But searching Rantapaa Dahlquist S doesn’t find all dalhquist-spelled records. Neither does a search for Rantapaa Dahlquist SB or with dots S.B. or with space S B.

When you instead search Dahlquist s, you will find the dahlquist-spelled records.

So when I click on the 20 hits from my initial search Rantapaa Dahlqvist S I get 69 (!) records instead of 20. How come?
Even stranger: When clicking the author name in one of the records, 71 records are retrieved. But the number of records retrieved in the initial search was 72. Could it be more confusing? View this video [AVI]

Still none of the 72 records mentioned includes the article by Rantapää Dahlqvist S she published in Lancet 1998. Searching that article by title shows that her name is implemented as Dahlqvist SR. Clicking on her name under author(s) gives 28 hits. 4 of them give hits on other spelling variants of her name but not all 72 articles from the initial search.

If you check this article “New concept in echocardiography” in PubMed you will see it’s implemented with Rantapaa Dahlqvist S. Checking the article in the electronic source of Lancet via Science Direct (Yes, it’s owned by Elsevier who owns Scopus!) shows the name implemented as Rantapaa Dahlqvist, Solbritt.

So how to find all articles by Rantapaa Dahlqvist S and author variants? The initial search Rantapaa S gives 72 records. Searching Dahlquist S gives 2 records. Searching Dahlqvist SR gives 24 records. Together 98.
Conclusion: Though this author search is an extreme example, it shows the problem of the implementations of author names. You have to search all possible misspellings of a name. Preferably, you should have a proper publication list from the author. And at last WHY is the reference from Lancet discussed above consistent in one Elsevier product (Science Direct) and not in the other (Scopus)?

When searching with author limited to address, you sometimes loose articles because of inconsistent implementation of addresses as we already have shown in a previous posting. We tested Web of Science and Scopus. In Google Scholar there is no option for address limit.

By searching for example au=haglin l NOT ad=umea in Advanced Search in Web of Science and haglin l (author field) AND NOT umea (affiliations field) in Scopus, the articles lacking correct addresses were easily found. Then we counted the citations in these articles that would be lost when limited to address search.

Note: The example is based on searching in author in General Search, not Cited ref search.

Next Page »