Saturday, August 20, 2011

The H index: how many for how long

Relationships between h index and number of papers, log-transformed # citations for highest-cited publication and the # years since the first publication for 38 plant/ecosystem ecologists
I looked at the H indices a little bit more today. In short, I added a number of people to see if I could bust any of the relationships. Again, these are people at different career stages in a similar field as me from the US.

I knew that I could find people that have been published on papers that were highly cited, but they didn't have a high H-index. That wasn't too hard. Being a coauthor on a highly-cited paper isn't as diagnostic as the number of publications and the number of years cited.

The one relationship that is hard to find outliers for is the number of publications. I couldn't think of anyone that had published a lot of papers that had a low h-index. Tilman is really the only outlier for this. Based on the number of publications he has published, his H-index should be 56 not 86.

Outside of Tilman, when you take into account the number of publications and how long they've been publishing, that's 90% of the variation in H-index. National Academy members (red dots) aren't necessarily higher or lower than non-academy members (P = 0.2). You can find individuals 10 points higher than you expect, which is diagnostic of something, but looking at the individuals that are 10 points too low, I don't think one would denigrate their stature because their h-index was 65 not 75. Still, there might be something to the residuals.

The final equation I get is H index = 3.8 + 0.17*#pubs + 0.54*#yearspublishing. r2 = 0.90.

For what it's worth, my h-index is spot on. I've authored or co-authored 57 papers published in 14 years. That predicts an h-index of 21. Mine is 22.

One thing that is interesting here is that the h-index, at least in my discipline and for almost everyone, really doesn't provide much more information than knowing the number of publications and how long they've been publishing.

Another thing is quantifying what it takes to get to an h-index of 45. Just 160 publications in 25 years is all. Or 150 publications in 30 years, if you can wait a bit longer.

For me, that would be 10 papers a year for the next 11 years.

The H index might not necessarily provide more information for most than how many for how long, but what is represented by an H-index of 45 is pretty impressive.


  1. I wonder if your results were skewed because you chose the sample? Maybe you've just chosen people who are good scientists - and for good scientists it is time & number of papers that are important? The less good scientists by definition have a lower profile so would be less likely to be chosen by you.

    I suspect that there are lots of people that have been around for a long time but have a low H-index (I can think of some) and others who have published lots but also have a low H-index (I can think of some of those too). I strongly suspect that your clear results are partially an artefact of your sampling strategy.

  2. The h-index was put into use to provide more information than just the number of publications. In most cases, it really doesn't. I'm sure we could find someone who published a paper 40 years ago and doesn't have an h-index of 20, but we're not really trying to compare their productivity. Publishing 100 papers over 20 years but only having an h-index of 10 is really hard to do.

  3. I agree that time and number of publications are important and clearly have to be accounted for before comparing people. I still reckon you've delineated the upper limit of the relationship by choosing 'good' researchers. This might actually make it a good way of comparing people - if they fit on that line then they are among the 'elite'. You're probably right on the lack of high publications, low H people - they would have given up before they get there. It would be interesting to see the whole picture. The H index is a fascinating little metric, we don't really use it over here in the UK, but people are starting to get interested in it.

  4. It looks like all the people who haven't been publishing long (say, less than ~15 years) or don't have many publications (less than ~30) have a negative residual. It seems the model doesn't work well for these scientist. I think this supports what Jon is saying. All the scientists above this threshold are very good and productive but even Joe can't necessarily pick out the scientists who are going to make a big impact when they have just 30 publications.