Challenges in Authorship Attribution: The “Name Problem”
I’ve encountered many challenges during the development of the Iquant Engine, but none has been more frustrating than what I’ve dubbed the “Name Problem”. In programming it is often the simplest of errors that are the most difficult to figure out and a misplaced comma has devoured my entire day more than once. However, I wasn’t expecting the same to be true on the data analysis side as well.
Simply put – how do I ensure that the bibliography of each author is both fully correct and exhaustive? Relatedly, how do I differentiate authors with often very similar names?
Early on in development I foresaw aspects of this problem and included checks to verify authors with multiple or similar names included in their publication list – combining authors who publish a portion of their papers with a middle initial and some without, for example. These initial rules worked for relatively small sets of publications in which there were fewer authors and consequently fewer possible matches. Unfortunately, I started encountering more errors when confronting larger topic areas as there were a surprising number of examples that defeated even my best algorithms. Undaunted, I set to work developing more algorithms – confident that I could develop a set of functions that would permanently solve the problem for me.
I can only plead ignorance to excuse my hubris.
Affiliation comparison and co-author analysis partially minimized the problem. Techniques like algorithmic string comparison also helped further address the issues. Yet no matter what I did, I always found exceptions to my carefully constructed and increasingly complicated system. After many months of trying new approaches, I had to accept that perfection was not achievable. As frustrating as it was, some level of error had to be accepted in the author accounting process.
I eventually learned that I was far from the first person to encounter this problem. In fact, this appears to be a widespread problem and one that doesn’t have a clear solution even given significantly more resources than I was able to devote. In 2010 the National Library for Medicine announced a project to assign authors a unique ID, but four years later the project was scrapped (Pubmed press release) and it was announced that Pubmed would instead rely on third parties like ORCID.
Third party solutions will never be the answer simply due to lack of uptake. The Name Problem will persist unless a universal system with full adoption is instituted. Until then, if you are publishing it is critical to put careful thought into how your name is displayed on papers. Unfortunately, differing conventions in how journals display author names and what information is accessible to databases like Pubmed will trip up even the most diligent of scientists, especially if you have relatively common last name.
The Name Problem will persist unless a universal system with full adoption is instituted. Until then, if you are publishing it is critical to put careful thought into how your name is displayed on papers.
– Ben Verdoorn, Senior Data Scientist
The Name Problem has real consequences, beyond my data-nerd frustration. Notably, it fosters inequality in our publication-based merit systems, at least partly because of culture-specific naming conventions. Without a unique author identifier many scientists will not be able to be properly credited or correctly sought out when their expertise is required. For now, we will account for The Name Problem in our analysis and accept the error it injects into our data. We will not be rid of it without concerted universal effort, which is likely out of reach for our shambolic scientific publishing landscape. Therefore, I must adjust to this reality despite the errors it propagates in my data and the sleepless nights spent worrying away at different possible solutions – like an ever-present popcorn kernel I just can’t get out of my tooth.
Let us know how we can help enhance your research.
We work with scientists, drug discovery professionals, pharmaceutical companies and researchers to create custom reports and precision analytics to fit your project's needs – with more transparency, on tighter timelines, and prices that make sense.
Challenges in Authorship Attribution: The “Name Problem”
I’ve encountered many challenges during the development of the Iquant Engine, but none has been more frustrating than what I’ve dubbed the “Name Problem”. In programming it is often the simplest of errors that are the most difficult to figure out and a misplaced comma has devoured my entire day more than once. However, I wasn’t expecting the same to be true on the data analysis side as well.
Simply put – how do I ensure that the bibliography of each author is both fully correct and exhaustive? Relatedly, how do I differentiate authors with often very similar names?
Early on in development I foresaw aspects of this problem and included checks to verify authors with multiple or similar names included in their publication list – combining authors who publish a portion of their papers with a middle initial and some without, for example. These initial rules worked for relatively small sets of publications in which there were fewer authors and consequently fewer possible matches. Unfortunately, I started encountering more errors when confronting larger topic areas as there were a surprising number of examples that defeated even my best algorithms. Undaunted, I set to work developing more algorithms – confident that I could develop a set of functions that would permanently solve the problem for me.
I can only plead ignorance to excuse my hubris.
Affiliation comparison and co-author analysis partially minimized the problem. Techniques like algorithmic string comparison also helped further address the issues. Yet no matter what I did, I always found exceptions to my carefully constructed and increasingly complicated system. After many months of trying new approaches, I had to accept that perfection was not achievable. As frustrating as it was, some level of error had to be accepted in the author accounting process.
I eventually learned that I was far from the first person to encounter this problem. In fact, this appears to be a widespread problem and one that doesn’t have a clear solution even given significantly more resources than I was able to devote. In 2010 the National Library for Medicine announced a project to assign authors a unique ID, but four years later the project was scrapped (Pubmed press release) and it was announced that Pubmed would instead rely on third parties like ORCID.
Third party solutions will never be the answer simply due to lack of uptake. The Name Problem will persist unless a universal system with full adoption is instituted. Until then, if you are publishing it is critical to put careful thought into how your name is displayed on papers. Unfortunately, differing conventions in how journals display author names and what information is accessible to databases like Pubmed will trip up even the most diligent of scientists, especially if you have relatively common last name.
The Name Problem has real consequences, beyond my data-nerd frustration. Notably, it fosters inequality in our publication-based merit systems, at least partly because of culture-specific naming conventions. Without a unique author identifier many scientists will not be able to be properly credited or correctly sought out when their expertise is required. For now, we will account for The Name Problem in our analysis and accept the error it injects into our data. We will not be rid of it without concerted universal effort, which is likely out of reach for our shambolic scientific publishing landscape. Therefore, I must adjust to this reality despite the errors it propagates in my data and the sleepless nights spent worrying away at different possible solutions – like an ever-present popcorn kernel I just can’t get out of my tooth.
Let us know how we can help enhance your research.
We work with scientists, drug discovery professionals, pharmaceutical companies and researchers to create custom reports and precision analytics to fit your project's needs – with more transparency, on tighter timelines, and prices that make sense.