Monday, 22 April 2019

Paper Title: The IEEE Conference Proceedings Template Text

It all started with IEC 61499 function blocks - a way of modelling industrial systems using pictorial representations in a standardised, and therefore programmable, way. It is used widely around the world, and a lot of research effort has gone into enhancing its capabilities and making it more usable in real-world applications. The paper "Remote Web-Based Execution of IEC 61499 Function Blocks'' (ID:7090220), published in the 6th Electronics, Computers, and Artificial Intelligence (ECAI) Conference held in 2014, described a prototype that integrated IEC 61499 with web technologies in a safe and secure way. The introduction of the paper suggests that this might allow for computationally expensive tasks like iterative optimisation or image processing to be executed on the cloud, with results used to control specific function blocks. The introduction also suggests that "this template, modified in MS Word 2003 and saved as "Word 97-2003 & 6.0/95 - RTF'' for the PC, provides authors with most of the formatting specifications needed for preparing electronic versions of their papers.''

If the last point seems incongruous with the highly technical subject matter of the paper, that is because it comes from the IEEE Conference Proceedings Template. The authors of that paper used the template to start the writing of their paper, and while they deleted most of the original template text, a large chunk of text was simply forgotten and submitted. In total, 147 words from the introduction section of the IEEE template remain in this IEEE Xplore-published paper. The rest of the paper seemed original and interesting, yet this passage of text clearly should not have been in the final paper. How did a paper with such a large block of text from the IEEE template make it past peer-review, plagiarism checks, and eXpress PDF checks to become indexed and published on IEEE Xplore?


This started a journey into uncovering just how widespread this issue was. Thousands of IEEE Xplore-published papers were discovered that contain at least some text that matches the IEEE conference template. I thought it might be worth documenting this process and describing our journey. This blog article covers how these papers were found, briefly describes how IEEE was informed about the issue and how they responded, and offers some opinions on the systematic failures that have allowed these errors to go unnoticed.

In most cases, I believe that the presence of template text in a paper was just a genuine mistake on the part of the authors. In many of the papers that I read, there is legitimate scientific work being reported that is of value to the academic community, and there may only be a few sentences of template text. It is not my intention to offend or embarrass any of these authors. Therefore, rather than referring to papers by their full title or authors, I mostly refer to them by their IEEE Xplore ID numbers. Readers interested in tracking these papers down can search for the ID number in IEEE Xplore to retrieve publication details.

Data Collection
The methodology was pretty simple - I used Google Scholar to search for papers that match some part of the IEEE conference template text. This was because Google Scholar's exact quote search seemed to be more accurate than the IEEE Xplore search. Each search used quote marks in order to get exact matches only. Google Scholar's search results were restricted to only those matching site:ieeexplore.ieee.org. Google Scholar has an undocumented limit to the length of each search query. Empirically, this appears to be 256 characters, so after taking the site filter into account, each query can be a maximum of 232 characters. After each search, a random sample of the papers were checked to make sure that the search was accurate and that the queried text was in fact in the paper (the examples in the Table below have been manually verified). Unfortunately, since Google Scholar does not offer an API, and scraping the website is against the Terms of Service, all of the data collection was done manually. The Table below shows some of the queries that were run, and gives an indication of the scale of this problem. While this cannot be interpreted as an exact number of papers that contain template text, it is hoped that this analysis gives a sense of the scale of the problem; it is not limited to a handful of papers. Hundreds, if not thousands, of papers have some template text in them. This search was done in June 2017, so the numbers have increased since then (estimated to be approximately 5-10%).


There are two important caveats that prevent us from simply adding up the number of papers to find the total number of papers that match template text. Firstly, it is probable that there are some papers that appear in more than one search. In manual checks, most papers appeared in only one of these searches, meaning that the amount of template text was relatively small in most papers (usually from one of the sections or just one of the sentences), but without further analysis no strong claim can be made here. Secondly, Google Scholar's search may not be perfect, and it is possible that papers may be listed more than once in the search results or listed when there actually is no match. In some cases, the authors have hidden the text so that it is not readable to humans (e.g. making the text white or placing a figure over the text), but is still searchable by computers, leading to an erroneous listing in the search results.

Importantly, there are also reasons to suspect that these numbers may undercount the actual number of papers matching template text. Firstly, these searches only parse papers where the PDFs are text-searchable. A small proportion of conferences have uploaded their papers with scanned PDF files that are essentially images without searchable text, which may not appear in these results. This is more likely to happen for older conferences. Secondly, even slight changes such as an additional word or an extra space could result in the paper not being included in the search results, because exact quotes were sought. It should be noted that although we are primarily interested in papers published on IEEE Xplore, the IEEE conference template is widely used around the world for other publishers as well, and there are large numbers of papers published outside of the IEEE that also contain text from this template, that were excluded because of the site filter used in the search query.

(Some Other) Analysis
The IEEE conference template file also includes seven references. A number of papers have failed to remove these references or re-used some of them, significantly increasing the number of citations for these papers. We can easily assess the magnitude of this issue, because two of the references in the template are not for real publications. K. Elissa's work, "Title of paper if known'' (unpublished), and R. Nicole's work, "Title of paper with only first word capitalized'' (published in the J. Name Stand. Abbrev.), have been cited over a thousand times by IEEE Xplore papers according to Google Scholar (1440 and 1110 respectively). There are some issues with this result, as Google Scholar's citation tracking is not perfect, but I have found IEEE Xplore papers that cite these papers directly, such as ID:5166784 and ID:5012315. Some papers only appear if the reference text is searched directly, as sometimes these placeholder references appear appended to the end of legitimate references, such as ID:6964641. Meanwhile, the other five real references have received an artificial boost to their citation counts - James Clerk Maxwell's "A treatise on electricity and magnetism'' had plenty of citations anyway, but a non-negligible number of these citations, such as ID:6983343, are not genuine.

As far as I can tell, the current IEEE conference template was created around 2002/2003, based on the IEEEtran LaTeX class made by Michael Shell. It therefore makes sense that the earliest paper that was found with template text was from a conference in 2004 (ID:1376936), although at this point in time most papers were still scanned into IEEE Xplore and not text-searchable.

The most egregious case, ID:6263645, was literally just the IEEE conference template in full with the title changed. Even the authors section of the paper was from the template. How was this paper accepted and published? The conference website seems to suggest that only the abstracts were peer-reviewed, with full submission of the papers after notification of acceptance to authors. The conference website includes the text "Failure to present the paper at the conference will result in withdrawal of the paper from the final published Proceedings,'' which implies that a presentation was made since the paper was published to IEEE Xplore. But perhaps, no one checked the uploaded paper itself after the conference.

After this paper was reported to the IEEE, it was removed several months later "in accordance with IEEE policy'', although evidence of the original paper is still available through secondary sources such as ResearchGate and SemanticScholar which carry the original abstract. In fact, the website DocSlide contains a copy of the full text of the paper. It is important to note that this paper appeared in the conference schedule and proceedings table of contents, alongside legitimate papers in a legitimate conference. As stated earlier, my intention in investigating and reporting template text in conference papers is not to punish or embarass the authors who have made these errors, as I believe that in most cases these errors were made unintentionally, and there is still scientific merit in the papers that outweighs the impact of these errors. I am not advocating for papers containing template text to be removed from IEEE Xplore. However, in cases like ID:6263645, where the whole paper is nothing but template text, it is clear that the paper is so flagrantly against the spirit of academic publication, that there is little choice but to remove the paper.

The IEEE Response
Members of our research group first notified IEEE about this in July 2017. After much searching about the correct process for reporting this type of issue, we tried to contact the IEEE Publication Services and Products Board (PSPB) Publishing Conduct Committee. However, no contact details were to be found anywhere, so we e-mailed the Managing Director of IEEE Publications. Eventually our report made its way to the Meetings, Conferences and Events (MCE) team, where the matter was placed under investigation and it began a slow internal process. Every couple of months we would e-mail for an update, and be told that the investigation was ongoing and we would be notified when it was concluded, but that they would be unable to report on each individual instance. IEEE assured us that "IEEE has been fully assessing the situation regarding this circumstance, and putting the appropriate time and resources into investigating this issue thoroughly." To my knowledge, ID:6263645 was the only paper that was removed since it contained no original content other than the title (and I am not advocating for papers that only have a few sentences of template text to be removed).

Since our original report, in May 2018 the following text was added to the IEEE conference template page (partly in bold) and in the actual template files at the end (in red):


IEEE conference templates contain guidance text for composing and formatting conference papers. Please ensure that all template text is removed from your conference paper prior to submission to the conference. Failure to remove template text from your paper may result in your paper not being published.

This is slowly being reflected in copies of the template as it propagates throughout the world for new conferences. Will this action by the IEEE resolve the problem?

In the subsequent year or so (to April 2019), Google Scholar suggests that there are 18 papers published on IEEE Xplore that contain the above warning text. A manual check over these papers reveals that authors have changed the text colour of the warning to white for most of these papers (which makes it invisible to humans, but not to computers), leaving four papers that contain the new template text. This includes ID:8580104, which appears to be a new paper from a 2018 conference that is just the new template published in its entirety (which we have just informed the IEEE about). Maybe the new warning in the template has helped reduce the rate of incidence, but cases are still slipping through.

Systematic Failures?
The IEEE claims to publish conference proceedings for "more than 1,500 leading-edge conference proceedings every year''. While the standards of IEEE are high, it is understandable that with so many papers being published every year, some papers will inevitably slip through the cracks of quality control. It could even be argued that a couple of papers out of the hundreds of thousands published by IEEE each year is relatively insignificant. However, we should still seek to understand why so many papers containing template text, something which should be easily avoidable, have been published in the IEEE Xplore database.

Similarity Checks
The IEEE requires that all papers submitted for publication be checked for plagiarism. It is important to note here that the inclusion of template text in a paper is not generally intentional plagiarism. However, the method for automatically detecting template text, similarity analysis, is more commonly used for identifying plagiarism. In the case of conferences, all organisers are expected to screen their papers for plagiarism. Any papers that are not screened during manuscript submission are checked by the Intellectual Property Rights (IPR) Office before the papers are published on IEEE Xplore. The point to emphasise here is that it is claimed by IEEE that at some point, every paper passes through a standard plagiarism check before publication.

The IEEE has its own portal, CrossCheck, which program chairs and other conference proceeding organisers can use to check for plagiarism. It is essentially an IEEE-branded front-end, with iThenticate running as the back-end engine. iThenticate is arguably the world's leading plagiarism checking service, and is also used by Turnitin, CrossRef, many universities, and others. The strength of CrossCheck in particular is that all participating organisations agree to provide full-text versions of their content, so that they can build up a large corpus of work and increase the probability of catching plagiarised text. It stands to reason that a plagiarism checking service as powerful as this should be able to detect text from the IEEE conference template and alert reviewers/editors/organisers.

However, anecdotally, I have heard that for many conferences the rule of thumb is that a paper should have an overall similarity score of less than 30%, and a similarity score with any single source of less than 7%. If the similarity scores exceed these thresholds, in most cases authors are given an opportunity to edit and reduce their similarity scores, or the paper is rejected. In paper management systems like EDAS, an alert is only generated if the similarity score exceeds a threshold; otherwise it is normally assumed that the paper doesn't have significant plagiarism and can be reviewed.

The template text problem shows an issue with this percentage based approach - one or two sentences can easily fall below these thresholds to avoid automatic detection. In a 6-8 page conference paper, even an entire paragraph of template text may only constitute 1-2% of the overall paper. If the IEEE template appears towards the bottom of the similarity report, then it may likely be missed by publication volunteers and staff, if the similarity report is checked at all.

Perhaps we should recognise that not all sentences are equal, and that some matching sentences are more problematic than others. One possible solution is to develop similarity checks that use two corpora; one corpus that contains the current collection of internet and otherwise published sources, and another corpus that contains privileged text that should never appear in texts passed through the similarity check. If there is any sentence in the paper that exactly matches one in the second corpus, then that should produce an alert at the top of the similarity report. Examples of passages to include in this second corpus include template text from different publishers, lorem ipsum, and other sources that contain text that should never (or very rarely) appear in a published paper. A human reviewer is still required to interpret the results of these similarity reports to ensure that false positives do not hinder or prevent the publication of good papers.

Peer Review
Conference peer-review is generally of lower quality than journal peer-review. There are, of course, exceptions in terms of the highest level conferences and the lowest quality journals, but overall, review expectations are lower for conference publications. The shorter review periods and lower standards disincentive reviewers from spending too much time conducting their reviews. Anecdotally, recruiting reviewers for conferences has become increasingly difficult as the number of publication opportunities grow.

One of the problems with the presence of template text is that there should be no cases where including the template text makes any logical sense in the context of the paper (unless it was a paper about the template text like this one could have been). If a reviewer has read the paper, then this error should be obvious. So why has peer-review failed to detect the template text?

First of all, it appears that some of the papers that are published in IEEE Xplore have not actually been peer-reviewed. In some cases, only conference abstracts are peer-reviewed, and once accepted, the subsequent paper is not reviewed at all. In these cases, the fault does not lie with the reviewers, but demonstrates that this model of publication is flawed and easily exploitable.

Where reviewers do spot template text, there is generally limited opportunity for them to inform authors. There may be a field in the paper review system to enter some comments. If the reviewer is motivated enough, then they might indicate to the authors exactly where the template text in their paper is. But in my experience, conference paper reviewers tend to provide higher-level feedback, looking at the contribution and novelty of the paper, rather than specific grammar or spelling errors. After all, these should be caught during proof-reading.

Even if the reviewer has provided the feedback to the authors that template text is in the paper, there is generally no opportunity for anyone in the process to make sure that the template text has been removed. For many conferences, there is only one round of review, and therefore reviewers do not see the papers again after camera-ready submission. Program Chairs and Publication Chairs cannot be expected to read and check every single paper. So if a paper is accepted but the authors have ignored the feedback provided by the reviewers, then chances are, it will go straight through to publication and appear in IEEE Xplore.

However, following Occam's Razor, the obvious answer here is that not all reviewers are fully reading their assigned papers. It is easy for template text to slip past the review process if no one actually reads the template text in the paper. This is perhaps an uncomfortable truth, and cannot be easily proven (or disproven).

The issues that are discussed here are symptomatic of a wider challenge in scientific peer-review. The issue of predatory open-source journals that publish papers without sufficient (or any) peer-review has been well publicised. One has to wonder if similar issues have affected conferences as well. Solving the issues of peer-review is well above my pay grade, and there is a wide range of literature on the subject across many academic disciplines.

The sheer scale of this problem indicates another major issue - the general apathy of the academic community towards this behaviour. Many of these papers have hundreds of reads, some have even been cited. Apparently we were the first to report these issues to the IEEE. Does this mean that this isn't really a significant problem, and that no one really cares? The impact is probably relatively small, with most readers accessing the paper for the meaningful scientific content, and are probably smart enough to ignore the template text, right? One could have said the same about the authors who published these papers in the first place.

Conclusions
So, maybe the impact of this template text being in published papers is negligible beyond it being a source of some amusement and entertainment. But at the same time, it can be seen as a symptom of the wider issues that face academia. There are more, and more, and more papers being published every year, and peer-review is falling apart. Automated tools that are meant to help detect misconduct are woefully insufficient. The current models of publishing research articles are exploitable. And there is always the uncomfortable question lingering in the background - how much "high quality" research output is genuinely high quality? Meanwhile, no one really has the time to figure out how to fix these issues while under the pressures of Publish or Perish.

I repeat here that this article is not accusing anyone of any intentional plagiarism or misconduct - everyone makes mistakes sometimes, and that's okay. However, a high-quality repository of academic content should have systems in place to catch mistakes and help rectify them. Over time, the problem has grown too large for the IEEE to retrospectively rectify, and realistically that's probably okay. But does this reflect the academic literature that we want to build and share, or is it just the academic literature that we deserve?

Acknowledgements
The initial instance of template text found was reported by Hammond Pearce, who then brought it to the attention of our research group, which kicked off this whole prosaic journey. This article is informed by discussions between members of the Embedded Systems Research Group, part of the Department of Electrical, Computer, and Software Engineering at the University of Auckland, New Zealand. The IEEE conference template says that "The preferred spelling of the world 'acknowledgment' in America is without an 'e' after the 'g'", but this article isn't being written in America, and the author prefers the 'e' to be in there.

No comments:

Post a Comment