Image: Bryn Nelson
By Columnist
msnbc.com
updated 3/3/2008 9:21:54 AM ET 2008-03-03T14:21:54

Presidential hopeful Hillary Rodham Clinton called it “change you can Xerox.” Barack Obama, her opponent for the Democratic nomination, said it was much ado about nothing (although not in those exact words).

But the resulting brouhaha over whether Obama really committed plagiarism when he borrowed a passage from Massachusetts Gov. Deval Patrick for a recent speech is overshadowing a larger question that has plagued scientists for years. Without YouTube or blogs or talking heads to guide them, how can researchers uncover the flagrant copycat studies that have infiltrated the scientific world?

A Dallas research group may be providing some new answers by setting its computer-assisted sights on questionable cut-and-paste documents published by fellow academics, revealing startling examples of unethical behavior in the process.

The group’s freely available online search engine has identified thousands of potential instances of plagiarism by highlighting significant similarities between blocks of text within a vast database of medical research. Subsequent sleuthing by the team of curators at the University of Texas Southwestern Medical Center has led to the retractions of three published studies and an additional seven investigations out of 20 cases pursued so far.

Beyond the evidence of blatant pirating, the results have hinted at broader reasons for worry. By plagiarizing a report on a clinical procedure, for instance, an unethical researcher can artificially bolster the initial report’s conclusions. “In medicine, researchers and clinicians rely very heavily on the research, and so this has high potential for doing harm,” said Harold “Skip” Garner, a professor of biochemistry and internal medicine at the medical center and one of the project’s co-leaders.

Or, to borrow the phrasing of at least one presidential contender: Words matter.

Uncovering plagiarism
Garner said he and his colleagues originally devised their program, named eTBLAST, as a service for other biomedical researchers. By depositing a research summary or entire document in the program’s search window, scientists can retrieve similar documents among the millions stored in MEDLINE, an online storehouse. “You can check the novelty of your idea. You can identify competitors or collaborators, or other experts in the field,” Garner said.

You can also, as it turns out, identify instances of less-than-honorable behavior. Spurred on by ethical discussions in several high-profile publications, Garner and his collaborators began asking whether their program could find studies that were a bit too similar. Last year, the team randomly selected summaries, or abstracts, from the MEDLINE database and put its tool to the test. "We quickly discovered that our code works very well," he said.

With funding from the federal Office of Research Integrity, Garner's group conducted a more systematic review. The results, published recently in the journal Bioinformatics and in a follow-up commentary in the journal Nature, found that essentially identical research published by different sets of authors — potential plagiarism — represented about 0.04 percent of MEDLINE’s database (roughly 6,700 cases in all).

Highly similar studies re-published by the same authors represented another 1.3 percent of the database’s documents. Garner estimated that about half of those seeming duplicates may represent clinical updates, reports of annual meetings or other legitimate publications. But re-positioning the same data in different journals can also pad an author’s resume, a far murkier ethical issue that he said is likely to be resolved not with algorithms or text mining but with clearer community standards.

In all, the group’s aptly named Déjà vu database has flagged more than 71,000 suspicious pairs.

“It’s a sensitive topic, so we’ve tried to be very precise and analytical about this,” Garner said.

How the database works
After first throwing out common words such as “and,” “or,” and “the,” eTBLAST compares the abstract’s remainder to the wording of other summaries in the database and retrieves the top 400 to 1,000 matches.

For the second step, the algorithm goes sentence by sentence, scanning for matching keywords and word order. The program doesn’t always match alternate spellings of the same word but can pick out synonyms and different forms of the same word — matching “cancerous” and “tumor” to “cancer,” for example.

For highly similar abstracts written by different groups, at least two curators in Garner’s group read the corresponding full-text articles side by side, noting their similarities and differences and posting their comments in Déjà vu.

Within that public database, suspicious pairings have pointed toward several serial plagiarists who have copied others’ work a half-dozen times or more.

Investigations launched
So far, Garner’s group has sent out e-mails calling attention to 20 individual cases, notifying  authors and journal editors of the striking resemblances and asking for clarification.

The surprised reactions of editors and initial authors, and the subsequent investigations and retracted papers are likely to be repeated on a much larger scale. In the coming weeks,  Garner’s group will send out automated questionnaires related to another 80 cases in which at least two curators have found reason to continue the process. Thousands more potential duplicates remains to be sorted.

Garner cautioned that the review is meant to initiate a process that should involve everyone with a vested interest. “We’re really trying not to be the judge and jury here,” he said. With the publicly available tool, though, some journal editors and reviewers have become sleuths on their own,  and in at least one case have intercepted a manuscript before its publication and referred it to a university’s ethics committee. “That validates our statement that tools like this can help make the medical database better,” Garner said.

The computer program isn’t perfect, he said, and determined rule-breakers can still escape detection. But as knowledge spreads about the tool, researchers may be less tempted to act unethically. And those who do now have a higher probability of being caught and punished. Under the common federal definition of misconduct for U.S. scientists, plagiarism, fabrication and falsification of data all carry stiff penalties.

Bad behavior detectors
The concept of an online bad behavior detector is nothing new (the popular Web-based Turnitin, developed by University of California at Berkeley researchers, got its start more than 10 years ago). What sets Garner's work apart is its method of picking out specific instances of questionable conduct and estimating the overall prevalence within a massive database, said Melissa Anderson, director of the Post-Secondary Education Research Institute at the University of Minnesota.

In 2005, Anderson and two co-authors published the results of an anonymous survey of more than 3,200 researchers in the U.S., a seminal report that, as she put it, revealed “some rather shocking estimates” of ethical lapses. Among her team’s findings, 4.7 percent of researchers confessed to re-publishing the same data within the past three years while 1.4 percent admitted to outright plagiarism.

Anderson said her group’s results, based on self-reporting by academics, couldn’t be directly compared to Garner’s study. Nevertheless, she praised his approach as a “great idea” aimed at firming up the figures.

Understanding the intent behind those numbers is still a delicate matter. Differing cultural norms could be one contributing factor, Anderson said, especially for researchers from countries where copying is seen as a way of honoring a good idea. Her group’s analysis also suggests that scientists who perceive a tenure review as unfair are more likely to engage in unethical practices. A recent study published in the journal Psychological Science backs that assessment, finding that students with more fatalistic beliefs — ones primed to believe they have less control over their own destinies — are more likely to cheat and lie.

A drain on research
Regardless of motive, Anderson believes the problem must be addressed before it creates an even bigger drain on legitimate research. “People work really hard to get good science published,” she said. “Then those venues are being wasted on research that isn’t new, when that space could have gone to new science. That’s a waste of public dollars.”

Garner’s team is taking a similar angle in a recent expansion of its work: comparing summaries of government-funded grant proposals. If groups are funded multiple times from different agencies, he believes, the new tool could help cut waste and root out unethical practices.  The same watchdog concept could be applied to a slew of publications stashed away within the suite of LexisNexis storehouses or any other accessible database, whether term papers or political speeches.

Ultimately, Garner believes his program has just as much potential to uncover positive connections. Among its newest projects, his team has begun working with the Susan G. Komen Breast Cancer Foundation to establish social networks among scientists conducting related cancer studies. From an initial computer-assisted search through similar-sounding research, the effort aims to promote auspicious pairings, opportunities for fruitful collaborations and a faster overall pace for subsequent research.

Those matched words, Garner hopes, will be the ones that eventually matter most of all.

© 2013 msnbc.com Reprints

Discuss:

Discussion comments

,

Most active discussions

  1. votes comments
  2. votes comments
  3. votes comments
  4. votes comments