From: Sam The Scientist
Dr. Tomkins,
My name is Sam [CONFIDENTIAL] and I am a PhD student in Bioinformatics and Systems Biology in the [CONFIDENTIAL] lab at [CONFIDENTIAL]. My research focuses on developing network biology methods at assessing functional relationships between genes. I recently had a friend direct me to one of your articles at Answers in Genesis that disputed the fusion of ancestral chromosomes 2A and 2B into human chromosome 2. I have noticed that there have been several criticisms so I will try to briefly tread over common threads and focus on technical aspects of your analysis. These are my criticisms in order:
1. In your introduction, you note that (as reported by Fan et al. 2002b) the fusion site does not represent a pristine telomeric site. However, this is a curious criticism as genomic rearrangements are rarely pristine (which is now greatly appreciated in the heterogeneity of certain cancerous lesions as curated by TCGA). Moreover, given the protective role of telomeres against such rearrangements, we should expect that the telomeric sites are degenerated if a fusion occurred.
2. As a general critique, your article did not include a methods section. As a bioinformatician, I was greatly dismayed by this as it was very difficult to recapitulate your analyses myself. With all due respect, as a scientist, I view myself as a professional skeptic and I like to do my due-diligence when it comes to testing claims. For example, to extract the annotated fusion site, I had to dig up the original mapping of the fusion site and subsequently convert that precise mapping to newer genome builds. In the future, it would be helpful for you to include a methods section in your articles in order for greater reproducibility.
3. As another stylistic critique, you tended to switch between DDX11L2 and DDXL11L2. This is a silly typo that could be included in a corrigendum or an errata section.
4. You focused on one version of the annotated transcript of DDX11L2 that spanned the fusion site. However, this transcript is annotated by Ensembl with no supporting evidence. Moreover, investigating further into the expression of this gene, I came to see that only the shorter transcript (without exception) was detected.
5. You erroneously attributed function to the histone modifications bordering the gene. As we know, biochemically active regulatory regions have characteristic histone modifications. But, we cannot conclude that any region with histone modifications is biochemically active. This would be an example of the fallacy of affirming the consequent. This is why we cannot construct predictive models of the cell even though we have genome-wide maps of histone modifications.
6. You claimed that DDX11L2 was highly expressed in several tissues. There are several issues with this. First, you did not normalize the expression level on a genome-wide level for each tissue-type. Instead, you opted to display RPKM values (which are normalized to the experiment). As a result, your figure 3 reported values without a genome wide context or reference to the expression of other genes. In the future, I would suggest carrying out this normalization using a Z-score (which normalizes for mean and standard deviation of your distribution). Secondly, given our best data (GTeX), you cannot conclude that DDX11L2 is highly expressed. Rather, it is expressed at extremely low levels in all tissues. Given that GTeX is the current gold standard with regards to tissue-specific expression, I have to ask that you retract the claim of DDX11L2 being highly expressed. This is simply not true given our best data.
7. I have to question your use of the COEXPRESSdb database. Given the low expression of DDX11L2, any signal is likely an instance of type I error, otherwise known as false positive. There are many more genes that are expressed at low levels than at high levels and this can introduce artificial correlations (this is because the dispersion of expression values is lower at a low/unexpressed value). For this reason, coexpression is considered far less reliable as an indicator of functional relationships when expression levels are low among the genes.
Moreover, coexpression is a notoriously faulty inference tool for functional relationships. Statistically, it has low precision and high recall. In layman’s terms, it is good at detecting signals, but poor at resolving true positives from false positives. This is compounded by the issue I highlighted earlier. If you want to infer a function for DDX11L2, you need to conduct experiments to validate its function. All else is mere hand-waving using unreliable correlations. Again, I have to encourage you to retract any functional inferences you make using coexpression data.
As a final note on this topic, in the future, it is an absolute necessity that you report the magnitude of the coexpression correlation. As with your reporting of expression data, you offer no context by which we can judge the strength of correlation. It is entirely possible that these correlations are illusory or a statistical artifact as I explained above. This type of functional relationship inference is my area of research so it disappoints me to see it done improperly.
8. You overlook the paralogous and orthologous relationships presented by Costa et al. (2009). This has been pointed out by others so I will not be elaborating here.
In sum, I believe that it is your duty as a scientist to retract this article. I do not believe that the data support your claims or conclusions. If you have any questions regarding my more technical objections, feel free to respond.
Best,
Sam