Peer Review is still BROKEN! The NeurIPS 2021 Review Experiment (results are in)

**The Inconsistency of Peer Review**

In the world of academia, peer review is often touted as the gold standard for evaluating the quality and validity of research. However, a closer examination reveals that this process can be downright inconsistent. The randomness of the system has led to some truly astonishing results, which we will delve into below.

**A Look at the Numbers**

Recent studies have shown that if you submit a paper to multiple conferences, the probability of getting accepted varies greatly depending on the conference. In fact, it's estimated that only three out of six papers suggested for an oral presentation by one committee were confirmed by another committee. This is not an isolated incident; similar results have been observed in other areas as well.

**The Accept/Reject Ratio**

When we examine the accept/reject ratio, we see that most papers fall somewhere in the middle – neither extremely good nor extremely bad. However, when we look at extreme cases, it's clear that if you have a really good paper, the probability of getting accepted by another committee is quite high, and if you have a really bad paper, the probability of rejection is even higher.

**The Role of Randomness**

So, what exactly causes this randomness? It seems that the system relies on a combination of factors, including the reputation of the professor submitting the paper, the impact factor of the conference, and even social media influence. The latter point is particularly surprising, as it suggests that the popularity of a paper can be determined by the number of likes and shares it receives online.

**The Impact on Ph.D. Students**

For Ph.D. students, this inconsistency can be devastating. With only three to five conferences to submit to over the course of a year, the stakes are high, and the randomness of the system means that even the best papers can go unnoticed. It's not uncommon for Ph.D. students to spend years trying to get their research published in top-tier journals, only to have it rejected multiple times.

**The Solution**

So, what can be done to address this issue? Some experts suggest that professors should hand out Ph.D.s independently of conference submissions, regardless of the impact factor or reputation of the professor. Others propose that universities should stop considering impact factors when granting tenure, and instead focus on other metrics such as the quality and originality of a researcher's work.

**The Bottom Line**

In conclusion, while peer review is essential for ensuring the quality and validity of research, its current implementation can be inconsistent and even random at times. By acknowledging these flaws and making changes to the system, we can create a more fair and equitable process that rewards good research and provides opportunities for Ph.D. students to succeed.

**The Role of Tenure**

Tenured professors play a significant role in this issue. They are often granted tenure based on their reputation and impact factor, rather than solely on the quality of their research. This can lead to a situation where professors prioritize conferences that have high impact factors over those with lower ones, simply because they offer more prestige.

**Grant Agencies and Conference Influence**

Grant agencies also contribute to this problem by providing funding based on the reputation of the researcher and the conference at which the work is presented. While this may seem fair, it can lead to a situation where researchers prioritize conferences that have high reputations over those with lower ones, simply because they offer more prestige.

**A Solution for Grant Agencies**

One potential solution to this problem is for grant agencies to consider other metrics when evaluating research proposals, such as the originality and impact of the research itself. This would help to reduce the influence of conference reputation and reputation-based funding.

**Conclusion**

In conclusion, while peer review is essential for ensuring the quality and validity of research, its current implementation can be inconsistent and even random at times. By acknowledging these flaws and making changes to the system, we can create a more fair and equitable process that rewards good research and provides opportunities for Ph.D. students to succeed.

**The Final Word**

Ultimately, it's up to us as academics and researchers to take responsibility for creating a better system. We must recognize the flaws in our current implementation of peer review and work together to develop new solutions that prioritize quality and originality over reputation and prestige. Only then can we ensure that research is truly valued and rewarded, regardless of conference or impact factor.

"WEBVTTKind: captionsLanguage: endo you know how hard it is to truly generate random numbers i i don't mean the random number generator on your phone or anything like this that's just algorithm that crunches something but it's deterministic true random numbers are super difficult to generate there is even a wikipedia article about it what you need to do is you need to measure some actual physical phenomenon like atmospheric noise or thermal noise or or other things that we have no idea they are so chaotic we just can't predict them and thus their results are truly truly random random.org even sells to random number generators for you this is big topic humanity has searched for and wide for truly random processes but now ladies and gentlemen we found it the nurip's review process is a absolutely truly random phenomenon so if you're not aware a way way time ago in nurips what was that 2014 the organizers made a little experiment where they gave certain set of papers that was submitted to the conference not only to one committee to review but the two separate committees in order to track how the committees would agree or disagree now the results right there were quite damning to be honest so not only did they not find any sort of correlation between what the reviewers scores they gave with any sort of future citations and that's a paper that i've covered in a video where they look back seven years later at whether or not the reviewers could predict anything about these papers turns out they cannot they also found that the reviewers mostly didn't really agree that much so here were these experiments now of the 166 papers most were rejected by both committees which most papers to such a conference are rejected so reject is sort of the default answer but here look at that if committee one accepted and committee one accepted for 22 plus 21 papers so for 33 papers committee 2 only agreed on half of them and likewise when committee 2 accepted for the 43 papers and this is 44 papers so for the 44 papers that committee 2 accepted committee one only agreed again in half of them so this means that if you were to switch committees for the papers only half of the accepted papers would be the same papers half of them would be other papers that had actually been rejected by the other committee which is kind of crazy but this just shows you how noisy this process really is now it's 20 21 and we've actually repeated this experiment so here's a reddit post by the user wai gua chiang that has scraped from open review these scores and put together some statistics such as this one here that shows the average rating of the papers versus how many of papers were in a particular bucket and what ultimately happened to them so we only have full data insight into the accepted papers and the rejected papers that have sort of voluntarily agreed to make their reviews public which most papers that are rejected don't now the most interesting part here is this one this is the repetition of the nurip's experiment you can see at the bottom the total is almost 300 papers and again these are not all the papers part of the experiment these are only the papers that were accepted because we don't know anything about the other ones so the way this worked was the follows papers were given to two separate committees these two committees reached a decision independently of each other and then the maximum of the two decisions was taken as an acceptance criterion so if either of the committees accepted the paper to be published the paper was going to be published so to understand this table the leftmost column is the final decision which is the max of decision one and decision two not always but we'll get to that then the second column is the decision of the first committee and the third column is the decision of the second committee now these things are ordered so it's not the same as in the last paper i've shown you so since there's no clear ordering we simply always put the larger decision on the left and the second large decision on the right so the most interesting part of this is how many papers were accepted by one committee but rejected by another one for that we have to add together all the rows where one of the decision is a reject so 174 plus 16 plus 9 is i think 199 papers 199 papers out of the 298 papers that were accepted had actually been rejected by a second committee so to compare we have to do the following we'll say that essentially the analogy would be that 22 and 22 and 21 papers so 65 papers would be our analogous total number from down here those are the papers that ultimately ended up being accepted because they were accepted by one of the committees and then 22 plus 21 papers so 43 papers would be the amount of papers that would have been rejected by one of the two committees but ultimately ended up being accepted because it was accepted by the other one so according to this here we see 43 out of 65 papers only were accepted by one of the committees and here we see that roughly 200 out of 300 papers were only accepted by one of the committees in both cases it's about two-thirds of the paper which means that actually this is remarkably consistent so in the face of that and with the explosion of the machine learning community more papers more reviewers and so on you could actually say it's a good thing it's actually surprising this hasn't gotten much worse over the years now that's one way to look at it and the other way to look at it is to say this is crap like come on this is completely inconsistent not only the accept reject is inconsistent you see of the six papers suggested to an oral by one of the committees this was never confirmed by another committee and how many were suggested for a spotlight by one of the committees 16 20 29 41 44 44 papers were suggested for a spotlight by one of the committees yet only three had actually both committees agreeing and again the same results hold if you were to swap out committees if you just differently assign people to papers half of the papers that are in the conference would be different half and i don't know how people can still claim that peer review is like this esteemed thing that is supposed to catch errors and do quality control and yada yada there's something to be said that if you have a really good paper the probability that a different committee also accepts it is is pretty high and also if you have a really bad paper the probability that two committees agree on rejecting it i guess that's even higher however most papers fall somewhere in the middle and that's the area of true randomness essentially what you do is you throw your paper in there and then something something happens and then you get a random number at the end and remember people use this to justify archive blackouts social media blackouts oh my god you cannot buy us the reviewers you must not buy us the pristine review you're you cannot buy us a random number generator i guess you can but it makes no makes no sense like honestly this is only half joking at this point the social media networks that we have people surfacing interesting papers from the depths of archive and from their social networks all the people filtering this kind of stuff yes there's promotion going on yes there's hype yes money plays a role but still this is a much better process than just like three random dudes sitting on the toilet like scrolling through your paper a bit and then writing not enough experiments uh reject i don't understand it it's confusing look at the learning rate grafting video i did like these are the types of reviews that reviewers have to battle with yes it hasn't gotten much worse over the years yes really good papers are consistent really bad papers are consistent but i still maintain that this situation is not really a good one this is absolutely inconsistent it's a lottery your best bet is to write as many papers as you can that are just barely barely not crap and then throw all of them in and through the random number process some of them will get accepted and that's a sad state because big companies do this for clout big companies do it to recruit new people and so on but there are a lot of phd students that need to get whatever their three papers published in their four or five years that they're doing the phd and with such randomness and with only very very limited amount of conferences that you can submit to over the course of a year there's like three or four different big conferences that you realistically can submit to if you want a good impact factor this is very bad situation and a lot of people are going to be damaged just because the universe has some random fluctuations the solution to this honestly starts with professors tenured professors start handing out phds independent of conference submissions universities start giving professors tenure not on the basis of the impact factor of where they publish look at citations look at how popular the work is in any other metric stop considering impact factors of conferences grant agencies stop giving out grants based on the reputations of the professors based on the impact factors essentially disregard conference publications for anything you do i see some people they have to do it some professors have to get tenure and this is a criterion phd students have to do this because that's a requirement for their phd but if you're in a position to discard all of this do it what stops you you have tenure tell your phd students do three really nice really good archive publications if i'm happy with it phd all right that was it from me for ranting about this topic what do you think about it let me know in the comments maybe i'm completely wrong here but you know i'm happy to be educated to the contrary see ya youdo you know how hard it is to truly generate random numbers i i don't mean the random number generator on your phone or anything like this that's just algorithm that crunches something but it's deterministic true random numbers are super difficult to generate there is even a wikipedia article about it what you need to do is you need to measure some actual physical phenomenon like atmospheric noise or thermal noise or or other things that we have no idea they are so chaotic we just can't predict them and thus their results are truly truly random random.org even sells to random number generators for you this is big topic humanity has searched for and wide for truly random processes but now ladies and gentlemen we found it the nurip's review process is a absolutely truly random phenomenon so if you're not aware a way way time ago in nurips what was that 2014 the organizers made a little experiment where they gave certain set of papers that was submitted to the conference not only to one committee to review but the two separate committees in order to track how the committees would agree or disagree now the results right there were quite damning to be honest so not only did they not find any sort of correlation between what the reviewers scores they gave with any sort of future citations and that's a paper that i've covered in a video where they look back seven years later at whether or not the reviewers could predict anything about these papers turns out they cannot they also found that the reviewers mostly didn't really agree that much so here were these experiments now of the 166 papers most were rejected by both committees which most papers to such a conference are rejected so reject is sort of the default answer but here look at that if committee one accepted and committee one accepted for 22 plus 21 papers so for 33 papers committee 2 only agreed on half of them and likewise when committee 2 accepted for the 43 papers and this is 44 papers so for the 44 papers that committee 2 accepted committee one only agreed again in half of them so this means that if you were to switch committees for the papers only half of the accepted papers would be the same papers half of them would be other papers that had actually been rejected by the other committee which is kind of crazy but this just shows you how noisy this process really is now it's 20 21 and we've actually repeated this experiment so here's a reddit post by the user wai gua chiang that has scraped from open review these scores and put together some statistics such as this one here that shows the average rating of the papers versus how many of papers were in a particular bucket and what ultimately happened to them so we only have full data insight into the accepted papers and the rejected papers that have sort of voluntarily agreed to make their reviews public which most papers that are rejected don't now the most interesting part here is this one this is the repetition of the nurip's experiment you can see at the bottom the total is almost 300 papers and again these are not all the papers part of the experiment these are only the papers that were accepted because we don't know anything about the other ones so the way this worked was the follows papers were given to two separate committees these two committees reached a decision independently of each other and then the maximum of the two decisions was taken as an acceptance criterion so if either of the committees accepted the paper to be published the paper was going to be published so to understand this table the leftmost column is the final decision which is the max of decision one and decision two not always but we'll get to that then the second column is the decision of the first committee and the third column is the decision of the second committee now these things are ordered so it's not the same as in the last paper i've shown you so since there's no clear ordering we simply always put the larger decision on the left and the second large decision on the right so the most interesting part of this is how many papers were accepted by one committee but rejected by another one for that we have to add together all the rows where one of the decision is a reject so 174 plus 16 plus 9 is i think 199 papers 199 papers out of the 298 papers that were accepted had actually been rejected by a second committee so to compare we have to do the following we'll say that essentially the analogy would be that 22 and 22 and 21 papers so 65 papers would be our analogous total number from down here those are the papers that ultimately ended up being accepted because they were accepted by one of the committees and then 22 plus 21 papers so 43 papers would be the amount of papers that would have been rejected by one of the two committees but ultimately ended up being accepted because it was accepted by the other one so according to this here we see 43 out of 65 papers only were accepted by one of the committees and here we see that roughly 200 out of 300 papers were only accepted by one of the committees in both cases it's about two-thirds of the paper which means that actually this is remarkably consistent so in the face of that and with the explosion of the machine learning community more papers more reviewers and so on you could actually say it's a good thing it's actually surprising this hasn't gotten much worse over the years now that's one way to look at it and the other way to look at it is to say this is crap like come on this is completely inconsistent not only the accept reject is inconsistent you see of the six papers suggested to an oral by one of the committees this was never confirmed by another committee and how many were suggested for a spotlight by one of the committees 16 20 29 41 44 44 papers were suggested for a spotlight by one of the committees yet only three had actually both committees agreeing and again the same results hold if you were to swap out committees if you just differently assign people to papers half of the papers that are in the conference would be different half and i don't know how people can still claim that peer review is like this esteemed thing that is supposed to catch errors and do quality control and yada yada there's something to be said that if you have a really good paper the probability that a different committee also accepts it is is pretty high and also if you have a really bad paper the probability that two committees agree on rejecting it i guess that's even higher however most papers fall somewhere in the middle and that's the area of true randomness essentially what you do is you throw your paper in there and then something something happens and then you get a random number at the end and remember people use this to justify archive blackouts social media blackouts oh my god you cannot buy us the reviewers you must not buy us the pristine review you're you cannot buy us a random number generator i guess you can but it makes no makes no sense like honestly this is only half joking at this point the social media networks that we have people surfacing interesting papers from the depths of archive and from their social networks all the people filtering this kind of stuff yes there's promotion going on yes there's hype yes money plays a role but still this is a much better process than just like three random dudes sitting on the toilet like scrolling through your paper a bit and then writing not enough experiments uh reject i don't understand it it's confusing look at the learning rate grafting video i did like these are the types of reviews that reviewers have to battle with yes it hasn't gotten much worse over the years yes really good papers are consistent really bad papers are consistent but i still maintain that this situation is not really a good one this is absolutely inconsistent it's a lottery your best bet is to write as many papers as you can that are just barely barely not crap and then throw all of them in and through the random number process some of them will get accepted and that's a sad state because big companies do this for clout big companies do it to recruit new people and so on but there are a lot of phd students that need to get whatever their three papers published in their four or five years that they're doing the phd and with such randomness and with only very very limited amount of conferences that you can submit to over the course of a year there's like three or four different big conferences that you realistically can submit to if you want a good impact factor this is very bad situation and a lot of people are going to be damaged just because the universe has some random fluctuations the solution to this honestly starts with professors tenured professors start handing out phds independent of conference submissions universities start giving professors tenure not on the basis of the impact factor of where they publish look at citations look at how popular the work is in any other metric stop considering impact factors of conferences grant agencies stop giving out grants based on the reputations of the professors based on the impact factors essentially disregard conference publications for anything you do i see some people they have to do it some professors have to get tenure and this is a criterion phd students have to do this because that's a requirement for their phd but if you're in a position to discard all of this do it what stops you you have tenure tell your phd students do three really nice really good archive publications if i'm happy with it phd all right that was it from me for ranting about this topic what do you think about it let me know in the comments maybe i'm completely wrong here but you know i'm happy to be educated to the contrary see ya you\n"