NLP Scholar: The State of Natural Language Processing Literature

Contact: Saif M. Mohammad (uvgotsaif@gmail.com, saif.mohammad@nrc-cnrc.gc.ca)




Project Overview

This work examines Natural Language Processing (NLP) research literature to identify broad trends in productivity, focus, and impact. To do so, we extracted and aligned the information in the ACL Anthology (AA) and Google Scholar (GS). We present the analyses in a sequence of questions and answers. The goal is to record the state of the NLP literature: who and how many of us are publishing? what are we publishing on? where and in what form are we publishing? and what is the impact of our publications? The answers are usually in the form of numbers, graphs, and inter-connected visualizations. Special emphasis is laid on the demographics and inclusiveness of NLP publishing.

The work is presented in a number of ways, including:

NLP Scholar Data:A single unified source of information from both the ACL Anthology (AA) and Google Scholar for tens of thousands of NLP papers. The dataset is described in the LREC-2020 paper listed at the bottom of this page. It was collected on June 2019. A second round of data collection was done in June 2020. Click here to download the latest version of the data. The interactive demo uses this dataset now.


Papers

Examining Citations of Natural Language Processing Literature. Saif M. Mohammad. In Proceedings of the 58th Annual Meeting of the Association of Computational Linguistics (ACL-2020). July 2020. Seattle, USA.
Paper (pdf)    Presentation

  • Summary: Examines nine questions pertaining to broad trends in citations of NLP papers (across time, across venue types, across paper types, across areas, etc.).
  • BibTeX:

    @inproceedings{mohammad2020citations,
       title={Examining Citations of Natural Language Processing Literature},
       author={Mohammad, Saif M.},
       booktitle={Proceedings of the 2020 annual conference of the association for computational linguistics},      address={Seattle, USA},
       year={2020} }


Gender Gap in Natural Language Processing Research: Disparities in Authorship and Citations. Saif M. Mohammad. In Proceedings of the 58th Annual Meeting of the Association of Computational Linguistics (ACL-2020). July 2020. Seattle, USA.
Paper (pdf)        Video       Presentation

  • Summary: Examines eight questions pertaining to disparities across male and female NLP researhcers (in authorship and citations).
  • BibTeX:

    @inproceedings{mohammad2020gender,
       title={Gender Gap in Natural Language Processing Research: Disparities in Authorship and Citations},
       author={Mohammad, Saif M.},
       booktitle={Proceedings of the 2020 annual conference of the association for computational linguistics},
       address={Seattle, USA},
       year={2020} }


The State of NLP Literature: A Diachronic Analysis of the ACL Anthology.
 Saif M. Mohammad. arXiv preprint arXiv:1911.03562. November 2019. 
Paper (pdf)

  • Summary: A manuscript that brings together the analyses of NLP papers first presented in the four State of NLP blog posts.
  • BibTeX:

    @article{mohammad2019nlpscholar,
       title={The State of NLP Literature: A Diachronic Analysis of the ACL Anthology},
       author={Mohammad, Saif M.},
       journal={arXiv preprint arXiv:1911.03562},
       year={2019}


NLP Scholar: An Interactive Visual Explorer for Natural Language Processing Literature. Saif M. Mohammad. In Proceedings of the 58th Annual Meeting of the Association of Computational Linguistics (ACL-2020). July 2020. Seattle, USA.
Paper (pdf)       Presentation

  • Summary: Presents an interactive visualization tool to help users find (related) work published in the ACL Anthology.
  • BibTeX:

    @inproceedings{mohammad2020demo,
       title={NLP Scholar: An Interactive Visual Explorer for Natural Language Processing Literature},
       author={Mohammad, Saif M.},
       booktitle={Proceedings of the 2020 annual conference of the association for computational linguistics},
       address={Seattle, USA},
       year={2020} }


NLP Scholar: A Dataset for Examining the State of NLP Research. Saif M. Mohammad. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC-2020). May 2020. Marseille, France.
Paper (pdf)      Presentation

  • Summary: Presents the NLP Scholar Dataset -- a single unified source of information from both the ACL Anthology (AA) and Google Scholar for tens of thousands of NLP papers. Presents initial work on analyzing the volume of  research in NLP over the years, identifies some the most cited papers in AA, as well as outlines a list of potential applications of the dataset.
  • BibTeX:

    @inproceedings{mohammad2020data,
       title={NLP Scholar: A Dataset for Examining the State of NLP Research},
       author={Mohammad, Saif M.},
       booktitle={Proceedings of the 12th Language Resources and Evaluation Conference (LREC-2020)},     
       address={Marseille, France},    
       year={2020} }

 

Screenshot of NLP Scholar When Showing the Full June 2020 Data

 

Caveats and Limitations

A detailed list of caveats and limitations associated with this work is available here.

 

Acknowledgments

This work was possible due to the helpful discussion and encouragement from a number of awesome people, including: Dan Jurafsky, Tara Small, Michael Strube, Cyril Goutte, Eric Joanis, Matt Post, Patrick Littell, Torsten Zesch, Ellen Riloff, Norm Vinson, Iryna Gurevych, Rebecca Knowles, Isar Nejadgholi, and Peter Turney. Also, a big thanks to the ACL Anthology and Google Scholar Teams for creating and maintaining a wonderful resource.