Methodology

Papers were downloaded from PDFs from the publisher’s sites. The PDFs were processed with GROBID, a machine learning library that parses PDF documents into structured XML documents. It’s a really powerful and surprisingly accurate parser for scientific documents, including functionality to recognize, parse, and Crossref-verify references. The abstracts and body texts were extracted after removing references to citations and figures. For journals without abstracts, the first paragraph was taken to be the abstract.

The two-axis text plot was created with scattertext, a truly awesome and rich package with support for a large variety of analyses, most of which I have not used for the PoTY project. Visualization of associations for phrases was done with pytextrank as part of a spaCy pipeline.

For topic models, the text was tokenized and stemmed using a Snowball stemmer and stopwords included with the NLTK (Natural Language Toolkit) package. We use the Latent Dirichlet Allocation (LDA) model from tomotopy. Visualization of the topic models were achieved with pyLDAvis, which is the Python implementation of the R-based LDAvis.

Geographies were processed with the Geocoder package, using the ArcGIS provider. The map was generated with Folium.