Every year, at the beginning of November, an increased excitement can be felt within the scientific world. It's time the year's Nobel Prize winners are announced. From an outside perspective, it can be hard to grasp what's going on, and newspapers spend a great deal on trying to give insight into the inner workings of this world. Now imagine you are fascinated by a new topic, and you want to get an insight, be it that you are an undergraduate student that wants to steer his future career, an experienced scientist that wants to extend his scope or just browsing around. When exploring outside of one's own expertise, it can be overwhelming to get an overview. Who are the key players, how can I get to know them and what are the hot topics? Typically, it takes several years within a field, going to conferences and reading publication over publication to get an intuition about the people involved and their connections and contributions. Luckily, with some python and its extensive libraries, we can speed up this process and quickly generate insight into this network by simply analyzing all relevant journal publications.
The topic of interest: Nobel Prize in physics 2018
As an example, we take a look at this year's Nobel Prize in physics
, awarded to Arthur Ashkin "for the optical tweezers and their application to biological systems." Also awarded were G√©rard Mourou and Donna Strickland for their invention of chirped pulse amplification. For the sake of demonstration, we will focus on the field of optical tweezers. In brief, optical tweezers use highly focused lasers that allow exerting a piconewton force to hold small particles in place.
Biophython for fetching Abstracts
To save us from reading all the scientific literature, we will use the biophython
package to fetch abstracts related to our search query from the PubMed database. Make sure to start reading the NCBI's Entrez User Requirements
. Specifically, we do not want to overload this service, so make large requests outside USA peak times and correctly set the e-mail parameter so that you can be contacted. As search argument, we will pass optical trap as a keyword. Note that the initial term for optical tweezer was called "single-beam gradient force trap," but since most publications are associated with several keywords, we should still be able to get relevant results. For fetching, we first collect the number of publications that are associated with the keyword and then collect all their ids for successive downloading in batches of 100.
Total number of publications that contain the term optical trap: 5059
100%|‚??‚??‚??‚??‚??‚??‚??‚??‚??‚??| 51/51 [01:20<00:00, 1.36s/it]
Exploring the data: Top Authors, Time Series, Top Journals and Top Keywords
Now having the information of thousands of publications we now would like to get some insight into the data. Some immediate questions might be: Who are the leading scientists? Are people still publishing in this field? Which journals should I read? What keywords should I look out for? To get an overview for that, we'll explore the data. First, we convert our record_list into a pandas data frame. The columns of interest are FAU for the author names, TA for the journal title, EDAT for the time and OT for a list of keywords associated with a publication. As the authors and the keywords are stored as lists in the data frames we need to flatten them first. This can be conveniently done with a list comprehension. We can then use the Counter method from collections to conveniently extract the most common elements and visualize them by creating barplots using seaborn.
From the four plots, we can make several interesting observations. First, when examining the top authors, we note, that the Nobel laureate Ashkin is not present. The top scholar is Bustamante, Carlos. When considering the publications over time, we see that most of the work was published after 2000 ‚?? when keeping in mind that Ashkin's publication was in 1986, it is evident how fundamental his work was. The time course also tells us, that we now are in a state of decline regarding the number of publications. The journal with the most publications related to our search query is Optics Express.
Visualizing the network
We can now dig deeper into our data and start to visualize the scientific network. Specifically, we will analyze which authors were publishing together and how often. For this, we create a list of all author combinations for each paper, which will be our author connections. Here, we do not consider the position of the author, i.e., whether one author was the principal investigator or not. After transforming this into a pandas dataframe, we can use this to create an undirected graph with the network package. Each author now becomes a node and the connection between two is an edge. To display the graph we use the nxviz library and display it as a CircosPlot. Here, the authors are arranged on a circle with connections in between them. While I consider this a beneficial representation for the given task, be sure to check out the different types of graphs you can use for visualization
Additionally, we limit the graph to the TOP50 authors. The number of author connections will be reflected with the edge_width parameter that determines the line width of the connections. We will orient the number of nodes by the publication count so that the number of publications decreases clockwise beginning with the scholar with the most papers (Bustamante, Carlos).
With the CircosPlot we now have a colorful representation of the scientific network that allows to immediately spot the most important people how connected they are within the network. We can see that within this TOP50 network a lot of authors are very well connected while there are others, even though they published a lot, that they are not connected at all. While this representation is ideal to get a qualitative overview of the people within the field, we next will explore how we can quantitatively assess the importance of people.
Network Analysis: Degree Centrality and Betweenness Centrality
In order to get the best-connected nodes we will calculate two related parameters for the network: degree_centrality and betweenness_centrality. While the former is a measure of connections in relation to all possible connections of a node, the latter is a measure of whether the node is part of the shortest path between two nodes. Translating to our case: Who had the most collaborations and who is essential in connecting to other authors?
Now we can extend our initial TOP10 plot with those two metrics. For this we normalize each parameter by the maximum of the TOP10 authors to get a relative value between 0 and 1:
This allows us to verify an important observation which was already present in the CircosPlot: The people with the most papers are not necessarily the ones that are the best connected within this network. Here, Rubinsztein-Dunlop, Halina and Chemla, Yann R can be identified as playing an integral part in the author network with respect to their publication count.
Pathfinding: How to get in contact with a person of interest
Lastly, we will explore how we can use this network for finding paths, i.e. when you know a certain professor and would like to get to know another one. This is easily done by using the all_shortest_path from the networx package. As an example, if you happen to know Ha, Taekjip and would like to see the connections to Bustamante, Carlos we can calculate the shortest paths:
['Ha, Taekjip', 'Chemla, Yann R', 'Bustamante, Carlos']
['Ha, Taekjip', 'Yu, Jin', 'Bustamante, Carlos']
Here, we find two paths, one over Chemla, Yann R, which we already identified as important in Figure 3, the other one over Yu, Jin. These results can be easily verified in the graphic representation.
Here, we used several python packages to analyze abstracts of journal publications that are related to a certain scientific field. With the help of network analysis, this can give great insight with only a few lines of code. Ultimately, it helps to identify the scientists that are well connected and productive within the field.
However, it also important to stress its limitations, which arises from the core metric that we used in this analysis, the number of publications. Arguably, this does not reflect how the scientific work was received within the community. Here, considering the number of citations would be beneficial. When considering Ashkin's publication from 1986, we can learn that this work was cited more than 6000 times, making its importance obvious. Other author-level metrics that attempt to measure the impact and productivity such as the h- or i-index seem beneficial choices.
Further analysis could include these additional metrics. Geographic information could be used to visualize the locations of the scientists. Example applications could be to do further network analysis and build a recommender system for scientists.
The article was originally published here