Coronavirus, the evolution through decades

Textual data analysis of Corona research over decades

Abhinav Pathak
5 min readMar 8, 2021
image source

Scientists first identified a human coronavirus in 1965. It caused a common cold. Later that decade, researchers found a group of similar human and animal viruses and named them after their crown-like appearance

Between 1965 and 2020, various Coronaviruses have been identified across Globe, researched upon, linked with sources like bats, turkey, camel, and several techniques were proposed, tested, and developed

Utilizing the 5GB of textual data from over 5K Coronavirus research papers, the goal is to understand how the research around Coronavirus has evolved over decades

Let’s begin!

Data is downloaded from this link

There are more than 5K research papers (PDFs) published in 55 years. The data is imported using the PyPDF module and inserted into the ‘docs’ list

Quick data cleaning and exploration:

#papers on Coronavirus from 1968 to 2020

Although the research has grown steadily since 1968, # published papers on coronavirus has picked up pace since the identification of deadly SARS Coronavirus in 2002, evident from the spike

Let's peek into a single document from 5,329 research papers. This snapshot shows the sample text from one of the first research papers published in 1967

Load functions for Textual data processing:

Data aggregation by year:

To understand the progression of research, data is transformed such that one doc represents all research work produced in a single year

Conveniently, the first 4 characters of each PDF Filename (from the link) contain the year number, which makes the job a little easier. year_from_fname is the list of all the years obtained from file names

This operation changes the shape of our corpus from 5,329 documents to just 53 documents (1967–2020)

The growth in research content in the last 20 years is discernible from the size of the documents towards the tail of the list (printed below)

Term Frequency from Corpus:

The DataFrame is transformed such that each column represents each year and row index represents features/tokens

TFIDF metric from Corpus:

Analysis through TFIDF metric is much more useful for the stated goal as compared to Term frequency. Rather than simply focusing on more common mentions every year, we need to pinpoint noteworthy words/phrases to grasp the true essence of Corona research

Without the limit of 10K features, we end up with 2.5M features additional features, which doesn’t add proportional value to the findings and may also produce memory error

Included bigrams as well to account for words like ‘SARS COV’, ‘rotavirus infection’ etc. Also, setting a limit on maximum and minimum document frequency

Top 15 salient words each year using TFIDF metric:

Let us look at the output of the TFIDF metric and the top features produced. The snapshot below reflects data from 1969–1983 only, for readability

Validation of references from TFIDF results:

1972: The hemadsorption technique provided a simple method for performing oc43 Coronavirus neutralization tests

1973: Particles morphologically similar to coronaviruses were found in bluecomb-infected turkey

1975: The haemagglutination of Connecticut was detectable after sucrose gradient purification whereas that of Massachusetts required both the purification step and incubation with the enzyme phospholipase C to reveal it

Mentions of Coronavirus family members (using TFIDF):

Each member of the Coronavirus family has been detected at different times and in different parts of the world. 229E and OC43 were the first detected followed by SARS-COV1 in 2002, NL63 in late 2004 in the Netherlands, HKU1 in Jan’2004 in HongKong, MERS COV in 2012, and SARS COV2 very recently

Mentions of few buzzwords around Corona (using TFIDF)

Line plot showing important discussion

Most mentioned Words/phrases (using Term freq)

Most Mentioned Bigrams (using Term freq)

This is also one of the disadvantages of using the Term Frequency metric instead of the TFIDF metric. Useless tokens like ‘doi org’ and ‘org 10’, ‘dx doi’ would automatically be taken care of in TFIDF, but they have to be separately processed when using term frequency


To end the analysis, a few word clouds below for each decade from 1970 to 2020. Interestingly, a lot of insights can be extracted through Word Clouds alone





Thanks for reading through. In addition to the analysis presented above, data can be further processed to obtain more refined insights. Plus, other techniques like Topic modeling (LDA) can be applied as well