Covid 19 Data Exploration

I have taken the covid 19 data from public websites. The dataset had information about daily cases, daily deaths, daily vaccinations for every country. There are two tables – coviddata and covidprevention. The former table contains details about the spread of covid 19 infections over time. The latter table contains data about the tests conducted and vaccinations done over time. In this exercise, I had explored the data in these two tables to gather some information.

Lets start by looking at the columns present in the coviddata table:

SELECT *
FROM SQLPortfolio.dbo.coviddata;

I see that there are some useful information in this table with which we can perform further analysis.

SELECT date, location, total_cases, new_cases, total_deaths, new_deaths, population
FROM SQLPortfolio.dbo.coviddata
ORDER BY 2,1;

Using these columns, let’s see how the data for cases vs deaths look like: (the condition “continent IS NOT NULL” is used to get the details of only the countries because the data has both countries and continents in the location column)

SELECT date, location, total_cases, total_deaths, (total_deaths / total_cases) * 100 AS death_vs_cases_percent
FROM SQLPortfolio.dbo.coviddata
WHERE continent IS NOT NULL
ORDER BY 2,1;

Next, I drilled down to my country, India. The death to cases percent is a minimal 1.3 % recently.

SELECT date, location, total_cases, total_deaths, (total_deaths / total_cases) * 100 AS death_vs_cases_percent
FROM SQLPortfolio.dbo.coviddata
WHERE continent IS NOT NULL
AND location = 'India'
ORDER BY 2,1;

Let’s have a look at the Cases vs Population ( I feel that this is the infection rate). India has around 2.7% of the population infected.

SELECT date, location, total_cases, population, (total_cases / population ) * 100 AS percent_of_population_covid
FROM SQLPortfolio.dbo.coviddata
WHERE continent IS NOT NULL
AND location = 'India'
ORDER BY 2,1;

Now let’s find the country with the highest infection rate. The tiny country Andorra which is located near Spain and France tops the list with 38.6% infection rate

SELECT location, MAX(total_cases) as latest_total, population, (MAX(total_cases) / population ) * 100 AS percent_of_population_covid
FROM SQLPortfolio.dbo.coviddata
WHERE continent IS NOT NULL
GROUP BY location, population
ORDER BY percent_of_population_covid DESC;

Next, let’s jump on to finding which is the country that has the highest death rate. Peru tops the list.

SELECT location, MAX(CAST(total_deaths AS INT)) as latest_total_deaths, population, (MAX(CAST(total_deaths AS INT))/population)*100 death_rate
FROM SQLPortfolio.dbo.coviddata
WHERE continent IS NOT NULL
GROUP BY location, population
ORDER BY death_rate DESC;

Let’s repeat the same activity to find the country with most number of deaths. I guess this will a country with a large population. The United States top this list with over 900k deaths, followed by Brazil and India

SELECT location, MAX(CAST(total_deaths AS INT)) as latest_total_deaths
FROM SQLPortfolio.dbo.coviddata
WHERE continent IS NOT NULL
GROUP BY location
ORDER BY latest_total_deaths DESC;

Let’s try to dig more into the data. I will start with finding the maximum number of cases in a single day worldwide. The number is huge – it’s a humungous 3.7 million single day spike that was the highest.

SELECT MAX(new_c.total_new_cases) AS highest_new_cases_per_day FROM
(SELECT date, SUM(new_cases) AS total_new_cases
FROM SQLPortfolio.dbo.coviddata
WHERE continent IS NOT NULL
GROUP BY date
) AS new_c;

I tried to validate the information between the new_cases and total_cases columns. I wrote a query to calculate the running total to validate the same

SELECT date, location, SUM(new_cases) OVER (PARTITION BY location ORDER BY location,date) AS Running_total_cases, total_cases
FROM SQLPortfolio.dbo.coviddata
WHERE continent IS NOT NULL;

Let’s move on to explore the covidprevention table

SELECT *
FROM SQLPortfolio.dbo.covidprevention
;

Which country (with population more than a million people) has the most Percentage of people partially vaccinated? UAE it is, they have partially vaccinated almost all of their people

SELECT location, MAX(CAST(people_vaccinated AS INT)) AS partially_vaccinated, population, (MAX(CAST(people_vaccinated AS INT)) / population) * 100 AS partially_vaccinated_people
FROM SQLPortfolio.dbo.covidprevention
WHERE continent IS NOT NULL
AND population > 1000000
GROUP BY location, population
ORDER BY 4 DESC;

I repeated the same activity as above to find the country (with population more than a million people) has the most Percentage of people fully vaccinated? It’s UAE again but closely followed by Portugal

SELECT location, MAX(CAST(people_fully_vaccinated AS INT)) AS fully_vaccinated, population, (MAX(CAST(people_fully_vaccinated AS INT)) / population) * 100 AS fully_vaccinated_people
FROM SQLPortfolio.dbo.covidprevention
WHERE continent IS NOT NULL
AND population > 1000000
GROUP BY location, population
ORDER BY 4 DESC;

Next, I wanted to dig into the vaccination trend for India. I wrote a query to find the percentage of the Indian population vaccinated over time. I used a temp table as I had to reuse a aggregated field in the same query.

DROP TABLE IF EXISTS #temp_vac;
CREATE TABLE #temp_vac
(
date datetime,
location nvarchar(255),
cumulative_total_vac float,
popu float
)

INSERT INTO #temp_vac
SELECT date, location, SUM(CAST(new_vaccinations AS FLOAT)) OVER (PARTITION BY location ORDER BY location,date) AS Running_total_vacc, population
FROM SQLPortfolio.dbo.covidprevention
WHERE continent IS NOT NULL
AND location = 'India'
SELECT date, location, cumulative_total_vac, popu, (cumulative_total_vac/popu)*100 AS percent_run_total_vac
FROM #temp_vac;

After this, I was researching of an alertnative way to do this without the temp table. That’s when I learnt about CTE – Common Table Expressions in MS SQL

WITH running_total_vaccinated_cte (date, location, run_total_vac, popu)
AS
(
SELECT date, location, SUM(CAST(new_vaccinations AS FLOAT)) OVER (PARTITION BY location ORDER BY location,date) AS Running_total_vacc, population
FROM SQLPortfolio.dbo.covidprevention
WHERE continent IS NOT NULL
AND location = 'India'
)
SELECT date, location, run_total_vac, popu, (run_total_vac/popu)*100 AS percent_run_total_vac
FROM running_total_vaccinated_cte;

To use the new SQL functions I learnt, I found out how much were the number of new cases in India in the top 5% of the days with most cases. There were more than 259k cases in the top 5% of the days with most number of cases.

SELECT PERCENTILE_CONT(0.95) WITHIN GROUP(ORDER BY new_cases) OVER (PARTITION BY location) as 'Top_5_percentile'
FROM SQLPortfolio.dbo.coviddata
WHERE continent IS NOT NULL and location = 'India';

Next, I wanted to use the ROW_NUMBER() function to add ranks to the list of countries with the most number of cases:

SELECT ROW_NUMBER() OVER (ORDER BY scd.sum_cases DESC) AS 'Rank',scd.loc AS 'Country',scd.sum_cases AS 'Total Cases'
FROM (
SELECT cd.location AS loc, SUM(cd.new_cases) AS sum_cases
FROM SQLPortfolio.dbo.coviddata AS cd
WHERE continent IS NOT NULL
GROUP BY cd.location) AS scd;

Next is a query to find the 7 day moving average for the number of cases in India:

SELECT
AVG(new_cases) OVER (ORDER BY s.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS '7DayMA',
new_cases as 'TotalNewCases',
date
FROM	SQLPortfolio.dbo.coviddata AS s
WHERE location = 'India'
GROUP BY date, new_cases
ORDER BY date ASC;

I tried a more complex query to find the moving average over a period of 7 weeks for the cases in India:

SELECT
AVG(WeeklyCases) OVER (ORDER BY YNo,WNo ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS '7WMA',
WeeklyCases as 'TotalNewCases',
YNo,
WNo
FROM (
SELECT	SUM(s.new_cases) 'WeeklyCases',
DATEPART(ww, s.date) AS WNo,
DATEPART(yyyy,s.date) AS YNo
FROM	SQLPortfolio.dbo.coviddata AS s
WHERE	location = 'India'
GROUP BY DATEPART(ww, s.date), DATEPART(yyyy,s.date)
) AS s
GROUP BY YNo,WNo, WeeklyCases
ORDER BY YNo ASC,WNo ASC;

I applied the same logic as above to find the weekly running total number of cases in India

SELECT
SUM(WeeklyCases) OVER (ORDER BY YNo,WNo ROWS UNBOUNDED PRECEDING) AS 'Cummulative Total',
WeeklyCases as 'TotalNewCases',
YNo,
WNo
FROM (
SELECT	SUM(s.new_cases) 'WeeklyCases',
DATEPART(ww, s.date) AS WNo,
DATEPART(yyyy,s.date) AS YNo
FROM	SQLPortfolio.dbo.coviddata AS s
WHERE	location = 'India'
GROUP BY DATEPART(ww, s.date), DATEPART(yyyy,s.date)
) AS s
GROUP BY YNo,WNo, WeeklyCases
ORDER BY YNo ASC,WNo ASC;

That’s it for now. I will keep this page update whenever I get new ideas or learn something new with SQL

Liked my work?