Data science enthusiast who loves getting lost in Jupyter notebooks đ
Love travel and foodie actvities!
View My LinkedIn Profile
IMPORTANT NOTE!! You may right click on any of the images and click âOpen image in new tabâ to enlarge any of the images below
In this project we are going to play the level of a newly hired healthcare data analyst at a hospital and because our supervisor is busy, they have given us a lot of tasks to handle for them.
The tasks:
admission_type_id
is 1), but stayed less than the average time in the hospitalThis SQL project will show us how versatile this dataset is and tease the possibilities of how much time can be spent on it. Because of all the columns that this dataset contains, there are so many insights and relationships that can be looked at and tested. But because of âjob dutiesâ, weâll just be skimming the tip of the iceberg.
I chose this project because I was interested in a health domain-focused dataset to add to my data repository! Also, because my mother is a nurse, I have always had a curiosity for the health industry at large and always wondered how or if I could make a contribution to it without having to go to nursing or medical school! đ
For this project, we are using this diabetic-focused dataset. This dataset represents 10 years of data (1999-2008) and is specifically geared toward the analysis of factors regarding patient readmission and other outcomes pertaining specifically to diabetic patients. But the version that weâre using is a modified version on Kaggle and we arenât focusing either on readmissions or diabetic-factors to complete our assigned tasks in the previous section.
There are over 40 columns of data, so I will only go over the relevant ones thatâll be most crucial to our data analysis for this project:
patient_nbr
: An individual patientâs identifiertime_in_hospital
: Number of days that a patient has spent in the hospitalnum_procedures
: Number of procedures that a patient has hadnum_lab_procedures
: Number of lab procedures that a patient has hadnum_medications
: Number of times that medications have been administered to a patientNow letâs get to our given tasks and our data analysis!
To start out, first weâre going to create a histogram! This might sound odd to see that because SQL isnât necessarily renowned for its visualization capabilities, but there is a query we can use to produce a histogram.
We use this query to output the histogram. Note that we are using the time_in_hospital
column as weâre trying to determine whether the majority of our patients are spending less than 7 days in the hospital. Knowing that our patients are spending less than 7 days in the hospital can free up beds faster in the event of overcrowding and also save some money for both us and the patient! Letâs print our histogram now.
As we can see by the results, the majority of our patients do indeed spend less than 7 days in our hospital beds!
Now with this line of code weâre going to check out the distinct values (or fields) in the medical_specialty
column using the following query.
I also counted the unique values in the medical_specialty
column and it turns out that we have 73 distinct values.
Now that we know what those are, letâs create a query that groups all the values of medical_specialty
together and find the average num_procedures
per specialty. We also want to count how many patients there are per specialty (please note that one row in this dataset represents one patient).
And here are a few lines of the output
Not only does it look more organized now, but we are closer to our next goal of answering which specialties have the highest number of procedures on average. So letâs say that our supervisor instructs us to filter down to specialties that have at least 50 patients and an average of at least 2.5 procedures per patient for a given specialty. With this we should have a great idea on what specialties to focus on!
So letâs put this concept into a query and see what we get.
Just as planned! We now have 5 specialties that were able to meet our rigorous criteria and as we can see from both the avg_procedures
and rows_count
columns, a lot of patients are treated in these specialties and they get at least 2.5 procedures too. These are amazing results and Iâm quite sure that the supervisor will be delighted with these results!
Next, our supervisor has asked us to check to see if there are any discrepancies in treatment based on race. For this purpose, we will use the num_lab_procedures
that patients from all racial backgrounds are receiving the highest quality treatment. The table that we have been using until now, patient
, doesnât have any information that outlines a personâs racial background. Therefore, we will use another table called demographics
employing the SQL inner join technique along the shared patient_nbr
column to ensure that we can properly align the demographic info of the new table to the parent table. The patient_nbr
column is our primary key for both tables, so we know that the values present are all unique.
For this query, we created a CTE and used it in a view to use for future reference and set up our join and the columns to be displayed from said join. Next we want to take that view and create a query that categorizes our database by race and uses the average of num_lab_procedures
and we can determine whether all of our patients are getting equal treatment.
Here is the query:
And the output:
From these results, we can determine that there arenât any outliers that would indicate the presence of preferential treatment due to racial makeup. Now letâs move further in our analysis.
Next we were asked to see if patients that have a lot of procedures done tend to have longer stay times in the hospital. First, we run a query that gives a sense of the data distribution for the num_lab_procedures
column.
From the output, we can see that num_lab_procedures
has an average value of around 43 procedures. With this knowledge, we can use a series of CASE
statements to categorize our patients based on their num_lab_procedures
value. If a value for num_lab_procedures
falls within 0-24 it will be assigned the tag of âfewâ, for 25-54: âaverageâ, for over 55: âmanyâ.
Now we are going to create a query that creates these tags for every patient, groups all of them by these tags and looks at the average hospital stay per group. Now, letâs look at that query and its output together
From what we have seen, we can now draw a positive correlation between time spent in the hospital and how many procedures that they have. The longer a patient stays at a hospital the more procedures they have and the shorter their stay, the fewer procedures they have.
Next, Iâve been tasked to compile a list of patients who are either African-American or have a value of âUpâ for the metformin
column. Here is a simple query that we can use, making use of the UNION
function. Some lines of the output will also be displayed.
admission_type_id
is 1), but stayed less than the average time in the hospitalTo find these success stories, we have to have something in our query like this: WHERE time_in_hospital < AVG(time_in_hospital)
. But actually, this wouldnât work and it would produce an error because we canât use AVG
in a WHERE
clause, only in a SELECT
clause. So we have to use a subquery to work around this. Here is an example of how we can get what we want:
Subqueries are a useful tool, but they can make things look messy if we have a lot of them to manage. Therefore, we can once again use a CTE to make things more neat. Last time we created a VIEW
, so this time weâre going to use WITH
. Weâll update the query and confirm that the output is still the same as our previous query.
With the help of the WITH
CTE our subquery has now been condensed from (SELECT AVG(time_in_hospital) FROM health)
to just (SELECT * FROM avg_time)
.
And we also have a nice list of âsuccess storyâ patients to provide to our supervisor too!
Next weâve been asked to generate a query that will summarize the results of the top 50 medicated patients along with the number of lab procedures that they had(table will be ordered by num_medications
and afterwards num_lab_procedures
). The output of our queries should produce values in this format:
Using a combination of CONCAT
and CASE
, we will be able to generate multiple summaries like this. Here is the query and the output below:
In conclusion, we set out to complete the tasks that our supervisor instructed us to and weâve done just that. To review what was asked of us, letâs refresh our memories with the below tasks:
admission_type_id
is 1), but stayed less than the average time in the hospitalFrom my own standpoint, the most interesting insights that we can come away with are that we werenât able to find any racial favoritism amongst the patients, the cardiology department had by far the highest amount of admitted patients and that we were able to find a strong correlation between the number of lab procedures done and length of time spent for our patients in the hospital.
I didnât know much about our patients going into this dataset, but now I have a much firmer understanding of them thanks to this brief analysis! Iâll stop here for now, but Iâd definitely like to jump back into this dataset sometime in the near future.
Thank you so much for your time and reading all of this! This was an absolute pleasure to do and if you have any questions, please comment below or please contact me at lance.inimgba@gmail.com or on LinkedIn!
Iâm actively looking for new opportunities in the data science field, so please donât hesitate to contact me if you know of anything out there!