What does it take for a student to study well

Introduction

All students want to do well in exams

But performance depends on a variety of factors. Whether it is attendance, study habits, and access to academic supports.

In this post, I will examine external factors and how they are related to performances in tests.

Approach

I will use the Student Academic Performance Dataset from Kaggle to analyze this. All data processing and visualizations are done in Google Sheets.

I will proceed this in the following steps.

Import and clean data to understand them
Visualize relationships between variables to observe potential correlations
I will summarize findings to answer the question “Which factors are involved for exam performances”

Clean the Data

The dataset is already in .csv file format and is directly imported into Google Sheets.

Assumptions

The following assumptions are made:

Use Exam Scores as Final Scores.
Assume that the scores are marked out of 101. This is the maximum exam score in the dataset.
Hours studied are interpreted as total weekly studied hours. Although the dataset tells us to interpret it daily, some entries exceed 24 hours. So interpreting it daily doesn’t make sense.

Handling Missing Data

Remove all rows where the following entries are empty
- exam scores
- attendance
- hours studied
- tutoring sessions
Entries for Teacher Quality and Access to Resources are allowed to remain blank. A blank response is reasonable. It represents uncertainty in a student’s answer. Also, these are controlled variables.

Variable Selection

Measured Variable
- Exam Scores
Varied Variable
- Attendance
- Hours Studied
- Tutoring Sessions
Controlled Variables
- Teacher Quality
- Access to Resources

A new columns No. is added to index each entry. With a total number of 6607 entries.

//generate indexing column
=SEQUENCE(6607)

Extract relevant columns from the dataset

//choosing relevant columns from entire dataset
=QUERY(raw_dataset,"select A, U, B, C, K, E, M")
//checking for empty entries for relevant columns
=COUNTBLANK(B1:B6608)
//Repeat the same process for rows C,K,E

This will give me the following dataset to work with

Visualize the Data

I analyze each factor individually to examine its relationship with exam performance.

Hours Studies vs. Scores

My hypothesis is

Students who study more tend to achieve higher exam scores

I can use scatter plot to demonstrate this with x-axis being hours studied and y-axis being exam scores.

=QUERY(cleaned_dataset, "SELECT B,C")

The initial plot shows a weak positive correlation. To refine this, I’m going to control two variables

Teacher Quality
Access to Resources

Here are my procedures and results after varying my controlled conditions.

//filtering my dataset with Teacher Quality and Access to Resources = High
=FILTER(clean_dataset,'Cleaned DataSet'!F1:F6608 = "High", 'Cleaned DataSet'!G1:G6608 = "High")

Teacher Quality = High
Access to Resources = High

Teacher Quality = High
Access to Resources = Low

Teacher Quality = Low
Access to Resources = Medium

The correlation is still weak. This suggests that

Study time shows only a slight association with exam scores, and increasing study hours alone does not substantially improve performance, regardless of learning conditions.

Attendance vs. Exam Scores

My hypothesis is

Exam performance are positively correlated to attendance in classes

A scatter pot was created using attendance percentage and exam scores.

The overall relationship is again very weak.

No meaningful trend emerges Hence, I conclude

Simply attending classes does not reliably translate into high exam performance.

Tutoring Sessions vs. Exam Scores

My hypothesis is

More tutoring sessions associates with higher exam scores

I repeat the process as before. However, there is an additional assumption to make here.

The tutoring sessions are measured in discrete counts. Each session has equal number of hours

A scatter plot does not reveal a clear relationship. To understand this pattern better, I use bar charts and a box plots instead.

To proceed, I will

Filter out 2 columns: Exam Scores, Tutoring Sessions
Group exam scores by number of tutoring sessions.
Work out Minimum, 25% Quartile, Average, 75% Quartile, Maximum of Exam Scores for each group.
Present the data with bar chart and box plot

//Filtering out columns 
=QUERY(clean_dataset, "select B,D")

To work out my average, I convert my dataset into a pivot table. Then use it to calculate the average of scores for each number of tutoring sessions.

This can be present in a Bar Chart

The bar chart suggests that exam scores increase with tutoring frequency up to six sessions, after which starts decreasing.

Next I will use relevant values namely,

Quartile Ranges
Minimum
Maximum

To present the data in a box plot format. Here is my procedure in Google Sheets and visualization.

//working out minimum while filtering each session
=MIN(FILTER(examscore,session = G1))
//working out percentiles while filtering each session
=PERCENTILE(FILTER(examscore,session = G1),0.25)
=PERCENTILE(FILTER(examscore,session = G1),0.75)
//working out maximum while filtering each session
=MAX(FILTER(examscore,session = G1))

Here’s my numerical result

With this, I can box plot with 8 boxes

Tutoring exhibits a law of diminishing returns. Moderate tutoring is associated with improved exam performance, but excessive tutoring does not guarantee continued improvement. It reflects over-reliance rather than effective learning. However, overall, the relationship is still very weak

Summarize the Data

Initial visualizations suggests almost no correlation at all between studied-related behaviours and exam performance. To verify these observations, I going to apply hypothesis testing and linear regression.

Before testing, I examined the distribution of exam scores. A histogram of exam scores below shows an approximately normal distribution.

//selecting only exam score data
=QUERY(clean_dataset,"select B")
//B is exam score here

This exam score dataset is a sample data set with

mean = 67.24
variance = 3.89

to 2 significant figures

Hours Studied vs. Scores

I’m going to apply linear regression and hypothesis testing to see if there is any correlation. First, I’m going to controlled

Teaching Quality = High
Access to Resources = High

I can then use XL Miner Analysis Toolpak add on for Google Sheets to get the details I needed. Apply Linear Regression function on this toolpak with 95% confidence level gives me the following results

The R-Square value is 0.22. This means that I can say with 22% of the variation in exam scores is explained by the following linear relationship:

Exam Scores = Studied Hours * 0.29 + 62.62. Where

0.29 = coefficient of Hours Studied
62.62 = coefficient of Intercept

Now I can apply hypothesis testing with

H0 = Studied Hours = 0 – There is no correlation between studied hours and exam scores.
H1 = Studied Hours != 0 – There is a correlation between studied hours and exam scores

Looking at confidence interval, the interval lies between [0.25, 0.34]. Since 0 does not lie in that interval we

reject H0 and accept H1

and concluded that there is a correlation between hours studied and exam scores at a 5% significant test.

Repeating these analysis without controlling variables gives out results as shown below.

The confidence interval is between [0.22,0.32]. 0 Does not lie within the interval so we reject H0 accept H1 and conclude with 95% confidence level that there is a positive correlation between studied hours and exam scores.

Attendance vs. Scores

In a similar manner, I apply hypothesis testing at 95% confidence level. My hypothesis is the following

H0 = Attendance = 0 – There is no correlation between attendance and exam scores.
H1 = Attendance != 0 – There is a correlation between attendance and exam scores

This gives me the following data

The confidence interval is between [0.16,0.21] . Since 0 does not lie in the interval, we reject H0 accept H1 and conclude with 95% confidence level that there is a positive correlation between attendance and exam scores.

When we apply test on linear regression we see that

28% of the variation in Exam Score is explain by the relationship of

Exam Scores = 0.19 * Attendance + 52.14

Tutoring Sessions vs. Scores

We repeat the same procedure as before with the following hypothesis at 5% significance level

H0 = Tutoring Sessions = 0 – There is no correlation between tutoring sessions and exam scores.
H1 = Tutoring Sessions != 0 – There is a correlation between tutoring sessions and exam scores

This time we will proceed this by hand

P-Value Testing

I will proceed in the following steps

Calculate the correlation between Exam Scores and Tutoring Sessions
Calculate T-Statistical Value
Apply two-tailed hypothesis testing

Here is my procedure with

Cells A2:A6608 being Exam Scores
Cells B2:B6608 being Tutoring Sessions

#Full Procedure of P value testing
#Work out Correlation
=CORREL(A2:A6608,B2:B6608) //set this to be at cell D13
#Work out T-Statistical value
=D13*SQRT((A2:A6608-2))/(1-D13^2)) //set this to be at cell D14
#Calculate p-value using 2-tail hypothesis testing
=T.DIST.2T(D14,COUNT(A2:A6608-2))

This gives out p-value to be 0.049. Since 0.049 is less than 0.05, we reject H0, accept H1 and concluded that there is sufficient evidence to say that there is correlation between Exam Scores and Tutoring Sessions.

Let’s check this result using confidence interval as well

Confidence Interval Testing

I will proceed in the following steps

Calculate the slope of the dataset
Calculate the Residue Standard Error (RSE)
Calculate Sum of Squares of Deviation (Sxx)
Calculate the Standard Error (SE)
Calculate T-Test Value
Calculate Margin Error
Calculate Confidence Interval

#Calculate Slope 
=SLOPE(A2:A6608,B2:B6608) //set to D19
#Calculating RSE
=STEYX(A2:A6608,B2:B6608) //set to D21
#Calculate Sxx
=DEVSQ(B2:B6608) //set to F19
#Calculate SE
=D21/SQRT(F19) //set to D23
#Calculating T-Test values
=T.INV.2T(0.05, 6605) //set to F21
#Margin Error (Standard Error * T Test Value)
=D23*F21 //set to D23
#Confidence Interval (Gradient +/- Margin Error)
=D19+D25 , D19 - D25

This gives us the confidence interval of [0.42,0.57]. Since 0 does not lie in the interval, we reject H0, Accept H1 and concluded that there is a correlation. Between tutoring sessions and exam scores

Linear Regression between this dataset

To work out the coefficient and intercept. I use

=LINEST(A2:A6608,B2:B6608)

To work out its R-Squared value, we squared the correlation value. This gives us 0.02 to 2 decimal places.

This is to say that only 2% of variation in Exams Scores is related to Tutoring Sessions in the following way:

Exam Scores = 0.49 * Tutoring Sessions + 66.50

Key Interpretation

Each Studied habits are statistically significant, but their predictive power is weak.

This supports the earlier visualization results:

Studied-Habits helps up to a point
The overall effect size is small
Other factors play a much larger role in exam performance

Statistical significance here reflects sample size more than practical impact.

Conclusion

We can conclude that there is a statistically significant relationship between exam scores and the factors analyzed. However, these relationships are weak in explanatory power, as indicated by their low R-squared values.

Attending lectures, participating in tutoring sessions, and studying more are all associated with higher exam scores. However, none of these factors alone strongly predicts performance. Exam outcomes are influenced by a combination of academic, environmental, and individual factors beyond the variables examined in this analysis.

Disclosures

AI tools were used to assist with outlining, clarification, and editing suggestions.
All ideas, interpretations, and final writing decisions are my own.

References

Data Rockie – Data Science Bootcamp
Student Academic Dataset By A.

What does it take for a student to study well

Introduction

Approach

Clean the Data

Assumptions

Handling Missing Data

Variable Selection

Refinements

Visualize the Data

Hours Studies vs. Scores

Attendance vs. Exam Scores

Tutoring Sessions vs. Exam Scores

Summarize the Data

Hours Studied vs. Scores

Attendance vs. Scores

Tutoring Sessions vs. Scores

P-Value Testing

Confidence Interval Testing

Linear Regression between this dataset

Key Interpretation

Conclusion

Disclosures

References

Comments

Leave a ReplyCancel reply

More posts

The Almnack of Naval Ravikant: An approach for stoicism

Temples in Thailand: When Thai Culture and Photography Combined.

Building a Second Brain with AI

The Four Months Worth Experience of Writing

Discover more from Quiet Horizon