Introduction
All students want to do well in exams
But performance depends on a variety of factors. Whether it is attendance, study habits, and access to academic supports.
In this post, I will examine external factors and how they are related to performances in tests.
Approach
I will use the Student Academic Performance Dataset from Kaggle to analyze this. All data processing and visualizations are done in Google Sheets.
I will proceed this in the following steps.
- Import and clean data to understand them
- Visualize relationships between variables to observe potential correlations
- I will summarize findings to answer the question “Which factors are involved for exam performances”
Clean the Data
The dataset is already in .csv file format and is directly imported into Google Sheets.

Assumptions
The following assumptions are made:
- Use Exam Scores as Final Scores.
- Assume that the scores are marked out of 101. This is the maximum exam score in the dataset.
- Hours studied are interpreted as total weekly studied hours. Although the dataset tells us to interpret it daily, some entries exceed 24 hours. So interpreting it daily doesn’t make sense.
Handling Missing Data
- Remove all rows where the following entries are empty
- exam scores
- attendance
- hours studied
- tutoring sessions
- Entries for Teacher Quality and Access to Resources are allowed to remain blank. A blank response is reasonable. It represents uncertainty in a student’s answer. Also, these are controlled variables.
Variable Selection
- Measured Variable
- Exam Scores
- Varied Variable
- Attendance
- Hours Studied
- Tutoring Sessions
- Controlled Variables
- Teacher Quality
- Access to Resources
Refinements
- A new columns No. is added to index each entry. With a total number of 6607 entries.
//generate indexing column
=SEQUENCE(6607)
- Extract relevant columns from the dataset
//choosing relevant columns from entire dataset
=QUERY(raw_dataset,"select A, U, B, C, K, E, M")
//checking for empty entries for relevant columns
=COUNTBLANK(B1:B6608)
//Repeat the same process for rows C,K,E
This will give me the following dataset to work with

Visualize the Data
I analyze each factor individually to examine its relationship with exam performance.
Hours Studies vs. Scores
My hypothesis is
Students who study more tend to achieve higher exam scores
I can use scatter plot to demonstrate this with x-axis being hours studied and y-axis being exam scores.
=QUERY(cleaned_dataset, "SELECT B,C")

The initial plot shows a weak positive correlation. To refine this, I’m going to control two variables
- Teacher Quality
- Access to Resources
Here are my procedures and results after varying my controlled conditions.
//filtering my dataset with Teacher Quality and Access to Resources = High
=FILTER(clean_dataset,'Cleaned DataSet'!F1:F6608 = "High", 'Cleaned DataSet'!G1:G6608 = "High")

Access to Resources = High

Access to Resources = Low

Access to Resources = Medium
The correlation is still weak. This suggests that
Study time shows only a slight association with exam scores, and increasing study hours alone does not substantially improve performance, regardless of learning conditions.
Attendance vs. Exam Scores
My hypothesis is
Exam performance are positively correlated to attendance in classes
A scatter pot was created using attendance percentage and exam scores.

The overall relationship is again very weak.

Access to Resources = High

Access to Resources = Medium

Access to Resources = Low
No meaningful trend emerges Hence, I conclude
Simply attending classes does not reliably translate into high exam performance.
Tutoring Sessions vs. Exam Scores
My hypothesis is
More tutoring sessions associates with higher exam scores
I repeat the process as before. However, there is an additional assumption to make here.
The tutoring sessions are measured in discrete counts. Each session has equal number of hours

A scatter plot does not reveal a clear relationship. To understand this pattern better, I use bar charts and a box plots instead.
To proceed, I will
- Filter out 2 columns: Exam Scores, Tutoring Sessions
- Group exam scores by number of tutoring sessions.
- Work out Minimum, 25% Quartile, Average, 75% Quartile, Maximum of Exam Scores for each group.
- Present the data with bar chart and box plot
//Filtering out columns
=QUERY(clean_dataset, "select B,D")
To work out my average, I convert my dataset into a pivot table. Then use it to calculate the average of scores for each number of tutoring sessions.

This can be present in a Bar Chart

The bar chart suggests that exam scores increase with tutoring frequency up to six sessions, after which starts decreasing.
Next I will use relevant values namely,
- Quartile Ranges
- Minimum
- Maximum
To present the data in a box plot format. Here is my procedure in Google Sheets and visualization.
//working out minimum while filtering each session
=MIN(FILTER(examscore,session = G1))
//working out percentiles while filtering each session
=PERCENTILE(FILTER(examscore,session = G1),0.25)
=PERCENTILE(FILTER(examscore,session = G1),0.75)
//working out maximum while filtering each session
=MAX(FILTER(examscore,session = G1))
Here’s my numerical result

With this, I can box plot with 8 boxes

Tutoring exhibits a law of diminishing returns. Moderate tutoring is associated with improved exam performance, but excessive tutoring does not guarantee continued improvement. It reflects over-reliance rather than effective learning. However, overall, the relationship is still very weak
Summarize the Data
Initial visualizations suggests almost no correlation at all between studied-related behaviours and exam performance. To verify these observations, I going to apply hypothesis testing and linear regression.
Before testing, I examined the distribution of exam scores. A histogram of exam scores below shows an approximately normal distribution.
//selecting only exam score data
=QUERY(clean_dataset,"select B")
//B is exam score here

This exam score dataset is a sample data set with
- mean = 67.24
- variance = 3.89
to 2 significant figures
Hours Studied vs. Scores
I’m going to apply linear regression and hypothesis testing to see if there is any correlation. First, I’m going to controlled
- Teaching Quality = High
- Access to Resources = High
I can then use XL Miner Analysis Toolpak add on for Google Sheets to get the details I needed. Apply Linear Regression function on this toolpak with 95% confidence level gives me the following results

The R-Square value is 0.22. This means that I can say with 22% of the variation in exam scores is explained by the following linear relationship:
Exam Scores = Studied Hours * 0.29 + 62.62. Where
- 0.29 = coefficient of Hours Studied
- 62.62 = coefficient of Intercept
Now I can apply hypothesis testing with
- H0 = Studied Hours = 0 – There is no correlation between studied hours and exam scores.
- H1 = Studied Hours != 0 – There is a correlation between studied hours and exam scores
Looking at confidence interval, the interval lies between [0.25, 0.34]. Since 0 does not lie in that interval we
- reject H0 and accept H1
and concluded that there is a correlation between hours studied and exam scores at a 5% significant test.
Repeating these analysis without controlling variables gives out results as shown below.

The confidence interval is between [0.22,0.32]. 0 Does not lie within the interval so we reject H0 accept H1 and conclude with 95% confidence level that there is a positive correlation between studied hours and exam scores.
Attendance vs. Scores
In a similar manner, I apply hypothesis testing at 95% confidence level. My hypothesis is the following
- H0 = Attendance = 0 – There is no correlation between attendance and exam scores.
- H1 = Attendance != 0 – There is a correlation between attendance and exam scores
This gives me the following data

The confidence interval is between [0.16,0.21] . Since 0 does not lie in the interval, we reject H0 accept H1 and conclude with 95% confidence level that there is a positive correlation between attendance and exam scores.
When we apply test on linear regression we see that
28% of the variation in Exam Score is explain by the relationship of
Exam Scores = 0.19 * Attendance + 52.14
Tutoring Sessions vs. Scores
We repeat the same procedure as before with the following hypothesis at 5% significance level
- H0 = Tutoring Sessions = 0 – There is no correlation between tutoring sessions and exam scores.
- H1 = Tutoring Sessions != 0 – There is a correlation between tutoring sessions and exam scores
This time we will proceed this by hand
P-Value Testing
I will proceed in the following steps
- Calculate the correlation between Exam Scores and Tutoring Sessions
- Calculate T-Statistical Value
- Apply two-tailed hypothesis testing
Here is my procedure with
- Cells A2:A6608 being Exam Scores
- Cells B2:B6608 being Tutoring Sessions
#Full Procedure of P value testing
#Work out Correlation
=CORREL(A2:A6608,B2:B6608) //set this to be at cell D13
#Work out T-Statistical value
=D13*SQRT((A2:A6608-2))/(1-D13^2)) //set this to be at cell D14
#Calculate p-value using 2-tail hypothesis testing
=T.DIST.2T(D14,COUNT(A2:A6608-2))
This gives out p-value to be 0.049. Since 0.049 is less than 0.05, we reject H0, accept H1 and concluded that there is sufficient evidence to say that there is correlation between Exam Scores and Tutoring Sessions.
Let’s check this result using confidence interval as well
Confidence Interval Testing
I will proceed in the following steps
- Calculate the slope of the dataset
- Calculate the Residue Standard Error (RSE)
- Calculate Sum of Squares of Deviation (Sxx)
- Calculate the Standard Error (SE)
- Calculate T-Test Value
- Calculate Margin Error
- Calculate Confidence Interval
#Calculate Slope
=SLOPE(A2:A6608,B2:B6608) //set to D19
#Calculating RSE
=STEYX(A2:A6608,B2:B6608) //set to D21
#Calculate Sxx
=DEVSQ(B2:B6608) //set to F19
#Calculate SE
=D21/SQRT(F19) //set to D23
#Calculating T-Test values
=T.INV.2T(0.05, 6605) //set to F21
#Margin Error (Standard Error * T Test Value)
=D23*F21 //set to D23
#Confidence Interval (Gradient +/- Margin Error)
=D19+D25 , D19 - D25
This gives us the confidence interval of [0.42,0.57]. Since 0 does not lie in the interval, we reject H0, Accept H1 and concluded that there is a correlation. Between tutoring sessions and exam scores
Linear Regression between this dataset
To work out the coefficient and intercept. I use
=LINEST(A2:A6608,B2:B6608)
To work out its R-Squared value, we squared the correlation value. This gives us 0.02 to 2 decimal places.
This is to say that only 2% of variation in Exams Scores is related to Tutoring Sessions in the following way:
Exam Scores = 0.49 * Tutoring Sessions + 66.50
Key Interpretation
Each Studied habits are statistically significant, but their predictive power is weak.
This supports the earlier visualization results:
- Studied-Habits helps up to a point
- The overall effect size is small
- Other factors play a much larger role in exam performance
Statistical significance here reflects sample size more than practical impact.
Conclusion
We can conclude that there is a statistically significant relationship between exam scores and the factors analyzed. However, these relationships are weak in explanatory power, as indicated by their low R-squared values.
Attending lectures, participating in tutoring sessions, and studying more are all associated with higher exam scores. However, none of these factors alone strongly predicts performance. Exam outcomes are influenced by a combination of academic, environmental, and individual factors beyond the variables examined in this analysis.
Disclosures
AI tools were used to assist with outlining, clarification, and editing suggestions.
All ideas, interpretations, and final writing decisions are my own.
References
Data Rockie – Data Science Bootcamp
Student Academic Dataset By A.


Leave a Reply