Introduction
Gaming has become a widespread activity across age groups, raising questions on health, academic and work life performance.
This analysis will explore the relationship between gaming behavior and academic, work life and health outcomes. It further applies machine learning models to predict academic and work performances
Approach
This project uses Gaming and Mental Health Dataset from Kaggle and analyzes it in Python. The workflow consists of four stages:
- Data Cleaning: Load the dataset, inspect its structures and address missing values
- Data Visualization: Explore relationships using various plots
- Data Summarization: Verify relationships through hypothesis testing
- Data Prediction: Built machine learning models to estimate Academic and Work Performances
Cleaning the Data
I begin by importing the necessary Python libraries for data manipulation, visualization, statistical analysis, and machine learning, and then load the dataset for inspection.
#load package#loading pandas packages: This acts as excel in pythonimport pandas as pd#loading numpy packages: Handles numbers and math more effiecientlyimport numpy as np#loading basic graph enginesimport matplotlib.pyplot as plt#loading advance graph enginesimport seaborn as sns#importing statistical packages that is used for statistical testingsfrom scipy import stats#import machine learning kitsfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.impute import SimpleImputerfrom sklearn.metrics import mean_squared_error, r2_scorefrom sklearn.ensemble import RandomForestRegressor
#importing the datasetdf = pd.read_csv('Dataset/Gaming and Mental Health.csv')
This gives us the following dataset

Data Inspection
Let’s inspect our data first
#inspecting datadf.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 1000 entries, 0 to 999Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 record_id 1000 non-null object 1 age 1000 non-null int64 2 gender 1000 non-null object 3 daily_gaming_hours 1000 non-null float64 4 game_genre 1000 non-null object 5 primary_game 1000 non-null object 6 gaming_platform 1000 non-null object 7 sleep_hours 1000 non-null float64 8 sleep_quality 1000 non-null object 9 sleep_disruption_frequency 1000 non-null object 10 academic_work_performance 1000 non-null object 11 grades_gpa 754 non-null float64 12 work_productivity_score 674 non-null float64 13 mood_state 1000 non-null object 14 mood_swing_frequency 1000 non-null object 15 withdrawal_symptoms 1000 non-null bool 16 loss_of_other_interests 1000 non-null bool 17 continued_despite_problems 1000 non-null bool 18 eye_strain 1000 non-null bool 19 back_neck_pain 1000 non-null bool 20 weight_change_kg 1000 non-null float64 21 exercise_hours_weekly 1000 non-null float64 22 social_isolation_score 1000 non-null int64 23 face_to_face_social_hours_weekly 1000 non-null float64 24 monthly_game_spending_usd 1000 non-null float64 25 years_gaming 1000 non-null int64 26 gaming_addiction_risk_level 1000 non-null object dtypes: bool(5), float64(8), int64(3), object(11)memory usage: 176.9+ KB[6]
The dataset contains 1,000 records and 27 variables, including behavioral, health, academic, and work-related measures.
Two variables contain missing values:
- work_productivity_score
- grades_gpa
These values will not be removed. Missing values will be predicted using machine learning.
Variable Selection
Measured Variables
- Heath factors (both physical and mental)
- eye_strain
- back_neck_pain
- mood_state
- mood_swing_frequency
- sleep_quality
- withdrawal_symptoms
- Academic and Work Life factors
- academic_work_performance
- grades_gpa
- work_productivity_score
Varied Variables
- daily_gaming_hours
- years_gaming
- gaming_addiction_risk_level
Controlled Variables
- age
- sleep_hours
- exercise_hours_weekly
- social_isolation_score
- face_to_face_social_hours_weekly
- loss_of_others_interests
- sleep_disruption_frequency
Excluded Variables
These are the variables are excluded due to irrelevance or unclear information
- gender
- game_genre
- primary_game
- gaming_platform
- continued_despite_problems
- weight_change_kg
This list serves as a starting point and may be refined as the analysis develops.
Visualize the Data
I investigated how increase in daily gaming hours are associated with poorer physical and mental health outcomes. My hypothesis is
Gaming negatively affects all aspects of health
Each health factor is visualize independently.
Health Factors vs. Daily Gaming hours
I’ll begin by examining how gaming strains our eye. To do this, I isolate the relevant variables and visualize the distribution of gaming hours across eye strain categories using a box plot.
#Filter my dataset#Eyestrain vs. Daily Gaming Hours eyeVsGaming = df[['daily_gaming_hours','eye_strain','age']]
#Assigning Boxplot using the package matplotlib.pyploteyeVsGaming.boxplot(column = 'daily_gaming_hours', by = 'eye_strain')#Set Title and Remove Subtitlesplt.title("Daily Gaming Hours vs. Eye Strain")plt.suptitle("")#Show plotplt.show()

Note that Python uses whiskers instead of minimum and maximum where whiskers are define as:
Whiskers = Q3 + 1.5 * IQR; Q1 – 1.5 * IQR
Points above the highest line are considered as outliers. I leave this representation as is.
The box plot shows a higher median gaming time among individuals reporting eye strain. This suggests an association between prolonged gaming and eye discomfort.
Other comparisons are box-plot using the exact same method. Here are my results shown


Mood state variable contains multiple categories, seaborn package is used to produce clearer categorical visualizations.
#boxplot using seaborn (sns) package#increase the width of the boxplotplt.figure(figsize=(12, 6)) # increase width#actual sns boxplotsns.boxplot(x="mood_state", y="daily_gaming_hours", data=df)#set titleplt.title("Daily Gaming Hours vs. Mood State")#tighten the layputplt.tight_layout()#show the plotplt.show()
To improve interpretability, mood states are grouped into broader classifications:
- Anxious, Irritable, Withdrawn, Angry, Euphoric, Restless, Depressed as “Negative”
- Normal, Excited as “Positive”
#create a new column "mood" to categorize positive and negative emotions#with an if condition#import numpy package and use the where function which acts as an if statementdf["mood"] = np.where( #check if the mood_state is "Normal" or "Excited" df["mood_state"].isin(["Normal", "Excited"]), #return positive "Positive", #otherwise return negative "Negative")#Construct a new table with "daily_gaming_hours", "mood_state", "mood"dvm = df[['daily_gaming_hours', 'mood_state', 'mood']]#box-plot using snssns.boxplot(x="mood", y="daily_gaming_hours", data=dvm)plt.title("Daily Gaming Hours vs. Mood State")plt.show()




The box plots suggests that
Extensive gaming has a negative impact on both physical and mental health.
The only factor where the correlation is weak is mood swings.
Academic and Work Life Factors
I begin by stating my hypothesis
Higher gaming hours are associated with lower academic and work performance
To examine how gaming relates to academic performance, I use a box plot to compare gaming hours and academic work performance.

Since academic performance follows a natural ranking, I reorder the categories from Excellent to Failing. This makes any trend easier to see.
Excellent -> Good -> Average -> Poor -> Below Average -> Failing
To do this, I will create an array of this categories in this particular order and box plot it.
#set graph scaleplt.figure(figsize=(12, 6)) # increase width#create an array of ordersnew_order = ['Excellent', 'Good', 'Average', 'Below Average', 'Poor', 'Failing']#boxplot itsns.boxplot(x="academic_work_performance", y="daily_gaming_hours", data=df, order= new_order)#set titleplt.title("Daily Gaming Hours vs. Academic Work Performance")#display plotplt.show()

The plot shows that median gaming hours increase as academic performance declines.
Students in lower performance categories tend to report higher gaming hours.
Next, I turn to grades_gpa. There are missing values. To proceed, I will filter out all missing values first.
#filter out all zero values fildf = df[pd.notna(df['grades_gpa'])]#select relevant columnsfildf[['daily_gaming_hours', 'grades_gpa']]
After removing missing values, 754 observations remain.
From here, I can scatter plot my data set
sns.scatterplot(fildf, x = 'grades_gpa', y = 'daily_gaming_hours')plt.title('Daily Gaming Hours vs. Grades')plt.suptitle("")

The scatter plot does not show a clear relationship between gaming hours and GPA
Because there are no strong pattern, I restrict the data based on:
- face to face social hours weekly
- sleep disruption frequency
- social isolation scores
- loss of others interests
- exercise hours weekly
To proceed, like before, I need to filter those data to match my requirements then scatter plot them. Here is an example.
#filter out my dataset to match my requirements #face to face social hours weekly < 5 #sleep disruption frequency = Often#startby filtering out missing values as beforenew_df = df[(pd.notna(df['grades_gpa'])) & \ #filter out social hours weekly < 10 (df['face_to_face_social_hours_weekly'] < 10) & \ #filter out sleep hours frequency = Often (df['sleep_disruption_frequency']=="Often")]

I try controlling various conditions. Here are my results below


From this, there seems to be no correlation between grades and daily gaming hours regardless of restrictions
I turn my attention to work_productivity_score instead. As before, I remove observations with missing values and begin with a scatter plot.

Scatter plot does not reveal a clear pattern because work productivity score is discrete.
To clarify the relationship, I visualize the results using a bar plot.

The vertical lines represents confidence intervals around the mean.
The bar plot shows weak positive relationship between daily gaming hours and work productivity scores.
Overall, visual evidences does not strongly support the hypothesis that increased gaming hours reduce academic and work performance.
Summarize the Data
I will focus on academic and work factors. Based on those visualizations above, I would say that there is
- no correlation between hours spent on gaming and grades
- a slight positive correlation between daily gaming hours and work productivity scores
To verify these two statements, I conduct hypothesis testings at 5% significance level.
Hours Spent Gaming Vs. Grades GPA
We are measuring Grades GPA. So we want to check if grades are Normally Distributed.
#checking for normal distribution#filter out values with only gradegrade = df[pd.notna(df['grades_gpa'])]gradeplot = grade['grades_gpa']#plotting these valuessns.histplot(data= gradeplot, kde = True)plt.title("Grades GPA distribution")

The distribution does not appear normal. so I apply Spearman’s rank correlation test, which does not assume normality.
The hypothesis are defined as:
H0: There is no association between hours spent gaming and grades
H1: There is an association between hours spent gaming and grades
Then, I apply the Spearman hypothesis testing to find out its p-value
rho, p_value = stats.spearmanr(grade['daily_gaming_hours'], grade['grades_gpa'])print("Spearman correlation:", rho)print("p-value:", p_value)
This give me:
Spearman Correlation = 0.02 (extremely weak correlation)
p-value = 0.48
Since 0.48 > 0.05, we failed to reject H0 and conclude that at 5% significance level, there is insufficient evidence to show that there is any association between hours gaming and grades. This result aligns with the earlier visual analysis.
Hours Spent Gaming Vs. Work Productivity
As with GPA, we check for distribution of work productivity first

Since the data is not normally distributed so we apply Spearman hypothesis testing with
H0: No association between work productivity and hours spent on gaming
H1: There is association between work productivity and hours spent on gaming
Running Spearman Hypothesis testing gives
Spearman correlation: -0.002
p-value: 0.9451239659498083
The correlation is essentially 0. With p-value, we fail to reject the null hypothesis. At 5% significance level, there is insufficient of association between gaming hours and work productivity.
Data Prediction
We apply hypothesis testing to see whether where is any correlation between two pairs of datasets namely:
- Daily Gaming Hours and Grades (GPA)
- Daily Gaming Hours and Work Productivity Score
The hypothesis testing results show no meaningful association between daily gaming hours and either GPA or work productivity when examined individually. However, this do not imply that prediction is impossible. Machine Learning models can incorporate multiple variables simultaneously.
To predict missing values, I proceed as follows:
- Prepare the Data: Remove any missing values
- Split the Data into an 80:20 of train-test ratio
- Train and Test the model using dataset
- Estimate the missing values
I am going to apply a supervised learning on this dataset as I am trying to predict missing values here. The model used is Regression Model.
Grades_GPA
Prepare the Dataset
We remove all the missing values first.
#clean data filtering out any missing gpa valuesgpa = df[pd.notna(df['grades_gpa'])]
Splitting the Dataset and Apply Regression Model
Based on earlier analysis, daily gaming hours alone is insufficient for prediction. Therefore, I will incorporate other variables to help predict the model.
To do this, I need to construct a list consisting of all the variables used to train the dataset.
- Daily Gaming Hours
- Sleep Hours
- Sleep Disruption Frequency
- Academic Work Performance
- Gaming Addiction Level
I decide to start with 5 variables. Work_productivity_score is excluded since it is filled with missing values. Some of these variables are categorical. and must be converted to a numerical format. We do this through mapping
#Define Mapping#create 3 new columns to the original dataframe#Convert sleep_disruption_frequency to numeric#Never = 0; Rarely = 1; Sometimes = 2; Often = 3; Always = 4#Define mappingsleepmapping = {'Never': 0, 'Rarely': 1, 'Sometimes': 2, \ 'Often': 3, 'Always' : 4 }#Apply mapping df['en_sleep_disruption_frequency'] = df['sleep_disruption_frequency'].\ map(sleepmapping)#Convert gaming_addiction_risk_level to numeric #Low = 0; Moderate = 1; High = 2; Severe = 3gamemapping = {'Low': 0, 'Moderate': 1, 'High': 2, 'Severe': 3}df['en_gaming_addiction_risk_level'] = df['gaming_addiction_risk_level'].\ map(gamemapping)#Convert academic_work_performace to numeric#Poor = 0; Below Average = 1; Average = 2; Good = 3; Excellent = 4academicmapping = {'Poor': 0, 'Below Average': 1, 'Average': 2,\ 'Good': 3, 'Excellent': 4}df['en_academic_work_performance'] = df['academic_work_performance']\ .map(academicmapping)
Next, we define the variables we will use to help prediction. All variables must be numerical here
#construct a list of variables to train the datasetvariables = ['daily_gaming_hours', 'sleep_hours', 'en_sleep_disruption_frequency',\ 'en_academic_work_performance', 'en_gaming_addiction_risk_level']
Now we split the data into the ratio of 80% train data to 20% test data and run a regression model.
#Train the model#assign variables for predictioonx = gpa[variables]#assign prediction variablesy = gpa['grades_gpa']#Split the datax_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)#Apply Linear Regression Modelmodel = LinearRegression()model.fit(x_train, y_train)
#Test the modely_pred = model.predict(x_test)print(r2_score(y_test, y_pred)) #R-Square Valueprint( mean_squared_error(y_test, y_pred)) #MSE Value
This gives the R-square value to be -0.0096. Negative R-Square implies that the model is weaker than simply replacing missing values with an average. This suggests that the current linear model fails to give meaningful data predictions.
Given that earlier visualizations did not suggest a strong linear relationship, this result is not surprising. I therefore explore alternative models that may better capture potential non-linear patterns in the dataset.
Random Forest Model
I run the random forest model with the same train-test dataset.
#Apply Random Forest Modelrf = RandomForestRegressor( n_estimators=200, max_depth=None, random_state=42)rf.fit(x_train, y_train)y_pred = rf.predict(x_test)print(r2_score(y_test, y_pred))
This gives R-Squared value to be -0.19, indicating performance is worse than baseline mean prediction. In this case, Random Forest model performs worse than linear regression model.
After experimenting with additional feature combinations and alternative model specifications, the highest R² achieved was −0.003. This value remains effectively zero and is not meaningfully different from the linear regression result using daily gaming hours alone.
These results suggest that the available predictors do not contain sufficient explanatory power to accurately predict GPA within this dataset.
Work Productivity Score
I apply the same modeling framework to predict work productivity scores, using identical preprocessing steps and an 80:20 train-test split.
The regression and Random Forest models both produce R-Square values close to zero or negative, indicating that the models do not meaningfully outperform a baseline mean prediction.
As with GPA, additional feature combinations and model adjustments do not substantially improve predictive performance. These results suggest that the available predictors lack sufficient explanatory power to accurately predict work productivity within this dataset.
Conclusion
Based on visualizations and hypothesis testings, the analysis suggests an association between daily gaming hours and several health problems. However, the relationship between gaming and academic or work-related outcomes appears weak.
Even when incorporating additional variables and testing multiple regression models, predictive performance for GPA and work productivity remains poor.
There are several possible reasons to this. Some of the reasons include
- The sample size may be insufficient.
- Important external factors, both within and beyond the dataset, may not have been included
- Not enough modeling approaches could be explored
Overall, the findings suggest that while gaming behavior affects health, it does not provide evidence of an impact on academic and work life performance.
Disclosures
AI tools were used to assist with outlining, clarification, and editing suggestions.
All ideas, interpretations, and final writing decisions are my own.
References
Data Rockie – Data Science Bootcamp
Gaming and Mental Health Dataset By Shaista Sahid


Leave a Reply