How Gaming affects our daily lives

Introduction

Gaming has become a widespread activity across age groups, raising questions on health, academic and work life performance.

This analysis will explore the relationship between gaming behavior and academic, work life and health outcomes. It further applies machine learning models to predict academic and work performances

Approach

This project uses Gaming and Mental Health Dataset from Kaggle and analyzes it in Python. The workflow consists of four stages:

Data Cleaning: Load the dataset, inspect its structures and address missing values
Data Visualization: Explore relationships using various plots
Data Summarization: Verify relationships through hypothesis testing
Data Prediction: Built machine learning models to estimate Academic and Work Performances

Cleaning the Data

I begin by importing the necessary Python libraries for data manipulation, visualization, statistical analysis, and machine learning, and then load the dataset for inspection.

			
#load package
#loading pandas packages: This acts as excel in python
import pandas as pd
#loading numpy packages: Handles numbers and math more effieciently
import numpy as np
#loading basic graph engines
import matplotlib.pyplot as plt
#loading advance graph engines
import seaborn as sns
#importing statistical packages that is used for statistical testings
from scipy import stats
#import machine learning kits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor

		

			
#importing the dataset
df = pd.read_csv('Dataset/Gaming and Mental Health.csv')

This gives us the following dataset

Data Inspection

Let’s inspect our data first

			
#inspecting data
df.info()

			
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 27 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 record_id                         1000 non-null   object 
 age                               1000 non-null   int64  
 gender                            1000 non-null   object 
 daily_gaming_hours                1000 non-null   float64
 game_genre                        1000 non-null   object 
 primary_game                      1000 non-null   object 
 gaming_platform                   1000 non-null   object 
 sleep_hours                       1000 non-null   float64
 sleep_quality                     1000 non-null   object 
 sleep_disruption_frequency        1000 non-null   object 
academic_work_performance         1000 non-null   object 
grades_gpa                        754 non-null    float64
work_productivity_score           674 non-null    float64
mood_state                        1000 non-null   object 
mood_swing_frequency              1000 non-null   object 
withdrawal_symptoms               1000 non-null   bool   
loss_of_other_interests           1000 non-null   bool   
continued_despite_problems        1000 non-null   bool   
eye_strain                        1000 non-null   bool   
back_neck_pain                    1000 non-null   bool   
weight_change_kg                  1000 non-null   float64
exercise_hours_weekly             1000 non-null   float64
social_isolation_score            1000 non-null   int64  
face_to_face_social_hours_weekly  1000 non-null   float64
monthly_game_spending_usd         1000 non-null   float64
years_gaming                      1000 non-null   int64  
gaming_addiction_risk_level       1000 non-null   object 
dtypes: bool(5), float64(8), int64(3), object(11)
memory usage: 176.9+ KB
[6]

		

The dataset contains 1,000 records and 27 variables, including behavioral, health, academic, and work-related measures.

Two variables contain missing values:

work_productivity_score
grades_gpa

These values will not be removed. Missing values will be predicted using machine learning.

Variable Selection

Measured Variables

Heath factors (both physical and mental)
- eye_strain
- back_neck_pain
- mood_state
- mood_swing_frequency
- sleep_quality
- withdrawal_symptoms
Academic and Work Life factors
- academic_work_performance
- grades_gpa
- work_productivity_score

Varied Variables

daily_gaming_hours
years_gaming
gaming_addiction_risk_level

Controlled Variables

age
sleep_hours
exercise_hours_weekly
social_isolation_score
face_to_face_social_hours_weekly
loss_of_others_interests
sleep_disruption_frequency

Excluded Variables

These are the variables are excluded due to irrelevance or unclear information

gender
game_genre
primary_game
gaming_platform
continued_despite_problems
weight_change_kg

This list serves as a starting point and may be refined as the analysis develops.

Visualize the Data

I investigated how increase in daily gaming hours are associated with poorer physical and mental health outcomes. My hypothesis is

Gaming negatively affects all aspects of health

Each health factor is visualize independently.

Health Factors vs. Daily Gaming hours

I’ll begin by examining how gaming strains our eye. To do this, I isolate the relevant variables and visualize the distribution of gaming hours across eye strain categories using a box plot.

			
#Filter my dataset
#Eyestrain vs. Daily Gaming Hours 
eyeVsGaming = df[['daily_gaming_hours','eye_strain','age']]

			
#Assigning Boxplot using the package matplotlib.pyplot
eyeVsGaming.boxplot(column = 'daily_gaming_hours', by = 'eye_strain')
#Set Title and Remove Subtitles
plt.title("Daily Gaming Hours vs. Eye Strain")
plt.suptitle("")
#Show plot
plt.show()

		

Note that Python uses whiskers instead of minimum and maximum where whiskers are define as:

Whiskers = Q3 + 1.5 * IQR; Q1 – 1.5 * IQR

Points above the highest line are considered as outliers. I leave this representation as is.

The box plot shows a higher median gaming time among individuals reporting eye strain. This suggests an association between prolonged gaming and eye discomfort.

Other comparisons are box-plot using the exact same method. Here are my results shown

Mood state variable contains multiple categories, seaborn package is used to produce clearer categorical visualizations.

			
#boxplot using seaborn (sns) package
#increase the width of the boxplot
plt.figure(figsize=(12, 6))  # increase width
#actual sns boxplot
sns.boxplot(x="mood_state", y="daily_gaming_hours", data=df)
#set title
plt.title("Daily Gaming Hours vs. Mood State")
#tighten the layput
plt.tight_layout()
#show the plot
plt.show()

		

To improve interpretability, mood states are grouped into broader classifications:

Anxious, Irritable, Withdrawn, Angry, Euphoric, Restless, Depressed as “Negative”
Normal, Excited as “Positive”

			
#create a new column "mood" to categorize positive and negative emotions
#with an if condition
#import numpy package and use the where function which acts as an if statement
df["mood"] = np.where(
  #check if the mood_state is "Normal" or "Excited"
    df["mood_state"].isin(["Normal", "Excited"]),
  #return positive
    "Positive",
  #otherwise return negative
    "Negative"
)
#Construct a new table with "daily_gaming_hours", "mood_state", "mood"
dvm = df[['daily_gaming_hours', 'mood_state', 'mood']]
#box-plot using sns
sns.boxplot(x="mood", y="daily_gaming_hours", data=dvm)
plt.title("Daily Gaming Hours vs. Mood State")
plt.show()

		

The box plots suggests that

Extensive gaming has a negative impact on both physical and mental health.

The only factor where the correlation is weak is mood swings.

Academic and Work Life Factors

I begin by stating my hypothesis

Higher gaming hours are associated with lower academic and work performance

To examine how gaming relates to academic performance, I use a box plot to compare gaming hours and academic work performance.

Since academic performance follows a natural ranking, I reorder the categories from Excellent to Failing. This makes any trend easier to see.

Excellent -> Good -> Average -> Poor -> Below Average -> Failing

To do this, I will create an array of this categories in this particular order and box plot it.

			
#set graph scale
plt.figure(figsize=(12, 6))  # increase width
#create an array of orders
new_order = ['Excellent', 'Good', 'Average', 'Below Average', 'Poor', 'Failing']
#boxplot it
sns.boxplot(x="academic_work_performance", y="daily_gaming_hours", data=df, order= new_order)
#set title
plt.title("Daily Gaming Hours vs. Academic Work Performance")
#display plot
plt.show()

		

The plot shows that median gaming hours increase as academic performance declines.

Students in lower performance categories tend to report higher gaming hours.

Next, I turn to grades_gpa. There are missing values. To proceed, I will filter out all missing values first.

			
#filter out all zero values 
fildf = df[pd.notna(df['grades_gpa'])]
#select relevant columns
fildf[['daily_gaming_hours', 'grades_gpa']]

After removing missing values, 754 observations remain.

From here, I can scatter plot my data set

			
sns.scatterplot(fildf, x = 'grades_gpa', y = 'daily_gaming_hours')
plt.title('Daily Gaming Hours vs. Grades')
plt.suptitle("")

The scatter plot does not show a clear relationship between gaming hours and GPA

Because there are no strong pattern, I restrict the data based on:

face to face social hours weekly
sleep disruption frequency
social isolation scores
loss of others interests
exercise hours weekly

To proceed, like before, I need to filter those data to match my requirements then scatter plot them. Here is an example.

			
#filter out my dataset to match my requirements 
#face to face social hours weekly < 5 
#sleep disruption frequency = Often
#startby filtering out missing values as before
new_df = df[(pd.notna(df['grades_gpa'])) & \
   #filter out social hours weekly < 10
   (df['face_to_face_social_hours_weekly'] < 10) & \
   #filter out sleep hours frequency = Often
   (df['sleep_disruption_frequency']=="Often")]

		

Conditions: face to face social hours less than 10 hours; Sleep disruption frequency = Often

I try controlling various conditions. Here are my results below

Conditions: Weekly Exercise Hours < 5; Loss of others interest = True

Conditions: Social Isolation Score < 5; Loss of others interests = False

From this, there seems to be no correlation between grades and daily gaming hours regardless of restrictions

I turn my attention to work_productivity_score instead. As before, I remove observations with missing values and begin with a scatter plot.

Scatter plot does not reveal a clear pattern because work productivity score is discrete.

To clarify the relationship, I visualize the results using a bar plot.

The vertical lines represents confidence intervals around the mean.

The bar plot shows weak positive relationship between daily gaming hours and work productivity scores.

Overall, visual evidences does not strongly support the hypothesis that increased gaming hours reduce academic and work performance.

Summarize the Data

I will focus on academic and work factors. Based on those visualizations above, I would say that there is

no correlation between hours spent on gaming and grades
a slight positive correlation between daily gaming hours and work productivity scores

To verify these two statements, I conduct hypothesis testings at 5% significance level.

Hours Spent Gaming Vs. Grades GPA

We are measuring Grades GPA. So we want to check if grades are Normally Distributed.

			
#checking for normal distribution
#filter out values with only grade
grade = df[pd.notna(df['grades_gpa'])]
gradeplot = grade['grades_gpa']
#plotting these values
sns.histplot(data= gradeplot, kde = True)
plt.title("Grades GPA distribution")

		

The distribution does not appear normal. so I apply Spearman’s rank correlation test, which does not assume normality.

The hypothesis are defined as:

H0: There is no association between hours spent gaming and grades
H1: There is an association between hours spent gaming and grades

Then, I apply the Spearman hypothesis testing to find out its p-value

			
rho, p_value = stats.spearmanr(grade['daily_gaming_hours'], grade['grades_gpa'])
print("Spearman correlation:", rho)
print("p-value:", p_value)

This give me:

Spearman Correlation = 0.02 (extremely weak correlation)
p-value = 0.48

Since 0.48 > 0.05, we failed to reject H0 and conclude that at 5% significance level, there is insufficient evidence to show that there is any association between hours gaming and grades. This result aligns with the earlier visual analysis.

Hours Spent Gaming Vs. Work Productivity

As with GPA, we check for distribution of work productivity first

Since the data is not normally distributed so we apply Spearman hypothesis testing with

H0: No association between work productivity and hours spent on gaming
H1: There is association between work productivity and hours spent on gaming

Running Spearman Hypothesis testing gives

Spearman correlation: -0.002
p-value: 0.9451239659498083

The correlation is essentially 0. With p-value, we fail to reject the null hypothesis. At 5% significance level, there is insufficient of association between gaming hours and work productivity.

Data Prediction

We apply hypothesis testing to see whether where is any correlation between two pairs of datasets namely:

Daily Gaming Hours and Grades (GPA)
Daily Gaming Hours and Work Productivity Score

The hypothesis testing results show no meaningful association between daily gaming hours and either GPA or work productivity when examined individually. However, this do not imply that prediction is impossible. Machine Learning models can incorporate multiple variables simultaneously.

To predict missing values, I proceed as follows:

Prepare the Data: Remove any missing values
Split the Data into an 80:20 of train-test ratio
Train and Test the model using dataset
Estimate the missing values

I am going to apply a supervised learning on this dataset as I am trying to predict missing values here. The model used is Regression Model.

Grades_GPA

Prepare the Dataset

We remove all the missing values first.

			
#clean data filtering out any missing gpa values
gpa = df[pd.notna(df['grades_gpa'])]

Splitting the Dataset and Apply Regression Model

Based on earlier analysis, daily gaming hours alone is insufficient for prediction. Therefore, I will incorporate other variables to help predict the model.

To do this, I need to construct a list consisting of all the variables used to train the dataset.

Daily Gaming Hours
Sleep Hours
Sleep Disruption Frequency
Academic Work Performance
Gaming Addiction Level

I decide to start with 5 variables. Work_productivity_score is excluded since it is filled with missing values. Some of these variables are categorical. and must be converted to a numerical format. We do this through mapping

			
#Define Mapping
#create 3 new columns to the original dataframe
#Convert sleep_disruption_frequency to numeric
#Never = 0; Rarely = 1; Sometimes = 2; Often = 3; Always = 4
#Define mapping
sleepmapping = {'Never': 0, 'Rarely': 1, 'Sometimes': 2, \
                'Often': 3, 'Always' : 4 }
#Apply mapping 
df['en_sleep_disruption_frequency'] = df['sleep_disruption_frequency'].\
  map(sleepmapping)
#Convert gaming_addiction_risk_level to numeric 
#Low = 0; Moderate = 1; High = 2; Severe = 3
gamemapping = {'Low': 0, 'Moderate': 1, 'High': 2, 'Severe': 3}
df['en_gaming_addiction_risk_level'] = df['gaming_addiction_risk_level'].\
  map(gamemapping)
#Convert academic_work_performace to numeric
#Poor = 0; Below Average = 1; Average = 2; Good = 3; Excellent = 4
academicmapping = {'Poor': 0, 'Below Average': 1, 'Average': 2,\
                   'Good': 3, 'Excellent': 4}
df['en_academic_work_performance'] = df['academic_work_performance']\
  .map(academicmapping)

		

Next, we define the variables we will use to help prediction. All variables must be numerical here

			
#construct a list of variables to train the dataset
variables = ['daily_gaming_hours', 'sleep_hours', 'en_sleep_disruption_frequency',\
             'en_academic_work_performance', 'en_gaming_addiction_risk_level']

Now we split the data into the ratio of 80% train data to 20% test data and run a regression model.

			
#Train the model
#assign variables for predictioon
x = gpa[variables]
#assign prediction variables
y = gpa['grades_gpa']
#Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
#Apply Linear Regression Model
model = LinearRegression()
model.fit(x_train, y_train)

		

			
#Test the model
y_pred = model.predict(x_test)
print(r2_score(y_test, y_pred)) #R-Square Value
print( mean_squared_error(y_test, y_pred)) #MSE Value

This gives the R-square value to be -0.0096. Negative R-Square implies that the model is weaker than simply replacing missing values with an average. This suggests that the current linear model fails to give meaningful data predictions.

Given that earlier visualizations did not suggest a strong linear relationship, this result is not surprising. I therefore explore alternative models that may better capture potential non-linear patterns in the dataset.

Random Forest Model

I run the random forest model with the same train-test dataset.

			
#Apply Random Forest Model
rf = RandomForestRegressor(
    n_estimators=200,
    max_depth=None,
    random_state=42
)
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
print(r2_score(y_test, y_pred))

		

This gives R-Squared value to be -0.19, indicating performance is worse than baseline mean prediction. In this case, Random Forest model performs worse than linear regression model.

After experimenting with additional feature combinations and alternative model specifications, the highest R² achieved was −0.003. This value remains effectively zero and is not meaningfully different from the linear regression result using daily gaming hours alone.

These results suggest that the available predictors do not contain sufficient explanatory power to accurately predict GPA within this dataset.

Work Productivity Score

I apply the same modeling framework to predict work productivity scores, using identical preprocessing steps and an 80:20 train-test split.

The regression and Random Forest models both produce R-Square values close to zero or negative, indicating that the models do not meaningfully outperform a baseline mean prediction.

As with GPA, additional feature combinations and model adjustments do not substantially improve predictive performance. These results suggest that the available predictors lack sufficient explanatory power to accurately predict work productivity within this dataset.

Conclusion

Based on visualizations and hypothesis testings, the analysis suggests an association between daily gaming hours and several health problems. However, the relationship between gaming and academic or work-related outcomes appears weak.

Even when incorporating additional variables and testing multiple regression models, predictive performance for GPA and work productivity remains poor.

There are several possible reasons to this. Some of the reasons include

The sample size may be insufficient.
Important external factors, both within and beyond the dataset, may not have been included
Not enough modeling approaches could be explored

Overall, the findings suggest that while gaming behavior affects health, it does not provide evidence of an impact on academic and work life performance.

Disclosures

AI tools were used to assist with outlining, clarification, and editing suggestions.
All ideas, interpretations, and final writing decisions are my own.

References

Data Rockie – Data Science Bootcamp
Gaming and Mental Health Dataset By Shaista Sahid

How Gaming affects our daily lives

Introduction

Approach

Cleaning the Data

Data Inspection

Variable Selection

Measured Variables

Varied Variables

Controlled Variables

Excluded Variables

Visualize the Data

Health Factors vs. Daily Gaming hours

Academic and Work Life Factors

Summarize the Data

Hours Spent Gaming Vs. Grades GPA

Hours Spent Gaming Vs. Work Productivity

Data Prediction

Grades_GPA

Prepare the Dataset

Splitting the Dataset and Apply Regression Model

Random Forest Model

Work Productivity Score

Conclusion

Disclosures

References

Comments

Leave a ReplyCancel reply

More posts

Marina Bay Sands: One of the most iconic landmark in Singapore

AI – The most valuable asset at disposal

Stoicism: A guide to Ultimate Life

Lumphini Park X One Piece

How Gaming affects our daily lives

Introduction

Approach

Cleaning the Data

Data Inspection

Variable Selection

Measured Variables

Varied Variables

Controlled Variables

Excluded Variables

Visualize the Data

Health Factors vs. Daily Gaming hours

Academic and Work Life Factors

Summarize the Data

Hours Spent Gaming Vs. Grades GPA

Hours Spent Gaming Vs. Work Productivity

Data Prediction

Grades_GPA

Prepare the Dataset

Splitting the Dataset and Apply Regression Model

Random Forest Model

Work Productivity Score

Conclusion

Disclosures

References

Comments

Leave a ReplyCancel reply

More posts

Marina Bay Sands: One of the most iconic landmark in Singapore

AI – The most valuable asset at disposal

Stoicism: A guide to Ultimate Life

Lumphini Park X One Piece

Discover more from Quiet Horizon