QS World University Rankings Analysis Project

Weike ZHANG

Nov 2024

For more details, datasets, and analysis scripts, visit our GitHub repo.

Check out our project hosted on GitHub Pages: Github web

Project Outline¶

Introduction¶

The QS World University Rankings are a globally recognized framework for evaluating higher education institutions. This project will analyze ranking trends from 2022 to 2024 to uncover patterns and determinants of university performance. The findings will serve as an empirical guide for stakeholders in the education sector.

Objectives:¶

  • To identify trends and shifts in university rankings over the specified years.
  • To understand the impact of various performance metrics on the rankings.
  • To provide insights for educational institutions aiming to improve their standings.

Data and Summary Statistics¶

I. Data Sources (Extraction, Transform, and Load)¶

  • Description of the datasets for 2022, 2023, and 2024, including data structure and collection methodology.
  • Data size and completeness, with an emphasis on any data preprocessing conducted.

II. Summary Statistics¶

  • Computation of summary statistics for critical variables to establish a baseline understanding of the dataset's characteristics.

Measure and Variable Definition¶

  • In-depth explanation of QS ranking metrics.
  • Discussion on how each metric is quantified and its presumed influence on the overall rankings.

Exploratory Data Analysis (EDA)¶

I. Ranking Trends¶

  • Tracking shifts in rankings across the years and pinpointing outliers.
  • Identifying institutions with notable improvements or declines.

II. Metric Correlations¶

  • Investigating the interrelationship between ranking metrics using correlation analysis.
  • Visualizations to showcase the strength and direction of these relationships.

III. Geographic Trends¶

  • Geographic analysis of the distribution of top-ranked institutions.
  • Examination of regional performance and disparities.

IV. Internationalization¶

  • Evaluating the influence of international faculty and student presence on ranking outcomes.

Empirical Results¶

I. Regression Analysis¶

  • Linear regression models to estimate the effect of ranking metrics on the overall score.
  • Discussion of the model's assumptions, validations, and any transformations applied to the data.

II. Predictive Modelling¶

  • Developing predictive models to forecast future rankings based on identified trends.
  • Validation of predictive accuracy through back-testing with historical data.

Conclusion and Implications¶

  • Synthesis of key findings and their implications for universities and policymakers.
  • Discussion of the study's limitations and suggestions for further research.

Additional Sections:¶

  • Methodology: Detailed justification of statistical methods used.
  • Ethical Considerations: Reflection on the ethical aspects of ranking interpretations.
  • Peer Review: Strategy for peer review to validate findings.

Appendices:¶

  • Detailed tables, additional analyses, and a glossary of terms used throughout the project.

References:¶

  • Detailed bibliography citing data sources, literature, and methodologies referenced.

Data and Summary Statistics¶

I. Data Sources (Extraction, Transform, and Load)¶

QS World University Rankings The QS World University Rankings provide a comprehensive evaluation of over 1,000 higher education institutions globally. Sourced from Quacquarelli Symonds (QS), these rankings are recognized worldwide for their depth of research and breadth of data regarding university performance. The datasets for 2022, 2023, and 2024, accessible through the QS website, form the primary basis of our analysis. These tables offer detailed insights into various performance metrics such as academic reputation, employer reputation, faculty-student ratio, citations per faculty, international faculty, and international students scores. By analyzing these datasets, we aim to uncover trends, evaluate shifts in rankings, and identify the determinants of university performance across the specified years.

  • View the QS World University Rankings 2022 Report
  • QS World University Rankings 2023 Result Tables - Excel
  • QS World University Rankings 2024 Results Table - Excel

QS World University Rankings Metrics Explained¶

The QS ranking methodology utilizes several metrics to gauge university performance, each capturing a distinct aspect of university excellence:

  • Academic Reputation Score (40% weight): Derived from a global academic survey, this score reflects the perceived research quality and academic standing of an institution.

  • Employer Reputation Score (10% weight): Based on a survey of employers, this score indicates the employability and preparedness of graduates in the workforce.

  • Faculty Student Score (20% weight): This metric measures the faculty-to-student ratio, providing insight into the teaching and learning environment of the university.

  • Citations per Faculty Score (20% weight): A measure of research impact, this score is calculated based on the average citations per faculty member, indicating research influence and quality.

  • International Faculty Score (5% weight): This score assesses the diversity of the faculty by measuring the proportion of international faculty members at the institution.

  • International Students Score (5% weight): Similarly, this score evaluates the diversity of the student body by looking at the percentage of international students.

  • Overall Score: A composite score that combines all individual metrics, representing a summarized assessment of a university's overall ranking performance.

In [ ]:
import pandas as pd
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
!git clone https://github.com/weike2001/ds
fatal: destination path 'ds' already exists and is not an empty directory.
In [ ]:
import pandas as pd

# Set the paths to the Excel files in the cloned repository
file_path_2022 = '/content/ds/data/2022_QS_World_University_Rankings_Results_public_version.xlsx'
file_path_2023 = '/content/ds/data/2023 QS World University Rankings V2.1 (For qs.com).xlsx'
file_path_2024 = '/content/ds/data/2024 QS World University Rankings 1.2 (For qs.com).xlsx'

# Read the data into pandas DataFrames
df_2022 = pd.read_excel(file_path_2022)
df_2023 = pd.read_excel(file_path_2023)
df_2024 = pd.read_excel(file_path_2024)

# Assuming you want to save these DataFrames as CSV files in the same directory
csv_file_path_2022 = file_path_2022.replace('.xlsx', '.csv')
csv_file_path_2023 = file_path_2023.replace('.xlsx', '.csv')
csv_file_path_2024 = file_path_2024.replace('.xlsx', '.csv')

# Save the DataFrames as CSV files
df_2022.to_csv(csv_file_path_2022, index=False)
df_2023.to_csv(csv_file_path_2023, index=False)
df_2024.to_csv(csv_file_path_2024, index=False)

Adjust columns in each csv form

In [ ]:
import pandas as pd

# Define the new specific column names
specific_column_names_2022 = [
    'National Rank', 'Regional Rank', '2022 Rank', '2021 Rank', 'Institution Name',
    'Location Code', 'Country/Territory', 'Size', 'Focus', 'Research Intensity',
    'Age Band', 'Status', 'Academic Reputation Score', 'Academic Reputation Rank',
    'Employer Reputation Score', 'Employer Reputation Rank', 'Faculty Student Score',
    'Faculty Student Rank', 'Citations per Faculty Score', 'Citations per Faculty Rank',
    'International Faculty Score', 'International Faculty Rank', 'International Students Score',
    'International Students Rank', 'Overall Score'
]

specific_column_names_2023 = [
    '2023 Rank', '2022 Rank', 'Institution Name', 'Location Code', 'Country/Territory',
    'Size', 'Focus', 'Research Intensity', 'Age Band', 'Status',
    'Academic Reputation Score', 'Academic Reputation Rank',
    'Employer Reputation Score', 'Employer Reputation Rank',
    'Faculty Student Score', 'Faculty Student Rank',
    'Citations per Faculty Score', 'Citations per Faculty Rank',
    'International Faculty Score', 'International Faculty Rank',
    'International Students Score', 'International Students Rank',
    'International Research Network Score', 'International Research Network Rank',
    'Employment Outcomes Score', 'Employment Outcomes Rank',
    'Overall Score'
]

specific_column_names_2024 = [
    '2024 Rank', '2023 Rank', 'Institution Name', 'Location Code', 'Country/Territory',
    'Size', 'Focus', 'Research Intensity', 'Status',
    'Academic Reputation Score', 'Academic Reputation Rank',
    'Employer Reputation Score', 'Employer Reputation Rank',
    'Faculty Student Score', 'Faculty Student Rank',
    'Citations per Faculty Score', 'Citations per Faculty Rank',
    'International Faculty Score', 'International Faculty Rank',
    'International Students Score', 'International Students Rank',
    'International Research Network Score', 'International Research Network Rank',
    'Employment Outcomes Score', 'Employment Outcomes Rank',
    'Sustainability Score', 'Sustainability Rank',
    'Overall Score'
]

print(len(specific_column_names_2024))
# Reading the CSV files into Pandas DataFrames
df_2022 = pd.read_csv(csv_file_path_2022, skiprows = 4, names=specific_column_names_2022)
df_2023 = pd.read_csv(csv_file_path_2023, skiprows = 4, names=specific_column_names_2023)
df_2024 = pd.read_csv(csv_file_path_2024, skiprows = 4, names=specific_column_names_2024)

df_2022.head()
28
Out[ ]:
National Rank Regional Rank 2022 Rank 2021 Rank Institution Name Location Code Country/Territory Size Focus Research Intensity ... Employer Reputation Rank Faculty Student Score Faculty Student Rank Citations per Faculty Score Citations per Faculty Rank International Faculty Score International Faculty Rank International Students Score International Students Rank Overall Score
0 1 1 1 1 Massachusetts Institute of Technology (MIT) US United States M CO VH ... 4 100.0 12 100.0 6 100.0 45 91.4 105 100
1 1 1 2 5 University of Oxford UK United Kingdom L FC VH ... 3 100.0 5 96.0 34 99.5 83 98.5 52 99.5
2 2 2 3= 2 Stanford University US United States L FC VH ... 5 100.0 9 99.9 10 99.8 73 67.0 208 98.7
3 2 2 3= 7 University of Cambridge UK United Kingdom L FC VH ... 2 100.0 10 92.1 48 100.0 57 97.7 64 98.7
4 3 3 5 3 Harvard University US United States L FC VH ... 1 99.1 37 100.0 3 84.2 188 70.1 196 98

5 rows × 25 columns

In [ ]:
df_2023.head()
Out[ ]:
2023 Rank 2022 Rank Institution Name Location Code Country/Territory Size Focus Research Intensity Age Band Status ... Citations per Faculty Rank International Faculty Score International Faculty Rank International Students Score International Students Rank International Research Network Score International Research Network Rank Employment Outcomes Score Employment Outcomes Rank Overall Score
0 1 1 Massachusetts Institute of Technology (MIT) US United States M CO VH 5.0 B ... 5 100.0 54 90.0 109 96.1 58 100.0 3 100
1 2 3= University of Cambridge UK United Kingdom L FC VH 5.0 A ... 55 100.0 60 96.3 70 99.5 6 100.0 9 98.8
2 3 3= Stanford University US United States L FC VH 5.0 B ... 9 99.8 74 60.3 235 96.3 55 100.0 2 98.5
3 4 2 University of Oxford UK United Kingdom L FC VH 5.0 A ... 64 98.8 101 98.4 54 99.9 3 100.0 7 98.4
4 5 5 Harvard University US United States L FC VH 5.0 B ... 2 76.9 228 66.9 212 100.0 1 100.0 1 97.6

5 rows × 27 columns

In [ ]:
df_2024.head()
Out[ ]:
2024 Rank 2023 Rank Institution Name Location Code Country/Territory Size Focus Research Intensity Status Academic Reputation Score ... International Faculty Rank International Students Score International Students Rank International Research Network Score International Research Network Rank Employment Outcomes Score Employment Outcomes Rank Sustainability Score Sustainability Rank Overall Score
0 1 1 Massachusetts Institute of Technology (MIT) US United States M CO VH B 100.0 ... 56 88.2 128 94.3 58 100.0 4 95.2 51 100
1 2 2 University of Cambridge UK United Kingdom L FC VH A 100.0 ... 64 95.8 85 99.9 7 100.0 6 97.3 33= 99.2
2 3 4 University of Oxford UK United Kingdom L FC VH A 100.0 ... 110 98.2 60 100.0 1 100.0 3 97.8 26= 98.9
3 4 5 Harvard University US United States L FC VH B 100.0 ... 210 66.8 223 100.0 5 100.0 1 96.7 39 98.3
4 5 3 Stanford University US United States L FC VH B 100.0 ... 78 51.2 284 95.8 44 100.0 2 94.4 63 98.1

5 rows × 28 columns

In this section, we focus on preparing the 'Overall Score' data from the QS World University Rankings for 2022, 2023, and 2024. The preparation involves two key steps:

  1. Replacing Missing Values: We convert missing values, originally represented as hyphens ('-'), to NaN (Not a Number) to standardize the dataset for numerical analysis.
  2. Converting to Numeric: The 'Overall Score' column is converted from string type to floating-point numbers, facilitating statistical operations and analysis.

Objectives:

  • Clean and standardize the data for accurate analysis.
  • Enable computation of descriptive statistics and facilitate trend analysis across years.
  • Assess the completeness of the data to ensure robust analytical outcomes.

This data preparation is essential for analyzing global university ranking trends and setting the stage for further in-depth examination of university performances.

In [ ]:
import pandas as pd
import numpy as np

# Replace hyphens with NaN and convert the column to numeric
df_2022['Overall Score'] = pd.to_numeric(df_2022['Overall Score'].replace('-', np.nan), errors='coerce')
df_2023['Overall Score'] = pd.to_numeric(df_2023['Overall Score'].replace('-', np.nan), errors='coerce')
df_2024['Overall Score'] = pd.to_numeric(df_2024['Overall Score'].replace('-', np.nan), errors='coerce')

# Now, 'Overall Score' will be a float column with NaNs where there were hyphens - .

II. Summary Statistics¶

In our analysis of the QS World University Rankings datasets spanning 2022 to 2024, we direct our attention to a curated selection of metrics that significantly influence a university's prestige and global ranking. The evaluation encompasses:

  • Academic Reputation Score: A gauge of a university's academic eminence as recognized by peers.
  • Employer Reputation Score: A reflection of the institution's graduate employability and readiness for the professional world.
  • Citations per Faculty Score: An index of research influence and scholarly impact.
  • International Faculty Score: A measure of the institution's success in fostering a diverse and global faculty.
  • International Students Score: An indicator of the university's ability to attract a worldwide student body.
  • Overall Score: A comprehensive score that embodies all individual metrics, offering a summarized assessment of a university's worldwide standing and performance.

For these pivotal metrics, we compute the mean, standard deviation, median, minimum, and maximum values to provide a distilled overview of university performance. This analysis will shed light on the average achievements, consistency, and range within these critical areas, offering stakeholders a succinct and strategic insight into the dynamics shaping university rankings.

In [ ]:
import pandas as pd

df_2022.describe()
Out[ ]:
Age Band Academic Reputation Score Employer Reputation Score Faculty Student Score Citations per Faculty Score International Faculty Score International Students Score Overall Score
count 1300.000000 1300.000000 1300.000000 1299.000000 1300.000000 1228.000000 1275.000000 501.000000
mean 4.011538 21.552462 22.193000 31.907313 26.293308 26.503746 28.119059 44.767066
std 0.988318 23.315627 24.535947 28.564402 28.299027 35.429502 31.211629 18.961269
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 24.100000
25% 3.000000 6.200000 5.100000 9.400000 3.400000 1.700000 3.750000 29.600000
50% 4.000000 11.900000 11.950000 20.600000 13.400000 5.400000 13.200000 38.600000
75% 5.000000 25.925000 29.625000 47.950000 43.400000 44.425000 44.450000 55.400000
max 5.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000
In [ ]:
df_2023.describe()
Out[ ]:
Age Band Academic Reputation Score Employer Reputation Score Faculty Student Score Citations per Faculty Score International Faculty Score International Students Score International Research Network Score Employment Outcomes Score Overall Score
count 1411.000000 1422.000000 1421.000000 1420.000000 1417.000000 1324.000000 1365.000000 1409.000000 1410.000000 500.000000
mean 4.008505 20.124684 20.657143 29.997113 24.529358 31.659517 26.545348 49.570121 26.186809 44.619400
std 0.965320 22.802706 24.027928 28.172207 27.910952 34.170817 30.896854 30.205439 26.201036 18.655057
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 24.200000
25% 3.000000 5.400000 4.400000 8.200000 3.100000 4.800000 3.300000 21.600000 6.700000 29.800000
50% 4.000000 10.800000 10.300000 18.250000 11.100000 13.750000 10.800000 47.700000 15.500000 38.550000
75% 5.000000 23.775000 27.000000 43.500000 39.400000 55.075000 40.500000 77.600000 36.900000 54.500000
max 5.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000
In [ ]:
df_2024.describe()
Out[ ]:
Academic Reputation Score Employer Reputation Score Faculty Student Score Citations per Faculty Score International Faculty Score International Students Score International Research Network Score Employment Outcomes Score Sustainability Score Overall Score
count 1498.000000 1497.000000 1474.000000 1474.000000 1372.000000 1418.000000 1494.000000 1474.000000 1398.000000 602.000000
mean 20.132043 19.806880 28.643894 23.940163 30.948834 25.575035 23.967938 20.016961 25.412017 40.879900
std 22.365895 23.764625 27.843868 28.075573 34.247562 30.867149 30.371277 20.241410 31.010557 19.181335
min 1.600000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 19.800000
25% 6.000000 4.100000 7.500000 2.800000 4.300000 3.000000 1.200000 8.225000 1.400000 25.700000
50% 10.900000 9.500000 16.750000 10.400000 13.050000 9.850000 6.850000 11.700000 8.400000 34.550000
75% 23.100000 25.500000 41.900000 37.900000 52.725000 38.075000 40.375000 22.475000 42.525000 51.300000
max 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000

Measure and Variable Definition¶

This section is dedicated to a comprehensive examination of the QS World University Rankings' metrics. We aim to dissect each component of the ranking system to provide an intricate understanding of how universities are evaluated and ranked on the global stage.

The QS ranking framework employs a set of multifaceted metrics, each designed to quantify distinct aspects of university performance. These metrics are:

  • Academic Reputation Score (40%): Derived from a global survey, reflecting the university's standing in the academic community.
  • Employer Reputation Score (10%): Based on employer surveys, indicating the quality and employability of the institution's graduates.
  • Faculty/Student Ratio (20%): A metric that assesses the faculty-to-student ratio, providing insights into the educational environment.
  • Citations per Faculty (20%): This measures the average number of citations per faculty member, serving as an indicator of research impact.
  • International Faculty Score (5%) and International Students Score (5%): Both these scores evaluate the university's internationalization by measuring the diversity of faculty and student bodies.

The Overall Score represents a consolidated assessment derived from these individual metrics, dictating the university's ranking.

In This Section:¶

  • We will analyze each metric in detail, understanding the data sources, methodology, and computation.
  • Discuss the weight each metric carries and its hypothesized impact on the overall ranking.
  • Conduct a comparative evaluation across various universities to identify strengths and weaknesses relative to these metrics.
  • Reflect on the historical evolution of these metrics and their definitions to appreciate changes in higher education quality assessment.

Through this deep dive into the QS ranking metrics, we seek to elucidate the nuances that underpin university rankings, providing a clear guide for institutions aiming to enhance their global standing.

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

qs_metrics_weights = {
    'Academic Reputation Score': {"weight": 0.40},
    'Employer Reputation Score': {"weight": 0.10},
    'Faculty Student Score': {"weight": 0.20},
    'Citations per Faculty Score': {"weight": 0.20},
    'International Faculty Score': {"weight": 0.05},
    'International Students Score': {"weight": 0.05},
}

def create_grid_layout_without_definitions(df, metrics_info, year):
    # Set up the figure with subplots
    fig, axes = plt.subplots(2, 3, figsize=(20, 10))  # Adjust figure size as needed
    axes = axes.ravel()
    palette = sns.color_palette("coolwarm", len(metrics_info))

    # Plot each metric in the grid
    for ax, (metric, info), color in zip(axes, metrics_info.items(), palette):
        weight = info['weight']
        sns.histplot(df[metric], kde=True, ax=ax, color=color, alpha=0.7, linewidth=0.5)
        ax.set_title(f"{metric} ({weight*100}%)", fontsize=10)
        ax.set_xlabel('Score', fontsize=9)

    # Add a main title and adjust layout
    plt.suptitle(f'Distribution of QS Ranking Metrics for {year}', fontsize=16)
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])  # Adjust the layout
    plt.show()

# Example usage with the 2022 dataset
create_grid_layout_without_definitions(df_2022, qs_metrics_weights, '2022')
create_grid_layout_without_definitions(df_2023, qs_metrics_weights, '2023')
create_grid_layout_without_definitions(df_2024, qs_metrics_weights, '2024')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

The QS World University Rankings across 2022, 2023, and 2024 highlight a consistent pattern among key metrics that determine institutional prestige. The Academic Reputation Score, as the most weighted metric, displays a persistent skew towards a select echelon of universities, emphasizing the enduring recognition of established institutions. Variability in Employer Reputation and Faculty Student ratios across these years reflects evolving perceptions of graduate quality and educational resource allocation. The metrics for Research Impact and Internationalization, though varied, indicate a continuous commitment to global engagement and scholarly output. Collectively, these trends reaffirm the comprehensive criteria of the QS rankings and the sustained excellence among leading universities on a global scale.

Exploratory Data Analysis (EDA)¶

I. Ranking Trends¶

  • Tracking shifts in rankings across the years and pinpointing outliers.
  • Identifying institutions with notable improvements or declines.

In today's session, we dove into analyzing the QS World University Rankings from 2022 to 2024. We covered data preparation, explored ranking trends, and applied predictive modeling and statistical analysis to understand the factors influencing university rankings. Through hands-on examples, we demonstrated how to use exploratory data analysis and interactive visualizations to uncover insights and trends within the rankings data. This session aimed to equip participants with practical data analysis skills, showcasing how to extract meaningful information from complex datasets in the context of global university rankings.To focus on the top 10 universities based on their 2024 rankings and visualize their trends over the 2022 to 2024 period,

In [ ]:
import pandas as pd
import plotly.graph_objects as go

# Merge the dataframes on 'Institution Name'
df_merged = pd.merge(pd.merge(df_2022[['Institution Name', '2022 Rank']],
                              df_2023[['Institution Name', '2023 Rank']],
                              on='Institution Name'),
                     df_2024[['Institution Name', '2024 Rank']],
                     on='Institution Name')

# Convert '2024 Rank' to numeric for sorting
df_2024['2024 Rank'] = pd.to_numeric(df_2024['2024 Rank'], errors='coerce')

# Get the top 10 universities based on their 2024 rank
top_10_universities_2024 = df_2024.nsmallest(10, '2024 Rank')['Institution Name'].tolist()

# Filter df_merged to only include the top 10 universities of 2024
df_merged_top_10 = df_merged[df_merged['Institution Name'].isin(top_10_universities_2024)]

# Visualization
fig = go.Figure()

for uni in top_10_universities_2024:
    uni_data = df_merged_top_10[df_merged_top_10['Institution Name'] == uni]
    fig.add_trace(go.Scatter(x=['2022', '2023', '2024'],
                             y=[uni_data['2022 Rank'].values[0], uni_data['2023 Rank'].values[0], uni_data['2024 Rank'].values[0]],
                             mode='lines+markers',
                             name=uni))

fig.update_layout(title='Ranking Trends for Top 10 Universities in 2024',
                  xaxis_title='Year',
                  yaxis_title='Rank',
                  yaxis_autorange='reversed')  # Higher ranks (lower numbers) appear at the top

fig.show()
In [ ]:
import plotly.express as px

# Assuming df_2024 is your DataFrame and it has been preprocessed correctly
fig = px.scatter(df_2024.dropna(subset=['Overall Score', 'Academic Reputation Score']),
                 x='Academic Reputation Score',
                 y='Overall Score',
                 hover_name='Institution Name',
                 color='Country/Territory',  # Using 'Country/Territory' for coloring
                 title='Overall Score vs. Academic Reputation Score by Country')

fig.show()
In [ ]:
import plotly.express as px
import numpy as np

# Assuming 'Citations per Faculty Score' is used for the marker size,
# replace NaN values in this column with a default size, e.g., the median size of the non-NaN values
default_size = df_2024['Citations per Faculty Score'].median()
df_2024['Citations per Faculty Score for Size'] = df_2024['Citations per Faculty Score'].fillna(default_size)

fig = px.scatter(df_2024.dropna(subset=['Citations per Faculty Score', 'Country/Territory']),
                 x='Country/Territory',
                 y='Citations per Faculty Score',
                 size='Citations per Faculty Score for Size',  # Use the new column with no NaNs for size
                 hover_name='Institution Name',
                 color='Citations per Faculty Score',
                 title='Citations per Faculty Score by Country')

fig.update_layout(xaxis_title="Country",
                  yaxis_title="Citations per Faculty Score")

fig.show()
In [ ]:
fig = px.scatter(df_2024.dropna(subset=['Faculty Student Score', 'Overall Score']),
                 x='Faculty Student Score',
                 y='Overall Score',
                 size='Citations per Faculty Score',  # This could indicate research strength
                 hover_name='Institution Name',
                 color='Country/Territory',
                 title='Faculty-Student Ratio vs. Overall Score')

fig.update_layout(xaxis_title="Faculty-Student Score",
                  yaxis_title="Overall Score")

fig.show()
In [ ]:
fig = px.bar(df_2024.dropna(subset=['International Students Score']),
             x='Country/Territory',
             y='International Students Score',
             color='International Students Score',
             hover_name='Institution Name',
             title='International Students Score Across Different Countries')

fig.update_layout(xaxis_title="Country",
                  yaxis_title="International Students Score",
                  xaxis={'categoryorder':'total descending'})

fig.show()
In [ ]:
# Calculate average overall score by country
average_scores = df_2024.groupby('Country/Territory')['Overall Score'].mean().sort_values(ascending=False).head(10).reset_index()

fig = px.bar(average_scores,
             x='Country/Territory',
             y='Overall Score',
             color='Overall Score',
             title='Top 10 Countries by Average Overall Score in QS Rankings')

fig.update_layout(xaxis_title="Country",
                  yaxis_title="Average Overall Score")

fig.show()

II. Correlation Analysis of QS Ranking Metrics¶

In this section, we conduct a comprehensive correlation analysis among the key QS World University Rankings metrics for the year 2022. The objective is to understand how different factors, such as academic reputation, employer reputation, faculty/student ratio, citations per faculty, international faculty, and international students, are interrelated and contribute to the overall score of universities. By examining these relationships, we aim to uncover insights into the dynamics that influence university rankings on a global scale.

The process involves two main steps:

  1. Calculating the Correlation Matrix: We compute the Pearson correlation coefficients between the selected metrics. This statistical measure helps us identify the strength and direction of linear relationships between the metrics. A positive correlation indicates that as one metric increases, the other tends to increase as well, whereas a negative correlation suggests an inverse relationship.

  2. Visualizing the Correlations: To make the correlations more accessible and easier to interpret, we utilize a heatmap visualization. This graphical representation uses color coding to reflect the magnitude of the correlations, ranging from strong positive (warm colors) to strong negative (cool colors) relationships.

In [ ]:
# Calculate the correlation matrix
corr = df_2022[['Academic Reputation Score', 'Employer Reputation Score',
                'Faculty Student Score', 'Citations per Faculty Score',
                'International Faculty Score', 'International Students Score',
                'Overall Score']].corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")
plt.title('Correlation Heatmap of QS Ranking Metrics')
plt.show()
No description has been provided for this image

III. Geographic Distribution of QS Ranked Universities¶

To gain a deeper understanding of the global landscape of higher education as reflected in the QS World University Rankings, we employ choropleth maps to visualize the distribution of ranked universities by country for the years 2022, 2023, and 2024. This geographic analysis allows us to observe trends, patterns, and potentially the regional dynamics influencing higher education excellence on a global scale.

The function create_choropleth_map is crafted to:

  1. Count Universities by Country: For each year, it calculates the number of universities within each country that appear in the QS rankings.
  2. Generate a Choropleth Map: Utilizing Plotly Express, it creates an interactive map highlighting countries based on the count of their ranked universities. The intensity of the color corresponds to the number of universities, providing a clear visual representation of higher education hubs worldwide.

Here's a brief overview of the function and its application:

In [ ]:
import pandas as pd
import plotly.express as px
import plotly.io as pio

# Set default renderer to 'notebook_connected' which works well with nbconvert
pio.renderers.default = 'notebook_connected'

def create_choropleth_map(dataframe, column_name, title):

    # Generate a dictionary of value counts for the specified column
    sample_data = dataframe[column_name].value_counts().to_dict()

    # Convert the dictionary into a DataFrame
    df_counts = pd.DataFrame(list(sample_data.items()), columns=['Country', 'University_Count'])
    #print(df_counts)
    # Create the choropleth map
    fig = px.choropleth(df_counts,
                        locations="Country",
                        locationmode='country names',
                        color="University_Count",
                        color_continuous_scale=px.colors.sequential.Reds,  # Reds color scale
                        title=title)

    # Update the layout
    fig.update_layout(
        geo=dict(
            showframe=False,
            showcoastlines=False,
            projection_type='equirectangular'
        )
    )
    # Show the figure
    fig.show()

# Use the function with your DataFrame and column
create_choropleth_map(df_2022, 'Country/Territory', 'Number of Universities per Country in 2022')
create_choropleth_map(df_2023, 'Country/Territory', 'Number of Universities per Country in 2023')
create_choropleth_map(df_2024, 'Country/Territory', 'Number of Universities per Country in 2024')

The choropleth maps for the QS World University Rankings from 2022 through 2024 consistently show that North America and Europe maintain a dominant presence with the highest number of globally recognized universities. This steadfast pattern underscores the concentration of academic prestige and resources in these regions. Despite the passage of time, the geographic distribution of leading institutions remains relatively unchanged, highlighting a persistent imbalance in global educational prominence. The continuity of this trend into 2024 further suggests that while there is global progress in higher education, efforts to diversify and enhance representation in the rankings could be strengthened to reflect a more inclusive global academic landscape.

IV. Internationalization¶

The internationalization of universities is a key factor in global higher education, reflecting an institution's ability to attract faculty and students from across the world. This aspect of university performance is crucial in the QS World University Rankings, which consider the presence of international faculty and students as significant metrics. In this analysis, we delve into the impact of internationalization on the overall rankings of universities for the years 2022, 2023, and 2024. By evaluating the scores for international faculty and students, we aim to uncover the extent to which these factors influence the global standing of universities.

The analysis involves:

  1. Data Preparation: Isolating the international faculty and student scores along with the overall ranking from each year's dataset.
  2. Correlation and Regression Analysis: Examining the relationship between internationalization metrics (faculty and students) and overall ranking through statistical techniques to quantify their influence.
  3. Trend Observation: Identifying patterns or changes over the specified years to understand how internationalization trends affect university rankings.
In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming df_2022, df_2023, df_2024 have been loaded and cleaned

# Define the region of interest, for example, 'North America'
region_of_interest = 'US'

# Filter the datasets to include only universities from the specified region
df_region_2022 = df_2022[df_2022['Location Code'] == region_of_interest]
df_region_2023 = df_2023[df_2023['Location Code'] == region_of_interest]
df_region_2024 = df_2024[df_2024['Location Code'] == region_of_interest]

def plot_region_universities(df, year):
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    fig.suptitle(f'{region_of_interest} Universities Internationalization Impact in {year}', fontsize=16)

    # Plot International Faculty Score vs Overall Score
    sns.scatterplot(ax=axes[0], x='International Faculty Score', y='Overall Score', data=df)
    axes[0].set_title('International Faculty Score vs Overall Score')
    axes[0].invert_yaxis()  # Higher rankings should appear at the top
    axes[0].set_xlabel('International Faculty Score')
    axes[0].set_ylabel('Overall Score')

    # Plot International Students Score vs Overall Score
    sns.scatterplot(ax=axes[1], x='International Students Score', y='Overall Score', data=df)
    axes[1].set_title('International Students Score vs Overall Score')
    axes[1].invert_yaxis()  # Higher rankings should appear at the top
    axes[1].set_xlabel('International Students Score')
    axes[1].set_ylabel('Overall Score')

    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()

# Plotting for the selected region across the years
plot_region_universities(df_region_2022, '2022')
plot_region_universities(df_region_2023, '2023')
plot_region_universities(df_region_2024, '2024')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

From 2022 to 2024, the trend across U.S. universities suggests a consistent and positive relationship between internationalization—measured by international faculty and student scores—and overall university rankings. This indicates that a strong international presence on campus may be a key contributor to maintaining and improving a university's competitive edge in the global academic arena.

Empirical Results¶

I. Regression Analysis¶

  • Linear regression models to estimate the effect of ranking metrics on the overall score.
  • Discussion of the model's assumptions, validations, and any transformations applied to the data.

We plan to use various QS ranking metrics such as Academic Reputation Score, Employer Reputation Score, Faculty Student Score, Citations per Faculty Score, International Faculty Score, and International Students Score to predict whether a university will be in the top 500 of the QS World University Rankings for the year 2024.

In [ ]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Convert 'National Rank' to numeric; use 'coerce' to handle any conversion errors
df_2024['National Rank'] = pd.to_numeric(df_2024['2024 Rank'], errors='coerce')

# Create a binary target variable where 1 indicates ranking in the top 500, and 0 otherwise
df_2024['Top_500'] = (df_2024['2024 Rank'] <= 500).astype(int)

# Select features
features = ['Academic Reputation Score', 'Employer Reputation Score',
            'Faculty Student Score', 'Citations per Faculty Score',
            'International Faculty Score', 'International Students Score']

# Drop rows with NaNs in the features or target
df_2024 = df_2024.dropna(subset=features + ['Top_500'])

X = df_2024[features]
y = df_2024['Top_500']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model
log_reg = LogisticRegression()

# Fit the model
log_reg.fit(X_train, y_train)

# Predict on the testing set
y_pred = log_reg.predict(X_test)

# Print classification report and confusion matrix
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plotting using seaborn
plt.figure(figsize=(10,7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
              precision    recall  f1-score   support

           0       0.91      0.96      0.94       237
           1       0.62      0.41      0.49        37

    accuracy                           0.89       274
   macro avg       0.77      0.68      0.71       274
weighted avg       0.87      0.89      0.88       274

[[228   9]
 [ 22  15]]
No description has been provided for this image

II. Predictive Modelling¶

  • Developing predictive models to forecast future rankings based on identified trends.
  • Validation of predictive accuracy through back-testing with historical data.

We plan to develop a predictive model using the QS World University Rankings data from 2022 and 2023 to forecast whether a university will be ranked in the top 500 for 2024. This model will leverage key performance metrics, including Academic Reputation Score, Employer Reputation Score, Faculty Student Score, Citations per Faculty Score, International Faculty Score, and International Students Score. Through this analysis, we aim to identify the significant predictors of ranking success and understand the dynamics and trends that influence a university's position in the global rankings. This predictive insight will help stakeholders make informed decisions and strategies to improve their institutions' standings in future QS World University Rankings.

In [ ]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Convert ranking columns to numeric and create a binary target variable for the top 500 ranking
df_2022['2022 Rank'] = pd.to_numeric(df_2022['2022 Rank'], errors='coerce')
df_2023['2023 Rank'] = pd.to_numeric(df_2023['2023 Rank'], errors='coerce')
df_2024['2024 Rank'] = pd.to_numeric(df_2024['2024 Rank'], errors='coerce')

df_2022['Top_500'] = (df_2022['2022 Rank'] <= 500).astype(int)
df_2023['Top_500'] = (df_2023['2023 Rank'] <= 500).astype(int)

# Concatenate the 2022 and 2023 data for training
features = ['Academic Reputation Score', 'Employer Reputation Score', 'Faculty Student Score',
            'Citations per Faculty Score', 'International Faculty Score', 'International Students Score']
X_train = pd.concat([df_2022[features], df_2023[features]], ignore_index=True)
y_train = pd.concat([df_2022['Top_500'], df_2023['Top_500']], ignore_index=True)

X_test = df_2024[features]
y_test = (df_2024['2024 Rank'] <= 500).astype(int)

# Define the base models for the voting classifier
log_clf = LogisticRegression(random_state=42)
rf_clf = RandomForestClassifier(random_state=42)
svc_clf = SVC(probability=True, random_state=42)

# Create a voting classifier
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rf_clf), ('svc', svc_clf)],
    voting='soft'
)

# Create a pipeline with preprocessing and the voting classifier
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', voting_clf)
])

# Set up the grid search for hyperparameter tuning
param_grid = {
    'classifier__lr__C': [0.1, 1, 10],
    'classifier__rf__n_estimators': [50, 100, 200],
    'classifier__svc__C': [0.1, 1, 10]
}

# Execute grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

# Evaluate on the test set
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

# Display the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Best parameters: {'classifier__lr__C': 0.1, 'classifier__rf__n_estimators': 50, 'classifier__svc__C': 0.1}
Best score: 0.87141594711279
              precision    recall  f1-score   support

           0       0.91      0.96      0.93      1171
           1       0.63      0.43      0.51       199

    accuracy                           0.88      1370
   macro avg       0.77      0.69      0.72      1370
weighted avg       0.87      0.88      0.87      1370

No description has been provided for this image

Reference¶

Our excel files come from links below:

  • View the QS World University Rankings 2022 Report
  • QS World University Rankings 2023 Result Tables - Excel
  • QS World University Rankings 2024 Results Table - Excel