Skip to content

Rishabh-git10/Course-Quality-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Course Quality Benchmarking

A comparative analysis of computer science pedagogy across MIT OpenCourseWare, NPTEL, and Independent YouTube creators. This pipeline ingests lecture audio, transcribes it, and extracts linguistic/semantic features to objectively quantify teaching styles, content density, and complexity.

Methodology

  • Ingestion: yt-dlp for audio extraction and directory mapping.
  • Transcription: OpenAI Whisper.
  • NLP Extraction: NLTK, Syllapy, and SentenceTransformer (all-MiniLM-L6-v2) for feature extraction.
  • Validation: Pydantic V2 schemas for strict data serialization.
  • Visualization: Seaborn for benchmark distributions.

Metrics Glossary

  • Fog Index: A readability test estimating the years of formal education required to understand the text on the first reading.
  • Word Count & Complex Word Count: The total volume of words versus the aggregate number of words containing three or more syllables (calculated via syllapy).
  • Average Sentence Length: The mean number of words per parsed sentence, used to gauge pacing and structural delivery.
  • Semantic Blabber Score: A custom density metric. It calculates the cosine similarity between each spoken sentence and a tensor embedding of core computer science keywords. A higher score indicates the speaker is deviating further from strict technical topics (e.g., classroom logistics, personal anecdotes).
  • Polarity Score: VADER sentiment analysis ranging from -1.0 (highly negative) to 1.0 (highly positive).

Key Analytical Inferences

Processing 100+ transcripts through the pipeline revealed clear, quantifiable distinctions in educational delivery:

  1. The Complexity Paradox: MIT OCW exhibited the lowest median Fog Index (8.68) and the lowest percentage of complex words (6.80%). Notably, despite MIT and Independent videos having nearly identical total Word Counts (~7,800 median words), Independent creators used nearly double the volume of complex, multi-syllabic words (1,027 vs. 541). The data suggests that elite institutional lectures prioritize breaking down complex topics using shorter, simpler sentence structures (median 14.9 words per sentence) rather than relying on heavy technical jargon.

Figure 1: Comparative distributions of academic readability (Fog Index) and jargon density (Complex Word Count).

  1. Content Density: Independent YouTube creators scored the lowest "Blabber Score," indicating the highest semantic density. This aligns with platform incentives, where creators strip out conversational filler to maintain strict topical pacing. Institutional lectures (MIT, NPTEL) display higher off-topic deviation, accurately capturing natural classroom environments, student Q&A, and logistical announcements.

Figure 2: Semantic Blabber Score. Lower scores indicate stricter adherence to core technical keywords.

  1. Delivery Structure: The NPTEL corpus contained significant outliers in Average Sentence Length (mean >36 words). This indicates a continuous, unbroken speaking style with fewer natural pauses, which alters how the material is digested compared to the shorter, more deliberate phrasing found in MIT OCW.

Figure 3: Average Sentence Length distributions, highlighting the extreme variance in the NPTEL corpus.

Directory Structure

Course-Quality-ETL/
├── data/
│   ├── raw_audio/
│   └── transcripts/
├── output/
│   ├── nlp_processing/
│   ├── plots/
│   └── pipeline.log
├── src/
│   ├── config.py
│   ├── download_audio.py
│   ├── transcribe_audio.py
│   ├── nlp.py
│   └── final_comparison.py
└── requirements.txt

Quick Start

  1. Environment Setup The environment is optimized for uv dependency resolution.
uv venv
source .venv/bin/activate  # Windows: .\.venv\Scripts\activate
uv pip install -r requirements.txt
  1. Pipeline Execution The pipeline is fully decoupled. Modules can be executed independently.
# 1. Ingest Raw Audio (Configure source URLs in src/download_audio.py)
python -m src.download_audio

# 2. # Execute Whisper Transcription
python -m src.transcribe_audio

# 3. Extract NLP Features
python -m src.nlp

# 4. Generate Seaborn Benchmarks
python -m src.final_comparison

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages