No description
| data | ||
| articles_with_html_test.csv | ||
| concat.ipynb | ||
| control.csv | ||
| dataset_creation.ipynb | ||
| download_html.ipynb | ||
| experiments.ipynb | ||
| experiments2.ipynb | ||
| mc-onlinenews-mediacloud-20250410165020-content.csv | ||
| mc-onlinenews-mediacloud-20250415190547-content.csv | ||
| mc-onlinenews-mediacloud-20250415190919-content.csv | ||
| mc-onlinenews-mediacloud-20250415191054-content.csv | ||
| mc-onlinenews-mediacloud-20250415191227-content.csv | ||
| mc-onlinenews-mediacloud-20250415193859-content.csv | ||
| preprocess.ipynb | ||
| README.md | ||
| test.ipynb | ||
| tokenized_articles_test.csv | ||
Workflow
concat.ipynb combines the CSV files that were downloaded separately together.
download_html.ipynb downloads the HTML files (scrapes) for all websites in the CSV files.
dataset_creation.ipynb takes the downloaded HTML files and parses them to create a corpus of tokenized text.
experiments.ipynb is where my preliminary experiments are located.
DATASETS:
These datasets were created from MediaCloud's search.
mc-onlinenews-mediacloud-20250410165020-content.csv
- QUERY:
(transgender OR nonbinary OR non-binary) AND (sport OR athlet) - TIMEFRAME: April 8, 2020 – April 8, 2025
control.csv
- QUERY:
(sport OR athlet) -(transgender OR nonbinary OR non-binary) - TIMEFRAME: April 8, 2020 – April 8, 2025
Created from the following: mc-onlinenews-mediacloud-20250415190547-content.csv, mc-onlinenews-mediacloud-20250415190919-content.csv, mc-onlinenews-mediacloud-20250415191054-content.csv, mc-onlinenews-mediacloud-20250415191227-content.csv, mc-onlinenews-mediacloud-20250415193859-content.csv