No description
Find a file
Soju Hokari 9fb6beb765 ngram
2025-05-01 20:25:34 -04:00
data ngram 2025-05-01 20:25:34 -04:00
articles_with_html_test.csv first commit 2025-04-11 15:14:06 -04:00
concat.ipynb more experiments 2025-05-01 09:42:56 -04:00
control.csv control data and more experiments 2025-04-15 23:56:12 -04:00
dataset_creation.ipynb more experiments 2025-05-01 09:42:56 -04:00
download_html.ipynb Add SAGE 2025-04-24 12:15:46 -04:00
experiments.ipynb Add SAGE 2025-04-24 12:15:46 -04:00
experiments2.ipynb ngram 2025-05-01 20:25:34 -04:00
mc-onlinenews-mediacloud-20250410165020-content.csv first commit 2025-04-11 15:14:06 -04:00
mc-onlinenews-mediacloud-20250415190547-content.csv control data and more experiments 2025-04-15 23:56:12 -04:00
mc-onlinenews-mediacloud-20250415190919-content.csv control data and more experiments 2025-04-15 23:56:12 -04:00
mc-onlinenews-mediacloud-20250415191054-content.csv control data and more experiments 2025-04-15 23:56:12 -04:00
mc-onlinenews-mediacloud-20250415191227-content.csv control data and more experiments 2025-04-15 23:56:12 -04:00
mc-onlinenews-mediacloud-20250415193859-content.csv control data and more experiments 2025-04-15 23:56:12 -04:00
preprocess.ipynb ngram 2025-05-01 20:25:34 -04:00
README.md control data and more experiments 2025-04-15 23:56:12 -04:00
test.ipynb first commit 2025-04-11 15:14:06 -04:00
tokenized_articles_test.csv first commit 2025-04-11 15:14:06 -04:00

Workflow

concat.ipynb combines the CSV files that were downloaded separately together.

download_html.ipynb downloads the HTML files (scrapes) for all websites in the CSV files.

dataset_creation.ipynb takes the downloaded HTML files and parses them to create a corpus of tokenized text.

experiments.ipynb is where my preliminary experiments are located.

DATASETS:

These datasets were created from MediaCloud's search.

mc-onlinenews-mediacloud-20250410165020-content.csv

  • QUERY: (transgender OR nonbinary OR non-binary) AND (sport OR athlet)
  • TIMEFRAME: April 8, 2020  April 8, 2025

control.csv

  • QUERY: (sport OR athlet) -(transgender OR nonbinary OR non-binary)
  • TIMEFRAME: April 8, 2020  April 8, 2025

Created from the following: mc-onlinenews-mediacloud-20250415190547-content.csv, mc-onlinenews-mediacloud-20250415190919-content.csv, mc-onlinenews-mediacloud-20250415191054-content.csv, mc-onlinenews-mediacloud-20250415191227-content.csv, mc-onlinenews-mediacloud-20250415193859-content.csv