
Academic Data Scraper
"Teaching a humanities PhD student to automate the extraction of authenticated news corpora for her academic thesis research."
Born out of a mentorship request on Superprof, I designed a tailored learning path. Together, we built a highly resilient suite of Playwright scrapers capable of navigating complex authentication flows (Google Login via Iubenda consent walls) and dynamic DOM pagination across portals like La Repubblica, Leonardo, and Eni.
Programming form Zero to One
The student lacked any technical background. Over a series of weekly 1-on-1 sessions, I walked her through the architecture of the Python language, teaching foundational concepts like data types, loops, asynchronous execution, and complex DOM manipulation using CSS selectors and XPaths.
XML Corpus Serialization
Extracting the unstructured data was only half the challenge. I developed and taught robust serialization utilities to cleanly scrape thousands of digital/print articles and structure them into highly precise XML trees, directly powering her academic text analysis software.