scraper.py — Summary Map

UK News Scraper · Internal Training Reference · Python Beginners

📄 scraper.py

Section 1

Imports

csv, json, osBuilt-in standard library
time, logging, randomBuilt-in standard library
dataclassesData containers with auto-init
datetime / UTCTimezone-aware timestamps
requestsHTTP web requests (3rd party)
feedparserRSS/Atom feed parser (3rd party)
BeautifulSoupHTML parser (3rd party)
dateutilFlexible date parsing (3rd party)

Section 2

Configuration Constants

USER_AGENTIdentifies scraper to web servers
REQUEST_TIMEOUT15s max wait per request
MIN_DELAY / MAX_DELAY2–5s polite pause range
OUTPUT_DIR/CSV/HTML/JSON/ODTOutput file paths
HEADERSHTTP headers sent with requests
logging.basicConfig()Timestamps + severity levels

Section 3

Article Data Model

@dataclass ArticleBlueprint for one article
.sourceBBC News, Guardian, etc.
.titleHeadline text
.urlLink to article
.summaryLead paragraph
.authorByline / journalist name
.published_dateISO 8601 timestamp
CSV_FIELDNAMESAuto-derived from field names

Section 4

Utility / Helper Functions

polite_get(url)Safe HTTP GET with error handling
parse_date(raw)Normalise any date → ISO 8601
delay()Random 2–5s polite pause
parse_feed(url)Fetch and parse RSS/Atom feed
strip_html(raw)Remove HTML tags from text

Sections 5–7

Per-Source Scrapers

scrape_bbc()RSS + per-page author scrape
_bbc_get_author(url)CSS selector fallback for byline
scrape_guardian()Guardian Content JSON API
scrape_independent()RSS + HTML author fallback
_independent_get_author()CSS selector + meta tag
scrape_sky_news()RSS + HTML author fallback
_sky_get_author()CSS selector + meta tag

BBC News The Guardian Independent Sky News

Section 8

Output Writers

save_to_csv()DictWriter, column headers
save_to_html()f-string template, styled table
save_to_json()json.dump, list comprehension
save_to_odt()odfpy LibreOffice document

.csv .html .json .odt

Section 9

Email HTML Builder

get_html_email_body()Gmail-safe inline CSS email
defaultdict(list)Groups articles by source
[:10] sliceTop 10 articles per source
Zebra stripingAlternating row colours
Inline CSS onlyNo <style> blocks (Gmail strips)
700px max-widthEmail client compatible

Section 10

main() & Entry Point

scrapers = [...]List of (name, function) tuples
for name, fn in scrapersTuple unpacking
fn()Call function stored in variable
.extend(results)Merge articles into master list
try/exceptCrash-safe per-source handling
if __name__ == "__main__"Only runs when executed directly

Execution Flow

main()entry point

→

scrape_bbc()RSS + per-page

→

scrape_guardian()JSON API

→

scrape_independent()RSS + fallback

→

scrape_sky_news()RSS + fallback

→

save outputsCSV · HTML · JSON · ODT

Key Python Concepts Used

@dataclass try/except list comprehension dict.get(key, default) defaultdict f-strings with open() as f: nested functions functions in variables tuple unpacking list slicing [:10] ternary expression Optional[T] if __name__ == "__main__" logging CSS selectors RSS / feedparser polite scraping