scraper.py — Summary Map
UK News Scraper · Internal Training Reference · Python Beginners
📄 scraper.py
Section 1
Imports
csv, json, os
Built-in standard library
time, logging, random
Built-in standard library
dataclasses
Data containers with auto-init
datetime / UTC
Timezone-aware timestamps
requests
HTTP web requests (3rd party)
feedparser
RSS/Atom feed parser (3rd party)
BeautifulSoup
HTML parser (3rd party)
dateutil
Flexible date parsing (3rd party)
Section 2
Configuration Constants
USER_AGENT
Identifies scraper to web servers
REQUEST_TIMEOUT
15s max wait per request
MIN_DELAY / MAX_DELAY
2–5s polite pause range
OUTPUT_DIR/CSV/HTML/JSON/ODT
Output file paths
HEADERS
HTTP headers sent with requests
logging.basicConfig()
Timestamps + severity levels
Section 3
Article Data Model
@dataclass Article
Blueprint for one article
.source
BBC News, Guardian, etc.
.title
Headline text
.url
Link to article
.summary
Lead paragraph
.author
Byline / journalist name
.published_date
ISO 8601 timestamp
CSV_FIELDNAMES
Auto-derived from field names
Section 4
Utility / Helper Functions
polite_get(url)
Safe HTTP GET with error handling
parse_date(raw)
Normalise any date → ISO 8601
delay()
Random 2–5s polite pause
parse_feed(url)
Fetch and parse RSS/Atom feed
strip_html(raw)
Remove HTML tags from text
Sections 5–7
Per-Source Scrapers
scrape_bbc()
RSS + per-page author scrape
_bbc_get_author(url)
CSS selector fallback for byline
scrape_guardian()
Guardian Content JSON API
scrape_independent()
RSS + HTML author fallback
_independent_get_author()
CSS selector + meta tag
scrape_sky_news()
RSS + HTML author fallback
_sky_get_author()
CSS selector + meta tag
BBC News
The Guardian
Independent
Sky News
Section 8
Output Writers
save_to_csv()
DictWriter, column headers
save_to_html()
f-string template, styled table
save_to_json()
json.dump, list comprehension
save_to_odt()
odfpy LibreOffice document
.csv
.html
.json
.odt
Section 9
Email HTML Builder
get_html_email_body()
Gmail-safe inline CSS email
defaultdict(list)
Groups articles by source
[:10] slice
Top 10 articles per source
Zebra striping
Alternating row colours
Inline CSS only
No <style> blocks (Gmail strips)
700px max-width
Email client compatible
Section 10
main() & Entry Point
scrapers = [...]
List of (name, function) tuples
for name, fn in scrapers
Tuple unpacking
fn()
Call function stored in variable
.extend(results)
Merge articles into master list
try/except
Crash-safe per-source handling
if __name__ == "__main__"
Only runs when executed directly
Execution Flow
main()
entry point
→
scrape_bbc()
RSS + per-page
→
scrape_guardian()
JSON API
→
scrape_independent()
RSS + fallback
→
scrape_sky_news()
RSS + fallback
→
save outputs
CSV · HTML · JSON · ODT
Key Python Concepts Used
@dataclass
try/except
list comprehension
dict.get(key, default)
defaultdict
f-strings
with open() as f:
nested functions
functions in variables
tuple unpacking
list slicing [:10]
ternary expression
Optional[T]
if __name__ == "__main__"
logging
CSS selectors
RSS / feedparser
polite scraping