# Session 2026-06-07 (cont.) — Phase 4: scale-out + scoring

## Direction
Michael flagged this as a long-term scalable project — more boards coming.
Refactored the foundation accordingly so adding sources is cheap.

## What's new

### Scoring + profile
- `profiles` and `resumes` tables added to MySQL
- `matching.py` — score = skills overlap (0.5) + must-have hits (0.2) + remote alignment (0.1) + salary alignment (0.1) + recency (0.1); hard zero on exclude keyword/company; multiplicative penalty for must-have-missing
- MCP tools: `set_profile`, `get_profile`, `score_new_jobs`, `top_jobs`, `list_resumes`, `register_resume`, `fetch_details`
- Sample run scored the 19 existing jobs — Netflix / Hired / Palantir Python roles ranked top

### Per-source rate limiting (was hardcoded LinkedIn-only)
- `AbstractScraper.__init__` now loads throttle config from
  `_DEFAULT_THROTTLES[name]` + override from `sources.config_json.throttle`
- All 4 existing scrapers refactored to use `self._min_delay / self._max_delay / self._limiter / self._max_pages`
- Greenhouse/Lever get fast defaults (0.2–0.6s, 1000/hr); LinkedIn keeps cautious 2–5s, 180/hr; Indeed gets 3–7s, 90/hr

### Two more sources
- **USAJobs** (`src/jobscraper/scrapers/usajobs.py`) — federal jobs, free API. Needs a free auth key from webmaster@usajobs.gov. Enabled in DB; will start returning jobs the moment key is added to `config_json.auth_key`
- **Adzuna** (`src/jobscraper/scrapers/adzuna.py`) — aggregates Indeed/Reed legitimately, free 1k/mo. Self-service signup at developer.adzuna.com. Disabled in DB; flip to enabled after pasting `app_id` + `app_key` into `config_json`

### Parallel multi-source search
- `worker/tasks.py:run_scrape_all(query, sources=None)` — fans out across all enabled sources via `asyncio.gather`
- MCP tool: `search_all(keywords, location, remote, posted_within_days, limit, sources)`
- Verified live: a single call ingested jobs from LinkedIn + Greenhouse + Lever in parallel

### Docs
- `docs/ADD_SOURCE.md` — step-by-step "add a new source" guide, including 10+ candidate sources with effort estimates

## DB state
```
linkedin   : 14
greenhouse :  9
lever      : 10
total      : 33
sources enabled: 4 (linkedin, greenhouse, lever, usajobs awaiting key)
sources scaffolded: 6 total (+ indeed + adzuna)
```

## Scrapers registered

| Source | Type | Status |
|---|---|---|
| linkedin | HTML guest endpoint | ✅ live |
| greenhouse | public JSON API | ✅ live |
| lever | public JSON API | ✅ live |
| usajobs | REST API | 🔑 needs free auth key |
| adzuna | REST API (aggregator) | 🔑 needs free app_id+app_key |
| indeed | Playwright | 🛡 blocked by Cloudflare without proxy |

## How to extend
See `docs/ADD_SOURCE.md`. tl;dr: one file in `scrapers/`, one line in
`registry.py`, one line in `_DEFAULT_THROTTLES`, one row in `seed_sources.py`.
Framework handles rate limiting, DB upsert, MCP exposure, parallel runs.

## Open / next
- **API keys to request** (~5 min of email each):
  - USAJobs: webmaster@usajobs.gov
  - Adzuna: developer.adzuna.com signup
- **More sources candidate list** (in `docs/ADD_SOURCE.md`): Wellfound, Ashby, SmartRecruiters, Workable, Recruitee, Jobicy, Himalayas, RemoteOK
- **Apply adapters** — Greenhouse + Lever both POST to public application endpoints; design from SPEC with the `confirm_token` rail
- **Profile refinement** — current sample profile is generic Python/SQL; Michael's real preferences haven't been captured yet (call `set_profile` with the right skills/salary)
- **Indeed** — still 403 until proxy
- **Daily scheduled scrapes** — `worker/scheduler.py` empty; add APScheduler jobs once profile is set so it can auto-rank overnight ingests
