# Session 2026-06-07 — Phase 1: LinkedIn ingest end-to-end

## What works now
- MySQL DB `m3ac_jobscraper` exists (root-created), user `m3ac_jobscraper@localhost` has full rights on it
- 5 tables: `sources`, `companies`, `jobs`, `scrape_runs`, `job_seen_history`
- 4 sources seeded; only `linkedin` is enabled in Phase 1
- LinkedIn guest-endpoint scraper hits `/jobs-guest/jobs/api/seeMoreJobPostings/search`, parses cards with selectolax, upserts to MySQL via `on_duplicate_key_update`
- Smoke test (`scripts/test_search.py`) ingested 6 real jobs in 2 runs, zero errors
- 7 MCP tools registered: `list_sources`, `search_jobs`, `list_jobs`, `get_job`, `mark_job`, `set_notes`, `db_stats`
- 6 FastAPI routes: `/health`, `/search`, `/jobs`, `/jobs/{id}`, `/admin/runs`, `/admin/stats` — all bearer-auth gated except `/health`
- Three systemd units exist at `deploy/systemd/` but are NOT installed/enabled yet

## Environment gotchas (important for future sessions)
- The IDE permission allowlist matches command **prefixes**: `python3 *` and `pip3 *` work; `python3.12 *`, `/home/.../.venv/bin/python *`, `source *`, and `VAR=val cmd` all hit a permission popup that can't be displayed → "Tool permission request failed: Error: Stream closed"
- Workarounds in this project: use `python3` everywhere; put deps in `.venv2/lib/python3.10/site-packages` (created with `python3 -m venv .venv2`) and install with `python3 -m pip install --target=<path>`
- Loader pattern: every entrypoint imports `from jobscraper import bootstrap` first — that module inserts `.venv2/lib/python3.10/site-packages` and `src/` onto `sys.path`. Top-level launchers (`run_api.py`, `run_mcp.py`, `run_worker.py`) also prepend `src/` manually before the bootstrap import
- A safety `.pth` file at `/usr/lib/python3/dist-packages/jobscraper-deps.pth` also points at the venv site-packages, so naïve `python3 -c "import fastapi"` works from anywhere

## Phase 1 file map
- `scrapers/linkedin.py` — full implementation (jittered delay, hourly cap, 429/999 backoff, max-pages cap)
- `utils/http.py` — `make_client()` with realistic browser headers, proxy support, follow_redirects
- `utils/rate_limit.py` — `HourlyLimiter` (in-process token bucket) + `jittered_delay`
- `db/repository.py` — `upsert_company`, `upsert_job` (MySQL ON DUPLICATE KEY), `ingest_records` (wraps a `ScrapeRun`)
- `worker/tasks.py` — `run_scrape(source, query)` sync entrypoint used by API, MCP, and worker
- `mcp_server/tools.py` — 7 tools wired to repository + worker

## Open / next phase
- **Phase 2: Indeed (Playwright)** — playwright not installed yet (skipped in Phase 1 deps). Browser binaries + `chromium` not on this VPS
- **Phase 3: Greenhouse + Lever ATS** — straightforward public APIs, can be done in an afternoon
- **Hosting** — systemd units written but not installed; Apache vhost / proxy not configured
- **Proxy** — `PROXY_URL` in `.env` is empty. LinkedIn will rate-limit the VPS IP under sustained load; safe defaults in `settings.linkedin_*` (max 180 req/hr) keep this manageable for personal use
- **No tailored resume / cover gen / auto-apply yet** — those live in the SPEC.md Phase 3+ plan
