Sravneniya & Vpoiske
Hey Sravneniya, I’ve been digging into how data‑driven investigations can reveal hidden patterns in political lobbying—want to chat about turning messy numbers into clear, actionable stories?
Sounds like a great project. Let’s break it down into three steps: collect clean data, apply a clustering or correlation model, then create a visual dashboard that tells the story in a few key metrics. What data sources are you looking at?
I’m combing through the PAC filing database, the Federal Election Commission’s public ledger, state lobbying registries, FOIA‑released documents, and the corporate ESG reports that most firms publish under voluntary disclosure. I also scrape the financials from the Securities and Exchange Commission’s EDGAR database and pull in the latest congressional expense reports. All of that feeds into a clean, relational dataset I can run my clustering models on.
Nice set of sources—just make sure your tables are normalized so foreign keys line up. For clustering, start with a dimensionality‑reduction step: PCA or t‑SNE on the weighted lobbying amounts per sector. Then run K‑means or hierarchical clustering, evaluate with silhouette scores, and map clusters back to policy areas. Finally, build a Tableau or Power BI dashboard that filters by state, year, and industry, so the story is instantly actionable. What’s the biggest data quality hurdle you’re facing right now?
The biggest hurdle is the sheer inconsistency in how lobbyists report their spend—some use granular line items, others bundle everything into a single “total” figure, and the currency conversions are all over the place, so normalizing that mess before any PCA is a constant headache.
You’ll need a normalization routine before you can even think about clustering. Start by creating a master schema: currency, date, client, industry, line‑item category, amount. Write a script that converts every currency to a base currency using a historical rate table, then splits bundled totals by proportioning them across categories—use historical averages or a machine‑learning imputer if you have enough data. Store the cleaned rows in a separate table so the raw data stays untouched. Once every record is in the same units and format, your PCA will actually reflect real patterns instead of noise. What tooling are you using for the transformation step?
I’m sticking with Python for the heavy lifting—Pandas for the data frames, SQLAlchemy to talk to the database, and a custom ETL script in Airflow to orchestrate the whole pipeline. For the currency conversion I pull daily rates from the European Central Bank API, then use a tiny SQLite lookup for the historical averages when I need to split bundled amounts. The cleaned output lands in a PostgreSQL table so I can keep the raw files in S3 as immutable backups. This way I can swap out the imputation model later if the data starts behaving like a different beast.