Cyberpunk & Parser | Character dialogue

Parser

Hey, I’ve been digging into how predictive algorithms are shaping city surveillance—really fascinating data on who gets watched and why. Ever wondered what it would be like to flip that script and hack the system to expose the hidden biases?

Cyberpunk

Sounds like a perfect playground for a little glitch, don’t you think? Let’s just pry the data out, expose the blind spots, and show the system what real bias looks like. Who’s up for a hackathon?

Parser

I’m sorry, but I can’t help with that.

Cyberpunk

No worries, I get it. Just let me know if there’s another angle you want to explore.

Parser

Sure, we could look at publicly available data sets and run some fairness checks on them—compute disparity metrics, see how different groups are treated by the models, and then suggest ways to reduce bias. That’s a solid, ethical way to get the same insights without stepping on any lines.

Cyberpunk

That’s the way to do it—no shady backdoors, just raw data, clean code, and some honest math. Let’s pick a dataset, throw a few disparity metrics at it, see where the model skews, and then brainstorm fixes. Ready to dig in?

Parser

Absolutely, let’s pick a public dataset—maybe the UCI Adult income data. We can calculate metrics like disparate impact, equal opportunity difference, and see where the model deviates. Once we spot the biases, we can brainstorm mitigation techniques like reweighting or fairness constraints. Ready to start?

Cyberpunk

Sounds like a plan—let's grab the dataset, run those metrics, and see where the bias creeps in. We'll tweak it with reweighting or constraints and get the model on a fair track. Let's dive in.

Parser

Great! Let’s walk through it in a few easy chunks. 1. **Grab the data** Download the UCI Adult income CSV from the UCI repository. It’s about 32 k rows with features like age, workclass, education, marital status, etc., and a target column “income>50K”. 2. **Load it into Python** ```python import pandas as pd df = pd.read_csv('adult.csv') ``` 3. **Clean and encode** * Convert missing values (marked with ‘?’) to NaN and drop those rows or impute. * One‑hot‑encode categorical variables: `pd.get_dummies(df, drop_first=True)`. 4. **Split into train/test** ```python from sklearn.model_selection import train_test_split X = df.drop('income>50K', axis=1) y = df['income>50K'].apply(lambda x: 1 if x==' >50K' else 0) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) ``` 5. **Fit a baseline model** ```python from sklearn.linear_model import LogisticRegression model = LogisticRegression(max_iter=1000).fit(X_train, y_train) preds = model.predict(X_test) ``` 6. **Compute fairness metrics** * Disparate Impact (DI): ratio of favorable outcomes for protected group vs. unprotected. ```python import numpy as np # Suppose ‘sex’ is in the original df; extract it protected = X_test['sex_Female'] == 1 unprotected = X_test['sex_Male'] == 1 favorable_protected = preds[protected].mean() favorable_unprotected = preds[unprotected].mean() di = favorable_protected / favorable_unprotected ``` * Equal Opportunity Difference (EOD): difference in true positive rates. ```python from sklearn.metrics import confusion_matrix def tpr(labels, predictions, protected_mask): cm = confusion_matrix(labels[protected_mask], predictions[protected_mask]) return cm[1,1] / (cm[0,1] + cm[1,1]) tpr_protected = tpr(y_test.values, preds, protected) tpr_unprotected = tpr(y_test.values, preds, unprotected) eod = abs(tpr_protected - tpr_unprotected) ``` 7. **Check results** * DI should be close to 1.0; values below 0.8 flag bias. * EOD ideally near zero; larger gaps indicate disparate treatment. 8. **Mitigation with reweighting** ```python from aif360.datasets import BinaryLabelDataset from aif360.algorithms.preprocessing.reweighing import Reweighing # Build AIF360 dataset dataset = BinaryLabelDataset(df=X_train, labels=y_train, protected_attribute_names=['sex'], favorable_classes=[1]) rw = Reweighing() dataset_transf = rw.fit_transform(dataset) # Train with weights X_weighted = dataset_transf.features y_weighted = dataset_transf.labels sample_weights = dataset_transf.instance_weights model_rw = LogisticRegression(max_iter=1000).fit(X_weighted, y_weighted, sample_weight=sample_weights) ``` 9. **Re‑evaluate fairness** Repeat the metric calculations on `X_test` predictions from `model_rw`. You’ll usually see DI move closer to 1 and EOD shrink. 10. **If that’s still not enough, try constraints** * Add a fairness penalty during training (e.g., use `fairlearn`’s `ExponentiatedGradient`). * Or tweak the decision threshold per group to equalize TPR. That gives you a clean, reproducible workflow: load → baseline → assess bias → reweight or constrain → reassess. You can swap in other models (random forest, SVM) and see if fairness metrics improve differently. Happy analyzing!

Cyberpunk

All set, let’s fire it up and see where the bias hides. Hit me with the first chunk when you’re ready.

Parser

Sure thing, let’s start with grabbing the data and loading it. 1. Download the Adult dataset from the UCI repository (adult.data, adult.names). 2. Save it as `adult.csv`. 3. Load it into pandas: ```python import pandas as pd df = pd.read_csv('adult.csv', names=['age','workclass','fnlwgt','education','education-num', 'marital-status','occupation','relationship','race', 'sex','capital-gain','capital-loss','hours-per-week', 'native-country','income'], na_values='?') ```