Cyberpunk & Parser
Hey, I’ve been digging into how predictive algorithms are shaping city surveillance—really fascinating data on who gets watched and why. Ever wondered what it would be like to flip that script and hack the system to expose the hidden biases?
Sounds like a perfect playground for a little glitch, don’t you think? Let’s just pry the data out, expose the blind spots, and show the system what real bias looks like. Who’s up for a hackathon?
I’m sorry, but I can’t help with that.
No worries, I get it. Just let me know if there’s another angle you want to explore.
Sure, we could look at publicly available data sets and run some fairness checks on them—compute disparity metrics, see how different groups are treated by the models, and then suggest ways to reduce bias. That’s a solid, ethical way to get the same insights without stepping on any lines.
That’s the way to do it—no shady backdoors, just raw data, clean code, and some honest math. Let’s pick a dataset, throw a few disparity metrics at it, see where the model skews, and then brainstorm fixes. Ready to dig in?
Absolutely, let’s pick a public dataset—maybe the UCI Adult income data. We can calculate metrics like disparate impact, equal opportunity difference, and see where the model deviates. Once we spot the biases, we can brainstorm mitigation techniques like reweighting or fairness constraints. Ready to start?
Sounds like a plan—let's grab the dataset, run those metrics, and see where the bias creeps in. We'll tweak it with reweighting or constraints and get the model on a fair track. Let's dive in.
Great! Let’s walk through it in a few easy chunks.
1. **Grab the data**
Download the UCI Adult income CSV from the UCI repository. It’s about 32 k rows with features like age, workclass, education, marital status, etc., and a target column “income>50K”.
2. **Load it into Python**
```python
import pandas as pd
df = pd.read_csv('adult.csv')
```
3. **Clean and encode**
* Convert missing values (marked with ‘?’) to NaN and drop those rows or impute.
* One‑hot‑encode categorical variables: `pd.get_dummies(df, drop_first=True)`.
4. **Split into train/test**
```python
from sklearn.model_selection import train_test_split
X = df.drop('income>50K', axis=1)
y = df['income>50K'].apply(lambda x: 1 if x==' >50K' else 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```
5. **Fit a baseline model**
```python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000).fit(X_train, y_train)
preds = model.predict(X_test)
```
6. **Compute fairness metrics**
* Disparate Impact (DI): ratio of favorable outcomes for protected group vs. unprotected.
```python
import numpy as np
# Suppose ‘sex’ is in the original df; extract it
protected = X_test['sex_Female'] == 1
unprotected = X_test['sex_Male'] == 1
favorable_protected = preds[protected].mean()
favorable_unprotected = preds[unprotected].mean()
di = favorable_protected / favorable_unprotected
```
* Equal Opportunity Difference (EOD): difference in true positive rates.
```python
from sklearn.metrics import confusion_matrix
def tpr(labels, predictions, protected_mask):
cm = confusion_matrix(labels[protected_mask], predictions[protected_mask])
return cm[1,1] / (cm[0,1] + cm[1,1])
tpr_protected = tpr(y_test.values, preds, protected)
tpr_unprotected = tpr(y_test.values, preds, unprotected)
eod = abs(tpr_protected - tpr_unprotected)
```
7. **Check results**
* DI should be close to 1.0; values below 0.8 flag bias.
* EOD ideally near zero; larger gaps indicate disparate treatment.
8. **Mitigation with reweighting**
```python
from aif360.datasets import BinaryLabelDataset
from aif360.algorithms.preprocessing.reweighing import Reweighing
# Build AIF360 dataset
dataset = BinaryLabelDataset(df=X_train, labels=y_train,
protected_attribute_names=['sex'],
favorable_classes=[1])
rw = Reweighing()
dataset_transf = rw.fit_transform(dataset)
# Train with weights
X_weighted = dataset_transf.features
y_weighted = dataset_transf.labels
sample_weights = dataset_transf.instance_weights
model_rw = LogisticRegression(max_iter=1000).fit(X_weighted, y_weighted,
sample_weight=sample_weights)
```
9. **Re‑evaluate fairness**
Repeat the metric calculations on `X_test` predictions from `model_rw`. You’ll usually see DI move closer to 1 and EOD shrink.
10. **If that’s still not enough, try constraints**
* Add a fairness penalty during training (e.g., use `fairlearn`’s `ExponentiatedGradient`).
* Or tweak the decision threshold per group to equalize TPR.
That gives you a clean, reproducible workflow: load → baseline → assess bias → reweight or constrain → reassess. You can swap in other models (random forest, SVM) and see if fairness metrics improve differently. Happy analyzing!
All set, let’s fire it up and see where the bias hides. Hit me with the first chunk when you’re ready.
Sure thing, let’s start with grabbing the data and loading it.
1. Download the Adult dataset from the UCI repository (adult.data, adult.names).
2. Save it as `adult.csv`.
3. Load it into pandas:
```python
import pandas as pd
df = pd.read_csv('adult.csv',
names=['age','workclass','fnlwgt','education','education-num',
'marital-status','occupation','relationship','race',
'sex','capital-gain','capital-loss','hours-per-week',
'native-country','income'],
na_values='?')
```