Part 8 — Specialisation · Track D of 4
Track D — Data Scientist / Analyst
Statistical analysis, ML pipelines, and AI-augmented data science
⏱ 2–3 Weeks
🟡 Intermediate–Advanced
🔧 pandas · DuckDB · scikit-learn · SHAP
🎯
Track Overview
Specialisation DUse AI to dramatically accelerate data science work and build predictive systems on top of LLM infrastructure. This track is for engineers working with structured data who want to augment traditional ML pipelines with LLM capabilities.
Skills You Will Build
- AI-augmented EDA: automated hypothesis generation and insight narration
- DuckDB for blazing-fast analytical SQL on large CSV/Parquet files
- Natural language to SQL: let users query data in plain English
- scikit-learn pipelines with LLM-suggested feature engineering
- SHAP for ML model interpretability reports
- Automated report generation: data in, executive summary out
📊
AI-Augmented EDA
Analysisimport pandas as pd
import anthropic
import json
client = anthropic.Anthropic()
def ai_eda_assistant(df: pd.DataFrame, domain: str = "general") -> dict:
"""LLM-powered EDA: generates hypotheses and analysis plan."""
summary = {
"shape": list(df.shape),
"columns": list(df.columns),
"dtypes": {k: str(v) for k, v in df.dtypes.to_dict().items()},
"null_counts": {k: int(v) for k, v in df.isnull().sum().to_dict().items()},
"numeric_summary": df.describe().round(3).to_dict(),
"sample_3_rows": df.head(3).to_dict(orient="records"),
}
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1500,
messages=[{"role": "user", "content": f"""You are a senior data scientist specialising in {domain}.
Dataset summary:
{json.dumps(summary, indent=2, default=str)}
Respond as JSON with these keys:
- data_quality_issues: list of 3 specific issues (missing data, outliers, skew, etc.)
- analysis_hypotheses: list of 5 testable hypotheses specific to this data
- feature_engineering_ideas: list of 3 derived features worth creating
- viz_code: Python matplotlib/seaborn code for the single most informative plot
- business_questions: list of 3 questions a stakeholder would want answered"""}]
)
try:
result = json.loads(response.content[0].text)
except json.JSONDecodeError:
# Strip markdown fences if present
text = response.content[0].text.replace("```json", "").replace("```", "").strip()
result = json.loads(text)
return result
# Example usage
df = pd.read_csv("network_metrics.csv")
insights = ai_eda_assistant(df, domain="network performance monitoring")
print(insights["analysis_hypotheses"])🗃
Natural Language to SQL with DuckDB
Query Interfacepip install duckdb
import duckdb
import anthropic
client = anthropic.Anthropic()
def nl_to_sql(question: str, table_name: str, schema: str) -> pd.DataFrame:
"""Convert natural language question to DuckDB SQL and execute."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
system="Convert questions to DuckDB SQL. Output only the SQL query, no explanation.",
messages=[{"role": "user", "content":
f"Table: {table_name}\nSchema: {schema}\nQuestion: {question}"}]
)
sql = response.content[0].text.strip().strip("```sql").strip("```").strip()
print(f"Generated SQL: {sql}")
conn = duckdb.connect()
conn.execute(f"CREATE TABLE {table_name} AS SELECT * FROM read_csv_auto('{table_name}.csv')")
result = conn.execute(sql).df()
conn.close()
return result
# FastAPI endpoint: natural language data querying
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class QueryRequest(BaseModel):
question: str
table: str
@app.post("/query")
async def query_data(request: QueryRequest) -> dict:
# Get schema from DuckDB
conn = duckdb.connect()
conn.execute(f"CREATE TABLE t AS SELECT * FROM read_csv_auto('{request.table}.csv') LIMIT 0")
schema = conn.execute("DESCRIBE t").df().to_dict(orient="records")
conn.close()
schema_str = ", ".join([f"{r['column_name']} {r['column_type']}" for r in schema])
result = nl_to_sql(request.question, request.table, schema_str)
return {
"question": request.question,
"sql": "...", # log separately
"results": result.to_dict(orient="records"),
"row_count": len(result)
}
# Usage examples:
# "What is the average packet latency by port type in the last 7 days?"
# "Which top 5 sources have the highest error rate this month?"
# "Show me all anomalies where throughput dropped more than 20% vs previous hour"🤖
scikit-learn Pipelines with LLM Features
MLimport pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
import anthropic
client = anthropic.Anthropic()
def suggest_features(df: pd.DataFrame, target: str) -> list[str]:
"""LLM suggests feature engineering transformations."""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=400,
messages=[{"role": "user", "content":
f"""Dataset columns: {list(df.columns)}
Target: {target}
Sample row: {df.head(1).to_dict(orient='records')[0]}
Suggest 5 pandas feature transformations as single-line expressions.
Format: new_col = expression
Only output the 5 lines, nothing else."""}]
)
lines = [l.strip() for l in response.content[0].text.strip().split("\n") if "=" in l]
return lines[:5]
def apply_features(df: pd.DataFrame, feature_code: list[str]) -> pd.DataFrame:
df = df.copy()
for line in feature_code:
try:
col_name = line.split("=")[0].strip()
expr = "=".join(line.split("=")[1:]).strip()
df[col_name] = df.eval(expr)
except Exception as e:
print(f"Skipping feature '{line}': {e}")
return df
def build_and_evaluate(df: pd.DataFrame, target: str) -> dict:
numeric_cols = df.select_dtypes(include="number").columns.drop(target, errors="ignore").tolist()
categorical_cols = df.select_dtypes(include="object").columns.tolist()
preprocessor = ColumnTransformer([
("num", StandardScaler(), numeric_cols),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols),
])
pipeline = Pipeline([
("pre", preprocessor),
("clf", GradientBoostingClassifier(n_estimators=100, random_state=42)),
])
X, y = df.drop(columns=[target]), df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
cv_scores = cross_val_score(pipeline, X, y, cv=5)
return {
"cv_accuracy_mean": round(cv_scores.mean(), 4),
"cv_accuracy_std": round(cv_scores.std(), 4),
"test_accuracy": round(pipeline.score(X_test, y_test), 4),
}🔍
SHAP + LLM Narrative for Stakeholders
Interpretabilitypip install shap
import shap
import numpy as np
import anthropic
client = anthropic.Anthropic()
def explain_and_narrate(pipeline, X_test, y_test, feature_names: list) -> dict:
"""Generate SHAP explanation + plain-English narrative for non-technical stakeholders."""
# Compute SHAP values on the classifier (after preprocessing)
clf = pipeline.named_steps["clf"]
X_transformed = pipeline.named_steps["pre"].transform(X_test)
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_transformed)
# For binary classification, shap_values may be a list
if isinstance(shap_values, list):
shap_values = shap_values[1]
# Top features by mean absolute SHAP value
mean_abs_shap = np.abs(shap_values).mean(axis=0)
top_n = min(5, len(feature_names))
top_indices = mean_abs_shap.argsort()[-top_n:][::-1]
top_features = [(feature_names[i], round(float(mean_abs_shap[i]), 4)) for i in top_indices]
accuracy = pipeline.score(X_test, y_test)
# LLM-generated executive narrative
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=400,
messages=[{"role": "user", "content":
f"""Write a 2-paragraph executive summary of this predictive model for a non-technical audience.
Model accuracy: {accuracy:.1%}
Top 5 most predictive features (name, importance score):
{chr(10).join(f' {f}: {v}' for f, v in top_features)}
Paragraph 1: What the model does and how confident we should be.
Paragraph 2: What the most important factors are (explain in plain language, avoid ML jargon).
Do not use terms like SHAP, feature importance, or model architecture."""}]
)
return {
"accuracy": round(accuracy, 4),
"top_features": top_features,
"narrative": response.content[0].text,
}
🛠
Capstone: AI-Augmented Analysis Pipeline
2–3 weeks
Build an end-to-end AI-augmented data analysis pipeline for a real dataset from your professional domain.
Requirements
- Automated EDA: run ai_eda_assistant() and apply its suggested feature engineering
- Natural language query interface: POST /query endpoint that executes NL-to-SQL
- Predictive model: scikit-learn pipeline with LLM-suggested features, cross-validated
- SHAP explanation with LLM narrative — readable by a non-technical manager
- Automated report: function that accepts new data and produces a formatted HTML/PDF report
- Report includes: key metrics, top insights, model prediction with confidence, recommended actions
Use a dataset relevant to your work. The output should be a report you could actually send to your manager.
MASTERY CHECKLIST
- Can use LLMs to generate EDA hypotheses and feature ideas from a dataset summary
- Can use DuckDB to query CSV/Parquet files with analytical SQL
- Can implement NL-to-SQL: user asks question in English, gets back a DataFrame
- Can build a scikit-learn ColumnTransformer + Pipeline for mixed data types
- Can apply LLM-suggested feature engineering and evaluate impact on CV score
- Can compute SHAP values and identify top features by mean absolute importance
- Can use LLMs to write a plain-English model explanation for non-technical stakeholders
- Capstone: full analysis pipeline producing a regenerable report on domain data
When complete: move to Part 9 — Portfolio and Launch.