Part 8 — Specialisation  ·  Track D of 4
Track D — Data Scientist / Analyst
Statistical analysis, ML pipelines, and AI-augmented data science
⏱ 2–3 Weeks 🟡 Intermediate–Advanced 🔧 pandas · DuckDB · scikit-learn · SHAP
🎯

Track Overview

Specialisation D

Use AI to dramatically accelerate data science work and build predictive systems on top of LLM infrastructure. This track is for engineers working with structured data who want to augment traditional ML pipelines with LLM capabilities.

Skills You Will Build

  • AI-augmented EDA: automated hypothesis generation and insight narration
  • DuckDB for blazing-fast analytical SQL on large CSV/Parquet files
  • Natural language to SQL: let users query data in plain English
  • scikit-learn pipelines with LLM-suggested feature engineering
  • SHAP for ML model interpretability reports
  • Automated report generation: data in, executive summary out
📊

AI-Augmented EDA

Analysis
import pandas as pd
import anthropic
import json

client = anthropic.Anthropic()

def ai_eda_assistant(df: pd.DataFrame, domain: str = "general") -> dict:
    """LLM-powered EDA: generates hypotheses and analysis plan."""
    summary = {
        "shape": list(df.shape),
        "columns": list(df.columns),
        "dtypes": {k: str(v) for k, v in df.dtypes.to_dict().items()},
        "null_counts": {k: int(v) for k, v in df.isnull().sum().to_dict().items()},
        "numeric_summary": df.describe().round(3).to_dict(),
        "sample_3_rows": df.head(3).to_dict(orient="records"),
    }

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1500,
        messages=[{"role": "user", "content": f"""You are a senior data scientist specialising in {domain}.

Dataset summary:
{json.dumps(summary, indent=2, default=str)}

Respond as JSON with these keys:
- data_quality_issues: list of 3 specific issues (missing data, outliers, skew, etc.)
- analysis_hypotheses: list of 5 testable hypotheses specific to this data
- feature_engineering_ideas: list of 3 derived features worth creating
- viz_code: Python matplotlib/seaborn code for the single most informative plot
- business_questions: list of 3 questions a stakeholder would want answered"""}]
    )

    try:
        result = json.loads(response.content[0].text)
    except json.JSONDecodeError:
        # Strip markdown fences if present
        text = response.content[0].text.replace("```json", "").replace("```", "").strip()
        result = json.loads(text)

    return result

# Example usage
df = pd.read_csv("network_metrics.csv")
insights = ai_eda_assistant(df, domain="network performance monitoring")
print(insights["analysis_hypotheses"])
🗃

Natural Language to SQL with DuckDB

Query Interface
pip install duckdb

import duckdb
import anthropic

client = anthropic.Anthropic()

def nl_to_sql(question: str, table_name: str, schema: str) -> pd.DataFrame:
    """Convert natural language question to DuckDB SQL and execute."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=300,
        system="Convert questions to DuckDB SQL. Output only the SQL query, no explanation.",
        messages=[{"role": "user", "content":
            f"Table: {table_name}\nSchema: {schema}\nQuestion: {question}"}]
    )
    sql = response.content[0].text.strip().strip("```sql").strip("```").strip()
    print(f"Generated SQL: {sql}")

    conn = duckdb.connect()
    conn.execute(f"CREATE TABLE {table_name} AS SELECT * FROM read_csv_auto('{table_name}.csv')")
    result = conn.execute(sql).df()
    conn.close()
    return result

# FastAPI endpoint: natural language data querying
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class QueryRequest(BaseModel):
    question: str
    table: str

@app.post("/query")
async def query_data(request: QueryRequest) -> dict:
    # Get schema from DuckDB
    conn = duckdb.connect()
    conn.execute(f"CREATE TABLE t AS SELECT * FROM read_csv_auto('{request.table}.csv') LIMIT 0")
    schema = conn.execute("DESCRIBE t").df().to_dict(orient="records")
    conn.close()

    schema_str = ", ".join([f"{r['column_name']} {r['column_type']}" for r in schema])

    result = nl_to_sql(request.question, request.table, schema_str)
    return {
        "question": request.question,
        "sql": "...",  # log separately
        "results": result.to_dict(orient="records"),
        "row_count": len(result)
    }

# Usage examples:
# "What is the average packet latency by port type in the last 7 days?"
# "Which top 5 sources have the highest error rate this month?"
# "Show me all anomalies where throughput dropped more than 20% vs previous hour"
🤖

scikit-learn Pipelines with LLM Features

ML
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
import anthropic

client = anthropic.Anthropic()

def suggest_features(df: pd.DataFrame, target: str) -> list[str]:
    """LLM suggests feature engineering transformations."""
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=400,
        messages=[{"role": "user", "content":
            f"""Dataset columns: {list(df.columns)}
Target: {target}
Sample row: {df.head(1).to_dict(orient='records')[0]}

Suggest 5 pandas feature transformations as single-line expressions.
Format: new_col = expression
Only output the 5 lines, nothing else."""}]
    )
    lines = [l.strip() for l in response.content[0].text.strip().split("\n") if "=" in l]
    return lines[:5]

def apply_features(df: pd.DataFrame, feature_code: list[str]) -> pd.DataFrame:
    df = df.copy()
    for line in feature_code:
        try:
            col_name = line.split("=")[0].strip()
            expr = "=".join(line.split("=")[1:]).strip()
            df[col_name] = df.eval(expr)
        except Exception as e:
            print(f"Skipping feature '{line}': {e}")
    return df

def build_and_evaluate(df: pd.DataFrame, target: str) -> dict:
    numeric_cols = df.select_dtypes(include="number").columns.drop(target, errors="ignore").tolist()
    categorical_cols = df.select_dtypes(include="object").columns.tolist()

    preprocessor = ColumnTransformer([
        ("num", StandardScaler(), numeric_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols),
    ])

    pipeline = Pipeline([
        ("pre", preprocessor),
        ("clf", GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ])

    X, y = df.drop(columns=[target]), df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    pipeline.fit(X_train, y_train)

    cv_scores = cross_val_score(pipeline, X, y, cv=5)
    return {
        "cv_accuracy_mean": round(cv_scores.mean(), 4),
        "cv_accuracy_std":  round(cv_scores.std(), 4),
        "test_accuracy":    round(pipeline.score(X_test, y_test), 4),
    }
🔍

SHAP + LLM Narrative for Stakeholders

Interpretability
pip install shap

import shap
import numpy as np
import anthropic

client = anthropic.Anthropic()

def explain_and_narrate(pipeline, X_test, y_test, feature_names: list) -> dict:
    """Generate SHAP explanation + plain-English narrative for non-technical stakeholders."""

    # Compute SHAP values on the classifier (after preprocessing)
    clf = pipeline.named_steps["clf"]
    X_transformed = pipeline.named_steps["pre"].transform(X_test)

    explainer = shap.TreeExplainer(clf)
    shap_values = explainer.shap_values(X_transformed)

    # For binary classification, shap_values may be a list
    if isinstance(shap_values, list):
        shap_values = shap_values[1]

    # Top features by mean absolute SHAP value
    mean_abs_shap = np.abs(shap_values).mean(axis=0)
    top_n = min(5, len(feature_names))
    top_indices = mean_abs_shap.argsort()[-top_n:][::-1]
    top_features = [(feature_names[i], round(float(mean_abs_shap[i]), 4)) for i in top_indices]

    accuracy = pipeline.score(X_test, y_test)

    # LLM-generated executive narrative
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=400,
        messages=[{"role": "user", "content":
            f"""Write a 2-paragraph executive summary of this predictive model for a non-technical audience.

Model accuracy: {accuracy:.1%}
Top 5 most predictive features (name, importance score):
{chr(10).join(f'  {f}: {v}' for f, v in top_features)}

Paragraph 1: What the model does and how confident we should be.
Paragraph 2: What the most important factors are (explain in plain language, avoid ML jargon).
Do not use terms like SHAP, feature importance, or model architecture."""}]
    )

    return {
        "accuracy":     round(accuracy, 4),
        "top_features": top_features,
        "narrative":    response.content[0].text,
    }
🛠 Capstone: AI-Augmented Analysis Pipeline 2–3 weeks

Build an end-to-end AI-augmented data analysis pipeline for a real dataset from your professional domain.

Requirements

  • Automated EDA: run ai_eda_assistant() and apply its suggested feature engineering
  • Natural language query interface: POST /query endpoint that executes NL-to-SQL
  • Predictive model: scikit-learn pipeline with LLM-suggested features, cross-validated
  • SHAP explanation with LLM narrative — readable by a non-technical manager
  • Automated report: function that accepts new data and produces a formatted HTML/PDF report
  • Report includes: key metrics, top insights, model prediction with confidence, recommended actions

Use a dataset relevant to your work. The output should be a report you could actually send to your manager.

MASTERY CHECKLIST

When complete: move to Part 9 — Portfolio and Launch.

← Track C: Automation All Modules Next: Part 9 — Capstone →