Track D - Data Scientist / Analyst

Part 8 — Specialisation · Track D of 4

Track D — Data Scientist / Analyst

Statistical analysis, ML pipelines, and AI-augmented data science

⏱ 2–3 Weeks 🟡 Intermediate–Advanced 🔧 pandas · DuckDB · scikit-learn · SHAP

🎯

Track Overview

Specialisation D

Use AI to dramatically accelerate data science work and build predictive systems on top of LLM infrastructure. This track is for engineers working with structured data who want to augment traditional ML pipelines with LLM capabilities.

Skills You Will Build

AI-augmented EDA: automated hypothesis generation and insight narration
DuckDB for blazing-fast analytical SQL on large CSV/Parquet files
Natural language to SQL: let users query data in plain English
scikit-learn pipelines with LLM-suggested feature engineering
SHAP for ML model interpretability reports
Automated report generation: data in, executive summary out

📊

AI-Augmented EDA

Analysis

import pandas as pd
import anthropic
import json

client = anthropic.Anthropic()

def ai_eda_assistant(df: pd.DataFrame, domain: str = "general") -> dict:
    """LLM-powered EDA: generates hypotheses and analysis plan."""
    summary = {
        "shape": list(df.shape),
        "columns": list(df.columns),
        "dtypes": {k: str(v) for k, v in df.dtypes.to_dict().items()},
        "null_counts": {k: int(v) for k, v in df.isnull().sum().to_dict().items()},
        "numeric_summary": df.describe().round(3).to_dict(),
        "sample_3_rows": df.head(3).to_dict(orient="records"),
    }

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1500,
        messages=[{"role": "user", "content": f"""You are a senior data scientist specialising in {domain}.

Dataset summary:
{json.dumps(summary, indent=2, default=str)}

Respond as JSON with these keys:
- data_quality_issues: list of 3 specific issues (missing data, outliers, skew, etc.)
- analysis_hypotheses: list of 5 testable hypotheses specific to this data
- feature_engineering_ideas: list of 3 derived features worth creating
- viz_code: Python matplotlib/seaborn code for the single most informative plot
- business_questions: list of 3 questions a stakeholder would want answered"""}]
    )

    try:
        result = json.loads(response.content[0].text)
    except json.JSONDecodeError:
        # Strip markdown fences if present
        text = response.content[0].text.replace("```json", "").replace("```", "").strip()
        result = json.loads(text)

    return result

# Example usage
df = pd.read_csv("network_metrics.csv")
insights = ai_eda_assistant(df, domain="network performance monitoring")
print(insights["analysis_hypotheses"])

🗃

Natural Language to SQL with DuckDB

Query Interface

pip install duckdb

import duckdb
import anthropic

client = anthropic.Anthropic()

def nl_to_sql(question: str, table_name: str, schema: str) -> pd.DataFrame:
    """Convert natural language question to DuckDB SQL and execute."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=300,
        system="Convert questions to DuckDB SQL. Output only the SQL query, no explanation.",
        messages=[{"role": "user", "content":
            f"Table: {table_name}\nSchema: {schema}\nQuestion: {question}"}]
    )
    sql = response.content[0].text.strip().strip("```sql").strip("```").strip()
    print(f"Generated SQL: {sql}")

    conn = duckdb.connect()
    conn.execute(f"CREATE TABLE {table_name} AS SELECT * FROM read_csv_auto('{table_name}.csv')")
    result = conn.execute(sql).df()
    conn.close()
    return result

# FastAPI endpoint: natural language data querying
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class QueryRequest(BaseModel):
    question: str
    table: str

@app.post("/query")
async def query_data(request: QueryRequest) -> dict:
    # Get schema from DuckDB
    conn = duckdb.connect()
    conn.execute(f"CREATE TABLE t AS SELECT * FROM read_csv_auto('{request.table}.csv') LIMIT 0")
    schema = conn.execute("DESCRIBE t").df().to_dict(orient="records")
    conn.close()

    schema_str = ", ".join([f"{r['column_name']} {r['column_type']}" for r in schema])

    result = nl_to_sql(request.question, request.table, schema_str)
    return {
        "question": request.question,
        "sql": "...",  # log separately
        "results": result.to_dict(orient="records"),
        "row_count": len(result)
    }

# Usage examples:
# "What is the average packet latency by port type in the last 7 days?"
# "Which top 5 sources have the highest error rate this month?"
# "Show me all anomalies where throughput dropped more than 20% vs previous hour"

🤖

scikit-learn Pipelines with LLM Features

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
import anthropic

client = anthropic.Anthropic()

def suggest_features(df: pd.DataFrame, target: str) -> list[str]:
    """LLM suggests feature engineering transformations."""
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=400,
        messages=[{"role": "user", "content":
            f"""Dataset columns: {list(df.columns)}
Target: {target}
Sample row: {df.head(1).to_dict(orient='records')[0]}

Suggest 5 pandas feature transformations as single-line expressions.
Format: new_col = expression
Only output the 5 lines, nothing else."""}]
    )
    lines = [l.strip() for l in response.content[0].text.strip().split("\n") if "=" in l]
    return lines[:5]

def apply_features(df: pd.DataFrame, feature_code: list[str]) -> pd.DataFrame:
    df = df.copy()
    for line in feature_code:
        try:
            col_name = line.split("=")[0].strip()
            expr = "=".join(line.split("=")[1:]).strip()
            df[col_name] = df.eval(expr)
        except Exception as e:
            print(f"Skipping feature '{line}': {e}")
    return df

def build_and_evaluate(df: pd.DataFrame, target: str) -> dict:
    numeric_cols = df.select_dtypes(include="number").columns.drop(target, errors="ignore").tolist()
    categorical_cols = df.select_dtypes(include="object").columns.tolist()

    preprocessor = ColumnTransformer([
        ("num", StandardScaler(), numeric_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols),
    ])

    pipeline = Pipeline([
        ("pre", preprocessor),
        ("clf", GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ])

    X, y = df.drop(columns=[target]), df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    pipeline.fit(X_train, y_train)

    cv_scores = cross_val_score(pipeline, X, y, cv=5)
    return {
        "cv_accuracy_mean": round(cv_scores.mean(), 4),
        "cv_accuracy_std":  round(cv_scores.std(), 4),
        "test_accuracy":    round(pipeline.score(X_test, y_test), 4),
    }

🔍

SHAP + LLM Narrative for Stakeholders

Interpretability

pip install shap

import shap
import numpy as np
import anthropic

client = anthropic.Anthropic()

def explain_and_narrate(pipeline, X_test, y_test, feature_names: list) -> dict:
    """Generate SHAP explanation + plain-English narrative for non-technical stakeholders."""

    # Compute SHAP values on the classifier (after preprocessing)
    clf = pipeline.named_steps["clf"]
    X_transformed = pipeline.named_steps["pre"].transform(X_test)

    explainer = shap.TreeExplainer(clf)
    shap_values = explainer.shap_values(X_transformed)

    # For binary classification, shap_values may be a list
    if isinstance(shap_values, list):
        shap_values = shap_values[1]

    # Top features by mean absolute SHAP value
    mean_abs_shap = np.abs(shap_values).mean(axis=0)
    top_n = min(5, len(feature_names))
    top_indices = mean_abs_shap.argsort()[-top_n:][::-1]
    top_features = [(feature_names[i], round(float(mean_abs_shap[i]), 4)) for i in top_indices]

    accuracy = pipeline.score(X_test, y_test)

    # LLM-generated executive narrative
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=400,
        messages=[{"role": "user", "content":
            f"""Write a 2-paragraph executive summary of this predictive model for a non-technical audience.

Model accuracy: {accuracy:.1%}
Top 5 most predictive features (name, importance score):
{chr(10).join(f'  {f}: {v}' for f, v in top_features)}

Paragraph 1: What the model does and how confident we should be.
Paragraph 2: What the most important factors are (explain in plain language, avoid ML jargon).
Do not use terms like SHAP, feature importance, or model architecture."""}]
    )

    return {
        "accuracy":     round(accuracy, 4),
        "top_features": top_features,
        "narrative":    response.content[0].text,
    }

🛠 Capstone: AI-Augmented Analysis Pipeline 2–3 weeks

Build an end-to-end AI-augmented data analysis pipeline for a real dataset from your professional domain.

Requirements

Automated EDA: run ai_eda_assistant() and apply its suggested feature engineering
Natural language query interface: POST /query endpoint that executes NL-to-SQL
Predictive model: scikit-learn pipeline with LLM-suggested features, cross-validated
SHAP explanation with LLM narrative — readable by a non-technical manager
Automated report: function that accepts new data and produces a formatted HTML/PDF report
Report includes: key metrics, top insights, model prediction with confidence, recommended actions

Use a dataset relevant to your work. The output should be a report you could actually send to your manager.

MASTERY CHECKLIST

Can use LLMs to generate EDA hypotheses and feature ideas from a dataset summary
Can use DuckDB to query CSV/Parquet files with analytical SQL
Can implement NL-to-SQL: user asks question in English, gets back a DataFrame
Can build a scikit-learn ColumnTransformer + Pipeline for mixed data types
Can apply LLM-suggested feature engineering and evaluate impact on CV score
Can compute SHAP values and identify top features by mean absolute importance
Can use LLMs to write a plain-English model explanation for non-technical stakeholders
Capstone: full analysis pipeline producing a regenerable report on domain data

When complete: move to Part 9 — Portfolio and Launch.

← Track C: Automation All Modules Next: Part 9 — Capstone →