release: trigger release prior to agentic workflow updates

Merge pull request #248 from GokuMohandas/dev
updated cluster env and local catch for efs
2026-03-04 15:44:17 -08:00 · 2023-12-07 11:37:37 -08:00 · 2023-12-07 11:36:49 -08:00 · 2023-12-07 11:34:26 -08:00 · 2023-09-19 13:56:37 -07:00 · 2023-09-19 13:56:16 -07:00
26 changed files with 3538 additions and 2165 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -4,6 +4,7 @@ stores/
 mlflow/
 results/
 workspaces/
+efs/

 # VSCode
 .vscode/
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -2,7 +2,7 @@
 # See https://pre-commit.com/hooks.html for more hooks
 repos:
 -   repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.4.0
+    rev: v4.5.0
    hooks:
    -   id: trailing-whitespace
    -   id: end-of-file-fixer
--- a/1
+++ b/1
@@ -12,6 +12,7 @@ style:
 # Cleaning
 .PHONY: clean
 clean: style
+	python notebooks/clear_cell_nums.py
 	find . -type f -name "*.DS_Store" -ls -delete
 	find . | grep -E "(__pycache__|\.pyc|\.pyo)" | xargs rm -rf
 	find . | grep -E ".pytest_cache" | xargs rm -rf
--- a/README.md
+++ b/README.md
@@ -83,7 +83,7 @@ We'll start by setting up our cluster with the environment and compute configura
  - Project: `madewithml`
  - Cluster environment name: `madewithml-cluster-env`
  # Toggle `Select from saved configurations`
-  - Compute config: `madewithml-cluster-compute`
+  - Compute config: `madewithml-cluster-compute-g5.4xlarge`
  ```

  > Alternatively, we can use the [CLI](https://docs.anyscale.com/reference/anyscale-cli) to create the workspace via `anyscale workspace create ...`
@@ -109,8 +109,19 @@ Now we're ready to clone the repository that has all of our code:

 ```bash
 git clone https://github.com/GokuMohandas/Made-With-ML.git .
-git remote set-url origin https://github.com/GITHUB_USERNAME/Made-With-ML.git  # <-- CHANGE THIS to your username
-git checkout -b dev
+```
+
+### Credentials
+
+```bash
+touch .env
+```
+```bash
+# Inside .env
+GITHUB_USERNAME="CHANGE_THIS_TO_YOUR_USERNAME"  # ← CHANGE THIS
+```
+```bash
+source .env
 ```

 ### Virtual environment
@@ -278,7 +289,6 @@ python madewithml/evaluate.py \

 ### Inference
 ```bash
-# Get run ID
 export EXPERIMENT_NAME="llm"
 export RUN_ID=$(python madewithml/predict.py get-best-run-id --experiment-name $EXPERIMENT_NAME --metric val_loss --mode ASC)
 python madewithml/predict.py predict \
@@ -317,15 +327,7 @@ python madewithml/predict.py predict \
  python madewithml/serve.py --run_id $RUN_ID
  ```

-  While the application is running, we can use it via cURL, Python, etc.:
-
-  ```bash
-  # via cURL
-  curl -X POST -H "Content-Type: application/json" -d '{
-    "title": "Transfer learning with transformers",
-    "description": "Using transformers for transfer learning on text classification tasks."
-  }' http://127.0.0.1:8000/predict
-  ```
+  Once the application is running, we can use it via cURL, Python, etc.:

  ```python
  # via Python
@@ -341,13 +343,6 @@ python madewithml/predict.py predict \
  ray stop  # shutdown
  ```

-```bash
-export HOLDOUT_LOC="https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/holdout.csv"
-curl -X POST -H "Content-Type: application/json" -d '{
-    "dataset_loc": "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/holdout.csv"
-  }' http://127.0.0.1:8000/evaluate
-```
-
 </details>

 <details open>
@@ -362,15 +357,7 @@ curl -X POST -H "Content-Type: application/json" -d '{
  python madewithml/serve.py --run_id $RUN_ID
  ```

-  While the application is running, we can use it via cURL, Python, etc.:
-
-  ```bash
-  # via cURL
-  curl -X POST -H "Content-Type: application/json" -d '{
-    "title": "Transfer learning with transformers",
-    "description": "Using transformers for transfer learning on text classification tasks."
-  }' http://127.0.0.1:8000/predict
-  ```
+  Once the application is running, we can use it via cURL, Python, etc.:

  ```python
  # via Python
@@ -399,7 +386,8 @@ export RUN_ID=$(python madewithml/predict.py get-best-run-id --experiment-name $
 pytest --run-id=$RUN_ID tests/model --verbose --disable-warnings

 # Coverage
-python3 -m pytest --cov madewithml --cov-report html
+python3 -m pytest tests/code --cov madewithml --cov-report html --disable-warnings  # html report
+python3 -m pytest tests/code --cov madewithml --cov-report term --disable-warnings  # terminal report
 ```

 ## Production
@@ -435,7 +423,7 @@ anyscale cluster-env build deploy/cluster_env.yaml --name $CLUSTER_ENV_NAME
 The compute configuration determines **what** resources our workloads will be executes on. We've already created this [compute configuration](./deploy/cluster_compute.yaml) for us but this is how we can create it ourselves.

 ```bash
-export CLUSTER_COMPUTE_NAME="madewithml-cluster-compute"
+export CLUSTER_COMPUTE_NAME="madewithml-cluster-compute-g5.4xlarge"
 anyscale cluster-compute create deploy/cluster_compute.yaml --name $CLUSTER_COMPUTE_NAME
 ```

@@ -497,17 +485,23 @@ We're not going to manually deploy our application every time we make a change.
  <img src="https://madewithml.com/static/images/mlops/cicd/cicd.png">
 </div>

-1. We'll start by adding the necessary credentials to the [`/settings/secrets/actions`](https://github.com/GokuMohandas/Made-With-ML/settings/secrets/actions) page of our GitHub repository.
+1. Create a new github branch to save our changes to and execute CI/CD workloads:
+```bash
+git remote set-url origin https://github.com/$GITHUB_USERNAME/Made-With-ML.git  # <-- CHANGE THIS to your username
+git checkout -b dev
+```
+
+2. We'll start by adding the necessary credentials to the [`/settings/secrets/actions`](https://github.com/GokuMohandas/Made-With-ML/settings/secrets/actions) page of our GitHub repository.

 ``` bash
 export ANYSCALE_HOST=https://console.anyscale.com
 export ANYSCALE_CLI_TOKEN=$YOUR_CLI_TOKEN  # retrieved from https://console.anyscale.com/o/madewithml/credentials
 ```

-2. Now we can make changes to our code (not on `main` branch) and push them to GitHub. But in order to push our code to GitHub, we'll need to first authenticate with our credentials before pushing to our repository:
+3. Now we can make changes to our code (not on `main` branch) and push them to GitHub. But in order to push our code to GitHub, we'll need to first authenticate with our credentials before pushing to our repository:

 ```bash
-git config --global user.name "Your Name"  # <-- CHANGE THIS to your name
+git config --global user.name $GITHUB_USERNAME  # <-- CHANGE THIS to your username
 git config --global user.email you@example.com  # <-- CHANGE THIS to your email
 git add .
 git commit -m ""  # <-- CHANGE THIS to your message
@@ -516,13 +510,13 @@ git push origin dev

 Now you will be prompted to enter your username and password (personal access token). Follow these steps to get personal access token: [New GitHub personal access token](https://github.com/settings/tokens/new) → Add a name → Toggle `repo` and `workflow` → Click `Generate token` (scroll down) → Copy the token and paste it when prompted for your password.

-3. Now we can start a PR from this branch to our `main` branch and this will trigger the [workloads workflow](/.github/workflows/workloads.yaml). If the workflow (Anyscale Jobs) succeeds, this will produce comments with the training and evaluation results directly on the PR.
+4. Now we can start a PR from this branch to our `main` branch and this will trigger the [workloads workflow](/.github/workflows/workloads.yaml). If the workflow (Anyscale Jobs) succeeds, this will produce comments with the training and evaluation results directly on the PR.

 <div align="center">
  <img src="https://madewithml.com/static/images/mlops/cicd/comments.png">
 </div>

-4. If we like the results, we can merge the PR into the `main` branch. This will trigger the [serve workflow](/.github/workflows/serve.yaml) which will rollout our new service to production!
+5. If we like the results, we can merge the PR into the `main` branch. This will trigger the [serve workflow](/.github/workflows/serve.yaml) which will rollout our new service to production!

 ### Continual learning

--- a/deploy/cluster_compute.yaml
+++ b/deploy/cluster_compute.yaml
@@ -1,12 +1,12 @@
-cloud: madewithml-us-east-2
-region: us-east2
+cloud: education-us-west-2
+region: us-west-2
 head_node_type:
  name: head_node_type
-  instance_type: m5.2xlarge  # 8 CPU, 0 GPU, 32 GB RAM
+  instance_type: g5.4xlarge
 worker_node_types:
 - name: gpu_worker
-  instance_type: g4dn.xlarge  # 4 CPU, 1 GPU, 16 GB RAM
-  min_workers: 0
+  instance_type: g5.4xlarge
+  min_workers: 1
  max_workers: 1
  use_spot: False
 aws:
--- a/deploy/cluster_env.yaml
+++ b/deploy/cluster_env.yaml
@@ -1,4 +1,4 @@
-base_image: anyscale/ray:2.6.0-py310-cu118
+base_image: anyscale/ray:2.7.0optimized-py310-cu118
 env_vars: {}
 debian_packages:
  - curl
--- a/deploy/jobs/workloads.sh
+++ b/deploy/jobs/workloads.sh
@@ -1,6 +1,5 @@
 #!/bin/bash
 export PYTHONPATH=$PYTHONPATH:$PWD
-export RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE=1
 mkdir results

 # Test data
--- a/deploy/jobs/workloads.yaml
+++ b/deploy/jobs/workloads.yaml
@@ -1,5 +1,5 @@
 name: workloads
-project_id: prj_v9izs5t1d6b512ism8c5rkq4wm
+project_id: prj_wn6el5cu9dqwktk6t4cv54x8zh
 cluster_env: madewithml-cluster-env
 compute_config: madewithml-cluster-compute
 runtime_env:
--- a/deploy/services/serve_model.yaml
+++ b/deploy/services/serve_model.yaml
@@ -1,5 +1,5 @@
 name: madewithml
-project_id: prj_v9izs5t1d6b512ism8c5rkq4wm
+project_id: prj_wn6el5cu9dqwktk6t4cv54x8zh
 cluster_env: madewithml-cluster-env
 compute_config: madewithml-cluster-compute
 ray_serve_config:
--- a/madewithml/init.py
+++ b/madewithml/init.py
@@ -0,0 +1,3 @@
+from dotenv import load_dotenv
+
+load_dotenv()
--- a/madewithml/config.py
+++ b/madewithml/config.py
@@ -1,18 +1,24 @@
 # config.py
 import logging
+import os
 import sys
 from pathlib import Path

 import mlflow
-import pretty_errors  # NOQA: F401 (imported but unused)

 # Directories
 ROOT_DIR = Path(__file__).parent.parent.absolute()
 LOGS_DIR = Path(ROOT_DIR, "logs")
 LOGS_DIR.mkdir(parents=True, exist_ok=True)
+EFS_DIR = Path(f"/efs/shared_storage/madewithml/{os.environ.get('GITHUB_USERNAME', '')}")
+try:
+    Path(EFS_DIR).mkdir(parents=True, exist_ok=True)
+except OSError:
+    EFS_DIR = Path(ROOT_DIR, "efs")
+    Path(EFS_DIR).mkdir(parents=True, exist_ok=True)

 # Config MLflow
-MODEL_REGISTRY = Path("/tmp/mlflow")
+MODEL_REGISTRY = Path(f"{EFS_DIR}/mlflow")
 Path(MODEL_REGISTRY).mkdir(parents=True, exist_ok=True)
 MLFLOW_TRACKING_URI = "file://" + str(MODEL_REGISTRY.absolute())
 mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
--- a/madewithml/data.py
+++ b/madewithml/data.py
@@ -5,7 +5,6 @@ import numpy as np
 import pandas as pd
 import ray
 from ray.data import Dataset
-from ray.data.preprocessor import Preprocessor
 from sklearn.model_selection import train_test_split
 from transformers import BertTokenizer

@@ -135,13 +134,18 @@ def preprocess(df: pd.DataFrame, class_to_index: Dict) -> Dict:
    return outputs


-class CustomPreprocessor(Preprocessor):
+class CustomPreprocessor:
    """Custom preprocessor class."""

-    def _fit(self, ds):
+    def __init__(self, class_to_index={}):
+        self.class_to_index = class_to_index or {}  # mutable defaults
+        self.index_to_class = {v: k for k, v in self.class_to_index.items()}
+
+    def fit(self, ds):
        tags = ds.unique(column="tag")
        self.class_to_index = {tag: i for i, tag in enumerate(tags)}
        self.index_to_class = {v: k for k, v in self.class_to_index.items()}
+        return self

-    def _transform_pandas(self, batch):  # could also do _transform_numpy
-        return preprocess(batch, class_to_index=self.class_to_index)
+    def transform(self, ds):
+        return ds.map_batches(preprocess, fn_kwargs={"class_to_index": self.class_to_index}, batch_format="pandas")
--- a/madewithml/evaluate.py
+++ b/madewithml/evaluate.py
@@ -8,13 +8,13 @@ import ray
 import ray.train.torch  # NOQA: F401 (imported but unused)
 import typer
 from ray.data import Dataset
-from ray.train.torch.torch_predictor import TorchPredictor
 from sklearn.metrics import precision_recall_fscore_support
 from snorkel.slicing import PandasSFApplier, slicing_function
 from typing_extensions import Annotated

 from madewithml import predict, utils
 from madewithml.config import logger
+from madewithml.predict import TorchPredictor

 # Initialize Typer CLI app
 app = typer.Typer()
@@ -133,8 +133,8 @@ def evaluate(
    y_true = np.stack([item["targets"] for item in values])

    # y_pred
-    z = predictor.predict(data=ds.to_pandas())["predictions"]
-    y_pred = np.stack(z).argmax(1)
+    predictions = preprocessed_ds.map_batches(predictor).take_all()
+    y_pred = np.array([d["output"] for d in predictions])

    # Metrics
    metrics = {
--- a/madewithml/models.py
+++ b/madewithml/models.py
@@ -1,13 +1,20 @@
+import json
+import os
+from pathlib import Path
+
 import torch
 import torch.nn as nn
+import torch.nn.functional as F
+from transformers import BertModel


-class FinetunedLLM(nn.Module):  # pragma: no cover, torch model
-    """Model architecture for a Large Language Model (LLM) that we will fine-tune."""
-
+class FinetunedLLM(nn.Module):
    def __init__(self, llm, dropout_p, embedding_dim, num_classes):
        super(FinetunedLLM, self).__init__()
        self.llm = llm
+        self.dropout_p = dropout_p
+        self.embedding_dim = embedding_dim
+        self.num_classes = num_classes
        self.dropout = torch.nn.Dropout(dropout_p)
        self.fc1 = torch.nn.Linear(embedding_dim, num_classes)

@@ -17,3 +24,36 @@ class FinetunedLLM(nn.Module):  # pragma: no cover, torch model
        z = self.dropout(pool)
        z = self.fc1(z)
        return z
+
+    @torch.inference_mode()
+    def predict(self, batch):
+        self.eval()
+        z = self(batch)
+        y_pred = torch.argmax(z, dim=1).cpu().numpy()
+        return y_pred
+
+    @torch.inference_mode()
+    def predict_proba(self, batch):
+        self.eval()
+        z = self(batch)
+        y_probs = F.softmax(z, dim=1).cpu().numpy()
+        return y_probs
+
+    def save(self, dp):
+        with open(Path(dp, "args.json"), "w") as fp:
+            contents = {
+                "dropout_p": self.dropout_p,
+                "embedding_dim": self.embedding_dim,
+                "num_classes": self.num_classes,
+            }
+            json.dump(contents, fp, indent=4, sort_keys=False)
+        torch.save(self.state_dict(), os.path.join(dp, "model.pt"))
+
+    @classmethod
+    def load(cls, args_fp, state_dict_fp):
+        with open(args_fp, "r") as fp:
+            kwargs = json.load(fp=fp)
+        llm = BertModel.from_pretrained("allenai/scibert_scivocab_uncased", return_dict=False)
+        model = cls(llm=llm, **kwargs)
+        model.load_state_dict(torch.load(state_dict_fp, map_location=torch.device("cpu")))
+        return model
--- a/madewithml/predict.py
+++ b/madewithml/predict.py
@@ -1,19 +1,20 @@
 import json
+from pathlib import Path
 from typing import Any, Dict, Iterable, List
 from urllib.parse import urlparse

 import numpy as np
-import pandas as pd
 import ray
-import torch
 import typer
 from numpyencoder import NumpyEncoder
 from ray.air import Result
-from ray.train.torch import TorchPredictor
 from ray.train.torch.torch_checkpoint import TorchCheckpoint
 from typing_extensions import Annotated

 from madewithml.config import logger, mlflow
+from madewithml.data import CustomPreprocessor
+from madewithml.models import FinetunedLLM
+from madewithml.utils import collate_fn

 # Initialize Typer CLI app
 app = typer.Typer()
@@ -48,25 +49,51 @@ def format_prob(prob: Iterable, index_to_class: Dict) -> Dict:
    return d


-def predict_with_proba(
-    df: pd.DataFrame,
-    predictor: ray.train.torch.torch_predictor.TorchPredictor,
+class TorchPredictor:
+    def __init__(self, preprocessor, model):
+        self.preprocessor = preprocessor
+        self.model = model
+        self.model.eval()
+
+    def __call__(self, batch):
+        results = self.model.predict(collate_fn(batch))
+        return {"output": results}
+
+    def predict_proba(self, batch):
+        results = self.model.predict_proba(collate_fn(batch))
+        return {"output": results}
+
+    def get_preprocessor(self):
+        return self.preprocessor
+
+    @classmethod
+    def from_checkpoint(cls, checkpoint):
+        metadata = checkpoint.get_metadata()
+        preprocessor = CustomPreprocessor(class_to_index=metadata["class_to_index"])
+        model = FinetunedLLM.load(Path(checkpoint.path, "args.json"), Path(checkpoint.path, "model.pt"))
+        return cls(preprocessor=preprocessor, model=model)
+
+
+def predict_proba(
+    ds: ray.data.dataset.Dataset,
+    predictor: TorchPredictor,
 ) -> List:  # pragma: no cover, tested with inference workload
    """Predict tags (with probabilities) for input data from a dataframe.

    Args:
        df (pd.DataFrame): dataframe with input features.
-        predictor (ray.train.torch.torch_predictor.TorchPredictor): loaded predictor from a checkpoint.
+        predictor (TorchPredictor): loaded predictor from a checkpoint.

    Returns:
        List: list of predicted labels.
    """
    preprocessor = predictor.get_preprocessor()
-    z = predictor.predict(data=df)["predictions"]
-    y_prob = torch.tensor(np.stack(z)).softmax(dim=1).numpy()
+    preprocessed_ds = preprocessor.transform(ds)
+    outputs = preprocessed_ds.map_batches(predictor.predict_proba)
+    y_prob = np.array([d["output"] for d in outputs.take_all()])
    results = []
    for i, prob in enumerate(y_prob):
-        tag = decode([z[i].argmax()], preprocessor.index_to_class)[0]
+        tag = preprocessor.index_to_class[prob.argmax()]
        results.append({"prediction": tag, "probabilities": format_prob(prob, preprocessor.index_to_class)})
    return results

@@ -125,11 +152,10 @@ def predict(
    # Load components
    best_checkpoint = get_best_checkpoint(run_id=run_id)
    predictor = TorchPredictor.from_checkpoint(best_checkpoint)
-    preprocessor = predictor.get_preprocessor()

    # Predict
-    sample_df = pd.DataFrame([{"title": title, "description": description, "tag": "other"}])
-    results = predict_with_proba(df=sample_df, predictor=predictor)
+    sample_ds = ray.data.from_items([{"title": title, "description": description, "tag": "other"}])
+    results = predict_proba(ds=sample_ds, predictor=predictor)
    logger.info(json.dumps(results, cls=NumpyEncoder, indent=2))
    return results

--- a/madewithml/serve.py
+++ b/madewithml/serve.py
@@ -1,12 +1,11 @@
 import argparse
+import os
 from http import HTTPStatus
 from typing import Dict

-import pandas as pd
 import ray
 from fastapi import FastAPI
 from ray import serve
-from ray.train.torch import TorchPredictor
 from starlette.requests import Request

 from madewithml import evaluate, predict
@@ -20,7 +19,7 @@ app = FastAPI(
 )


-@serve.deployment(route_prefix="/", num_replicas="1", ray_actor_options={"num_cpus": 8, "num_gpus": 0})
+@serve.deployment(num_replicas="1", ray_actor_options={"num_cpus": 8, "num_gpus": 0})
@serve.ingress(app)
 class ModelDeployment:
    def __init__(self, run_id: str, threshold: int = 0.9):
@@ -29,8 +28,7 @@ class ModelDeployment:
        self.threshold = threshold
        mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)  # so workers have access to model registry
        best_checkpoint = predict.get_best_checkpoint(run_id=run_id)
-        self.predictor = TorchPredictor.from_checkpoint(best_checkpoint)
-        self.preprocessor = self.predictor.get_preprocessor()
+        self.predictor = predict.TorchPredictor.from_checkpoint(best_checkpoint)

    @app.get("/")
    def _index(self) -> Dict:
@@ -54,11 +52,10 @@ class ModelDeployment:
        return {"results": results}

    @app.post("/predict/")
-    async def _predict(self, request: Request) -> Dict:
-        # Get prediction
+    async def _predict(self, request: Request):
        data = await request.json()
-        df = pd.DataFrame([{"title": data.get("title", ""), "description": data.get("description", ""), "tag": ""}])
-        results = predict.predict_with_proba(df=df, predictor=self.predictor)
+        sample_ds = ray.data.from_items([{"title": data.get("title", ""), "description": data.get("description", ""), "tag": ""}])
+        results = predict.predict_proba(ds=sample_ds, predictor=self.predictor)

        # Apply custom logic
        for i, result in enumerate(results):
@@ -75,5 +72,5 @@ if __name__ == "__main__":
    parser.add_argument("--run_id", help="run ID to use for serving.")
    parser.add_argument("--threshold", type=float, default=0.9, help="threshold for `other` class.")
    args = parser.parse_args()
-    ray.init()
+    ray.init(runtime_env={"env_vars": {"GITHUB_USERNAME": os.environ["GITHUB_USERNAME"]}})
    serve.run(ModelDeployment.bind(run_id=args.run_id, threshold=args.threshold))
--- a/madewithml/train.py
+++ b/madewithml/train.py
@@ -1,5 +1,7 @@
 import datetime
 import json
+import os
+import tempfile
 from typing import Tuple

 import numpy as np
@@ -9,21 +11,23 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
 import typer
-from ray.air import session
-from ray.air.config import (
+from ray.air.integrations.mlflow import MLflowLoggerCallback
+from ray.data import Dataset
+from ray.train import (
+    Checkpoint,
    CheckpointConfig,
-    DatasetConfig,
+    DataConfig,
    RunConfig,
    ScalingConfig,
 )
-from ray.air.integrations.mlflow import MLflowLoggerCallback
-from ray.data import Dataset
-from ray.train.torch import TorchCheckpoint, TorchTrainer
+from ray.train.torch import TorchTrainer
+from torch.nn.parallel.distributed import DistributedDataParallel
 from transformers import BertModel
 from typing_extensions import Annotated

-from madewithml import data, models, utils
-from madewithml.config import MLFLOW_TRACKING_URI, logger
+from madewithml import data, utils
+from madewithml.config import EFS_DIR, MLFLOW_TRACKING_URI, logger
+from madewithml.models import FinetunedLLM

 # Initialize Typer CLI app
 app = typer.Typer()
@@ -105,18 +109,18 @@ def train_loop_per_worker(config: dict) -> None:  # pragma: no cover, tested via
    lr = config["lr"]
    lr_factor = config["lr_factor"]
    lr_patience = config["lr_patience"]
-    batch_size = config["batch_size"]
    num_epochs = config["num_epochs"]
+    batch_size = config["batch_size"]
    num_classes = config["num_classes"]

    # Get datasets
    utils.set_seeds()
-    train_ds = session.get_dataset_shard("train")
-    val_ds = session.get_dataset_shard("val")
+    train_ds = train.get_dataset_shard("train")
+    val_ds = train.get_dataset_shard("val")

    # Model
    llm = BertModel.from_pretrained("allenai/scibert_scivocab_uncased", return_dict=False)
-    model = models.FinetunedLLM(llm=llm, dropout_p=dropout_p, embedding_dim=llm.config.hidden_size, num_classes=num_classes)
+    model = FinetunedLLM(llm=llm, dropout_p=dropout_p, embedding_dim=llm.config.hidden_size, num_classes=num_classes)
    model = train.torch.prepare_model(model)

    # Training components
@@ -125,7 +129,8 @@ def train_loop_per_worker(config: dict) -> None:  # pragma: no cover, tested via
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=lr_factor, patience=lr_patience)

    # Training
-    batch_size_per_worker = batch_size // session.get_world_size()
+    num_workers = train.get_context().get_world_size()
+    batch_size_per_worker = batch_size // num_workers
    for epoch in range(num_epochs):
        # Step
        train_loss = train_step(train_ds, batch_size_per_worker, model, num_classes, loss_fn, optimizer)
@@ -133,9 +138,14 @@ def train_loop_per_worker(config: dict) -> None:  # pragma: no cover, tested via
        scheduler.step(val_loss)

        # Checkpoint
-        metrics = dict(epoch=epoch, lr=optimizer.param_groups[0]["lr"], train_loss=train_loss, val_loss=val_loss)
-        checkpoint = TorchCheckpoint.from_model(model=model)
-        session.report(metrics, checkpoint=checkpoint)
+        with tempfile.TemporaryDirectory() as dp:
+            if isinstance(model, DistributedDataParallel):  # cpu
+                model.module.save(dp=dp)
+            else:
+                model.save(dp=dp)
+            metrics = dict(epoch=epoch, lr=optimizer.param_groups[0]["lr"], train_loss=train_loss, val_loss=val_loss)
+            checkpoint = Checkpoint.from_directory(dp)
+            train.report(metrics, checkpoint=checkpoint)


@app.command()
@@ -182,7 +192,6 @@ def train_model(
        num_workers=num_workers,
        use_gpu=bool(gpu_per_worker),
        resources_per_worker={"CPU": cpu_per_worker, "GPU": gpu_per_worker},
-        _max_cpu_fraction_per_node=0.8,
    )

    # Checkpoint config
@@ -200,10 +209,7 @@ def train_model(
    )

    # Run config
-    run_config = RunConfig(
-        callbacks=[mlflow_callback],
-        checkpoint_config=checkpoint_config,
-    )
+    run_config = RunConfig(callbacks=[mlflow_callback], checkpoint_config=checkpoint_config, storage_path=EFS_DIR, local_dir=EFS_DIR)

    # Dataset
    ds = data.load_data(dataset_loc=dataset_loc, num_samples=train_loop_config["num_samples"])
@@ -212,14 +218,13 @@ def train_model(
    train_loop_config["num_classes"] = len(tags)

    # Dataset config
-    dataset_config = {
-        "train": DatasetConfig(fit=False, transform=False, randomize_block_order=False),
-        "val": DatasetConfig(fit=False, transform=False, randomize_block_order=False),
-    }
+    options = ray.data.ExecutionOptions(preserve_order=True)
+    dataset_config = DataConfig(datasets_to_split=["train"], execution_options=options)

    # Preprocess
    preprocessor = data.CustomPreprocessor()
-    train_ds = preprocessor.fit_transform(train_ds)
+    preprocessor = preprocessor.fit(train_ds)
+    train_ds = preprocessor.transform(train_ds)
    val_ds = preprocessor.transform(val_ds)
    train_ds = train_ds.materialize()
    val_ds = val_ds.materialize()
@@ -232,7 +237,7 @@ def train_model(
        run_config=run_config,
        datasets={"train": train_ds, "val": val_ds},
        dataset_config=dataset_config,
-        preprocessor=preprocessor,
+        metadata={"class_to_index": preprocessor.class_to_index},
    )

    # Train
@@ -252,5 +257,5 @@ def train_model(
 if __name__ == "__main__":  # pragma: no cover, application
    if ray.is_initialized():
        ray.shutdown()
-    ray.init()
+    ray.init(runtime_env={"env_vars": {"GITHUB_USERNAME": os.environ["GITHUB_USERNAME"]}})
    app()
--- a/madewithml/tune.py
+++ b/madewithml/tune.py
@@ -1,5 +1,6 @@
 import datetime
 import json
+import os

 import ray
 import typer
@@ -19,7 +20,7 @@ from ray.tune.search.hyperopt import HyperOptSearch
 from typing_extensions import Annotated

 from madewithml import data, train, utils
-from madewithml.config import MLFLOW_TRACKING_URI, logger
+from madewithml.config import EFS_DIR, MLFLOW_TRACKING_URI, logger

 # Initialize Typer CLI app
 app = typer.Typer()
@@ -72,7 +73,6 @@ def tune_models(
        num_workers=num_workers,
        use_gpu=bool(gpu_per_worker),
        resources_per_worker={"CPU": cpu_per_worker, "GPU": gpu_per_worker},
-        _max_cpu_fraction_per_node=0.8,
    )

    # Dataset
@@ -89,7 +89,8 @@ def tune_models(

    # Preprocess
    preprocessor = data.CustomPreprocessor()
-    train_ds = preprocessor.fit_transform(train_ds)
+    preprocessor = preprocessor.fit(train_ds)
+    train_ds = preprocessor.transform(train_ds)
    val_ds = preprocessor.transform(val_ds)
    train_ds = train_ds.materialize()
    val_ds = val_ds.materialize()
@@ -101,7 +102,7 @@ def tune_models(
        scaling_config=scaling_config,
        datasets={"train": train_ds, "val": val_ds},
        dataset_config=dataset_config,
-        preprocessor=preprocessor,
+        metadata={"class_to_index": preprocessor.class_to_index},
    )

    # Checkpoint configuration
@@ -117,10 +118,7 @@ def tune_models(
        experiment_name=experiment_name,
        save_artifact=True,
    )
-    run_config = RunConfig(
-        callbacks=[mlflow_callback],
-        checkpoint_config=checkpoint_config,
-    )
+    run_config = RunConfig(callbacks=[mlflow_callback], checkpoint_config=checkpoint_config, storage_path=EFS_DIR, local_dir=EFS_DIR)

    # Hyperparameters to start with
    initial_params = json.loads(initial_params)
@@ -178,5 +176,5 @@ def tune_models(
 if __name__ == "__main__":  # pragma: no cover, application
    if ray.is_initialized():
        ray.shutdown()
-    ray.init()
+    ray.init(runtime_env={"env_vars": {"GITHUB_USERNAME": os.environ["GITHUB_USERNAME"]}})
    app()
--- a/notebooks/benchmarks.ipynb
+++ b/notebooks/benchmarks.ipynb
@@ -58,7 +58,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
   "id": "e2c96931-d511-4c6e-b582-87d24455a11e",
   "metadata": {
    "tags": []
@@ -79,7 +79,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
   "id": "953a577e-3cd0-4c6b-81f9-8bc32850214d",
   "metadata": {
    "tags": []
@@ -101,7 +101,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
   "id": "1790e2f5-6b8b-425c-8842-a2b0ea8f3f07",
   "metadata": {
    "tags": []
@@ -113,7 +113,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
   "id": "6b9bfadb-ba49-4f5a-b216-4db14c8888ab",
   "metadata": {
    "tags": []
@@ -208,7 +208,7 @@
       "4  A PyTorch Implementation of \"Watch Your Step: ...            other  "
      ]
     },
-     "execution_count": 4,
+     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -222,7 +222,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": null,
   "id": "aa5b95d5-d61e-48e4-9100-d9d2fc0d53fa",
   "metadata": {
    "tags": []
@@ -234,7 +234,7 @@
       "['computer-vision', 'other', 'natural-language-processing', 'mlops']"
      ]
     },
-     "execution_count": 5,
+     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -247,7 +247,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
   "id": "3c828129-8248-4e38-93a4-cabb097e7ba5",
   "metadata": {
    "tags": []
@@ -279,7 +279,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
   "id": "8e3c3f44-2c19-4c32-9bc5-e9a7a917d19d",
   "metadata": {},
   "outputs": [],
@@ -295,7 +295,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": null,
   "id": "4950bdb4",
   "metadata": {},
   "outputs": [
@@ -337,7 +337,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": null,
   "id": "b2aae14c-9870-4a27-b5ad-90f339686620",
   "metadata": {
    "tags": []
@@ -364,7 +364,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": null,
   "id": "03ee23e5",
   "metadata": {},
   "outputs": [
@@ -401,7 +401,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": null,
   "id": "71c43e8c",
   "metadata": {},
   "outputs": [
@@ -416,7 +416,7 @@
       "  'description': 'A PyTorch implementation of \"Capsule Graph Neural Network\" (ICLR 2019).'}]"
      ]
     },
-     "execution_count": 11,
+     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -429,7 +429,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": null,
   "id": "c9359a91-ac19-48a4-babb-e65d53f39b42",
   "metadata": {
    "tags": []
@@ -462,7 +462,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": null,
   "id": "5fac795e",
   "metadata": {},
   "outputs": [
@@ -486,7 +486,7 @@
       "['other', 'computer-vision', 'computer-vision']"
      ]
     },
-     "execution_count": 13,
+     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -507,7 +507,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": null,
   "id": "e4cb38a8-44cb-4cea-828c-590f223d4063",
   "metadata": {
    "tags": []
@@ -543,7 +543,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": null,
   "id": "de2d0416",
   "metadata": {},
   "outputs": [],
@@ -576,7 +576,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": null,
   "id": "ff3c37fb",
   "metadata": {},
   "outputs": [],
@@ -618,7 +618,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": null,
   "id": "972fee2f-86e2-445e-92d0-923f5690132a",
   "metadata": {},
   "outputs": [],
@@ -647,7 +647,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": null,
   "id": "9ee4e745-ef56-4b76-8230-fcbe56ac46aa",
   "metadata": {
    "tags": []
@@ -663,7 +663,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": null,
   "id": "73780054-afeb-4ce6-8255-51bf91f9f820",
   "metadata": {
    "tags": []
@@ -709,7 +709,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": null,
   "id": "24af6d04-d29e-4adb-a289-4c34c2cc7ec8",
   "metadata": {
    "tags": []
@@ -780,7 +780,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": null,
   "id": "e22ed1e1-b34d-43d1-ae8b-32b1fd5be53d",
   "metadata": {
    "tags": []
@@ -815,7 +815,7 @@
       "  'tag': 'mlops'}]"
      ]
     },
-     "execution_count": 22,
+     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -833,7 +833,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": null,
   "id": "294548a5-9edf-4dea-ab8d-dc7464246810",
   "metadata": {
    "tags": []
@@ -864,7 +864,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": null,
   "id": "29bca273-3ea8-4ce0-9fa9-fe19062b7c5b",
   "metadata": {
    "tags": []
@@ -917,7 +917,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": null,
   "id": "3e59a3b9-69d9-4bb5-8b88-0569fcc72f0c",
   "metadata": {
    "tags": []
@@ -1001,7 +1001,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": null,
   "id": "15ea136e",
   "metadata": {},
   "outputs": [],
@@ -1020,7 +1020,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 28,
+   "execution_count": null,
   "id": "ec0b498a-97c1-488c-a6b9-dc63a8a9df4d",
   "metadata": {
    "tags": []
@@ -1065,7 +1065,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 29,
+   "execution_count": null,
   "id": "4cc80311",
   "metadata": {},
   "outputs": [],
@@ -1080,7 +1080,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 30,
+   "execution_count": null,
   "id": "6771b1d2",
   "metadata": {},
   "outputs": [
--- a/notebooks/clear_cell_nums.py
+++ b/notebooks/clear_cell_nums.py
@@ -0,0 +1,23 @@
+from pathlib import Path
+
+import nbformat
+
+
+def clear_execution_numbers(nb_path):
+    with open(nb_path, "r", encoding="utf-8") as f:
+        nb = nbformat.read(f, as_version=4)
+    for cell in nb["cells"]:
+        if cell["cell_type"] == "code":
+            cell["execution_count"] = None
+            for output in cell["outputs"]:
+                if "execution_count" in output:
+                    output["execution_count"] = None
+    with open(nb_path, "w", encoding="utf-8") as f:
+        nbformat.write(nb, f)
+
+
+if __name__ == "__main__":
+    NOTEBOOK_DIR = Path(__file__).parent
+    notebook_fps = list(NOTEBOOK_DIR.glob("**/*.ipynb"))
+    for fp in notebook_fps:
+        clear_execution_numbers(fp)
--- a/notebooks/madewithml.ipynb
+++ b/notebooks/madewithml.ipynb
--- a/requirements.txt
+++ b/requirements.txt
@@ -7,8 +7,8 @@ nltk==3.8.1
 numpy==1.24.3
 numpyencoder==0.3.0
 pandas==2.0.1
-pretty-errors==1.2.25
-ray[air]==2.6.0
+python-dotenv==1.0.0
+ray[air]==2.7.0
 scikit-learn==1.2.2
 snorkel==0.9.9
 SQLAlchemy==1.4.48
--- a/tests/code/test_data.py
+++ b/tests/code/test_data.py
@@ -54,5 +54,7 @@ def test_preprocess(df, class_to_index):

 def test_fit_transform(dataset_loc, preprocessor):
    ds = data.load_data(dataset_loc=dataset_loc)
-    preprocessor.fit_transform(ds)
+    preprocessor = preprocessor.fit(ds)
+    preprocessed_ds = preprocessor.transform(ds)
    assert len(preprocessor.class_to_index) == 4
+    assert ds.count() == preprocessed_ds.count()
--- a/tests/code/test_utils.py
+++ b/tests/code/test_utils.py
@@ -4,6 +4,7 @@ from pathlib import Path
 import numpy as np
 import pytest
 import torch
+from ray.train.torch import get_device

 from madewithml import utils

@@ -42,9 +43,9 @@ def test_collate_fn():
    }
    processed_batch = utils.collate_fn(batch)
    expected_batch = {
-        "ids": torch.tensor([[1, 2, 0], [1, 2, 3]], dtype=torch.int32),
-        "masks": torch.tensor([[1, 1, 0], [1, 1, 1]], dtype=torch.int32),
-        "targets": torch.tensor([3, 1], dtype=torch.int64),
+        "ids": torch.as_tensor([[1, 2, 0], [1, 2, 3]], dtype=torch.int32, device=get_device()),
+        "masks": torch.as_tensor([[1, 1, 0], [1, 1, 1]], dtype=torch.int32, device=get_device()),
+        "targets": torch.as_tensor([3, 1], dtype=torch.int64, device=get_device()),
    }
    for k in batch:
        assert torch.allclose(processed_batch[k], expected_batch[k])
--- a/tests/model/conftest.py
+++ b/tests/model/conftest.py
@@ -1,7 +1,7 @@
 import pytest
-from ray.train.torch.torch_predictor import TorchPredictor

 from madewithml import predict
+from madewithml.predict import TorchPredictor


 def pytest_addoption(parser):
--- a/tests/model/utils.py
+++ b/tests/model/utils.py
@@ -1,12 +1,9 @@
-import numpy as np
-import pandas as pd
+import ray

 from madewithml import predict


 def get_label(text, predictor):
-    df = pd.DataFrame({"title": [text], "description": "", "tag": "other"})
-    z = predictor.predict(data=df)["predictions"]
-    preprocessor = predictor.get_preprocessor()
-    label = predict.decode(np.stack(z).argmax(1), preprocessor.index_to_class)[0]
-    return label
+    sample_ds = ray.data.from_items([{"title": text, "description": "", "tag": "other"}])
+    results = predict.predict_proba(ds=sample_ds, predictor=predictor)
+    return results[0]["prediction"]
Author	SHA1	Message	Date
Goku Mohandas	3361aeb8dd	release: trigger release prior to agentic workflow updates	2026-03-04 15:44:17 -08:00
Goku Mohandas	66e8e43711	Merge pull request #248 from GokuMohandas/dev updated cluster env and local catch for efs	2023-12-07 11:37:37 -08:00
GokuMohandas	be3d2732a8	updated cluser compute, environment and workloads	2023-12-07 11:36:49 -08:00
GokuMohandas	11ab35b251	added local catch for EFS OS/permissions issues	2023-12-07 11:34:26 -08:00
Goku Mohandas	540754392a	Merge pull request #241 from GokuMohandas/dev updating to Ray 2.7	2023-09-19 13:56:37 -07:00
GokuMohandas	f682fbffa4	updating to Ray 2.7	2023-09-19 13:56:16 -07:00
GokuMohandas	14671a438a	updated project id for workloads	2023-09-18 22:08:25 -07:00
GokuMohandas	b98bd5b1ae	updated to Ray 2.7	2023-09-18 22:03:20 -07:00
Goku Mohandas	71b3d50a05	Merge pull request #239 from GokuMohandas/dev adding dotenv for credentials management	2023-09-15 09:46:18 -07:00
Goku Mohandas	097ffaadc8	adding dotenv for credentials management	2023-09-15 09:44:38 -07:00