anky — training run 2026-03-04

what changed from run 001

We moved from a small first-pass set to a larger, deduplicated, caption-complete dataset. The source pool had 577 gallery images, but the final trainable round-two archive intentionally kept only high-quality unique image-caption pairs.

Final dataset used for training: 465 png + 465 txt. This run increases capacity (rank 32) and extends the schedule (6500 steps) to improve identity consistency across harder prompts.

training parameters

base model

FLUX.1-dev

technique

LoRA

rank / alpha

32 / 16

steps

6500

save every

500

optimizer

adamw8bit

dtype

bf16

images

465

trigger word

anky

hardware

RTX PRO 6000 96GB

published artifacts

Model + dataset repo: jpfraneto/anky-flux-lora-v2

Final dataset archive: training-data/final-training-dataset-for-round-two.tar.gz

Training outputs: training-runs/anky_flux_lora_v2/weights/anky_flux_lora_v2.safetensors, training-runs/anky_flux_lora_v2/samples/..., training-runs/anky_flux_lora_v2/meta/....

execution timeline

The run started on a fresh RunPod pod with RTX PRO 6000 (96GB), 188GB RAM. Initial setup stalled due to a host-level pod issue and an interrupted virtualenv bootstrap. After moving to a clean pod and repairing the environment, training proceeded normally.

Effective schedule: baseline sample generation, latent caching at 512/768/1024 buckets, then full 6500-step optimization with checkpoint/sample saves every 500 steps.

critical failures and fixes

1) CUDA kernel mismatch on Blackwell GPUs. The default torch build (2.5.1+cu124) failed with no kernel image is available for execution on the device. Fix: reinstall torch/vision/audio using cu128 wheels.

2) Dataset extraction shape mismatch. Archive extracted under final-training-dataset-for-round-two/, while scripts expected /workspace/dataset. Fix: move/auto-detect dataset folder in bootstrap flow.

3) Broken venv after interrupted setup. Missing /workspace/venv/bin/activate. Fix: recreate venv before re-running setup.

what this run produced

Final weight + recent checkpoints are published. The canonical production weight for inference is: training-runs/anky_flux_lora_v2/weights/anky_flux_lora_v2.safetensors.

Samples for each checkpoint band are published under: training-runs/anky_flux_lora_v2/samples/. Reproducibility logs/config/env snapshots are under: training-runs/anky_flux_lora_v2/meta/.

deployment in anky.app

The /generate Flux path now prefers the run-002 LoRA (anky_flux_lora_v2.safetensors) on ComfyUI (GPU0), with fallback to the prior LoRA filename if v2 is unavailable. Ollama remains isolated on GPU1.

This keeps inference fast and production-safe while preserving backward compatibility.

next improvements for run 003

Priority upgrades: add a held-out validation prompt set, score identity drift per checkpoint, and gate final export on objective + human review instead of fixed step count only.

Operationally, the run is now one-shot reproducible: dataset URL + bootstrap + upload + metadata are all documented and scriptable.

run 002 — round-two LoRA complete