We moved from a small first-pass set to a larger, deduplicated, caption-complete dataset. The source pool had 577 gallery images, but the final trainable round-two archive intentionally kept only high-quality unique image-caption pairs.
Final dataset used for training: 465 png + 465 txt. This run increases capacity (rank 32) and extends the schedule (6500 steps) to improve identity consistency across harder prompts.
Model + dataset repo: jpfraneto/anky-flux-lora-v2
Final dataset archive:
training-data/final-training-dataset-for-round-two.tar.gz
Training outputs:
training-runs/anky_flux_lora_v2/weights/anky_flux_lora_v2.safetensors,
training-runs/anky_flux_lora_v2/samples/...,
training-runs/anky_flux_lora_v2/meta/....
The run started on a fresh RunPod pod with RTX PRO 6000 (96GB), 188GB RAM. Initial setup stalled due to a host-level pod issue and an interrupted virtualenv bootstrap. After moving to a clean pod and repairing the environment, training proceeded normally.
Effective schedule: baseline sample generation, latent caching at 512/768/1024 buckets, then full 6500-step optimization with checkpoint/sample saves every 500 steps.
1) CUDA kernel mismatch on Blackwell GPUs. The default torch build (2.5.1+cu124) failed with no kernel image is available for execution on the device. Fix: reinstall torch/vision/audio using cu128 wheels.
2) Dataset extraction shape mismatch. Archive extracted under final-training-dataset-for-round-two/, while scripts expected /workspace/dataset. Fix: move/auto-detect dataset folder in bootstrap flow.
3) Broken venv after interrupted setup. Missing /workspace/venv/bin/activate. Fix: recreate venv before re-running setup.
Final weight + recent checkpoints are published. The canonical production weight for inference is:
training-runs/anky_flux_lora_v2/weights/anky_flux_lora_v2.safetensors.
Samples for each checkpoint band are published under:
training-runs/anky_flux_lora_v2/samples/.
Reproducibility logs/config/env snapshots are under:
training-runs/anky_flux_lora_v2/meta/.
The /generate Flux path now prefers the run-002 LoRA (anky_flux_lora_v2.safetensors) on ComfyUI (GPU0), with fallback to the prior LoRA filename if v2 is unavailable. Ollama remains isolated on GPU1.
This keeps inference fast and production-safe while preserving backward compatibility.
Priority upgrades: add a held-out validation prompt set, score identity drift per checkpoint, and gate final export on objective + human review instead of fixed step count only.
Operationally, the run is now one-shot reproducible: dataset URL + bootstrap + upload + metadata are all documented and scriptable.