how to run a training

the live recipe for fine-tuning FLUX.1-dev on anky with the current runpod bootstrap.

step 1

curate the dataset

Go to /training and use the curation tab to keep/reject images. For round two, the final prepared dataset is already published to HuggingFace and can be consumed directly by the one-shot RunPod command.

Round-two final archive: training-data/final-training-dataset-for-round-two.tar.gz in jpfraneto/anky-flux-lora-v2. Final trainable set: 465 PNG + 465 TXT caption pairs.

Each caption teaches the model the prompt-to-visual mapping. Missing captions were generated and deduplicated before publishing this archive.

step 2

spin up a RunPod GPU pod

Go to runpod.io and rent a high-memory GPU pod (A100/H100/RTX PRO 6000 class). Use a PyTorch/CUDA-ready template.

The key requirement is that /workspace is mounted on the large volume disk. Keep container/root disk separate.

⚠️ Make sure to set HF_HOME=/workspace/hf_cache before downloading models — the root disk fills up fast and kills the run.

step 3

run the setup script

Once the pod is running, open a terminal and run the one-liner:

HF_TOKEN=hf_xxx ANKY_TOKEN=xxx_optional ANKY_DATASET_URL=https://huggingface.co/jpfraneto/anky-flux-lora-v2/resolve/main/training-data/final-training-dataset-for-round-two.tar.gz ANKY_DATASET_MIN_IMAGES=460 TRAIN_NAME=anky_flux_lora_v2 LORA_RANK=32 LORA_ALPHA=16 TRAIN_STEPS=6500 SAVE_EVERY=500 SAMPLE_EVERY=500 bash <(curl -fsSL https://anky.app/static/train_anky_setup.sh)

If HF_TOKEN or ANKY_TOKEN are omitted, the script now prompts you interactively. ANKY_TOKEN is optional (only needed for live dashboard updates).

The script installs ai-toolkit, creates/fixes venv, installs PyTorch from cu128 wheels (Blackwell-compatible), pulls dataset, writes config, and launches training in tmux.

⚠️ Before running, accept the FLUX.1-dev license at huggingface.co/black-forest-labs/FLUX.1-dev.

step 4

verify tmux launch

Wait until setup reaches === All launched in tmux session 'anky' ===. Only leave once tmux is live.

tmux ls
tmux attach -t anky:training

Detach safely with Ctrl+B, D. Training continues after disconnect.

step 5

monitor training

Watch progress in tmux or tail logs directly:

tmux attach -t anky:training
tail -f /workspace/training.log

With 465 images and 6500 steps, expect multiple hours depending on GPU/clocking. Samples/checkpoints save every 500 steps.

step 6

upload to huggingface

Once training finishes, upload weights + samples in one shot:

source /workspace/venv/bin/activate && HF_TOKEN=hf_xxx REPO_ID=jpfraneto/anky-flux-lora-v2 RUN_DIR=/workspace/output/anky_flux_lora_v2 bash -lc 'set -euo pipefail; STAGE=/tmp/hf_upload_anky_flux_lora_v2; export STAGE; rm -rf "$STAGE"; mkdir -p "$STAGE/weights" "$STAGE/samples"; cp -av "$RUN_DIR"/*.safetensors "$STAGE/weights/"; cp -av "$RUN_DIR/samples/." "$STAGE/samples/" 2>/dev/null || true; python -c "import os; from huggingface_hub import HfApi; HfApi(token=os.environ[\"HF_TOKEN\"]).upload_folder(folder_path=os.environ[\"STAGE\"], repo_id=os.environ[\"REPO_ID\"], repo_type=\"model\", path_in_repo=\"training-runs/anky_flux_lora_v2\", commit_message=\"Upload round-two final weights and samples\")"'

Then upload reproducibility metadata (config/log/env):

source /workspace/venv/bin/activate && HF_TOKEN=hf_xxx REPO_ID=jpfraneto/anky-flux-lora-v2 RUN=anky_flux_lora_v2 bash -lc 'set -euo pipefail; M=/tmp/${RUN}_meta; export M; rm -rf "$M"; mkdir -p "$M"; cp -f /workspace/train_anky.yaml "$M/"; [ -f /workspace/training.log ] && cp -f /workspace/training.log "$M/" || true; nvidia-smi > "$M/nvidia-smi.txt"; python -V > "$M/python-version.txt"; pip freeze > "$M/pip-freeze.txt"; python -c "import torch; print(\"torch\",torch.__version__); print(\"cuda\",torch.version.cuda); print(\"gpu\",torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"none\")" > "$M/torch-env.txt"; df -h / /workspace > "$M/disk.txt"; python -c "import os; from huggingface_hub import HfApi; HfApi(token=os.environ[\"HF_TOKEN\"]).upload_folder(folder_path=os.environ[\"M\"], repo_id=os.environ[\"REPO_ID\"], repo_type=\"model\", path_in_repo=\"training-runs/{}/meta\".format(os.environ[\"RUN\"]), commit_message=\"Upload run metadata\")"'

step 7

delete the pod

Once everything is on HuggingFace, delete the RunPod instance. The weights are safe. The training config is safe. The dataset lives on anky.app.

Cost for a full run 001: ~$3–5 for 2 hours on an A100 80GB.

known issues

things that went wrong and how to fix them

ModuleNotFoundError: torchaudio
ai-toolkit imports torchaudio at startup even if you don't use audio. The setup script installs it — don't skip that step.

No space left on device
FLUX.1-dev is 23GB. If it downloads to /root/.cache (the small container disk), you run out of space. Always set HF_HOME=/workspace/hf_cache before starting.

xet download error
HuggingFace's xet protocol is broken on some pods. Set HF_HUB_DISABLE_XET=1 to fall back to normal HTTPS downloads.

GatedRepoError: 401
You need to accept the FLUX.1-dev license on the HuggingFace website and re-login with huggingface-cli login.

CUDA no kernel image available
This happens on newer Blackwell GPUs with older torch wheels. Use cu128 wheels:

pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Broken venv after interrupted setup
If /workspace/venv/bin/activate is missing, rebuild venv:

rm -rf /workspace/venv
python3 -m venv /workspace/venv

Python output not appearing in terminal
Python buffers output when piped. Use python -u (unbuffered) to see logs in real time.

huggingface-cli: command not found / python -m huggingface_hub fails
Use the direct Python API upload commands above (HfApi().upload_folder) instead of relying on CLI entrypoints.