← all training runs

how to run a training

the full recipe for fine-tuning FLUX.1-dev on anky. written from memory after run 001.


step 1

curate the dataset

Go to /training and swipe through generated Ankys. Keep the ones that feel true to the character — correct proportions, good lighting, recognizable. Reject blurry, distorted, or off-model images.

Aim for at least 80–150 images. More isn't always better — quality matters more than quantity. Each approved image gets copied to data/training-images/ with a caption .txt file alongside it.

The caption files are the image prompts that were used to generate each image. They teach the model the connection between words and visual features.

step 2

spin up a RunPod A100

Go to runpod.io and rent an A100 80GB SXM pod. Use the RunPod PyTorch 2.4.0 template (not the base image — you need CUDA pre-installed).

Set the volume disk to at least 50GB (the FLUX model alone is ~23GB). Container disk can be 20GB.

⚠️ Make sure to set HF_HOME=/workspace/hf_cache before downloading models — the root disk fills up fast and kills the run.
step 3

run the setup script

Once the pod is running, open a terminal and run:

curl -s https://anky.app/static/train_anky_setup.sh | bash

This installs ai-toolkit, all dependencies (including torchaudio — don't skip it), clones the repo, and sets up the environment. Takes about 5–10 minutes.

⚠️ Before running, you need to accept the FLUX.1-dev license at huggingface.co/black-forest-labs/FLUX.1-dev and log in with huggingface-cli login.
step 4

upload the dataset

Copy the training images from anky.app to the RunPod pod. The images are at data/training-images/ on the server. You can tar them and scp, or use rsync. On the pod, put them at /workspace/dataset/.

Each image needs a matching .txt caption file in the same folder. The caption should describe the image and include the trigger word anky.

step 5

start training in tmux

Always use tmux so training survives if your SSH connection drops:

tmux new -s train
source /workspace/venv/bin/activate
export HF_HOME=/workspace/hf_cache
export HF_HUB_DISABLE_XET=1
cd /workspace/ai-toolkit
python -u run.py /workspace/train_anky.yaml

Detach with Ctrl+B, D. Reattach later with tmux attach -t train.

Training 3000 steps on an A100 80GB takes roughly 1–2 hours. Sample images are generated every 500 steps at /workspace/output/anky_flux_lora/samples/.

step 6

upload to huggingface

Once training finishes, upload the weights and everything else using these one-liners from the pod terminal:

Upload the LoRA weights:

python3 -c "from huggingface_hub import HfApi; api=HfApi(); api.upload_file(path_or_fileobj='/workspace/output/anky_flux_lora/anky_flux_lora.safetensors', path_in_repo='anky_flux_lora.safetensors', repo_id='jpfraneto/anky-flux-lora-v1', repo_type='model'); print('done')"

Upload the training config:

python3 -c "from huggingface_hub import HfApi; api=HfApi(); api.upload_file(path_or_fileobj='/workspace/train_anky.yaml', path_in_repo='train_anky.yaml', repo_id='jpfraneto/anky-flux-lora-v1', repo_type='model'); print('done')"

Upload the README (managed on anky.app):

curl -s https://anky.app/static/hf/anky-flux-lora-v1-readme.md -o /tmp/readme.md && python3 -c "from huggingface_hub import HfApi; api=HfApi(); api.upload_file(path_or_fileobj='/tmp/readme.md', path_in_repo='README.md', repo_id='jpfraneto/anky-flux-lora-v1', repo_type='model'); print('done')"

Upload checkpoints + samples:

curl -s https://anky.app/static/hf/upload-checkpoints.py | python3
step 7

delete the pod

Once everything is on HuggingFace, delete the RunPod instance. The weights are safe. The training config is safe. The dataset lives on anky.app.

Cost for a full run 001: ~$3–5 for 2 hours on an A100 80GB.


known issues

things that went wrong and how to fix them

ModuleNotFoundError: torchaudio
ai-toolkit imports torchaudio at startup even if you don't use audio. The setup script installs it — don't skip that step.

No space left on device
FLUX.1-dev is 23GB. If it downloads to /root/.cache (the small container disk), you run out of space. Always set HF_HOME=/workspace/hf_cache before starting.

xet download error
HuggingFace's xet protocol is broken on some pods. Set HF_HUB_DISABLE_XET=1 to fall back to normal HTTPS downloads.

GatedRepoError: 401
You need to accept the FLUX.1-dev license on the HuggingFace website and re-login with huggingface-cli login.

torch install corrupted after Ctrl+C
If you interrupt a pip install mid-way, torch can be in a broken state. Fix with a forced reinstall as a single line (no backslash continuations):

pip install torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124 --force-reinstall --no-deps

Python output not appearing in terminal
Python buffers output when piped. Use python -u (unbuffered) to see logs in real time.