Exp-Pi: Fine-tuning Stable Diffusion XL Natively on MacBook Pro M4 Max

Fine-tuning large models is typically done on CUDA-enabled devices. Whether using consumer-grade GPUs or specialized AI accelerator cards, the cost is often high. These setups also demand substantial power and efficient cooling, which usually requires a large desktop workstation. Alternatively, you can rent cloud computing resources by the hour using platforms like Runpod or Lambda.ai. However, this still incurs significant costs and often requires considerable time to upload data from your local machine to the cloud.

Since Apple introduced its Silicon chip series, PyTorch has added support for the MPS (Metal Performance Shaders) backend on M1 and later devices, significantly improving compute performance on macOS. Thanks to the unified memory architecture of Apple Silicon, it’s possible to load larger models than what most consumer GPUs can handle, reducing the constraints imposed by limited VRAM. This allows developers to fine-tune models locally while still enjoying the portability of a laptop.

Compared to bulky desktop workstations, this offers a more convenient alternative—although it’s worth noting that in terms of raw training speed, CUDA-based setups still have a clear advantage.

In this experiment, I used the Diffusers library from Hugging Face along with its full-parameter fine-tuning example for Stable Diffusion XL (SDXL).

Device Information

MacBook Pro M4 Max with 128GB RAM

Virtual Environment Setup

Python 3.12

Step 1. Download and install the Diffusers library

git clone https://github.com/huggingface/diffusers 
cd diffusers
pip install .

Step 2. Navigate to the fine-tuning example code


cd examples/text_to_image
python3.12 -m venv venv
source ./venv/bin/activate
pip install -r requirements_sdxl.txt

Step 3. Logging [Optional]


pip install wandb
wandb login  # Enter your API key

Step 4. Start training using the provided train_text_to_image_sdxl.py script.

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
export DATASET_NAME="lambdalabs/naruto-blip-captions"


python  train_text_to_image_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_model_name_or_path=$VAE_NAME \
  --dataset_name=$DATASET_NAME \
  --resolution=512 \
  --center_crop \
  --random_flip \
  --report_to="wandb" \
  --proportion_empty_prompts=0.2 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=10000 \
  --learning_rate=1e-06 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --validation_prompt="a cute Sundar Pichai creature" \
  --validation_epochs 5 \
  --checkpointing_steps=5000 \
  --output_dir="sdxl-naruto-model"

Training Statistics

Dataset size: 1,221 samples (~740 MB download)
Model size: approximately 15 GB download
Speed: 5–6 seconds per iteration, estimated total time 17–18 hours
Memory usage: 64–70 GB RAM
Disk space per checkpoint: around 30 GB

Note

Validation images are not visible

If you don’t enable TensorBoard or wandb for logging, validation images will not be saved. You need to add extra code to enable this.


# After generating images, add the following code (before `for tracker in accelerator.trackers:`)
import os
validation_images_dir = os.path.join(args.output_dir, "validation_images")
os.makedirs(validation_images_dir, exist_ok=True)
for i, image in enumerate(images):
    image.save(os.path.join(validation_images_dir, f"{args.validation_prompt.replace(' ', '_')}_step_{global_step}_{i}.png"))

"accelerate config" and "accelerate launch" commands do not work
"enable_xformers_memory_efficient_attention" does not support xformers
"use_8bit_adam" is not supported because bitsandbytes lacks GPU support
"mixed_precision="fp16"" is not supported, and bf16 is also unsupported

Step 5. Use the trained model for inference


from diffusers import DiffusionPipeline
import torch

MODEL_PATH = "path/your/model"
pipeline = DiffusionPipeline.from_pretrained(MODEL_PATH, torch_dtype=torch.float16).to("mps")

prompt = "A naruto with green eyes and red legs."
image = pipeline(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
image.save("naruto.png")

For more details, refer to the official Hugging Face Diffusers SDXL training guide.

Fine-tuning Stable Diffusion XL Natively on MacBook Pro M4 Max

Device Information

Virtual Environment Setup

Note

MLX-Chroma + Lora