Skip to main content

Fine-tuning Stable Diffusion XL Natively on MacBook Pro M4 Max

Fine-tuning large models is typically done on CUDA-enabled devices. Whether using consumer-grade GPUs or specialized AI accelerator cards, the cost is often high. These setups also demand substantial power and efficient cooling, which usually requires a large desktop workstation. Alternatively, you can rent cloud computing resources by the hour using platforms like Runpod or Lambda.ai. However, this still incurs significant costs and often requires considerable time to upload data from your local machine to the cloud.

Since Apple introduced its Silicon chip series, PyTorch has added support for the MPS (Metal Performance Shaders) backend on M1 and later devices, significantly improving compute performance on macOS. Thanks to the unified memory architecture of Apple Silicon, it’s possible to load larger models than what most consumer GPUs can handle, reducing the constraints imposed by limited VRAM. This allows developers to fine-tune models locally while still enjoying the portability of a laptop.

Compared to bulky desktop workstations, this offers a more convenient alternative—although it’s worth noting that in terms of raw training speed, CUDA-based setups still have a clear advantage.

In this experiment, I used the Diffusers library from Hugging Face along with its full-parameter fine-tuning example for Stable Diffusion XL (SDXL).


Device Information

MacBook Pro M4 Max with 128GB RAM

Virtual Environment Setup

Python 3.12


Step 1. Download and install the Diffusers library

git clone https://github.com/huggingface/diffusers 
cd diffusers
pip install .

Step 2. Navigate to the fine-tuning example code


cd examples/text_to_image
python3.12 -m venv venv
source ./venv/bin/activate
pip install -r requirements_sdxl.txt

Step 3. Logging [Optional]


pip install wandb
wandb login  # Enter your API key

Step 4. Start training using the provided train_text_to_image_sdxl.py script.

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
export DATASET_NAME="lambdalabs/naruto-blip-captions"

python  train_text_to_image_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_model_name_or_path=$VAE_NAME \
  --dataset_name=$DATASET_NAME \
  --resolution=512 \
  --center_crop \
  --random_flip \
  --report_to="wandb" \
  --proportion_empty_prompts=0.2 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=10000 \
  --learning_rate=1e-06 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --validation_prompt="a cute Sundar Pichai creature" \
  --validation_epochs 5 \
  --checkpointing_steps=5000 \
  --output_dir="sdxl-naruto-model"

Training Statistics

  • Dataset size: 1,221 samples (~740 MB download)

  • Model size: approximately 15 GB download

  • Speed: 5–6 seconds per iteration, estimated total time 17–18 hours

  • Memory usage: 64–70 GB RAM

  • Disk space per checkpoint: around 30 GB


Note

  1. Validation images are not visible

        If you don’t enable TensorBoard or wandb for logging, validation images will not be saved. You need to add extra code to enable this.


# After generating images, add the following code (before `for tracker in accelerator.trackers:`)
import os
validation_images_dir = os.path.join(args.output_dir, "validation_images")
os.makedirs(validation_images_dir, exist_ok=True)
for i, image in enumerate(images):
    image.save(os.path.join(validation_images_dir, f"{args.validation_prompt.replace(' ', '_')}_step_{global_step}_{i}.png"))

    

  1. "accelerate config" and "accelerate launch" commands do not work
  2. "enable_xformers_memory_efficient_attention" does not support xformers

  3. "use_8bit_adam" is not supported because bitsandbytes lacks GPU support

  4. "mixed_precision="fp16"" is not supported, and bf16 is also unsupported



Step 5. Use the trained model for inference


from diffusers import DiffusionPipeline
import torch

MODEL_PATH = "path/your/model"
pipeline = DiffusionPipeline.from_pretrained(MODEL_PATH, torch_dtype=torch.float16).to("mps")

prompt = "A naruto with green eyes and red legs."
image = pipeline(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
image.save("naruto.png")

For more details, refer to the official Hugging Face Diffusers SDXL training guide.

Comments

Popular posts from this blog

Performance Comparison of Multiple Image Generation Models on Apple Silicon MacBook Pro

Background Since the introduction of the Apple Silicon chip series, Apple has consistently highlighted its exceptional capabilities in image processing and AI computation. The unified memory architecture provides significantly higher memory bandwidth, enabling accelerated performance for AI model workloads. Within the community, while there is extensive discussion around models, workflows, and quantization techniques for acceleration, there is relatively little detailed data or analysis regarding their performance on Mac systems. Some users are curious about how the MacBook Pro compares to systems equipped with NVIDIA RTX discrete GPUs. They seek a balance between the portability and productivity benefits of macOS and the ability to engage in AI-related development and design tasks. Content This analysis evaluates the performance of several mainstream image generation models on an Apple Silicon MacBook Pro equipped with the M4 Max chip and 128 GB of unified memory. The selected models ...