Exp-Pi: Performance Comparison of Multiple Image Generation Models on Apple Silicon MacBook Pro

Background

Since the introduction of the Apple Silicon chip series, Apple has consistently highlighted its exceptional capabilities in image processing and AI computation. The unified memory architecture provides significantly higher memory bandwidth, enabling accelerated performance for AI model workloads.

Within the community, while there is extensive discussion around models, workflows, and quantization techniques for acceleration, there is relatively little detailed data or analysis regarding their performance on Mac systems. Some users are curious about how the MacBook Pro compares to systems equipped with NVIDIA RTX discrete GPUs. They seek a balance between the portability and productivity benefits of macOS and the ability to engage in AI-related development and design tasks.

Content

This analysis evaluates the performance of several mainstream image generation models on an Apple Silicon MacBook Pro equipped with the M4 Max chip and 128 GB of unified memory.

The selected models are all post-SDXL image generation models from the open-source community, each garnering significant attention. These include Chroma, Flex.2 Preview, HiDream, SD3.5, Flux.1, and SDXL.

The primary goal of this test is to assess the inference performance of these models on the MacBook Pro with Apple Silicon, focusing on generation time and memory usage.

The testing environment consists of a 14-inch MacBook Pro with the top-tier M4 Max configuration and 128 GB of memory. To simulate a typical office workflow, the device was connected to a 2K external monitor as the primary display, while running Activity Monitor and Notion for data logging alongside the testing program.

Test Methodology

All models except SDXL utilize 8-bit quantized GGUF model versions. Each model generates a 1024x1024 image using the same prompt, the officially recommended sampler, and the default number of generation steps, with a consistent seed value. ComfyUI is employed as the runtime environment.

Promp

A young woman embodying classic trad wife style, wearing a modest vintage dress with delicate floral patterns and a fitted waist, soft curls pinned back, pearl earrings, subtle makeup, standing in a sunlit kitchen with lace curtains, but with a slightly undone collar revealing smooth collarbones, a gentle, inviting smile and a playful sparkle in her eyes, holding a fresh bouquet of flowers, soft fabric clinging slightly to her curves, warm golden hour lighting, tasteful and romantic, hint of intimate allure, photorealistic style

Test Results

Model Overview

Chroma

Chroma v32

The Chroma model is a deeply modified version of Flux, trained on new datasets to enhance its capabilities. It is licensed for commercial use, making it an attractive option for professional applications. However, Chroma is still under active development, and its ecosystem is not yet fully mature. Features such as ControlNet and cache acceleration are currently unavailable. Some experimental LoRA (Low-Rank Adaptation) models for Chroma are available on Hugging Face, though they remain in early stages. The AI-Toolkit reportedly supports LoRA training for Chroma, but I have not personally verified this functionality.

Flex 2. Preview

Flex.2 Preview

The Flex.2 Preview model is an enhanced version of Flux Schnell, modified and retrained to incorporate advanced features. Notably, Flex.2 Preview integrates certain ControlNet functionalities, such as generating images using depth maps or leveraging Redux for image variation generation. These additions enable more precise control over image composition and structure, making it a versatile tool for creators.

HiDream

HiDream Full

HiDream Dev

HiDream Fast

Currently, HiDream lacks ControlNet support, but an editing model, HiDream-E1, has been introduced for image-to-image tasks. Cache acceleration nodes, such as TeaCaChe and TaylerSeer, are supported to optimize performance. However, in my experience, these nodes occasionally cause image generation failures. Additionally, HiDream requires a minimum resolution of 1024x1024 for optimal results, as lower resolutions significantly degrade image quality.

Stable Diffusion 3.5

SD 3.5 Large

SD 3.5 Large Turbo

SD 3.5 Medium

Following the remarkable impact of Flux, Stability AI released Stable Diffusion 3.5, aiming to differentiate itself with superior prompt adherence and a permissive commercial license. Despite these efforts, SD3.5 has not generated the same level of excitement in the community. Although it has been available for some time, the model has seen limited adoption in terms of LoRA (Low-Rank Adaptation) development and community discussion, reflecting a relatively muted response compared to Flux.

Flux.1

Flux.1 Dev

Flux.1 Schnell

Since its release last year, Flux.1 has consistently ranked among the top models for download volume on Hugging Face. The Flux.1-Dev variant is restricted to non-commercial use, while the Flux.1-Schnell variant is licensed for commercial applications, prompting the community to explore modifications based on Schnell to create new models. The ecosystem around Flux.1-Dev is exceptionally robust, with extensive support for quantization techniques and acceleration methods that continuously improve generation speed. Additionally, Flux.1-Dev benefits from ControlNet integration and a wide array of LoRA (Low-Rank Adaptation) models, establishing it as a benchmark for image generation quality.

SDXL

SDXL serves as the baseline for this performance comparison. It has been widely discussed in Apple’s community and technical articles, particularly for its optimization on Apple Silicon through Metal Performance Shaders (MPS) or CoreML frameworks. Notably, when using CoreML, SDXL can leverage Apple’s Neural Engine (ANE) for significant performance boosts, especially at resolutions below 512x512. However, this requires a model version converted with SPLIT_EINSUM for compatibility and optimal performance.

Additional Observations on Flux.1-Dev and ComfyUI

I previously tested the quantized version of the Flux.1-Dev model on a MacBook Pro with an M1 Pro chip and 16 GB of unified memory. The model could run effectively with 4-bit quantization, though 5-bit quantization was also feasible. A key characteristic of unified memory is that it rarely encounters out-of-memory errors due to model size. However, larger models or higher quantization levels can result in extremely slow image generation, rendering the process nearly impractical.

When using ComfyUI, generating images at excessively high resolutions, such as 2048x2048, can lead to crashes, particularly with large-parameter models. This suggests that careful configuration of image dimensions is necessary to ensure stable performance.

Conclusion and Future Outlook

This article aims to demonstrate the computational capabilities of the Apple MacBook Pro when running large-scale image generation models, addressing common questions about the hardware configurations required and their performance in real-world scenarios.

Localized AI hardware is poised to become a critical direction for future development, as AI applications demand substantial memory and computational power. Apple’s Neural Engine and unified memory architecture represent significant advancements in infrastructure, enabling efficient AI workloads.

The MacBook Pro is an exceptional productivity tool, capable of running these large models locally through quantization techniques. Thanks to its unified memory design, it avoids the out-of-memory issues often encountered with discrete GPUs. Moreover, during model inference or training, everyday applications like document editors and code IDEs continue to run smoothly, ensuring a seamless workflow.

Since the emergence of Flux, Apple has provided limited official guidance on hardware acceleration for newer models. While the MLX framework offers some performance improvements, as validated through personal testing, these gains are not substantial enough to outweigh the flexibility and rich ecosystem of tools like ComfyUI.

Currently, AI model development is progressing at a remarkable pace, but Apple’s chips have yet to match the performance of NVIDIA RTX GPUs. During extensive model training or inference, the Neural Engine is underutilized, with the GPU handling most of the workload. Additionally, many acceleration techniques optimized for CUDA are not compatible with Metal Performance Shaders (MPS). Looking ahead, there is hope that Apple will enhance AI development support and provide greater openness in its developer programs to bridge this gap.

Performance Comparison of Multiple Image Generation Models on Apple Silicon MacBook Pro