
Numba JIT Tutorial and PyCUDA with Apple Silicon Adaptation
Numba JIT on Apple Silicon
This tutorial explores using Numba’s @jit(nopython=True) decorator to optimize Python code for faster execution. The @jit(nopython=True) decorator from Numba compiles Python functions into machine code for improved performance, especially for numerical tasks.
Basic Example: Sum of Squares
from numba import jit
@jit(nopython=True)
def sum_of_squares(n):
total = 0
for i in range(n):
total += i * i
return total
print(sum_of_squares(10)) # Output: 285
On Apple Silicon, Numba can be used to optimize CPU-bound tasks. Although it doesn’t directly support GPU via Metal or NPU, you can use it to significantly speed up CPU computations, which Apple’s M1/M2 chips handle efficiently with multiple cores.
Transition to Metal for GPU
To offload heavy parallel tasks to the GPU, Apple uses Metal, an API for high-performance graphics and computation on macOS. Metal’s Metal Shading Language (MSL) provides a way to run GPU tasks that would otherwise be written for CUDA in environments like PyCUDA.
PyCUDA to Metal for Apple Silicon
PyCUDA is typically used for running GPU tasks on NVIDIA hardware using CUDA. However, on Apple Silicon, you can transition from PyCUDA to Metal for GPU programming. Metal can handle parallel GPU tasks on macOS or iOS.
Metal Shading Language (MSL) for Compute Kernels
Here is an example of using MSL for performing an addition operation on two arrays:
#include <metal_stdlib>
using namespace metal;
kernel void add_arrays(const device float* inA [[buffer(0)]],
const device float* inB [[buffer(1)]],
device float* out [[buffer(2)]],
uint id [[thread_position_in_grid]]) {
out[id] = inA[id] + inB[id];
}
You can compile and run this code using Swift to dispatch the compute kernel on the GPU.
Running Metal from Swift
import Metal
let device = MTLCreateSystemDefaultDevice()
let commandQueue = device?.makeCommandQueue()
let shader = '''
#include <metal_stdlib>
using namespace metal;
kernel void add_arrays(const device float* inA [[buffer(0)]],
const device float* inB [[buffer(1)]],
device float* out [[buffer(2)]],
uint id [[thread_position_in_grid]]) {
out[id] = inA[id] + inB[id];
}
'''
let library = try! device?.makeLibrary(source: shader, options: nil)
let addFunction = library?.makeFunction(name: "add_arrays")
let computePipelineState = try! device?.makeComputePipelineState(function: addFunction!)
let commandBuffer = commandQueue?.makeCommandBuffer()
let computeEncoder = commandBuffer?.makeComputeCommandEncoder()
computeEncoder?.setComputePipelineState(computePipelineState!)
computeEncoder?.dispatchThreads(MTLSize(width: data.count, height: 1, depth: 1),
threadsPerThreadgroup: MTLSize(width: 32, height: 1, depth: 1))
computeEncoder?.endEncoding()
commandBuffer?.commit()
commandBuffer?.waitUntilCompleted()
CoreML for Apple NPU
CoreML is optimized for running machine learning models on Apple Silicon’s Neural Engine (NPU). You can train models using MLX or other machine learning frameworks like PyTorch or TensorFlow, and convert them to CoreML for inference on iOS and macOS devices.
Example of CoreML Conversion:
import coremltools as ct
import torch
# Convert PyTorch model to CoreML format
model = torch.load('model.pth')
input_shape = ct.Shape([1, 3, 224, 224])
mlmodel = ct.convert(model, inputs=[ct.TensorType(name="input", shape=input_shape)])
mlmodel.save('model.mlmodel')
Using MLX on Apple Silicon
The MLX framework is highly adaptable to Apple Silicon. It provides support for training and deploying machine learning models across various platforms and hardware types. On Apple hardware, it can integrate with Metal Performance Shaders (MPS) and CoreML to run models on the GPU or Neural Engine.
Fullmoon iOS Integration with MLX and Metal
The Fullmoon iOS project by Mainframe Computer demonstrates how to run large language models like LLaMA on Apple devices using MLX, CoreML, and Metal. It leverages Apple’s GPU for model inference, showing how you can deploy ML models optimized for Apple’s M1/M2 chips using the MLX framework. You can explore more about Fullmoon iOS at its GitHub repository.
Summary
- Numba JIT: Optimize CPU-bound Python code on Apple Silicon.
- Metal: Replace PyCUDA with Metal for GPU programming on macOS and iOS, using MSL for kernel code.
- CoreML: Run machine learning models on the NPU or GPU, converting them from TensorFlow or PyTorch using coremltools.
- MLX: Use MLX to develop and deploy machine learning models efficiently across Apple’s hardware accelerators.
- Fullmoon iOS: Demonstrates the use of MLX and Metal for running large models like LLaMA on Apple Silicon.
Relevant links: