NumPy: How Python Gets C Speed

https://www.youtube.com/watch?v=jImHzWSQd5s

One line of Python. One C loop underneath.

Summing 100 million numbers in a Python for-loop takes about 8 seconds. np.arange(100_000_000).sum() does the same work in a tenth of a second. Same Python syntax. 80× faster.

Python didn't suddenly get fast. The loop moved.

Bytes, not objects

Start with a Python list of the numbers 1, 2, 3.

You'd think those numbers live in the list. They don't. The list holds pointers — little arrows that point somewhere else on the heap. Follow an arrow and you land on a full Python integer object: a reference count, a type tag, and finally the actual digits. Twenty-eight bytes for the number 3, on CPython 3.11+.

A million numbers? A million tiny objects. Scattered across the heap. Every element access is a pointer chase.

The NumPy array doesn't play that game. No pointers. No objects. No type tags. Just bytes.

Python list [1, 2, 3]  →  [ptr][ptr][ptr]  →  heap: [PyLong:1] [PyLong:2] [PyLong:3]
NumPy array  [1, 2, 3]  →  [01][02][03]   ← 24 bytes, contiguous, done

Three eight-byte integers. Twenty-four bytes, end to end. Ask for the tenth element — jump ten slots, read eight bytes, done. Ask for the millionth — still one jump. No chasing.

The array isn't a list of Python things. It's a block of raw memory, with a label on top.

Ufuncs — one call, one C loop

Now the trick.

a + b, in Python syntax, looks like one operation. It is. But that one operation has to touch a million elements. Or a billion.

A Python for-loop would round-trip through the interpreter once per number. That's why it's slow.

NumPy doesn't do that. The + operator on an ndarray dispatches to a ufunc — a universal function. A ufunc is a compiled C function. It gets handed two things: the byte blocks, and a count.

for (i = 0; i < n; i++)
    out[i] = a[i] + b[i];

That's the loop. It runs in C, for the entire array, in one function call. The Python interpreter sees one operation. The CPU runs a million adds.

And inside that C loop, there's SIMD. Vector instructions NumPy ships hand-tuned for every modern CPU — SSE, AVX, NEON. One cycle, four adds. Sometimes eight. Sometimes sixteen.

That's where the speed lives. Every arithmetic op, every comparison, every math function in NumPy — it's a ufunc. One call, one C loop, done.

Strides — slicing without copying

Slice a NumPy array. Take every other element: arr[::2]. What got copied?

Nothing.

What you got back looks like an array. It has a shape. A dtype. But it's pointing at the same bytes. It just reads them differently.

That's what strides are. A stride says: to reach the next element, skip this many bytes. A normal array of eight-byte integers has a stride of 8. Element, element, element. A stride of 16? Skip every other one. Same memory, different walk.

Transpose a matrix? Bytes don't move. The strides just swap.

>>> arr = np.array([[1,2,3],[4,5,6]])
>>> arr.strides
(24, 8)
>>> arr.T.strides
(8, 24)     # same bytes. different walk.

Broadcasting uses the same trick. Adding a row to a whole matrix — it looks like the row got copied to every row below. It didn't. Broadcasting sets a stride of zero. Stride zero means: don't advance. Read the same bytes, again and again.

One block of bytes. Many ways to walk it. Still one C loop at the bottom.

The Python mask over C

This is the pattern.

NumPy is a thin Python layer. Syntax for shapes, arithmetic, slicing — all Python-looking. Underneath: a block of bytes, and a table of compiled C functions. You write in Python. The CPU runs in C.

Pandas works the same way — a dataframe is an ndarray with labels on top. Every operation drops into C. PyTorch tensors follow the same playbook: a block of bytes, compiled kernels, C or CUDA underneath. Scikit-learn models wrap NumPy arrays with C kernels on top.

This is why Python won scientific computing. It never had to be fast. The loops Python can't run, Python doesn't run. It hands the bytes to C, and waits.

One line of Python. One C loop underneath. That's the trick.