Skip to main content

Example: pandas

Run pandas under DAGZ in a Docker image with dagz, Postgres, and MariaDB pre-installed. Docker is required on macOS (the pytest plugin is Linux-only) and optional on Linux.

Prerequisites

Install the zb runtime and start the local daemon on the host:

curl -LsSf https://dagz.run/install.sh | bash
zb daemon up --bg

Allow containers to reach the daemon

Enable listen_on_docker_bridge in ~/.dagz/local.env/daemon.yaml, then restart the daemon:

zb daemon up --bg --restart

See the daemon config reference for the listen keys.

Clone pandas

git clone https://github.com/pandas-dev/pandas.git
cd pandas

Parallelizing DB tests

For the -m db slice (MySQL + Postgres), see Parallelizing DB Tests for the rerouting API. Place this file at the pandas repo root (/pandas/dagz_pandas_integ.py), not inside the inner package directory (/pandas/pandas/):

dagz_pandas_integ.py
try:
from dagz.integ.psycopg import PG_CONFIG
from dagz.integ.pymysql import MYSQL_CONFIG

except ImportError:
def setup():
pass
else:
def setup_adbc_driver():
"""Custom ADBC driver rerouting using DAGZ rerouting framework."""
import adbc_driver_postgresql.dbapi
_orig_adbc_connect = adbc_driver_postgresql.dbapi.connect

def _override_connect(uri, *args, **kwargs):
uri = PG_CONFIG.maybe_reroute_uri(uri)
return _orig_adbc_connect(uri, *args, **kwargs)

adbc_driver_postgresql.dbapi.connect = _override_connect

def setup_parallel_db():
import dagz.integ.psycopg2
import dagz.integ.pymysql

PG_CONFIG.configure(
rewrite_db_name=PG_CONFIG.default_rewrite_db_name,
should_reroute=PG_CONFIG.default_should_reroute,
worker_init=dagz.integ.psycopg2.create_worker_init(["pandas"], host="127.0.0.1", port=5432, user="postgres", password="postgres"),
prepare=None,
)
MYSQL_CONFIG.configure(
rewrite_db_name=MYSQL_CONFIG.default_rewrite_db_name,
should_reroute=MYSQL_CONFIG.default_should_reroute,
worker_init=dagz.integ.pymysql.create_worker_init(["pandas"], host="127.0.0.1", port=3306, user="root", password=""),
prepare=None,
)


def setup():
setup_parallel_db()
setup_adbc_driver()

pandas is editable-installed via meson-python, whose MetaPathFinder only resolves submodules declared in meson.build. A file dropped into /pandas/pandas/ is not importable as pandas.dagz_pandas_integ. Keep it at the repo root and import it as a top-level module; the image's PYTHONPATH=/pandas makes that work.

Call setup() from a conftest.py:

# in pandas/conftest.py (or a new conftest.py at the repo root)
import dagz_pandas_integ
dagz_pandas_integ.setup()

Build the image

Save the Dockerfile below in the pandas checkout root:

Dockerfile
# syntax=docker/dockerfile:1
FROM ubuntu:24.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends \
postgresql mysql-server \
build-essential ninja-build curl git ca-certificates \
python3 python3-dev python3-venv \
zsh vim \
libgl1 libglib2.0-0 libfontconfig1 tzdata tzdata-legacy \
&& rm -rf /var/lib/apt/lists/*

RUN sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" "" --unattended

# Init PostgreSQL (password + database)
RUN pg_ctlcluster 16 main start \
&& su postgres -c "psql -c \"ALTER USER postgres WITH PASSWORD 'postgres'\"" \
&& su postgres -c "createdb pandas" \
&& pg_ctlcluster 16 main stop

# Init MySQL (TCP root access for pymysql + database). Pandas's fixtures
# connect to host=localhost; reverse DNS in the container resolves the
# client IP back to 'localhost', so we grant on both 127.0.0.1 and localhost.
# MySQL 8 defaults root@localhost to auth_socket; drop and recreate with
# mysql_native_password so pymysql/TCP clients can authenticate.
RUN mkdir -p /var/run/mysqld && chown mysql:mysql /var/run/mysqld \
&& mysqld_safe & \
while ! mysqladmin ping --silent 2>/dev/null; do sleep 0.1; done \
&& mysql -e "\
DROP USER IF EXISTS 'root'@'localhost'; \
CREATE USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY ''; \
CREATE USER IF NOT EXISTS 'root'@'127.0.0.1' IDENTIFIED WITH mysql_native_password BY ''; \
CREATE USER IF NOT EXISTS 'root'@'%' IDENTIFIED WITH mysql_native_password BY ''; \
GRANT ALL ON *.* TO 'root'@'127.0.0.1' WITH GRANT OPTION; \
GRANT ALL ON *.* TO 'root'@'localhost' WITH GRANT OPTION; \
GRANT ALL ON *.* TO 'root'@'%' WITH GRANT OPTION; \
CREATE DATABASE IF NOT EXISTS pandas; \
FLUSH PRIVILEGES" \
&& mysqladmin shutdown

RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:$PATH"

COPY --exclude=build \
--exclude=**/*.so \
. /pandas
WORKDIR /pandas
RUN uv venv /venv --seed --python 3.12
ENV VIRTUAL_ENV=/venv PATH="/venv/bin:$PATH"

# pandas's clipboard tests use pytest-qt's qapp fixture. Qt tries to talk
# to an active display using the xcb X11 client library. For running
# inside docker, we force the offscreen Qt backend.
ENV QT_QPA_PLATFORM=offscreen

RUN uv pip install -r requirements-dev.txt
RUN uv pip install -v -e . --config-settings=builddir=/build --no-build-isolation
RUN git config --global --add safe.directory /pandas


# Pre-install dagz-pytest's runtime deps so first container start is fast.
RUN uv pip install dagz

# Install the zb CLI (lands in /root/.local/bin, already on PATH).
RUN curl -LsSf https://dagz.run/install.sh | bash

COPY <<'EOF' /entrypoint.sh
#!/bin/bash
set -e

pg_ctlcluster 16 main start
mysqld_safe &
while ! mysqladmin ping --silent 2>/dev/null; do sleep 0.1; done
exec "$@"
EOF
RUN chmod +x /entrypoint.sh

ENV PATH=$PATH:/venv/bin

ENTRYPOINT ["/entrypoint.sh"]
CMD ["zsh"]
docker build -t dagz-pandas-demo:latest .

Requires Docker 23.0+ for COPY --exclude=.

Run the container

docker run -it --rm \
--env DAGZ_URL=http://host.docker.internal:29111 \
--add-host=host.docker.internal:host-gateway \
-v "${PWD}:/pandas" \
dagz-pandas-demo:latest

Run tests with DAGZ

Inside the container, the first run generates a baseline:

pytest --dagz

This will:

  • Create a job you can inspect using the web UI at http://localhost:29111 or zb CLI at real time.
  • Analyze pandas code
  • Start a few parallel workers, based on your hardware.
  • Collect code coverage and other runtime signals.
  • Show testing progress
  • Create DB per worker and route DB connections from workers to their assigned DB.
  • Record a baseline for ~240k tests.

Subsequent runs select only the tests affected by your changes:

pytest --dagz --dagz-skip-redundant=1

A typical single-module edit selects hundreds to a few thousand tests.

Re-run historical commits with DAGZ

Commit your local integration so it survives git reset --hard:

git checkout -b dagz-integ
git add pandas/conftest.py dagz_pandas_integ.py
git commit -m "dagz integration"

Replay the last 20 commits, cherry-picking that integration onto each:

dagz git-replay -n 20 --run --skip-redundant \
--branch origin/main \
--cherry-pick dagz-integ

Note:

  • git reset --hard discards uncommitted host changes. Commit or stash first.
  • Drop --run for a dry-run.

The Jobs view shows one row per commit:

Jobs view after a 20-commit pandas replay

Benchmarks

Head-to-head with pytest-xdist

Pandas suite (-m 'not db'), 6 workers, same machine. --dagz-no-bl disables selection so this isolates parallel runtime. xdist (cov) uses pytest-cov; DAGZ (no cov) uses --dagz-disable-sensors. CPU cycles sum P-core and E-core counters from perf stat. DAGZ rows show delta vs the corresponding xdist row.

ConfigurationFailedWall timeCPU time (user+sys)CPU cycles (× 10¹²)Peak memory
xdist (no cov)13407.4s2232.3s9.0614.5 GB
xdist (cov)3035550.6s2829.8s9.3014.2 GB
DAGZ (no cov)13
0%
319.8s
−22%
1809.0s
−19%
7.13
−21%
12.2 GB
−16%
DAGZ (cov)13
−99.6%
334.4s
−39%
1901.6s
−33%
7.60
−18%
14.0 GB
−1%

pytest-cov adds 35% to xdist wall time and induces 3,035 errors (vs 13 without). DAGZ instrumentation adds 5%. Selection savings stack on top.

Measuring on your machine

For CPU and memory numbers, run pytest under perf stat and a transient systemd scope (Linux host only):

systemd-run --user --scope --unit=pytest-run -p MemoryAccounting=yes -- \
bash -c 'env QT_QPA_PLATFORM=offscreen perf stat \
docker run --rm \
--env DAGZ_URL=http://host.docker.internal:29111 \
--add-host=host.docker.internal:host-gateway \
-v "${PWD}:/pandas" \
dagz-pandas-demo:latest \
pytest --dagz-no-bl --dagz-workers=6 pandas/tests/base ; \
systemctl --user status pytest-run'

QT_QPA_PLATFORM=offscreen keeps Qt-backed tests headless. The trailing systemctl --user status prints peak memory. Swap pandas/tests/base for the slice you want.

Notes

  • pandas Cython files (.pyx/.pxd) are tracked as binary modules; changes trigger re-selection.
  • Tests requiring optional deps (qapp, httpserver) error out unless installed. Unrelated to DAGZ.

Exporting coverage

After any pytest --dagz run, export a coverage report on the host:

zb export-cov --format pycoverage # → .coverage (coverage.py-compatible SQLite)
zb export-cov --format xml # → coverage.xml (Cobertura)

Accuracy:

  • Semantic-unit coverage (DAGZ selection coverage) stays up-to-date even when only a subset ran.
  • Line coverage (zb export-cov output) reflects only the chosen job. For a full-suite report, run with selection disabled and export from that job.

See Coverage: Two Modes of Coverage.

Measuring peak memory

Inside the container (cgroup v2):

before=$(cat /sys/fs/cgroup/memory.peak)
pytest --dagz pandas/tests/
after=$(cat /sys/fs/cgroup/memory.peak)
echo "Peak memory: $(( (after - before) / 1024 / 1024 )) MB"