scsplice.tl¶

Tools that operate on a populated splicing AnnData. Both functions require layers["M1"] to be present; pseudo_correlation additionally requires valid M2 (uns["scsplice"]["m2_valid"] == True).

tl ¶

scsplice.tl — tools that operate on a populated splicing AnnData.

make_m2 ¶

make_m2(adata: AnnData, *, n_threads: int = 1, copy: bool = False) -> AnnData | None

Build the exclusion matrix M2 and store it in adata.layers['M2'].

For every event i (var) and cell j (obs):

M2[j, i] = sum(M1[j, k] for k in same LJV group as i) - M1[j, i]

Entries that are exactly zero are dropped (no epsilon threshold). M2 has the same shape, sparse format (CSC), and dtype (float64) as M1.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Splicing AnnData built by :func:`scsplice.io.read_starsolo` or equivalent. Must have `layers['M1']` (cells × events, sparse CSC, `float64`) and the var columns required by :func:`scsplice._core._validators.validate_var_schema`.	required
`n_threads`	`int`	OpenMP thread count for the kernel. `n_threads > 1` requires a build with OpenMP enabled (check `scsplice._scsplice_cpp.__openmp__`). The kernel produces byte-identical output regardless of thread count.	`1`
`copy`	`bool`	Mutate `adata` in place and return `None` (default), or operate on a deep copy and return the new AnnData.	`False`

Returns:

Type	Description
``None`` when ``copy=False``; the new AnnData when ``copy=True``.

Notes

Sets adata.uns['scsplice']['m2_valid'] = True and records the call params under adata.uns['scsplice']['params']['make_m2'].

The kernel reads layers['M1'], transposes to events × cells at the C++ boundary (the kernel works in that layout), and transposes the result back to AnnData layout (cells × events) before storing in layers['M2'].

Source code in src/scsplice/tl/_make_m2.py

def make_m2(
    adata: ad.AnnData,
    *,
    n_threads: int = 1,
    copy: bool = False,
) -> ad.AnnData | None:
    """Build the exclusion matrix M2 and store it in ``adata.layers['M2']``.

    For every event ``i`` (var) and cell ``j`` (obs):

        M2[j, i] = sum(M1[j, k] for k in same LJV group as i) - M1[j, i]

    Entries that are exactly zero are dropped (no epsilon threshold). M2 has the
    same shape, sparse format (CSC), and dtype (``float64``) as M1.

    Parameters
    ----------
    adata
        Splicing AnnData built by :func:`scsplice.io.read_starsolo` or equivalent.
        Must have ``layers['M1']`` (cells × events, sparse CSC, ``float64``) and
        the var columns required by :func:`scsplice._core._validators.validate_var_schema`.
    n_threads
        OpenMP thread count for the kernel. ``n_threads > 1`` requires a build
        with OpenMP enabled (check ``scsplice._scsplice_cpp.__openmp__``). The
        kernel produces byte-identical output regardless of thread count.
    copy
        Mutate ``adata`` in place and return ``None`` (default), or operate on a
        deep copy and return the new AnnData.

    Returns
    -------
    ``None`` when ``copy=False``; the new AnnData when ``copy=True``.

    Notes
    -----
    Sets ``adata.uns['scsplice']['m2_valid'] = True`` and records the call params
    under ``adata.uns['scsplice']['params']['make_m2']``.

    The kernel reads ``layers['M1']``, transposes to events × cells at the C++
    boundary (the kernel works in that layout), and transposes the result back
    to AnnData layout (cells × events) before storing in ``layers['M2']``.
    """
    cpp = _import_extension()

    if copy:
        adata = adata.copy()

    validate_m1_layer(adata)
    validate_var_schema(adata)

    M1 = adata.layers["M1"]
    if not isinstance(M1, sp.csc_matrix):
        M1 = sp.csc_matrix(M1)
    if M1.dtype != np.float64:
        M1 = M1.astype(np.float64)

    # AnnData is cells × events; the kernel works in events × cells.
    M1_T = M1.T.tocsc()

    group_ids = np.asarray(adata.var["group_id"])
    if group_ids.shape != (adata.n_vars,):
        raise ValueError(
            f"var['group_id'] shape {group_ids.shape} != (n_vars={adata.n_vars},)"
        )
    group_ids_dense = _validate_and_densify_group_ids(group_ids)

    M2_T = cpp.make_m2(M1_T, group_ids_dense, int(n_threads))

    M2 = sp.csc_matrix(M2_T).T.tocsc()
    if M2.dtype != np.float64:
        M2 = M2.astype(np.float64)
    adata.layers["M2"] = M2

    mark_m2_valid(adata)
    ns = setdefault_scsplice_ns(adata)
    ns.setdefault("params", {})["make_m2"] = {"n_threads": int(n_threads)}

    return adata if copy else None

pseudo_correlation ¶

pseudo_correlation(adata: AnnData, zdb: ndarray, *, metric: Literal['CoxSnell', 'Nagelkerke'] = 'CoxSnell', n_permutations: int = 0, seed: int | None = None, n_threads: int = 1, key_added: str = 'pseudo_correlation', inplace: bool = True) -> AnnData | None

Compute per-event signed pseudo-correlation against zdb.

For each event, fit a binomial GLM (logistic link) via IRLS with design matrix [intercept | zdb[event, valid_cells]] against the M1/(M1+M2) response, where valid_cells are cells with M1+M2 > 0. Return sqrt(R^2) * sign(slope) where R^2 is Cox-Snell or Nagelkerke pseudo-R^2. Events with fewer than two valid cells, all-zero M1 or M2, singular Hessian, or negative R^2 receive NaN.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Splicing AnnData with valid M1 / M2 layers (`uns['scsplice']['m2_valid']`).	required
`zdb`	`ndarray`	Dense `(n_var, n_obs)` numpy array — one predictor value per `(event, cell)`. Note: this is events x cells, NOT `(events x K_latent_dims)`. Callers with a K-dim latent embedding must compute `zdb` per dim separately and call this function once per dim.	required
`metric`	`Literal['CoxSnell', 'Nagelkerke']`	`"CoxSnell"` (default) or `"Nagelkerke"`.	`'CoxSnell'`
`n_permutations`	`int`	If > 0, also compute `n_permutations` null distributions by column-permuting `zdb` (cell axis) and recomputing the kernel. Stored in `adata.varm[key_added + '_null']` of shape `(n_var, n_permutations)`.	`0`
`seed`	`int \| None`	Seed for the column permutation RNG (`numpy.random.default_rng`). Cross-language bit-equivalence with R is impossible (PCG64 vs Mersenne Twister); permutations are reproducible across Python runs given the same seed.	`None`
`n_threads`	`int`	OpenMP thread count for the per-event outer loop. Per-event results are bit-identical regardless of n_threads (disjoint scalar writes).	`1`
`key_added`	`str`	Output column name in `adata.var`.	`'pseudo_correlation'`
`inplace`	`bool`	Mutate `adata` in place and return `None` (default), or operate on a copy and return it.	`True`

Returns:

Type	Description
``None`` when ``inplace=True``; otherwise the modified copy.

Source code in src/scsplice/tl/_pseudo_correlation.py

def pseudo_correlation(
    adata: ad.AnnData,
    zdb: np.ndarray,
    *,
    metric: Literal["CoxSnell", "Nagelkerke"] = "CoxSnell",
    n_permutations: int = 0,
    seed: int | None = None,
    n_threads: int = 1,
    key_added: str = "pseudo_correlation",
    inplace: bool = True,
) -> ad.AnnData | None:
    """Compute per-event signed pseudo-correlation against ``zdb``.

    For each event, fit a binomial GLM (logistic link) via IRLS with design
    matrix ``[intercept | zdb[event, valid_cells]]`` against the M1/(M1+M2)
    response, where ``valid_cells`` are cells with ``M1+M2 > 0``. Return
    ``sqrt(R^2) * sign(slope)`` where ``R^2`` is Cox-Snell or Nagelkerke
    pseudo-R^2. Events with fewer than two valid cells, all-zero M1 or M2,
    singular Hessian, or negative R^2 receive ``NaN``.

    Parameters
    ----------
    adata
        Splicing AnnData with valid M1 / M2 layers (``uns['scsplice']['m2_valid']``).
    zdb
        Dense ``(n_var, n_obs)`` numpy array — one predictor value per
        ``(event, cell)``. Note: this is **events x cells**, NOT
        ``(events x K_latent_dims)``. Callers with a K-dim latent
        embedding must compute ``zdb`` per dim separately and call this
        function once per dim.
    metric
        ``"CoxSnell"`` (default) or ``"Nagelkerke"``.
    n_permutations
        If > 0, also compute ``n_permutations`` null distributions by
        column-permuting ``zdb`` (cell axis) and recomputing the kernel.
        Stored in ``adata.varm[key_added + '_null']`` of shape
        ``(n_var, n_permutations)``.
    seed
        Seed for the column permutation RNG (``numpy.random.default_rng``).
        Cross-language bit-equivalence with R is impossible (PCG64 vs
        Mersenne Twister); permutations are reproducible across Python
        runs given the same seed.
    n_threads
        OpenMP thread count for the per-event outer loop. Per-event
        results are bit-identical regardless of n_threads (disjoint scalar
        writes).
    key_added
        Output column name in ``adata.var``.
    inplace
        Mutate ``adata`` in place and return ``None`` (default), or
        operate on a copy and return it.

    Returns
    -------
    ``None`` when ``inplace=True``; otherwise the modified copy.
    """
    cpp = _import_extension()

    if metric not in ("CoxSnell", "Nagelkerke"):
        raise ValueError(
            f"metric must be 'CoxSnell' or 'Nagelkerke', got {metric!r}"
        )
    zdb = np.asarray(zdb, dtype=np.float64)
    if zdb.ndim != 2:
        raise ValueError(f"zdb must be 2D, got shape {zdb.shape}")
    if zdb.shape != (adata.n_vars, adata.n_obs):
        raise ValueError(
            f"zdb shape {zdb.shape} != (n_var={adata.n_vars}, n_obs={adata.n_obs}); "
            "zdb is events x cells, not events x K. See docstring."
        )
    if n_permutations < 0:
        raise ValueError(f"n_permutations must be non-negative, got {n_permutations}")

    if not inplace:
        adata = adata.copy()

    validate_paired_layers(adata, require_m2_valid=True)
    validate_var_schema(adata)

    M1 = adata.layers["M1"]
    M2 = adata.layers["M2"]
    if not isinstance(M1, sp.csc_matrix):
        M1 = sp.csc_matrix(M1)
    if not isinstance(M2, sp.csc_matrix):
        M2 = sp.csc_matrix(M2)
    if M1.dtype != np.float64:
        M1 = M1.astype(np.float64)
    if M2.dtype != np.float64:
        M2 = M2.astype(np.float64)

    # AnnData layout is cells x events; the kernel works in events x cells.
    M1_T = M1.T.tocsc()
    M2_T = M2.T.tocsc()

    # Eigen MatrixXd is column-major; pybind11 will copy from numpy's C-order.
    # The cost is O(n_events * n_cells * 8) bytes — acceptable for typical sizes.
    Z = np.asfortranarray(zdb)

    point = np.asarray(
        cpp.pseudo_correlation(Z, M1_T, M2_T, str(metric), int(n_threads)),
        dtype=np.float64,
    ).ravel()
    adata.var[key_added] = point

    if n_permutations > 0:
        rng = np.random.default_rng(seed)
        nulls = np.empty((adata.n_vars, int(n_permutations)), dtype=np.float64)
        for k in range(int(n_permutations)):
            perm = rng.permutation(adata.n_obs)
            Z_perm = np.asfortranarray(zdb[:, perm])
            nulls[:, k] = np.asarray(
                cpp.pseudo_correlation(
                    Z_perm, M1_T, M2_T, str(metric), int(n_threads)
                ),
                dtype=np.float64,
            ).ravel()
        adata.varm[key_added + "_null"] = nulls

    ns = setdefault_scsplice_ns(adata)
    ns.setdefault("params", {})["pseudo_correlation"] = {
        "metric": str(metric),
        "n_permutations": int(n_permutations),
        "seed": None if seed is None else int(seed),
        "n_threads": int(n_threads),
        "key_added": str(key_added),
    }

    return adata if not inplace else None