scsplice.io¶

Input / output functions for STARsolo output. Three readers, one consistent API shape:

Function	STARsolo source	Output
`read_starsolo`	`Solo.out/SJ/`	Splicing AnnData (`layers["M1"]`, `layers["M2"]`)
`read_starsolo_gene`	`Solo.out/Gene/`	Gene-expression AnnData (`X` = raw counts)
`read_starsolo_velocyto`	`Solo.out/Velocyto/`	Velocity AnnData (`layers["spliced/unspliced/ambiguous"]`)

All three accept tissue_positions= for Visium / spatial samples and populate squidpy-compatible obsm["spatial"] and uns["spatial"].

See STARsolo readers and AnnData data layouts for design rationale and the full AnnData schema.

io ¶

scsplice.io — readers for STARsolo / 10x output.

read_starsolo ¶

read_starsolo(sj_dirs: str | Path | Sequence[str | Path], sample_ids: str | Sequence[str], *, barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None, use_internal_whitelist: bool = True, keep_multi_mapped: bool = False, min_counts: int = 1, ljv_kind: Literal['start_end', 'start', 'end'] = 'start_end', tissue_positions: Sequence[str | Path | None] | None = None, spatial_library_ids: Sequence[str | None] | None = None, verbose: bool = False) -> AnnData

Read STARsolo splice-junction output and assemble a splicing AnnData.

Parameters:

Name	Type	Description	Default
`sj_dirs`	`str \| Path \| Sequence[str \| Path]`	One or more `Solo.out/SJ/` (or `Solo.out/SJ/raw/`) directories. A single string or :class:`pathlib.Path` is auto-promoted to a one-element sequence.	required
`sample_ids`	`str \| Sequence[str]`	Per-directory unique sample identifier; appears in `adata.obs["sample_id"]` and as the `-{sample_id}` suffix on `adata.obs_names`. Must have the same length as `sj_dirs`. Must not contain `'-'`: the `{barcode}-{sample_id}` `obs_names` convention becomes unparseable if the sample_id has an embedded hyphen, so the reader rejects such inputs up-front.	required
`barcode_whitelists`	`Sequence[str \| Path \| Sequence[str] \| None] \| None`	Optional per-sample whitelist of raw 16-mer barcodes. Each entry may be a path to a one-barcode-per-line file, a sequence of barcodes, or `None` to defer to `use_internal_whitelist`.	`None`
`use_internal_whitelist`	`bool`	When no explicit whitelist is supplied for a sample, fall back to STARsolo's per-sample whitelist at `../Gene/filtered/barcodes.tsv`. If that file does not exist, a warning is emitted and no whitelist is applied.	`True`
`keep_multi_mapped`	`bool`	Default `False` drops junctions whose `unique_mapped` flag in `SJ.out.tab` is zero (multi-mapped-only junctions). Set `True` to keep all junctions.	`False`
`min_counts`	`int`	Minimum row-sum (across all cells of all samples) required for an event row to survive in the final AnnData. Applied after LJV grouping. Default `1` drops zero-count events.	`1`
`ljv_kind`	`Literal['start_end', 'start', 'end']`	Which LJV-grouped event rows to emit: `"start_end"` (default; matches R splikit), `"start"` (only `_S`), or `"end"` (only `_E`). The single-half modes are advanced overrides for users who only care about alternative 5' or 3' splice sites.	`'start_end'`
`tissue_positions`	`Sequence[str \| Path \| None] \| None`	Per-sample optional path to a Space Ranger `tissue_positions[_list].csv`. When provided for a sample, the file's barcodes act as the whitelist (raw/ source) and populate `obs["in_tissue"]` / `obs["array_row"]` / `obs["array_col"]`, `obsm["spatial"]`, and `uns["spatial"][library_id]` per the squidpy contract. Header (Space Ranger 2.x) and headerless `tissue_positions_list.csv` (v1) are auto-detected.	`None`
`spatial_library_ids`	`Sequence[str \| None] \| None`	Per-sample squidpy library_id key for `uns["spatial"]`. Defaults to `sample_ids[i]` when not given.	`None`
`verbose`	`bool`	Print per-sample progress to stdout.	`False`

Notes

Whitelist precedence per sample:

tissue_positions[i] (when given)
barcode_whitelists[i] (when given)
use_internal_whitelist=True and Gene/filtered/barcodes.tsv exists
otherwise no whitelist (use all raw barcodes)

The SJ feature has only raw/ available (no per-feature filtered dir), so the "read raw and intersect" step is the only mode regardless of which precedence rule fires; the rule still affects WHICH barcodes are kept.

Returns:

Type	Description
`AnnData`	Cells × events. `layers["M1"]` populated; `layers["M2"]` absent and `uns["scsplice"]["m2_valid"] is False` — call :func:`scsplice.tl.make_m2` next.

Source code in src/scsplice/io/_starsolo.py

def read_starsolo(
    sj_dirs: str | Path | Sequence[str | Path],
    sample_ids: str | Sequence[str],
    *,
    barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None,
    use_internal_whitelist: bool = True,
    keep_multi_mapped: bool = False,
    min_counts: int = 1,
    ljv_kind: Literal["start_end", "start", "end"] = "start_end",
    tissue_positions: Sequence[str | Path | None] | None = None,
    spatial_library_ids: Sequence[str | None] | None = None,
    verbose: bool = False,
) -> ad.AnnData:
    """Read STARsolo splice-junction output and assemble a splicing AnnData.

    Parameters
    ----------
    sj_dirs
        One or more ``Solo.out/SJ/`` (or ``Solo.out/SJ/raw/``) directories.
        A single string or :class:`pathlib.Path` is auto-promoted to a
        one-element sequence.
    sample_ids
        Per-directory unique sample identifier; appears in
        ``adata.obs["sample_id"]`` and as the ``-{sample_id}`` suffix on
        ``adata.obs_names``. Must have the same length as ``sj_dirs``.
        **Must not contain ``'-'``**: the ``{barcode}-{sample_id}``
        ``obs_names`` convention becomes unparseable if the sample_id has
        an embedded hyphen, so the reader rejects such inputs up-front.
    barcode_whitelists
        Optional per-sample whitelist of raw 16-mer barcodes. Each entry
        may be a path to a one-barcode-per-line file, a sequence of
        barcodes, or ``None`` to defer to ``use_internal_whitelist``.
    use_internal_whitelist
        When no explicit whitelist is supplied for a sample, fall back to
        STARsolo's per-sample whitelist at ``../Gene/filtered/barcodes.tsv``.
        If that file does not exist, a warning is emitted and no whitelist
        is applied.
    keep_multi_mapped
        Default ``False`` drops junctions whose ``unique_mapped`` flag in
        ``SJ.out.tab`` is zero (multi-mapped-only junctions). Set ``True``
        to keep all junctions.
    min_counts
        Minimum row-sum (across all cells of all samples) required for an
        event row to survive in the final AnnData. Applied **after** LJV
        grouping. Default ``1`` drops zero-count events.
    ljv_kind
        Which LJV-grouped event rows to emit: ``"start_end"`` (default;
        matches R splikit), ``"start"`` (only ``_S``), or ``"end"`` (only
        ``_E``). The single-half modes are advanced overrides for users who
        only care about alternative 5' or 3' splice sites.
    tissue_positions
        Per-sample optional path to a Space Ranger
        ``tissue_positions[_list].csv``. When provided for a sample, the
        file's barcodes act as the whitelist (raw/ source) and populate
        ``obs["in_tissue"]`` / ``obs["array_row"]`` / ``obs["array_col"]``,
        ``obsm["spatial"]``, and ``uns["spatial"][library_id]`` per the
        squidpy contract. Header (Space Ranger 2.x) and headerless
        ``tissue_positions_list.csv`` (v1) are auto-detected.
    spatial_library_ids
        Per-sample squidpy library_id key for ``uns["spatial"]``. Defaults
        to ``sample_ids[i]`` when not given.
    verbose
        Print per-sample progress to stdout.

    Notes
    -----
    Whitelist precedence per sample:

    1. ``tissue_positions[i]`` (when given)
    2. ``barcode_whitelists[i]`` (when given)
    3. ``use_internal_whitelist=True`` and ``Gene/filtered/barcodes.tsv`` exists
    4. otherwise no whitelist (use all raw barcodes)

    The SJ feature has only ``raw/`` available (no per-feature filtered
    dir), so the "read raw and intersect" step is the only mode regardless
    of which precedence rule fires; the rule still affects WHICH barcodes
    are kept.

    Returns
    -------
    AnnData
        Cells × events. ``layers["M1"]`` populated; ``layers["M2"]`` absent
        and ``uns["scsplice"]["m2_valid"] is False`` — call
        :func:`scsplice.tl.make_m2` next.
    """
    if isinstance(sj_dirs, (str, Path)):
        sj_dirs = [sj_dirs]
    if isinstance(sample_ids, str):
        sample_ids = [sample_ids]
    sj_dirs = list(sj_dirs)
    sample_ids = list(sample_ids)
    if len(sj_dirs) != len(sample_ids):
        raise ValueError(
            f"len(sj_dirs)={len(sj_dirs)} must equal len(sample_ids)={len(sample_ids)}"
        )
    if len(sj_dirs) == 0:
        raise ValueError("At least one sj_dir / sample_id pair is required.")
    if len(set(sample_ids)) != len(sample_ids):
        dup = sorted({s for s, c in Counter(sample_ids).items() if c > 1})
        raise ValueError(f"sample_ids must be unique; duplicates: {dup}")
    bad_sid = [s for s in sample_ids if "-" in s]
    if bad_sid:
        raise ValueError(
            f"sample_ids must not contain '-' (the reader uses 'obs_names = "
            f"{{barcode}}-{{sample_id}}' so embedded hyphens make the suffix "
            f"ambiguous to parse). Offending sample_ids: {bad_sid}"
        )
    if min_counts < 0:
        raise ValueError(f"min_counts must be non-negative, got {min_counts}")
    if ljv_kind not in ("start_end", "start", "end"):
        raise ValueError(
            f"ljv_kind must be 'start_end', 'start', or 'end'; got {ljv_kind!r}"
        )

    n = len(sj_dirs)
    bcw = normalize_per_sample_arg(barcode_whitelists, n, name="barcode_whitelists")
    tp = normalize_per_sample_arg(tissue_positions, n, name="tissue_positions")
    libs = normalize_per_sample_arg(spatial_library_ids, n, name="spatial_library_ids")

    artifacts: list[_SampleArtifacts] = []
    for sj_dir, sid, bw, tps, lib in zip(
        sj_dirs, sample_ids, bcw, tp, libs, strict=True,
    ):
        artifacts.append(
            _read_one_sample(
                sj_dir, sid,
                keep_multi_mapped=keep_multi_mapped,
                barcode_whitelist=bw,
                tissue_positions=tps,
                use_internal_whitelist=use_internal_whitelist,
                spatial_library_id=lib,
                verbose=verbose,
            )
        )

    mtx_concat, var_pre, obs, spatial_uns, spatial_arr = _concat_samples(
        artifacts, sample_ids,
    )
    mtx_grouped, var_grouped = _apply_ljv_grouping(mtx_concat, var_pre, ljv_kind)
    mtx_grouped, var_grouped = _filter_min_counts(mtx_grouped, var_grouped, min_counts)

    # Transpose junctions × cells → cells × events for AnnData.
    M1 = mtx_grouped.T.tocsc().astype(np.float64)

    adata = ad.AnnData(
        layers={"M1": M1},
        obs=obs,
        var=var_grouped,
    )
    # Use the canonical scsplice namespace key. Legacy h5ad files written by
    # splikit-py 1.0.0 carry uns['splikit']; readers / validators downstream
    # transparently migrate that via scsplice._core._validators.get_scsplice_ns.
    adata.uns["scsplice"] = {
        "version": 1,
        "m2_valid": False,
        "ljv_kind": ljv_kind,
        "source": "starsolo",
        "params": {
            "read_starsolo": {
                "n_samples": len(sj_dirs),
                "keep_multi_mapped": bool(keep_multi_mapped),
                "min_counts": int(min_counts),
                "ljv_kind": ljv_kind,
                "use_internal_whitelist": bool(use_internal_whitelist),
                "any_explicit_whitelist": any(b is not None for b in bcw),
                "any_tissue_positions": any(t is not None for t in tp),
            }
        },
    }
    if spatial_arr is not None:
        adata.obsm["spatial"] = spatial_arr.astype(np.float64, copy=False)
    if spatial_uns is not None:
        adata.uns["spatial"] = spatial_uns
    invalidate_m2(adata)

    # Final sanity check: the schema validators must pass on the output.
    validate_m1_layer(adata)
    validate_var_schema(adata)

    if verbose:
        print(
            f"[scsplice.io] Built AnnData: {adata.n_obs} cells × "
            f"{adata.n_vars} events ({len(sj_dirs)} samples, "
            f"M1 nnz={adata.layers['M1'].nnz})"
        )

    return adata

read_starsolo_gene ¶

read_starsolo_gene(sample_dirs: str | Path | Sequence[str | Path], sample_ids: str | Sequence[str], *, barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None, use_internal_whitelist: bool = True, var_names: Literal['gene_ids', 'gene_symbols'] = 'gene_ids', tissue_positions: Sequence[str | Path | None] | None = None, spatial_library_ids: Sequence[str | None] | None = None, verbose: bool = False) -> AnnData

Read STARsolo Gene-feature output and assemble a cell × gene AnnData.

Parameters:

Name	Type	Description	Default
`sample_dirs`	`str \| Path \| Sequence[str \| Path]`	One or more sample directories. Each may be the sample root (parent of `Solo.out/`), the `Solo.out/` dir, `Solo.out/Gene/`, or `Solo.out/Gene/raw/` / `.../filtered/` directly.	required
`sample_ids`	`str \| Sequence[str]`	Per-sample unique identifier; appears in `obs["sample_id"]` and as the `-{sample_id}` suffix on `obs_names`. Must not contain `'-'`: the `{barcode}-{sample_id}` `obs_names` convention becomes unparseable if the sample_id has an embedded hyphen.	required
`barcode_whitelists`	`Sequence[str \| Path \| Sequence[str] \| None] \| None`	Per-sample whitelist of barcodes. `None` to fall back to internal whitelist; otherwise a path or sequence. When provided, counts are read from `Gene/raw/` (so barcodes outside `filtered/` are captured) and intersected with the whitelist.	`None`
`use_internal_whitelist`	`bool`	When neither an external whitelist nor `tissue_positions` is given, fall back to `Solo.out/Gene/filtered/barcodes.tsv`. If that file is missing, a warning is emitted.	`True`
`var_names`	`Literal['gene_ids', 'gene_symbols']`	`"gene_ids"` (default) uses Ensembl/NCBI IDs (always unique); `"gene_symbols"` uses gene names. With symbols, the reader calls `var_names_make_unique()` because symbols collide for paralogs (e.g. `IGH@`) — this is only legal here, never on the splicing reader's `_S/_E` axis.	`'gene_ids'`
`tissue_positions`	`Sequence[str \| Path \| None] \| None`	Per-sample optional path to a Space Ranger `tissue_positions[_list].csv`. When provided, the file's barcodes act as the whitelist (raw/ source) AND populate `obs["in_tissue"]`, `obs["array_row"]`, `obs["array_col"]`, `obsm["spatial"]`, and `uns["spatial"][library_id]` per the squidpy contract. Header (Space Ranger 2.x) and headerless v1 `tissue_positions_list.csv` are auto-detected.	`None`
`spatial_library_ids`	`Sequence[str \| None] \| None`	Per-sample squidpy library_id used as the key in `uns["spatial"]`. Defaults to `sample_ids[i]` when not given.	`None`
`verbose`	`bool`	Print per-sample diagnostics to stdout.	`False`

Returns:

Type	Description
`AnnData`	Cells × genes. `X` is sparse CSC float64 (raw counts). When any sample provides `tissue_positions`, the squidpy-shaped fields are populated; cells from non-spatial samples have `in_tissue=-1` and `obsm["spatial"][i] = (-1.0, -1.0)` as sentinel.

Notes

Whitelist precedence per sample:

tissue_positions[i] (read raw/, intersect)
barcode_whitelists[i] (read raw/, intersect)
use_internal_whitelist=True and filtered/ exists (read filtered/)
otherwise raw/ unfiltered

The reader does not run min_counts filtering — gene expression workflows have their own QC (e.g. scanpy.pp.calculate_qc_metrics).

Validated against STARsolo 2.7.10+ and Cell Ranger 6/7/8 features.tsv (3-col); 2-col v2 features.tsv is supported via the column-count dispatch in _read_features.

Source code in src/scsplice/io/_starsolo_gene.py

def read_starsolo_gene(
    sample_dirs: str | Path | Sequence[str | Path],
    sample_ids: str | Sequence[str],
    *,
    barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None,
    use_internal_whitelist: bool = True,
    var_names: Literal["gene_ids", "gene_symbols"] = "gene_ids",
    tissue_positions: Sequence[str | Path | None] | None = None,
    spatial_library_ids: Sequence[str | None] | None = None,
    verbose: bool = False,
) -> ad.AnnData:
    """Read STARsolo Gene-feature output and assemble a cell × gene AnnData.

    Parameters
    ----------
    sample_dirs
        One or more sample directories. Each may be the sample root (parent
        of ``Solo.out/``), the ``Solo.out/`` dir, ``Solo.out/Gene/``, or
        ``Solo.out/Gene/raw/`` / ``.../filtered/`` directly.
    sample_ids
        Per-sample unique identifier; appears in ``obs["sample_id"]`` and
        as the ``-{sample_id}`` suffix on ``obs_names``. **Must not contain
        ``'-'``**: the ``{barcode}-{sample_id}`` ``obs_names`` convention
        becomes unparseable if the sample_id has an embedded hyphen.
    barcode_whitelists
        Per-sample whitelist of barcodes. ``None`` to fall back to internal
        whitelist; otherwise a path or sequence. When provided, counts are
        read from ``Gene/raw/`` (so barcodes outside ``filtered/`` are
        captured) and intersected with the whitelist.
    use_internal_whitelist
        When neither an external whitelist nor ``tissue_positions`` is
        given, fall back to ``Solo.out/Gene/filtered/barcodes.tsv``. If
        that file is missing, a warning is emitted.
    var_names
        ``"gene_ids"`` (default) uses Ensembl/NCBI IDs (always unique);
        ``"gene_symbols"`` uses gene names. With symbols, the reader
        calls ``var_names_make_unique()`` because symbols collide for
        paralogs (e.g. ``IGH@``) — this is *only* legal here, never on
        the splicing reader's ``_S/_E`` axis.
    tissue_positions
        Per-sample optional path to a Space Ranger
        ``tissue_positions[_list].csv``. When provided, the file's
        barcodes act as the whitelist (raw/ source) AND populate
        ``obs["in_tissue"]``, ``obs["array_row"]``, ``obs["array_col"]``,
        ``obsm["spatial"]``, and ``uns["spatial"][library_id]`` per the
        squidpy contract. Header (Space Ranger 2.x) and headerless v1
        ``tissue_positions_list.csv`` are auto-detected.
    spatial_library_ids
        Per-sample squidpy library_id used as the key in
        ``uns["spatial"]``. Defaults to ``sample_ids[i]`` when not given.
    verbose
        Print per-sample diagnostics to stdout.

    Returns
    -------
    AnnData
        Cells × genes. ``X`` is sparse CSC float64 (raw counts). When any
        sample provides ``tissue_positions``, the squidpy-shaped fields
        are populated; cells from non-spatial samples have ``in_tissue=-1``
        and ``obsm["spatial"][i] = (-1.0, -1.0)`` as sentinel.

    Notes
    -----
    Whitelist precedence per sample:

    1. ``tissue_positions[i]`` (read raw/, intersect)
    2. ``barcode_whitelists[i]`` (read raw/, intersect)
    3. ``use_internal_whitelist=True`` and filtered/ exists (read filtered/)
    4. otherwise raw/ unfiltered

    The reader does not run ``min_counts`` filtering — gene expression
    workflows have their own QC (e.g. ``scanpy.pp.calculate_qc_metrics``).

    Validated against STARsolo 2.7.10+ and Cell Ranger 6/7/8 features.tsv
    (3-col); 2-col v2 features.tsv is supported via the column-count
    dispatch in ``_read_features``.
    """
    if isinstance(sample_dirs, (str, Path)):
        sample_dirs = [sample_dirs]
    if isinstance(sample_ids, str):
        sample_ids = [sample_ids]
    sample_dirs = list(sample_dirs)
    sample_ids = list(sample_ids)

    if len(sample_dirs) != len(sample_ids):
        raise ValueError(
            f"len(sample_dirs)={len(sample_dirs)} must equal "
            f"len(sample_ids)={len(sample_ids)}"
        )
    if len(sample_dirs) == 0:
        raise ValueError("At least one sample_dir / sample_id pair is required.")
    if len(set(sample_ids)) != len(sample_ids):
        dup = sorted({s for s, c in Counter(sample_ids).items() if c > 1})
        raise ValueError(f"sample_ids must be unique; duplicates: {dup}")
    bad_sid = [s for s in sample_ids if "-" in s]
    if bad_sid:
        raise ValueError(
            f"sample_ids must not contain '-' (the reader uses 'obs_names = "
            f"{{barcode}}-{{sample_id}}' so embedded hyphens make the suffix "
            f"ambiguous to parse). Offending sample_ids: {bad_sid}"
        )

    n = len(sample_dirs)
    bcw = normalize_per_sample_arg(barcode_whitelists, n, name="barcode_whitelists")
    tp = normalize_per_sample_arg(tissue_positions, n, name="tissue_positions")
    libs = normalize_per_sample_arg(spatial_library_ids, n, name="spatial_library_ids")

    artifacts: list[_GeneSampleArtifacts] = []
    for sd, sid, bw, tps, lib in zip(sample_dirs, sample_ids, bcw, tp, libs, strict=True):
        artifacts.append(
            _read_one_gene_sample(
                sd, sid,
                barcode_whitelist=bw,
                tissue_positions=tps,
                use_internal_whitelist=use_internal_whitelist,
                spatial_library_id=lib,
                verbose=verbose,
            )
        )

    X, var, obs, spatial_uns, spatial_arr = _concat_gene_samples(
        artifacts, sample_ids, var_names=var_names,
    )

    adata = ad.AnnData(X=X, obs=obs, var=var)
    # Use the canonical scsplice namespace key. Legacy h5ad files written by
    # splikit-py 1.0.0 carry uns['splikit']; validators downstream transparently
    # migrate that via scsplice._core._validators.get_scsplice_ns.
    adata.uns["scsplice"] = {
        "version": 1,
        "source": "starsolo",
        "params": {
            "read_starsolo_gene": {
                "n_samples": n,
                "var_names": var_names,
                "use_internal_whitelist": bool(use_internal_whitelist),
                "any_explicit_whitelist": any(b is not None for b in bcw),
                "any_tissue_positions": any(t is not None for t in tp),
            }
        },
    }
    if var_names == "gene_symbols":
        # Legal here; gene symbols can collide and there is no LJV suffix scheme.
        adata.var_names_make_unique()
    else:
        if not adata.var_names.is_unique:
            raise RuntimeError(
                "var_names not unique with var_names='gene_ids' — duplicate "
                "gene_id seen across samples; this should be impossible."
            )

    if spatial_arr is not None:
        adata.obsm["spatial"] = spatial_arr.astype(np.float64, copy=False)
    if spatial_uns is not None:
        adata.uns["spatial"] = spatial_uns

    if verbose:
        print(
            f"[scsplice.io] Built Gene AnnData: {adata.n_obs} cells × "
            f"{adata.n_vars} genes ({n} samples, X nnz={adata.X.nnz})"
        )
    return adata

read_starsolo_velocyto ¶

read_starsolo_velocyto(sample_dirs: str | Path | Sequence[str | Path], sample_ids: str | Sequence[str], *, barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None, use_internal_whitelist: bool = True, var_names: Literal['gene_ids', 'gene_symbols'] = 'gene_ids', tissue_positions: Sequence[str | Path | None] | None = None, spatial_library_ids: Sequence[str | None] | None = None, verbose: bool = False) -> AnnData

Read STARsolo Velocyto output and assemble a cell × gene AnnData with three layers.

Parameters:

Name	Type	Description	Default
`sample_dirs`	`str \| Path \| Sequence[str \| Path]`		required
`sample_ids`	`str \| Path \| Sequence[str \| Path]`		required
`barcode_whitelists`	`str \| Path \| Sequence[str \| Path]`		required
`use_internal_whitelist`	`str \| Path \| Sequence[str \| Path]`		required
`var_names`	`Literal['gene_ids', 'gene_symbols']`	Same shape as :func:`read_starsolo_gene`. `sample_ids` must not contain `'-'`: the `{barcode}-{sample_id}` `obs_names` convention becomes unparseable if a sample_id has an embedded hyphen.	`'gene_ids'`
`tissue_positions`	`Literal['gene_ids', 'gene_symbols']`	Same shape as :func:`read_starsolo_gene`. `sample_ids` must not contain `'-'`: the `{barcode}-{sample_id}` `obs_names` convention becomes unparseable if a sample_id has an embedded hyphen.	`'gene_ids'`
`spatial_library_ids`	`Literal['gene_ids', 'gene_symbols']`	Same shape as :func:`read_starsolo_gene`. `sample_ids` must not contain `'-'`: the `{barcode}-{sample_id}` `obs_names` convention becomes unparseable if a sample_id has an embedded hyphen.	`'gene_ids'`
`verbose`	`Literal['gene_ids', 'gene_symbols']`	Same shape as :func:`read_starsolo_gene`. `sample_ids` must not contain `'-'`: the `{barcode}-{sample_id}` `obs_names` convention becomes unparseable if a sample_id has an embedded hyphen.	`'gene_ids'`

Returns:

Type	Description
`AnnData`	`X` is sparse CSC float64 spliced counts (alias of `layers["spliced"]`). Additional layers `unspliced` and `ambiguous` sit alongside on the same gene axis.

Notes

Both wire formats are auto-detected:

Modern: three separate spliced.mtx / unspliced.mtx / ambiguous.mtx files.
Legacy: a stacked matrix.mtx with rows [0..n_genes) spliced, [n_genes..2*n_genes) unspliced, [2*n_genes..3*n_genes) ambiguous.

The Velocyto feature has only a raw/ directory in stock STARsolo output, so the filtered/ fallback is not applicable here. Internal whitelist (when enabled) reads from Solo.out/Gene/filtered/barcodes.tsv next door.

The reader does not run velocity smoothing or moments — that's scvelo.pp.filter_and_normalize / scvelo.tl.velocity.

Source code in src/scsplice/io/_starsolo_velocyto.py

def read_starsolo_velocyto(
    sample_dirs: str | Path | Sequence[str | Path],
    sample_ids: str | Sequence[str],
    *,
    barcode_whitelists: Sequence[str | Path | Sequence[str] | None] | None = None,
    use_internal_whitelist: bool = True,
    var_names: Literal["gene_ids", "gene_symbols"] = "gene_ids",
    tissue_positions: Sequence[str | Path | None] | None = None,
    spatial_library_ids: Sequence[str | None] | None = None,
    verbose: bool = False,
) -> ad.AnnData:
    """Read STARsolo Velocyto output and assemble a cell × gene AnnData with three layers.

    Parameters
    ----------
    sample_dirs, sample_ids, barcode_whitelists, use_internal_whitelist,
    var_names, tissue_positions, spatial_library_ids, verbose
        Same shape as :func:`read_starsolo_gene`. ``sample_ids`` **must
        not contain ``'-'``**: the ``{barcode}-{sample_id}`` ``obs_names``
        convention becomes unparseable if a sample_id has an embedded
        hyphen.

    Returns
    -------
    AnnData
        ``X`` is sparse CSC float64 spliced counts (alias of
        ``layers["spliced"]``). Additional layers ``unspliced`` and
        ``ambiguous`` sit alongside on the same gene axis.

    Notes
    -----
    Both wire formats are auto-detected:

    - **Modern**: three separate ``spliced.mtx`` / ``unspliced.mtx`` /
      ``ambiguous.mtx`` files.
    - **Legacy**: a stacked ``matrix.mtx`` with rows
      ``[0..n_genes)`` spliced, ``[n_genes..2*n_genes)`` unspliced,
      ``[2*n_genes..3*n_genes)`` ambiguous.

    The Velocyto feature has only a ``raw/`` directory in stock STARsolo
    output, so the ``filtered/`` fallback is not applicable here. Internal
    whitelist (when enabled) reads from
    ``Solo.out/Gene/filtered/barcodes.tsv`` next door.

    The reader does not run velocity smoothing or moments — that's
    ``scvelo.pp.filter_and_normalize`` / ``scvelo.tl.velocity``.
    """
    if isinstance(sample_dirs, (str, Path)):
        sample_dirs = [sample_dirs]
    if isinstance(sample_ids, str):
        sample_ids = [sample_ids]
    sample_dirs = list(sample_dirs)
    sample_ids = list(sample_ids)

    if len(sample_dirs) != len(sample_ids):
        raise ValueError(
            f"len(sample_dirs)={len(sample_dirs)} must equal "
            f"len(sample_ids)={len(sample_ids)}"
        )
    if len(sample_dirs) == 0:
        raise ValueError("At least one sample_dir / sample_id pair is required.")
    if len(set(sample_ids)) != len(sample_ids):
        dup = sorted({s for s, c in Counter(sample_ids).items() if c > 1})
        raise ValueError(f"sample_ids must be unique; duplicates: {dup}")
    bad_sid = [s for s in sample_ids if "-" in s]
    if bad_sid:
        raise ValueError(
            f"sample_ids must not contain '-' (the reader uses 'obs_names = "
            f"{{barcode}}-{{sample_id}}' so embedded hyphens make the suffix "
            f"ambiguous to parse). Offending sample_ids: {bad_sid}"
        )

    n = len(sample_dirs)
    bcw = normalize_per_sample_arg(barcode_whitelists, n, name="barcode_whitelists")
    tp = normalize_per_sample_arg(tissue_positions, n, name="tissue_positions")
    libs = normalize_per_sample_arg(spatial_library_ids, n, name="spatial_library_ids")

    artifacts: list[_VelSampleArtifacts] = []
    for sd, sid, bw, tps, lib in zip(sample_dirs, sample_ids, bcw, tp, libs, strict=True):
        artifacts.append(
            _read_one_velocyto_sample(
                sd, sid,
                barcode_whitelist=bw, tissue_positions=tps,
                use_internal_whitelist=use_internal_whitelist,
                spatial_library_id=lib, verbose=verbose,
            )
        )

    (
        spliced, unspliced, ambiguous, var, obs, spatial_uns, spatial_arr
    ) = _concat_velocyto_samples(
        artifacts, sample_ids, var_names=var_names,
    )

    adata = ad.AnnData(
        X=spliced.copy(),
        layers={"spliced": spliced, "unspliced": unspliced, "ambiguous": ambiguous},
        obs=obs, var=var,
    )
    # Record per-sample wire-format detection.
    wire_formats = {sid: art.wire_format for sid, art in zip(sample_ids, artifacts, strict=True)}
    # Use the canonical scsplice namespace key. Legacy h5ad files written by
    # splikit-py 1.0.0 carry uns['splikit']; validators downstream transparently
    # migrate that via scsplice._core._validators.get_scsplice_ns.
    adata.uns["scsplice"] = {
        "version": 1,
        "source": "starsolo",
        "params": {
            "read_starsolo_velocyto": {
                "n_samples": n,
                "var_names": var_names,
                "use_internal_whitelist": bool(use_internal_whitelist),
                "any_explicit_whitelist": any(b is not None for b in bcw),
                "any_tissue_positions": any(t is not None for t in tp),
                "wire_formats": wire_formats,
            }
        },
    }
    if var_names == "gene_symbols":
        adata.var_names_make_unique()
    else:
        if not adata.var_names.is_unique:
            raise RuntimeError(
                "var_names not unique with var_names='gene_ids' — duplicate "
                "gene_id seen across samples."
            )

    if spatial_arr is not None:
        adata.obsm["spatial"] = spatial_arr.astype(np.float64, copy=False)
    if spatial_uns is not None:
        adata.uns["spatial"] = spatial_uns

    if verbose:
        print(
            f"[scsplice.io] Built Velocyto AnnData: {adata.n_obs} cells × "
            f"{adata.n_vars} genes (spliced nnz={adata.layers['spliced'].nnz}, "
            f"unspliced nnz={adata.layers['unspliced'].nnz}, "
            f"ambiguous nnz={adata.layers['ambiguous'].nnz})"
        )
    return adata