corp-extractor v0.7.0

Statement Extractor Documentation

Extract structured subject-predicate-object statements from unstructured text using T5-Gemma 2 and GLiNER2 models with document processing, entity resolution, and taxonomy classification.

Getting Started

Installation & quick start

CLI

Command line & documents

5-Stage Pipeline

Entity resolution & taxonomy

Entity Database

Organizations & people

API Reference

Functions & classes

Getting Started

Installation

Bash

pip install corp-extractor

The GLiNER2 model (205M params) is downloaded automatically on first use.

GPU support: Install PyTorch with CUDA before installing corp-extractor. The library auto-detects GPU availability at runtime.

Bash

# Example for CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install corp-extractor

Apple Silicon (M1/M2/M3): MPS acceleration is automatically detected. Just install normally:

Bash

pip install corp-extractor

Quick Start

Extract structured statements from text in 5 lines:

Python

from statement_extractor import extract_statements

text = "Apple Inc. acquired Beats Electronics for $3 billion in May 2014."
statements = extract_statements(text)

for stmt in statements:
    print(f"{stmt.subject.text} ({stmt.subject.type}) -> {stmt.predicate} -> {stmt.object.text}")

Output:

Text

Apple Inc. (ORG) -> acquired -> Beats Electronics
Apple Inc. (ORG) -> paid -> $3 billion
Beats Electronics (ORG) -> acquisition price -> $3 billion

Each statement includes confidence scores and extraction method:

Python

for stmt in statements:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
    print(f"  method: {stmt.extraction_method}")  # hybrid, gliner, or model
    print(f"  confidence: {stmt.confidence_score:.2f}")

v0.5.0 features: Plugin-based pipeline architecture with entity qualification, labeling, and taxonomy classification. GLiNER2 entity recognition, entity-based scoring.

v0.6.0 features: Entity embedding database with ~100K+ SEC filers, ~3M GLEIF records, ~5M UK organizations for fast entity qualification.

v0.7.0 features: Document processing for files, URLs, and PDFs with automatic chunking, deduplication, and citation tracking.

v0.8.0 features: Merged qualification and canonicalization into single stage. EntityType classification for organizations (business, nonprofit, government, etc.).

v0.9.0 features: Person database with Wikidata import for notable people (executives, politicians, athletes, artists). PersonQualifier for canonical person identification with role/org context.

v0.9.1 features: Wikidata dump importer (import-wikidata-dump) for large imports without SPARQL timeouts. Uses aria2c for fast parallel downloads. Extracts people via occupation (P106) and position dates (P580/P582).

v0.9.2 features: Organization canonicalization links equivalent records across sources (GLEIF, SEC, Companies House, Wikidata). People canonicalization with priority-based deduplication. Expanded PersonType classification (executive, politician, government, military, legal, etc.).

v0.9.3 features: SEC Form 4 officers import (import-sec-officers) and Companies House officers import (import-ch-officers). People now sourced from Wikidata, SEC Edgar, and Companies House with cross-source canonicalization.

v0.9.4 features: Database v2 schema with normalized INTEGER foreign keys and enum lookup tables. Scalar (int8) embeddings for 75% storage reduction with ~92% recall. New locations import for countries/states/cities with hierarchy. Migration commands: db migrate-v2, db backfill-scalar. New search commands: db search-roles, db search-locations.

v0.9.5 features: USearch HNSW indexes for sub-millisecond search on 50M+ vectors. 3-thread parallel Wikidata dump import (reader/embedder/writer). Multi-record person import (one per position+org). Auto-canonicalization after dump import. New commands: db post-import, db build-index. --hybrid flag for text+embeddings search. fast-import extras (orjson, indexed_bzip2). Zstandard (.zst) dump support. New hf_classifier labeler plugin.

v0.9.6 features: Database v3 — lite databases drop all embedding tables (use USearch indexes for search). db download/db upload now include USearch .bin files. New db_info metadata table with schema_version. Global --db-version CLI flag for backwards compatibility. Filenames: entities-v3.db / entities-v3-lite.db.

v0.9.7 features: Persistent local server (corp-extractor serve) keeps models warm in memory for fast repeated CLI use. New --server / --server-url flags and CORP_EXTRACTOR_SERVER env var to delegate processing to a running server instance.

v0.9.8 features: Python API server delegation — pass server_url to extract_statements(), ExtractionPipeline, or DocumentPipeline to delegate processing to a running server from Python code. No local GPU required. All backends now return standardized Pydantic model_dump() JSON.

Pipeline Quick Start (v0.5.0)

For full entity resolution with qualification, canonicalization, labeling, and taxonomy classification:

Python

from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans to hire workers.")

# Access fully qualified names (e.g., "Andy Jassy (CEO, Amazon)")
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} --[{stmt.statement.predicate}]--> {stmt.object_fqn}")

    # Access labels (sentiment, etc.)
    for label in stmt.labels:
        print(f"  {label.label_type}: {label.label_value}")

CLI usage:

Bash

# Full pipeline
corp-extractor pipeline "Amazon CEO Andy Jassy announced..."

# Run specific stages only
corp-extractor pipeline -f article.txt --stages 1-3

# Process documents, PDFs, and URLs (v0.7.0)
corp-extractor document process article.txt
corp-extractor document process report.pdf
corp-extractor document process report.pdf --pdf-parser glm_ocr_parser
corp-extractor document process https://example.com/article

Using Predicate Taxonomies

Normalize extracted predicates to canonical forms using embedding similarity:

Python

from statement_extractor import extract_statements, PredicateTaxonomy, ExtractionOptions

# Define your domain's canonical predicates
taxonomy = PredicateTaxonomy(predicates=[
    "acquired", "founded", "works_for", "headquartered_in",
    "invested_in", "partnered_with", "announced"
])

options = ExtractionOptions(predicate_taxonomy=taxonomy)

text = "Google bought YouTube for $1.65 billion in 2006."
result = extract_statements(text, options)

for stmt in result:
    print(f"{stmt.predicate} -> {stmt.canonical_predicate}")
    # Output: bought -> acquired

This maps synonyms like "bought", "purchased", "acquired" to a single canonical form, making downstream analysis easier.

Requirements

Dependency	Version	Notes
Python	3.10+	Required
PyTorch	2.0+	Required
transformers	5.0+	Required for T5-Gemma2 support
Pydantic	2.0+	Required
sentence-transformers	2.2+	Required, for embedding features
GLiNER2	latest	Required, for entity recognition and relation extraction (model auto-downloads)

Hardware requirements:

NVIDIA GPU: RTX 4090+ recommended for production. Uses bfloat16 precision for efficiency.
Apple Silicon: M1/M2/M3 with 16GB+ RAM. MPS acceleration auto-detected.
CPU: Functional but slower. Use for development or low-volume processing.
Disk: ~100GB for all models and entity database (9.7M+ organizations, 63M+ people).

The library runs entirely locally with no external API dependencies. Models use bfloat16 on CUDA and float32 on MPS/CPU.

Command Line Interface

The corp-extractor CLI provides commands for extraction, document processing, and database management.

Commands Overview

Command	Description	Use Case
`split`	Simple extraction (Stage 1 only)	Fast extraction, basic triples
`pipeline`	Full 5-stage pipeline	Entity resolution, labeling, taxonomy
`document`	Document processing	Files, URLs, PDFs with chunking and deduplication
`db`	Database management	Import, search, upload/download entity database
`serve`	Persistent local server	Keep models warm in memory for fast repeated use
`plugins`	Plugin management	List and inspect available plugins

Global Options

These options apply to all commands when placed before the command name:

Option	Description	Default
`--server`	Delegate processing to a running `corp-extractor serve` instance at the default URL	http://localhost:8111
`--server-url URL`	Delegate processing to a server at a custom URL	—
`--db-version N`	Database schema version for filenames	latest (3)

The CORP_EXTRACTOR_SERVER environment variable can also be set to a server URL, which acts as a fallback when neither --server nor --server-url is provided.

Bash

# Use the default server URL (http://localhost:8111)
corp-extractor --server pipeline "Apple announced a new iPhone."

# Use a custom server URL
corp-extractor --server-url http://gpu-box:8111 pipeline "Apple announced a new iPhone."

# Or set via environment variable
export CORP_EXTRACTOR_SERVER=http://localhost:8111
corp-extractor pipeline "Apple announced a new iPhone."

Installation

For best results, install globally:

Bash

# Using uv (recommended)
uv tool install "corp-extractor[embeddings]"

# Using pipx
pipx install "corp-extractor[embeddings]"

# Using pip
pip install "corp-extractor[embeddings]"

Quick Run with uvx

Run directly without installing using uv:

Bash

uvx corp-extractor split "Apple announced a new iPhone."

Note: First run downloads the model (~1.5GB) which may take a few minutes.

Split Command

The split command extracts sub-statements using the T5-Gemma model. It's fast and simple—use pipeline for full entity resolution.

Bash

# Extract from text argument
corp-extractor split "Apple Inc. announced the iPhone 15."

# Extract from file
corp-extractor split -f article.txt

# Pipe from stdin
cat article.txt | corp-extractor split -

# Output as JSON
corp-extractor split "Tim Cook is CEO of Apple." --json

# Output as XML
corp-extractor split "Tim Cook is CEO of Apple." --xml

# Verbose output with confidence scores
corp-extractor split -f article.txt --verbose

# Use more beams for better quality
corp-extractor split -f article.txt --beams 8

Split Options

Option	Description	Default
`-f, --file PATH`	Read input from file	—
`-o, --output`	Output format: table, json, xml	table
`--json / --xml`	Output format shortcuts	—
`-b, --beams`	Number of beams for diverse beam search	4
`--diversity`	Diversity penalty for beam search	1.0
`--no-gliner`	Disable GLiNER2 extraction	—
`--predicates`	Comma-separated predicates for relation extraction	—
`--predicates-file`	Path to custom predicates JSON file	—
`--device`	Device: auto, cuda, mps, cpu	auto
`-v, --verbose`	Show confidence scores and metadata	—

Pipeline Command

NEW in v0.5.0

The pipeline command runs the full 5-stage extraction pipeline for comprehensive entity resolution and taxonomy classification.

Bash

# Run all 5 stages
corp-extractor pipeline "Amazon CEO Andy Jassy announced plans to hire workers."

# Run from file
corp-extractor pipeline -f article.txt

# Run specific stages
corp-extractor pipeline "..." --stages 1-3
corp-extractor pipeline "..." --stages 1,2,5

# Skip specific stages
corp-extractor pipeline "..." --skip-stages 4,5

# Enable specific plugins only
corp-extractor pipeline "..." --plugins gleif,companies_house

# Disable specific plugins
corp-extractor pipeline "..." --disable-plugins sec_edgar

# Output formats
corp-extractor pipeline "..." -o json
corp-extractor pipeline "..." -o yaml
corp-extractor pipeline "..." -o triples

Pipeline Stages

Stage	Name	Description
1	Splitting	Text → Raw triples (T5-Gemma)
2	Extraction	Raw triples → Typed statements (GLiNER2)
3	Entity Qualification	Add identifiers (LEI, CIK, etc.) and canonical names via embedding DB
4	Labeling	Apply sentiment, relation type, confidence
5	Taxonomy	Classify against large taxonomies (MNLI/embeddings)

Pipeline Options

Option	Description	Example
`--stages`	Stages to run	`1-3` or `1,2,5`
`--skip-stages`	Stages to skip	`4,5`
`--plugins`	Enable only these plugins	`gleif,person`
`--disable-plugins`	Disable these plugins	`sec_edgar`
`--predicates-file`	Custom predicates JSON file for GLiNER2	`custom.json`
`-o, --output`	Output format	table, json, yaml, triples

Plugins Command

NEW in v0.5.0

The plugins command lists and inspects available pipeline plugins.

Bash

# List all plugins
corp-extractor plugins list

# List plugins for a specific stage
corp-extractor plugins list --stage 3

# Get details about a plugin
corp-extractor plugins info gleif_qualifier
corp-extractor plugins info person_qualifier

Example output:

Text

Stage 1: Splitting
----------------------------------------
  t5_gemma_splitter  [priority: 100]

Stage 2: Extraction
----------------------------------------
  gliner2_extractor  [priority: 100]

Stage 3: Entity Qualification
----------------------------------------
  person_qualifier (PERSON)  [priority: 100]
  embedding_company_qualifier (ORG)  [priority: 5]

Stage 4: Labeling
----------------------------------------
  sentiment_labeler  [priority: 100]
  confidence_labeler  [priority: 100]
  relation_type_labeler  [priority: 100]

Stage 5: Taxonomy
----------------------------------------
  embedding_taxonomy_classifier  [priority: 100]

Serve Command

NEW in v0.9.7

The serve command starts a persistent local FastAPI server that keeps all models warm in memory. This eliminates the ~30s startup cost for repeated CLI invocations.

Bash

# Start the server (default: 0.0.0.0:8111)
corp-extractor serve

# Custom host and port
corp-extractor serve --host 127.0.0.1 --port 9000

# Skip model warmup (models load on first request)
corp-extractor serve --no-warmup

# Verbose logging
corp-extractor serve -v

Once the server is running, use --server with any extraction command to delegate processing:

Bash

# In another terminal
corp-extractor --server split "Apple announced a new iPhone."
corp-extractor --server pipeline "Amazon CEO Andy Jassy announced plans."
corp-extractor --server pipeline -f article.txt -o json
corp-extractor --server document process article.txt

For Python API delegation, pass server_url to any extraction function (see Examples):

Python

from statement_extractor import extract_statements
result = extract_statements("text", server_url="http://localhost:8111")

Serve Options

Option	Description	Default
`--host`	Bind address	0.0.0.0
`--port`	Port number	8111
`--no-warmup`	Skip eager model loading (models load on first request)	—
`-v, --verbose`	Enable debug logging	—

Server Endpoints

Endpoint	Method	Description
`/`	GET	Health check — device info, loaded models, registered plugins
`/pipeline`	POST	Full extraction pipeline — returns `PipelineContext` model_dump JSON
`/split`	POST	Stage 1 extraction only — returns `ExtractionResult` model_dump JSON
`/document`	POST	Document pipeline — returns `DocumentContext` model_dump JSON

Output Formats

Table output (default):

Text

Extracted 2 statement(s):

--------------------------------------------------------------------------------
1. Andy Jassy (CEO, Amazon)
   --[announced]-->
   plans to hire workers
--------------------------------------------------------------------------------

JSON output:

JSON

{
  "statement_count": 2,
  "labeled_statements": [
    {
      "subject": {"text": "Andy Jassy", "type": "PERSON", "fqn": "Andy Jassy (CEO, Amazon)"},
      "predicate": "announced",
      "object": {"text": "plans to hire workers", "type": "EVENT"},
      "labels": {"sentiment": "positive"}
    }
  ]
}

Triples output:

Text

Andy Jassy (CEO, Amazon)	announced	plans to hire workers
Amazon	has CEO	Andy Jassy (CEO, Amazon)

Shell Integration

Processing multiple files:

Bash

# Process all .txt files
for f in *.txt; do
  echo "=== $f ==="
  corp-extractor pipeline -f "$f" -o json > "${f%.txt}.json"
done

Combining with jq:

Bash

# Extract just predicates
corp-extractor split "Your text" --json | jq '.statements[].predicate'

# Filter high-confidence statements
corp-extractor split -f article.txt --json | jq '.statements[] | select(.confidence_score > 0.8)'

# Get FQNs from pipeline
corp-extractor pipeline "Your text" -o json | jq '.labeled_statements[].subject.fqn'

Document Command

NEW in v0.7.0

The document command processes files, URLs, and PDFs with automatic chunking and deduplication.

Bash

# Process local files
corp-extractor document process article.txt
corp-extractor document process report.txt --title "Annual Report" --year 2024

# Process local PDFs (auto-detected by .pdf extension)
corp-extractor document process report.pdf
corp-extractor document process report.pdf --pdf-parser glm_ocr_parser

# Process URLs (web pages and PDFs)
corp-extractor document process https://example.com/article
corp-extractor document process https://example.com/report.pdf --use-ocr
corp-extractor document process https://example.com/report.pdf --pdf-parser glm_ocr_parser

# Configure chunking
corp-extractor document process article.txt --max-tokens 500 --overlap 50

# Preview chunking without extraction
corp-extractor document chunk article.txt --max-tokens 500

# Output formats
corp-extractor document process article.txt -o json
corp-extractor document process article.txt -o triples

Document Options

Option	Description	Default
`--title`	Document title for citations	Filename
`--max-tokens`	Target tokens per chunk	1000
`--overlap`	Token overlap between chunks	100
`--use-ocr`	Force OCR for PDF parsing	—
`--pdf-parser`	PDF parser plugin name (e.g., `glm_ocr_parser`)	Auto (lowest priority)
`--no-summary`	Skip document summarization	—
`--no-dedup`	Skip cross-chunk deduplication	—
`--stages`	Pipeline stages to run	1-5

Database Commands

MOVED in v0.10.0

The entity database CLI moved out of corp-extractor into the corp-entity-db project — see that project's docs for search, download, build, and management commands. corp-extractor consumes the database transparently for entity qualification (Stage 3 of the pipeline); you do not need to invoke any db commands to use the extraction pipeline.

See ENTITY_DATABASE.md for the project-level overview.

Core Concepts

Corp-extractor is designed to analyze complex text and extract relationship information about people and organizations. It runs entirely on your hardware (RTX 4090+, Apple M1 16GB+) with no external API dependencies, using multiple fine-tuned small models to transform unstructured text into structured knowledge.

Statement Extraction

Statement extraction is the process of converting unstructured natural language text into structured subject-predicate-object triples. Each triple represents a discrete fact or relationship extracted from the source text.

For example, given the text:

"Apple announced a new iPhone at their Cupertino headquarters."

The extractor produces triples like:

Subject	Predicate	Object
Apple (ORG)	announced	iPhone (PRODUCT)
Apple (ORG)	has headquarters in	Cupertino (GPE)

The T5-Gemma 2 Model

Corp-extractor uses a fine-tuned T5-Gemma 2 model with 540 million parameters. This encoder-decoder architecture excels at sequence-to-sequence tasks, making it well-suited for transforming text into structured XML output.

The model processes input text wrapped in <page> tags and generates XML containing <stmt> elements with subject, predicate, object, and source text spans.

Entity Type Recognition

Each extracted subject and object is classified into one of 12 entity types (plus UNKNOWN):

Type	Description	Example
`ORG`	Organizations, companies	Apple, United Nations
`PERSON`	Named individuals	Tim Cook, Marie Curie
`GPE`	Geopolitical entities	France, New York City
`LOC`	Non-GPE locations	Mount Everest, Pacific Ocean
`PRODUCT`	Products, artifacts	iPhone, Model S
`EVENT`	Named events	World War II, Olympics
`WORK_OF_ART`	Creative works	Mona Lisa, Hamlet
`LAW`	Legal documents	GDPR, First Amendment
`DATE`	Temporal expressions	January 2024, last Tuesday
`MONEY`	Monetary values	$50 million, €100
`PERCENT`	Percentages	15%, half
`QUANTITY`	Measurements	500 kilometers, 3 tons
`UNKNOWN`	Unclassified entities	—

Diverse Beam Search

Corp-extractor uses Diverse Beam Search (Vijayakumar et al., 2016) to generate multiple candidate extractions from the same input text.

Why Diverse Beam Search?

Standard beam search tends to produce similar outputs—slight variations of the same interpretation. Diverse Beam Search introduces a diversity penalty that encourages the model to explore fundamentally different extractions.

This is particularly valuable for statement extraction because:

A single sentence may contain multiple valid interpretations
Different phrasings can capture different aspects of the same fact
Merging diverse outputs produces more comprehensive coverage

How It Works

The model generates multiple beams in parallel, each representing a different extraction path. A diversity penalty is applied during generation to prevent beams from converging on identical outputs.

Default Parameters

Parameter	Default	Description
`num_beams`	4	Number of parallel beams to generate
`diversity_penalty`	1.0	Strength of diversity encouragement (higher = more diverse)

Python

from statement_extractor import extract_statements

# Use default beam search settings
result = extract_statements("Apple announced a new iPhone.")

# Customize beam search
result = extract_statements(
    "Apple announced a new iPhone.",
    num_beams=6,
    diversity_penalty=1.5
)

Quality Scoring

UPDATED in v0.4.0

Each extracted statement receives a confidence score between 0 and 1, measuring extraction quality through a weighted combination of semantic and entity-based signals.

Confidence Score

The score combines three components using GLiNER2 for entity recognition:

Component	Weight	Description
Semantic similarity	50%	Cosine similarity between source text and reassembled triple
Subject entity score	25%	How entity-like the subject is (via GLiNER2 NER)
Object entity score	25%	How entity-like the object is (via GLiNER2 NER)

Higher scores indicate the triple is semantically grounded and contains well-formed entities. Lower scores may suggest hallucination or poorly extracted entities.

Confidence Filtering

Use the min_confidence parameter to filter out low-quality extractions:

Python

from statement_extractor import extract_statements

# Only return statements with confidence >= 0.7
result = extract_statements(
    "Apple CEO Tim Cook announced the iPhone 15.",
    min_confidence=0.7
)

# Access individual scores
for stmt in result:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Beam Merging vs Best Beam Selection

Corp-extractor supports two strategies for combining beam outputs:

Strategy	Description	Use Case
`merge` (default)	Combine unique statements from all beams, deduplicated by content	Maximum coverage
`best`	Return only statements from the highest-scoring beam	Higher precision

Python

# Merge all beams (default)
result = extract_statements(text, beam_strategy="merge")

# Use only the best beam
result = extract_statements(text, beam_strategy="best")

When using merge, statements are deduplicated based on normalized subject-predicate-object content, and the highest confidence score is retained for duplicates.

GLiNER2 Integration

NEW in v0.4.0

Version 0.4.0 introduces GLiNER2 (205M parameters) for entity recognition and relation extraction, replacing spaCy.

Why GLiNER2?

GLiNER2 is a unified model that handles:

Named Entity Recognition - identifying entities with types
Relation Extraction - using 324 default predicates across 21 categories
Confidence Scoring - real confidence values via include_confidence=True
Entity Scoring - measuring how "entity-like" subjects and objects are

Default Predicates

GLiNER2 uses 324 predicates organized into 21 categories loaded from default_predicates.json. Categories include:

ownership_control - acquires, owns, has_subsidiary, etc.
employment_leadership - employs, is_ceo_of, manages, etc.
funding_investment - funds, invests_in, sponsors, etc.
supply_chain - supplies, manufactures, distributes_for, etc.
legal_regulatory - regulates, violates, complies_with, etc.

Each predicate includes a description for semantic matching and a confidence threshold.

All Matches Returned

GLiNER2 now returns all matching relations, not just the best one. This allows downstream filtering and selection based on your use case:

Python

from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans to hire workers.")

# All matching relations are returned, sorted by confidence
for stmt in ctx.statements:
    print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
    print(f"  Category: {stmt.predicate_category}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Custom Predicates

You can provide custom predicates via a JSON file:

Python

from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

config = PipelineConfig(
    extractor_options={"predicates_file": "/path/to/custom_predicates.json"}
)
pipeline = ExtractionPipeline(config)

Or via CLI:

Bash

corp-extractor pipeline "..." --predicates-file custom_predicates.json

Entity-Based Scoring

Confidence scores come directly from GLiNER2 with include_confidence=True:

Source	Description
Relation confidence	GLiNER2 confidence in the relation match
Entity confidence	GLiNER2 confidence in entity recognition

Pipeline Architecture

Updated in v0.8.0

Version 0.8.0 uses a 5-stage plugin-based pipeline for comprehensive entity resolution, statement enrichment, and taxonomy classification. Qualification and canonicalization have been merged into a single stage using the embedding database.

The 5 Stages

Stage	Name	Input	Output	Purpose
1	Splitting	Text	`RawTriple[]`	Extract raw subject-predicate-object triples using T5-Gemma2
2	Extraction	`RawTriple[]`	`PipelineStatement[]`	Refine entities with type recognition using GLiNER2
3	Entity Qualification	Entities	`CanonicalEntity[]`	Add identifiers (LEI, CIK, etc.) and resolve canonical names via embedding database
4	Labeling	Statements	`LabeledStatement[]`	Apply sentiment, relation type, confidence labels
5	Taxonomy	Statements	`TaxonomyResult[]`	Classify against large taxonomies (ESG topics, etc.)

Stage 1: Splitting

The splitting stage transforms raw text into RawTriple objects using the T5-Gemma2 model. Each triple contains:

subject_text: The raw subject text
predicate_text: The raw predicate/relationship
object_text: The raw object text
source_sentence: The sentence this triple was extracted from
confidence: Extraction confidence score

Stage 2: Extraction

The extraction stage uses GLiNER2 to extract relations and assign entity types, producing PipelineStatement objects with:

subject: ExtractedEntity with text, type, span, and confidence
object: ExtractedEntity with text, type, span, and confidence
predicate: Predicate from GLiNER2's 324 default predicates
predicate_category: Category the predicate belongs to (e.g., "employment_leadership")
source_text: Source text for this statement
confidence_score: Real confidence from GLiNER2

Note: Stage 2 returns all matching relations from GLiNER2, not just the best one. This allows downstream stages to filter, deduplicate, or select based on specific criteria. Relations are sorted by confidence (descending).

Stage 3: Entity Qualification

Entity qualification combines what were previously separate qualification and canonicalization stages. It adds context, external identifiers, and canonical names to entities using the embedding database:

PersonQualifier: Adds role, organization, and canonical ID for PERSON entities Enhanced in v0.9.0
- Uses LLM (Gemma3) to extract role and organization from context
- Searches person database for notable people (executives, politicians, athletes, etc.)
- Resolves organization mentions against the organization database
- Returns canonical Wikidata IDs for matched people
EmbeddingCompanyQualifier: Looks up company identifiers (LEI, CIK, UK company numbers) and canonical names using vector similarity search

The output is CanonicalEntity with:

entity_type: Classification (business, nonprofit, government, etc.)
canonical_match: Match details (id, name, method, confidence)
fqn: Fully Qualified Name, e.g., "Tim Cook (CEO, Apple Inc)"
External identifiers: lei, ch_number, sec_cik, ticker, etc.
resolved_role: Canonical role information from person database v0.9.0
resolved_org: Canonical organization information from org database v0.9.0

Note: The embedding-based company qualifier replaces the older API-based qualifiers (GLEIF, Companies House, SEC Edgar APIs) for faster, offline entity resolution.

Stage 4: Labeling

Labeling plugins annotate statements with additional metadata:

SentimentLabeler: Adds sentiment classification (positive/negative/neutral)
ConfidenceLabeler: Adds confidence scoring
RelationTypeLabeler: Classifies relation types

The output is LabeledStatement with:

Original statement
Canonicalized subject and object
List of StatementLabel objects

Stage 5: Taxonomy

Taxonomy classification plugins classify statements against large taxonomies with hundreds of possible values. Multiple labels may match a single statement above the confidence threshold.

MNLITaxonomyClassifier: Uses MNLI zero-shot classification for accurate taxonomy labeling
EmbeddingTaxonomyClassifier: Uses embedding similarity for faster classification

The output is a list of TaxonomyResult objects, each with:

taxonomy_name: Name of the taxonomy (e.g., "esg_topics")
category: Top-level category (e.g., "environment", "governance")
label: Specific label within the category
confidence: Classification confidence score

Both classifiers use hierarchical classification for efficiency: first identify the top-k categories, then return all labels above the threshold within those categories.

Plugin System

Each stage is implemented through plugins registered with PluginRegistry. Plugins can be:

Enabled/disabled per invocation
Prioritized for execution order
Entity-type specific (e.g., PersonQualifier only runs on PERSON entities)

Python

from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Run with specific plugins disabled
config = PipelineConfig(
    disabled_plugins={"mnli_taxonomy_classifier"}  # Use embedding classifier instead
)
pipeline = ExtractionPipeline(config)
ctx = pipeline.process(text)

Document Processing

NEW in v0.7.0

Version 0.7.0 introduces document-level processing for handling files, URLs, and PDFs with automatic chunking, deduplication, and citation tracking.

Document Pipeline

The document pipeline:

Loads content from files, URLs, or PDFs
Chunks text into optimal-sized segments for the extraction model
Processes each chunk through the 5-stage extraction pipeline
Deduplicates statements across chunks
Generates optional document summary
Tracks citations back to source chunks

Chunking Strategy

Documents are split into chunks based on token count with configurable overlap:

Parameter	Default	Description
`target_tokens`	1000	Target tokens per chunk
`overlap_tokens`	100	Token overlap between consecutive chunks
`respect_sentences`	true	Avoid splitting mid-sentence

URL and PDF Support

The document pipeline can fetch and process content from URLs and local PDF files:

Web pages: HTML content is extracted using Readability-style parsing
PDFs: Two built-in parsers available via --pdf-parser flag:
- pypdf_parser (default) — PyMuPDF text extraction with Tesseract OCR fallback
- glm_ocr_parser — GLM-OCR 0.9B VLM for high-quality OCR of scans, tables, and formulas

Python

from statement_extractor.document import DocumentPipeline

pipeline = DocumentPipeline()

# Process a web page
ctx = await pipeline.process_url("https://example.com/article")

# Process a PDF with OCR
from statement_extractor.document import URLLoaderConfig
config = URLLoaderConfig(use_ocr=True)
ctx = await pipeline.process_url("https://example.com/report.pdf", config)

# Process a PDF with GLM-OCR parser
config = URLLoaderConfig(pdf_parser_plugin="glm_ocr_parser")
ctx = await pipeline.process_url("https://example.com/report.pdf", config)

Bash

# CLI: local PDF with default parser
corp-extractor document process report.pdf

# CLI: local PDF with GLM-OCR VLM parser
corp-extractor document process report.pdf --pdf-parser glm_ocr_parser

Cross-Chunk Deduplication

When processing long documents, the same fact may appear in multiple chunks. The deduplicator uses embedding similarity to identify and merge duplicate statements, keeping the highest-confidence version with proper citation tracking.

Entity Embedding Database

UPDATED in v0.9.6

The entity embedding database provides fast qualification for both organizations and people using vector similarity search.

Organization Data Sources

Source	Records	Identifier	Date Fields
Companies House	5.5M	UK Company Number	`from_date`: Incorporation, `to_date`: Dissolution
GLEIF	2.6M	LEI (Legal Entity Identifier)	`from_date`: LEI registration date
Wikidata	1.5M	QID	`from_date`: Inception (P571), `to_date`: Dissolution (P576)
SEC Edgar	73K	CIK (Central Index Key)	`from_date`: First SEC filing date

Total: 9.7M+ organization records

Person Data Sources UPDATED in v0.9.3

Source	Records	Identifier	Coverage
Companies House	27.5M	Person Number	UK company officers and directors
Wikidata	36M	QID	Notable people with English Wikipedia articles

Total: 63M+ people records

Person Types

PersonType	Description	Example People
`executive`	C-suite, board members	Tim Cook, Satya Nadella
`politician`	Elected officials (presidents, MPs, mayors)	Joe Biden, Angela Merkel
`government`	Civil servants, diplomats, appointed officials	Ambassadors, agency heads
`military`	Military officers, armed forces personnel	Generals, admirals
`legal`	Judges, lawyers, legal professionals	Supreme Court justices
`professional`	Known for profession (doctors, engineers)	Famous surgeons, architects
`athlete`	Sports figures	LeBron James, Lionel Messi
`artist`	Traditional creatives (musicians, actors, painters)	Tom Hanks, Taylor Swift
`media`	Internet/social media personalities	YouTubers, influencers, podcasters
`academic`	Professors, researchers	Neil deGrasse Tyson
`scientist`	Scientists, inventors	Research scientists
`journalist`	Reporters, news presenters	Anderson Cooper
`entrepreneur`	Founders, business owners	Mark Zuckerberg
`activist`	Advocates, campaigners	Greta Thunberg

People are imported from Companies House (UK company officers) and Wikidata (notable people with English Wikipedia articles). Each person record includes:

name: Display name
known_for_role: Primary role (e.g., "CEO", "President")
known_for_org: Primary organization (e.g., "Apple Inc", "Tesla")
country: Country of citizenship
person_type: Classification category
from_date: Role start date (ISO format)
to_date: Role end date (ISO format)
birth_date: Date of birth (ISO format) v0.9.2
death_date: Date of death if deceased (ISO format) v0.9.2

Note: The same person can have multiple records with different role/org combinations (e.g., Tim Cook as "CEO at Apple" and "Board Director at Nike"). The unique constraint is on (source, source_id, known_for_role, known_for_org).

When organizations are discovered during people import (employers, affiliated orgs), they are automatically inserted into the organizations table if not already present. Each person record has a known_for_org_id foreign key linking to the organizations table, enabling efficient joins and lookups.

EntityType Classification

NEW in v0.8.0

Each organization record is classified with an entity_type field to distinguish between businesses, non-profits, government agencies, and other organization types:

Category	Types	Description
Business	`business`, `fund`, `branch`	Commercial entities, investment funds, branch offices
Non-profit	`nonprofit`, `ngo`, `foundation`, `trade_union`	Charitable organizations, NGOs, labor unions
Government	`government`, `international_org`, `political_party`	Government agencies, UN/WHO/IMF, political parties
Education	`educational`, `research`	Schools, universities, research institutes
Other	`healthcare`, `media`, `sports`, `religious`	Hospitals, studios, sports clubs, religious orgs
Unknown	`unknown`	Classification not determined

How It Works

Embedding Generation: Organization names are embedded using EmbeddingGemma (300M params)
Vector Search: USearch HNSW indexes enable sub-millisecond approximate nearest neighbor search across 50M+ records
Qualification: When an ORG entity is found, the database is searched for matching organizations
Identifier Resolution: Matched organizations provide LEI, CIK, company numbers, etc.

Other Tables NEW in v0.9.4

Table	Records	Description
Roles	139K+	Job titles with Wikidata QIDs (CEO, Director, etc.)
Locations	25K+	Countries, states, and cities with hierarchy

Database Variants

entities-v3-lite.db: Core fields only, no embedding tables (default download)
entities-v3.db: Full database with all embedding tables and source metadata
organizations_usearch.bin: USearch HNSW index for organization search
people_usearch.bin: USearch HNSW index for person search

Entity Database

MOVED in v0.10.0

As of corp-extractor v0.10.0 the entity database is a separate project, corp-entity-db. It provides embedding-based search across organizations, people, roles, and locations sourced from GLEIF, SEC Edgar, Companies House, and Wikidata.

The full reference — schema, sizes, source coverage, EntityType / PersonType classifications, search / download / build CLI, and Python API — lives at corp-entity-db.vercel.app.

How `corp-extractor` uses the database

corp-extractor depends on corp-entity-db>=0.1.0 and consumes it via the qualifier plugins in Stage 3 of the pipeline:

embedding_company_qualifier — looks up organizations by embedding similarity, attaching canonical IDs (LEI, CIK, UK CH number, Wikidata QID) to extracted ORG entities.
person_qualifier — looks up notable people, optionally using a local LLM (Gemma-3-12B GGUF) to disambiguate when multiple candidates match a given name + role + organization context.

The pipeline doesn't need any explicit database setup — corp-entity-db will download the lite variant on first use. Backwards-compatible re-export shims under statement_extractor.database (OrganizationDatabase, PersonDatabase, get_database, get_organization_resolver) keep older code working unchanged.

Cerebrium deployment shares the volume

The Cerebrium app for corp-extractor deploys into the same Cerebrium project as the corp-entity-db app, so both share /persistent-storage in us-east-1. Database files, USearch indexes, and the embedding model (google/embeddinggemma-300m) are downloaded once by whichever app boots first and reused by the other.

See ENTITY_DATABASE.md for the project-level overview.

API Reference

Functions

The library provides convenience functions for quick extraction without managing extractor instances.

Function	Returns	Description
`extract_statements(text, options?, server_url?)`	`ExtractionResult`	Main extraction function. Returns structured statements with confidence scores. Pass `server_url` to delegate to a running server.
`extract_statements_as_json(text, options?, indent?, server_url?)`	`str`	Returns extraction result as a JSON string.
`extract_statements_as_xml(text, options?, server_url?)`	`str`	Returns raw XML output from the model.
`extract_statements_as_dict(text, options?, server_url?)`	`dict`	Returns extraction result as a Python dictionary.

Function Signatures

Python

def extract_statements(
    text: str,
    options: Optional[ExtractionOptions] = None,
    server_url: Optional[str] = None,
    **kwargs
) -> ExtractionResult:
    """
    Extract structured statements from text.

    Args:
        text: Input text to extract statements from
        options: Extraction options (or pass individual options as kwargs)
        server_url: URL of a running corp-extractor server to delegate to
                    (e.g. "http://localhost:8111"). When provided, no local
                    models are loaded.
        **kwargs: Individual option overrides (num_beams, diversity_penalty, etc.)

    Returns:
        ExtractionResult containing Statement objects
    """

Python

def extract_statements_as_json(
    text: str,
    options: Optional[ExtractionOptions] = None,
    indent: Optional[int] = 2,
    **kwargs
) -> str:
    """Returns JSON string representation of the extraction result."""

Python

def extract_statements_as_xml(
    text: str,
    options: Optional[ExtractionOptions] = None,
    **kwargs
) -> str:
    """Returns XML string with <statements> containing <stmt> elements."""

Python

def extract_statements_as_dict(
    text: str,
    options: Optional[ExtractionOptions] = None,
    **kwargs
) -> dict:
    """Returns dictionary representation of the extraction result."""

Usage Examples

Python

from statement_extractor import extract_statements, extract_statements_as_json

# Basic extraction
result = extract_statements("Apple acquired Beats for $3 billion.")
for stmt in result:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")

# With options via kwargs
result = extract_statements(
    "Tesla announced new factories.",
    num_beams=6,
    diversity_penalty=1.5
)

# JSON output
json_str = extract_statements_as_json("OpenAI released GPT-4.", indent=2)
print(json_str)

Classes

StatementExtractor

The main extractor class with full control over device, model loading, and extraction options.

Python

class StatementExtractor:
    def __init__(
        self,
        model_id: str = "Corp-o-Rate-Community/statement-extractor",
        device: Optional[str] = None,
        torch_dtype: Optional[torch.dtype] = None,
        predicate_taxonomy: Optional[PredicateTaxonomy] = None,
        predicate_config: Optional[PredicateComparisonConfig] = None,
        scoring_config: Optional[ScoringConfig] = None,
    ):
        """
        Initialize the statement extractor.

        Args:
            model_id: HuggingFace model ID or local path
            device: Device to use ('cuda', 'cpu', or None for auto-detect)
            torch_dtype: Torch dtype (default: bfloat16 on GPU, float32 on CPU)
            predicate_taxonomy: Optional taxonomy for predicate normalization
            predicate_config: Configuration for predicate comparison
            scoring_config: Configuration for quality scoring
        """

    def extract(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
    ) -> ExtractionResult:
        """Extract statements from text."""

    def extract_as_xml(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
    ) -> str:
        """Extract statements and return raw XML output."""

    def extract_as_json(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
        indent: Optional[int] = 2,
    ) -> str:
        """Extract statements and return JSON string."""

    def extract_as_dict(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
    ) -> dict:
        """Extract statements and return as dictionary."""

Example: Custom extractor with GPU control

Python

from statement_extractor import StatementExtractor, ExtractionOptions

# Force CPU usage
extractor = StatementExtractor(device="cpu")

# Extract with custom options
options = ExtractionOptions(num_beams=6, diversity_penalty=1.2)
result = extractor.extract("Microsoft partnered with OpenAI.", options)

ExtractionOptions

Configuration for the extraction process.

Python

class ExtractionOptions(BaseModel):
    # Beam search parameters
    num_beams: int = 4                    # 1-16, beams for diverse beam search
    diversity_penalty: float = 1.0        # >= 0.0, penalty for beam diversity
    max_new_tokens: int = 2048            # 128-8192, max tokens to generate
    min_statement_ratio: float = 1.0      # >= 0.0, min statements per sentence
    max_attempts: int = 3                 # 1-10, extraction retry attempts
    deduplicate: bool = True              # Remove duplicate statements

    # Predicate taxonomy & comparison
    predicate_taxonomy: Optional[PredicateTaxonomy] = None
    predicate_config: Optional[PredicateComparisonConfig] = None

    # Scoring configuration (v0.2.0)
    scoring_config: Optional[ScoringConfig] = None

    # Pluggable canonicalization
    entity_canonicalizer: Optional[Callable[[str], str]] = None

    # Mode flags
    merge_beams: bool = True              # Merge top-N beams vs select best
    embedding_dedup: bool = True          # Use embedding similarity for dedup

ScoringConfig

Quality scoring parameters for beam selection and triple assessment. Added in v0.2.0.

Python

class ScoringConfig(BaseModel):
    quality_weight: float = 1.0           # >= 0.0, weight for confidence scores
    coverage_weight: float = 0.5          # >= 0.0, bonus for source text coverage
    redundancy_penalty: float = 0.3       # >= 0.0, penalty for duplicate triples
    length_penalty: float = 0.1           # >= 0.0, penalty for verbosity
    min_confidence: float = 0.0           # 0.0-1.0, minimum confidence threshold
    merge_top_n: int = 3                  # 1-10, beams to merge when merge_beams=True

Tuning for precision vs recall:

Use Case	min_confidence	Notes
High recall	0.0	Keep all extractions
Balanced	0.5	Filter low-confidence triples
High precision	0.8	Only keep high-confidence triples

PredicateTaxonomy

A taxonomy of canonical predicates for normalization.

Python

class PredicateTaxonomy(BaseModel):
    predicates: list[str]                 # List of canonical predicate forms
    name: Optional[str] = None            # Optional taxonomy name

    @classmethod
    def from_file(cls, path: str | Path) -> "PredicateTaxonomy":
        """Load taxonomy from a file (one predicate per line)."""

    @classmethod
    def from_list(cls, predicates: list[str], name: Optional[str] = None) -> "PredicateTaxonomy":
        """Create taxonomy from a list of predicates."""

Example:

Python

from statement_extractor import PredicateTaxonomy, ExtractionOptions, extract_statements

# Define canonical predicates
taxonomy = PredicateTaxonomy.from_list([
    "acquired", "founded", "works_for", "located_in", "partnered_with"
])

# Use in extraction
options = ExtractionOptions(predicate_taxonomy=taxonomy)
result = extract_statements("Google bought YouTube.", options)
# predicate "bought" maps to canonical "acquired"

PredicateComparisonConfig

Configuration for embedding-based predicate comparison.

Python

class PredicateComparisonConfig(BaseModel):
    embedding_model: str = "sentence-transformers/paraphrase-MiniLM-L6-v2"
    similarity_threshold: float = 0.65    # 0.0-1.0, min similarity for taxonomy match
    dedup_threshold: float = 0.65         # 0.0-1.0, min similarity for duplicates
    normalize_text: bool = True           # Lowercase and strip before embedding

Data Models

All data models use Pydantic for validation and serialization.

Statement

A single extracted subject-predicate-object triple.

Python

class Statement(BaseModel):
    subject: Entity                              # The subject entity
    predicate: str                               # The relationship/predicate
    object: Entity                               # The object entity
    source_text: Optional[str] = None            # Original text span

    # Quality scoring fields (v0.2.0)
    confidence_score: Optional[float] = None     # 0.0-1.0, quality score (semantic + entity)
    evidence_span: Optional[tuple[int, int]] = None  # Character offsets in source
    canonical_predicate: Optional[str] = None    # Canonical form if taxonomy used

    def as_triple(self) -> tuple[str, str, str]:
        """Return as (subject, predicate, object) tuple."""

    def __str__(self) -> str:
        """Format: 'subject -- predicate --> object'"""

Example:

Python

stmt = result.statements[0]
print(stmt.subject.text)           # "Apple Inc."
print(stmt.predicate)              # "acquired"
print(stmt.object.text)            # "Beats Electronics"
print(stmt.confidence_score)       # 0.92
print(stmt.as_triple())            # ("Apple Inc.", "acquired", "Beats Electronics")

Entity

An entity representing a subject or object.

Python

class Entity(BaseModel):
    text: str                        # The entity text
    type: EntityType = UNKNOWN       # The entity type

    def __str__(self) -> str:
        """Format: 'text (TYPE)'"""

EntityType

Enumeration of supported entity types.

Python

class EntityType(str, Enum):
    ORG = "ORG"                 # Organization
    PERSON = "PERSON"           # Person
    GPE = "GPE"                 # Geopolitical entity (country, city, state)
    LOC = "LOC"                 # Non-GPE location
    PRODUCT = "PRODUCT"         # Product
    EVENT = "EVENT"             # Event
    WORK_OF_ART = "WORK_OF_ART" # Creative work
    LAW = "LAW"                 # Legal document
    DATE = "DATE"               # Date or time
    MONEY = "MONEY"             # Monetary value
    PERCENT = "PERCENT"         # Percentage
    QUANTITY = "QUANTITY"       # Quantity or measurement
    UNKNOWN = "UNKNOWN"         # Unknown type

ExtractionResult

Container for extraction results. Supports iteration and length.

Python

class ExtractionResult(BaseModel):
    statements: list[Statement] = []     # List of extracted statements
    source_text: Optional[str] = None    # Original input text

    def __len__(self) -> int:
        """Number of statements."""

    def __iter__(self):
        """Iterate over statements."""

    def to_triples(self) -> list[tuple[str, str, str]]:
        """Return all statements as (subject, predicate, object) tuples."""

Example:

Python

result = extract_statements(text)

# Iterate directly
for stmt in result:
    print(stmt)

# Check count
print(f"Found {len(result)} statements")

# Get as simple tuples
triples = result.to_triples()

PredicateMatch

Result of matching a predicate to a canonical form.

Python

class PredicateMatch(BaseModel):
    original: str                        # The original extracted predicate
    canonical: Optional[str] = None      # Matched canonical predicate, if any
    similarity: float = 0.0              # 0.0-1.0, cosine similarity score
    matched: bool = False                # Whether a match was found above threshold

Example:

Python

from statement_extractor import PredicateComparer, PredicateTaxonomy

taxonomy = PredicateTaxonomy.from_list(["acquired", "founded", "works_for"])
comparer = PredicateComparer(taxonomy=taxonomy)

match = comparer.match_to_canonical("bought")
print(match.original)     # "bought"
print(match.canonical)    # "acquired"
print(match.similarity)   # ~0.82
print(match.matched)      # True

Pipeline API

NEW in v0.5.0

The pipeline API provides comprehensive entity resolution and taxonomy classification through a 5-stage plugin architecture.

ExtractionPipeline

The main orchestrator class that runs all pipeline stages.

Python

from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

class ExtractionPipeline:
    def __init__(self, config: Optional[PipelineConfig] = None, server_url: Optional[str] = None):
        """
        Initialize the extraction pipeline.

        Args:
            config: Pipeline configuration (default: all stages enabled)
            server_url: URL of a running corp-extractor server to delegate to.
                        When provided, process() sends HTTP requests instead
                        of loading models locally.
        """

    def process(self, text: str, metadata: Optional[dict] = None) -> PipelineContext:
        """
        Process text through the pipeline stages.

        Args:
            text: Input text to process
            metadata: Optional source metadata (document ID, URL, etc.)

        Returns:
            PipelineContext with results from all stages
        """

Example:

Python

pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans.")

print(f"Statements: {ctx.statement_count}")
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

PipelineConfig

Configuration for stage and plugin selection.

Python

from statement_extractor.pipeline import PipelineConfig

class PipelineConfig(BaseModel):
    enabled_stages: set[int] = {1, 2, 3, 4, 5}  # Stages to run (1-5)
    enabled_plugins: Optional[set[str]] = None   # Plugins to enable (None = all)
    disabled_plugins: set[str] = set()           # Plugins to disable
    fail_fast: bool = False                       # Stop on first error
    parallel_processing: bool = False             # Enable parallel processing
    max_statements: Optional[int] = None          # Limit statements processed

    # Stage-specific options
    splitter_options: dict = {}
    extractor_options: dict = {}
    qualifier_options: dict = {}
    labeler_options: dict = {}
    taxonomy_options: dict = {}

    @classmethod
    def from_stage_string(cls, stages: str, **kwargs) -> "PipelineConfig":
        """Create config from stage string like '1-3' or '1,2,5'."""

    @classmethod
    def default(cls) -> "PipelineConfig":
        """All stages enabled."""

    @classmethod
    def minimal(cls) -> "PipelineConfig":
        """Only splitting and extraction (stages 1-2)."""

Example:

Python

# Run only stages 1-3
config = PipelineConfig(enabled_stages={1, 2, 3})

# Disable specific plugins
config = PipelineConfig(disabled_plugins={"sec_edgar_qualifier"})

# From stage string
config = PipelineConfig.from_stage_string("1-3")

PipelineContext

Data container that flows through all pipeline stages.

Python

from statement_extractor.pipeline import PipelineContext

class PipelineContext(BaseModel):
    # Input
    source_text: str                                    # Original input text
    source_metadata: dict = {}                          # Document metadata

    # Stage outputs
    raw_triples: list[RawTriple] = []                   # Stage 1 output
    statements: list[PipelineStatement] = []           # Stage 2 output
    canonical_entities: dict[str, CanonicalEntity] = {} # Stage 3 output
    labeled_statements: list[LabeledStatement] = []    # Stage 4 output
    taxonomy_results: dict[tuple, list[TaxonomyResult]] = {}  # Stage 5 output (multiple labels per statement)

    # Processing metadata
    processing_errors: list[str] = []
    processing_warnings: list[str] = []
    stage_timings: dict[str, float] = {}

    @property
    def statement_count(self) -> int:
        """Number of statements in final output."""

    @property
    def has_errors(self) -> bool:
        """Check if any errors occurred."""

PluginRegistry

Registry for discovering and managing plugins.

Python

from statement_extractor.pipeline import PluginRegistry

class PluginRegistry:
    @classmethod
    def list_plugins(cls, stage: Optional[int] = None) -> list[dict]:
        """List all registered plugins, optionally filtered by stage."""

    @classmethod
    def get_plugin(cls, name: str) -> Optional[BasePlugin]:
        """Get a plugin by name."""

Pipeline Data Models

RawTriple

Output of Stage 1 (Splitting).

Python

class RawTriple(BaseModel):
    subject_text: str                    # Raw subject text
    predicate_text: str                  # Raw predicate text
    object_text: str                     # Raw object text
    source_sentence: str                 # Source sentence
    confidence: float = 1.0              # Extraction confidence (0-1)

    def as_tuple(self) -> tuple[str, str, str]:
        """Return as (subject, predicate, object) tuple."""

PipelineStatement

Output of Stage 2 (Extraction).

Python

class PipelineStatement(BaseModel):
    subject: ExtractedEntity             # Subject with type, span, confidence
    predicate: str                       # Predicate text
    predicate_category: Optional[str]    # Predicate category (e.g., "employment_leadership")
    object: ExtractedEntity              # Object with type, span, confidence
    source_text: str                     # Source text
    confidence_score: float = 1.0        # Overall confidence (from GLiNER2)
    extraction_method: Optional[str]     # Method: gliner_relation

Note: Stage 2 returns all matching relations from GLiNER2, not just the best one. Relations are sorted by confidence (descending).

GLiNER2Extractor

The Stage 2 extractor plugin that uses GLiNER2 for relation extraction.

Python

from statement_extractor.plugins.extractors.gliner2 import GLiNER2Extractor

class GLiNER2Extractor(BaseExtractorPlugin):
    def __init__(
        self,
        predicates: Optional[list[str]] = None,
        predicates_file: Optional[str | Path] = None,
        entity_types: Optional[list[str]] = None,
        use_default_predicates: bool = True,
    ):
        """
        Initialize the GLiNER2 extractor.

        Args:
            predicates: Custom list of predicate names
            predicates_file: Path to custom predicates JSON file
            entity_types: Entity types to extract (default: all)
            use_default_predicates: Use 324 built-in predicates when no custom provided
        """

Key behaviors:

Uses include_confidence=True for real confidence scores from GLiNER2
Iterates through 21 predicate categories to stay under GLiNER2's ~25 label limit
Returns all matching relations per source sentence (filtered later)
Predicates loaded from default_predicates.json (324 predicates)

EntityQualifiers

Qualifiers added in Stage 3.

Python

class EntityQualifiers(BaseModel):
    # Semantic qualifiers
    org: Optional[str] = None            # Organization/employer
    role: Optional[str] = None           # Job title/position

    # Location qualifiers
    region: Optional[str] = None         # State/province
    country: Optional[str] = None        # Country
    city: Optional[str] = None           # City
    jurisdiction: Optional[str] = None   # Legal jurisdiction

    # External identifiers
    identifiers: dict[str, str] = {}     # lei, ch_number, sec_cik, ticker, etc.

    def has_any_qualifier(self) -> bool:
        """Check if any qualifier is set."""

CanonicalMatch

Result of canonical matching in Stage 3.

Python

class CanonicalMatch(BaseModel):
    canonical_id: Optional[str]          # ID in canonical database
    canonical_name: Optional[str]        # Canonical name/label
    match_method: str                    # identifier, name_exact, name_fuzzy, embedding
    match_confidence: float = 1.0        # Confidence in match (0-1)
    match_details: Optional[dict]        # Additional match details

CanonicalEntity

Output of Stage 3 (Entity Qualification).

Python

class CanonicalEntity(BaseModel):
    entity_ref: str                      # Reference to original entity
    original_text: str                   # Original entity text
    entity_type: EntityType              # Entity type
    qualifiers: EntityQualifiers         # Qualifiers and identifiers
    canonical_match: Optional[CanonicalMatch]  # Canonical match if found
    fqn: str                             # Fully Qualified Name
    qualification_sources: list[str]     # Plugins that contributed

StatementLabel

A label applied in Stage 4.

Python

class StatementLabel(BaseModel):
    label_type: str                      # sentiment, relation_type, confidence
    label_value: Union[str, float, bool] # The label value
    confidence: float = 1.0              # Confidence in label
    labeler: Optional[str]               # Plugin that produced the label

LabeledStatement

Final output from Stage 4 (Labeling).

Python

class LabeledStatement(BaseModel):
    statement: PipelineStatement         # Original statement
    subject_canonical: CanonicalEntity   # Canonicalized subject
    object_canonical: CanonicalEntity    # Canonicalized object
    labels: list[StatementLabel] = []    # Applied labels

    @property
    def subject_fqn(self) -> str:
        """Subject's fully qualified name."""

    @property
    def object_fqn(self) -> str:
        """Object's fully qualified name."""

    def get_label(self, label_type: str) -> Optional[StatementLabel]:
        """Get label by type."""

    def as_dict(self) -> dict:
        """Convert to simplified dictionary."""

Example:

Python

for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

    # Access labels
    sentiment = stmt.get_label("sentiment")
    if sentiment:
        print(f"  Sentiment: {sentiment.label_value}")

    # Access qualifiers
    subject_quals = stmt.subject_canonical.qualified_entity.qualifiers
    if subject_quals.role:
        print(f"  Role: {subject_quals.role}")

TaxonomyResult

Output of Stage 5 (Taxonomy) classification.

Python

class TaxonomyResult(BaseModel):
    taxonomy_name: str                   # e.g., "esg_topics"
    category: str                        # Top-level category
    label: str                           # Specific label
    label_id: Optional[int] = None       # Numeric ID if available
    confidence: float = 1.0              # Classification confidence (0-1)
    classifier: Optional[str] = None     # Plugin that produced this result
    metadata: dict = {}                  # Additional metadata

    @property
    def full_label(self) -> str:
        """Return category:label format."""

Example:

Python

# Access taxonomy results from context
# Each statement may have multiple labels above the threshold
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
    print(f"Statement: {source_text[:50]}...")
    print(f"  Taxonomy: {taxonomy_name}")
    print(f"  Labels ({len(results)}):")
    for result in results:
        print(f"    - {result.full_label} (confidence: {result.confidence:.2f})")

ClassificationSchema

Schema for simple multi-choice classification (2-20 options). Used by labelers that need GLiNER2 to perform classification.

Python

class ClassificationSchema(BaseModel):
    label_type: str                      # e.g., "sentiment"
    choices: list[str]                   # Available choices
    description: str = ""                # Description for the classifier
    scope: str = "statement"             # statement or entity

TaxonomySchema

Schema for large taxonomy classification (100+ values). Used by taxonomy plugins.

Python

class TaxonomySchema(BaseModel):
    label_type: str                      # e.g., "taxonomy"
    values: list[str] | dict[str, list[str]]  # Flat list or category -> labels
    description: str = ""
    scope: str = "statement"
    label_descriptions: Optional[dict[str, str]] = None  # Descriptions for labels

Server API

NEW in v0.9.7

The corp-extractor serve command starts a FastAPI server that keeps all models warm in memory. All endpoints return standardized Pydantic model_dump() JSON responses. The server can be called from the CLI (--server flag), the Python API (server_url parameter), or any HTTP client.

`GET /` — Health Check

Returns server status, device info, loaded models, and registered plugins.

Bash

curl http://localhost:8111/

Response:

JSON

{
  "status": "ok",
  "device": "cuda:0",
  "cuda_available": true,
  "mps_available": false,
  "models_loaded": {"extractor": true, "pipeline": true},
  "plugins": {
    "splitters": ["t5_gemma_splitter"],
    "extractors": ["gliner2_extractor"],
    "qualifiers": ["person_qualifier", "embedding_company_qualifier"],
    "labelers": ["sentiment_labeler", "confidence_labeler", "relation_type_labeler"],
    "taxonomy": ["embedding_taxonomy_classifier"]
  }
}

`POST /pipeline` — Full Pipeline

Runs the full extraction pipeline. Request body:

Python

class PipelineRequest(BaseModel):
    text: str                              # Input text
    config: dict[str, Any] = {}            # Pipeline configuration

Config supports: enabled_stages (str like "1-3" or list), disabled_plugins (list), enabled_plugins (list), extractor_options, splitter_options, qualifier_options, labeler_options, taxonomy_options.

Response: PipelineContext.model_dump() JSON.

Bash

curl -X POST http://localhost:8111/pipeline \
  -H "Content-Type: application/json" \
  -d '{"text": "Apple CEO Tim Cook announced a new iPhone.", "config": {"enabled_stages": "1-3"}}'

`POST /split` — Stage 1 Only

Runs T5-Gemma extraction only. Request body:

Python

class SplitRequest(BaseModel):
    text: str                              # Input text
    options: dict[str, Any] = {}           # ExtractionOptions fields

Options supports: num_beams, diversity_penalty, max_new_tokens, deduplicate, etc.

Response: ExtractionResult.model_dump() JSON.

Bash

curl -X POST http://localhost:8111/split \
  -H "Content-Type: application/json" \
  -d '{"text": "Apple announced a new iPhone.", "options": {"num_beams": 6}}'

`POST /document` — Document Pipeline

Runs the document pipeline for text input. Request body:

Python

class DocumentRequest(BaseModel):
    text: str                              # Document text
    title: Optional[str] = None            # Document title
    stages: str = "1-6"                    # Pipeline stages
    max_tokens: int = 1000                 # Target tokens per chunk
    overlap: int = 100                     # Token overlap between chunks
    no_summary: bool = False               # Skip summarization
    no_dedup: bool = False                 # Skip cross-chunk deduplication

Bash

curl -X POST http://localhost:8111/document \
  -H "Content-Type: application/json" \
  -d '{"text": "Long document text...", "title": "Annual Report", "stages": "1-3"}'

Configuration

The statement-extractor library provides fine-grained control over extraction behavior through configuration classes. This section covers all configuration options for tuning precision, recall, and performance.

ExtractionOptions

The primary configuration class for controlling extraction behavior.

Parameter	Type	Default	Description
`num_beams`	int	4	Number of beam search candidates
`diversity_penalty`	float	1.0	Penalty for beam diversity in diverse beam search
`max_new_tokens`	int	2048	Maximum generation length in tokens
`deduplicate`	bool	True	Remove duplicate statements from output
`merge_beams`	bool	True	Merge top beams into single result set (v0.2.0)
`embedding_dedup`	bool	True	Use embedding similarity for deduplication (v0.2.0)
`predicates`	list[str]	None	Predefined predicates for GLiNER2 relation extraction (v0.4.0)
`all_triples`	bool	False	Keep all candidate triples instead of best per source
`predicate_taxonomy`	PredicateTaxonomy	None	Taxonomy of canonical predicates
`scoring_config`	ScoringConfig	None	Quality scoring configuration
`entity_canonicalizer`	Callable	None	Custom function for entity canonicalization

Basic usage:

Python

from statement_extractor import ExtractionOptions, extract_statements

options = ExtractionOptions(
    num_beams=6,
    diversity_penalty=1.2,
    deduplicate=True
)

result = extract_statements("Apple acquired Beats for $3 billion.", options)

ScoringConfig

Added in v0.2.0

Configuration for quality scoring, filtering, and beam selection. Use this to tune the precision-recall tradeoff.

Parameter	Type	Default	Description
`min_confidence`	float	0.0	Filter threshold (0=recall, 0.7+=precision)
`quality_weight`	float	1.0	Weight for confidence scores
`coverage_weight`	float	0.5	Weight for source text coverage
`redundancy_penalty`	float	0.3	Penalty for duplicate triples
`length_penalty`	float	0.1	Penalty for verbose predicates/entities
`merge_top_n`	int	3	Number of beams to merge

Common configurations:

Python

from statement_extractor import ScoringConfig, ExtractionOptions, extract_statements

# High precision mode - only keep confident extractions
precision_config = ScoringConfig(
    min_confidence=0.7,
    quality_weight=1.5,
    redundancy_penalty=0.5
)

# High recall mode - keep everything
recall_config = ScoringConfig(
    min_confidence=0.0,
    quality_weight=0.5,
    redundancy_penalty=0.1
)

# Use in extraction
options = ExtractionOptions(scoring_config=precision_config)
result = extract_statements(text, options)

Precision vs recall tuning:

Use Case	min_confidence	quality_weight	Notes
Maximum recall	0.0	0.5	Keep all extractions
Balanced	0.4	1.0	Good default
High precision	0.7	1.5	Fewer false positives
Knowledge base	0.8	2.0	Very strict

PredicateComparisonConfig

Added in v0.2.0

Configuration for embedding-based predicate comparison and taxonomy matching. Requires the [embeddings] extra.

Parameter	Type	Default	Description
`embedding_model`	str	paraphrase-MiniLM-L6-v2	Model for computing similarity
`similarity_threshold`	float	0.65	Minimum similarity for taxonomy matching
`dedup_threshold`	float	0.65	Minimum similarity to consider duplicates
`normalize_text`	bool	True	Lowercase/strip predicates before embedding

Custom thresholds:

Python

from statement_extractor import (
    PredicateComparisonConfig,
    PredicateTaxonomy,
    ExtractionOptions,
    extract_statements
)

# Stricter matching for precision
config = PredicateComparisonConfig(
    similarity_threshold=0.75,
    dedup_threshold=0.80,
    normalize_text=True
)

taxonomy = PredicateTaxonomy.from_list([
    "acquired", "founded", "works_for", "located_in",
    "partnered_with", "invested_in", "announced"
])

options = ExtractionOptions(
    predicate_taxonomy=taxonomy,
    predicate_config=config
)

result = extract_statements("Google bought YouTube in 2006.", options)

PipelineConfig

NEW in v0.5.0

Configuration for the 5-stage extraction pipeline. Controls which stages run, which plugins are enabled, and stage-specific options.

Parameter	Type	Default	Description
`enabled_stages`	set[int]	{1, 2, 3, 4, 5}	Stages to run (1-6)
`enabled_plugins`	set[str] \| None	None	Plugins to enable (None = all)
`disabled_plugins`	set[str]	set()	Plugins to disable
`fail_fast`	bool	False	Stop on first error
`parallel_processing`	bool	False	Enable parallel processing
`max_statements`	int \| None	None	Limit statements processed

Stage selection examples:

Python

from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Run only splitting and extraction (stages 1-2)
config = PipelineConfig(enabled_stages={1, 2})

# Run stages 1-3 (skip canonicalization and labeling)
config = PipelineConfig(enabled_stages={1, 2, 3})

# From stage string
config = PipelineConfig.from_stage_string("1-3")  # {1, 2, 3}
config = PipelineConfig.from_stage_string("1,2,5")  # {1, 2, 5}

# Use presets
config = PipelineConfig.default()   # All 5 stages
config = PipelineConfig.minimal()   # Stages 1-2 only

Plugin selection examples:

Python

# Disable specific plugins
config = PipelineConfig(
    disabled_plugins={"sec_edgar_qualifier", "companies_house_qualifier"}
)

# Enable only specific plugins
config = PipelineConfig(
    enabled_plugins={"t5_gemma_splitter", "gliner2_extractor", "person_qualifier"}
)

Stage-specific options:

Python

config = PipelineConfig(
    splitter_options={
        "num_beams": 6,
        "diversity_penalty": 1.2,
    },
    extractor_options={
        "predicates_file": "/path/to/custom_predicates.json",  # Custom predicate file
    },
    qualifier_options={
        "timeout": 10.0,  # API timeout
    },
)

GLiNER2 Extractor Options:

Option	Type	Default	Description
`predicates_file`	str \| Path	None	Path to custom predicates JSON file
`predicates`	list[str]	None	Custom list of predicate names (overrides file)
`entity_types`	list[str]	all types	Entity types to extract
`use_default_predicates`	bool	True	Use 324 built-in predicates when no custom ones provided

Custom Predicates File Format:

JSON

{
  "category_name": {
    "predicate_name": {
      "description": "Description for semantic matching",
      "threshold": 0.7
    }
  }
}

Example:

JSON

{
  "employment": {
    "works_for": {"description": "Employment relationship", "threshold": 0.75},
    "manages": {"description": "Management relationship", "threshold": 0.7}
  },
  "ownership": {
    "owns": {"description": "Ownership relationship", "threshold": 0.7},
    "acquired": {"description": "Acquisition of entity", "threshold": 0.75}
  }
}

Stage Combinations

Common stage combinations for different use cases:

Use Case	Stages	Description
Fast extraction	{1, 2}	Basic triples with entity types
With qualifiers	{1, 2, 3}	Add roles, identifiers (no canonicalization)
Full resolution	{1, 2, 3, 4}	Canonical forms, FQNs (no labeling)
Full pipeline	{1, 2, 3, 4, 5}	All stages except taxonomy
Complete pipeline	{1, 2, 3, 4, 5}	All stages including taxonomy
Labeling only	{1, 2, 5}	Skip qualification/canonicalization

Python

# Fast extraction for high-volume processing
fast_config = PipelineConfig.minimal()

# Full resolution for knowledge graph building
full_config = PipelineConfig.default()

# Custom: qualifiers without external APIs
internal_config = PipelineConfig(
    enabled_stages={1, 2, 3, 4, 5},
    disabled_plugins={"gleif_qualifier", "companies_house_qualifier", "sec_edgar_qualifier"},
)

Entity Types

Corp-extractor classifies extracted subjects and objects into 13 entity types based on common Named Entity Recognition (NER) standards. Understanding these types helps you filter and process extracted statements effectively.

Complete Entity Type Reference

Type	Description	Examples
`ORG`	Organizations, companies, agencies	Apple Inc., United Nations, FBI
`PERSON`	Individual people	Tim Cook, Elon Musk, Jane Doe
`GPE`	Geopolitical entities (countries, cities, states)	United States, California, Paris
`LOC`	Non-GPE locations	Pacific Ocean, Mount Everest, Central Park
`PRODUCT`	Products and services	iPhone 15, Model S, Gmail
`EVENT`	Events and occurrences	CES 2024, Annual Meeting, World Cup
`WORK_OF_ART`	Creative works, documents, reports	Sustainability Report, Mona Lisa
`LAW`	Legal documents and regulations	GDPR, Clean Air Act, Section 230
`DATE`	Dates and time periods	Q3 2024, January 15, 2030
`MONEY`	Monetary values	$4.7 billion, 100 million euros
`PERCENT`	Percentages	30%, 0.5%, 100%
`QUANTITY`	Quantities and measurements	1,000 employees, 50 megawatts
`UNKNOWN`	Unclassified entities (fallback)	(varies)

Accessing Entity Types in Code

Each extracted statement contains subject and object entities with a type attribute:

Python

from statement_extractor import extract_statements

result = extract_statements("Apple CEO Tim Cook announced the iPhone 15.")

for stmt in result:
    print(f"Subject: {stmt.subject.text} ({stmt.subject.type})")
    print(f"Object: {stmt.object.text} ({stmt.object.type})")

Output:

Text

Subject: Apple (ORG)
Object: Tim Cook (PERSON)
Subject: Tim Cook (PERSON)
Object: iPhone 15 (PRODUCT)

You can also import the EntityType enum for type checking and comparisons:

Python

from statement_extractor import extract_statements, EntityType

result = extract_statements("Microsoft acquired Activision for $69 billion.")

for stmt in result:
    if stmt.subject.type == EntityType.ORG:
        print(f"Organization found: {stmt.subject.text}")
    if stmt.object.type == EntityType.MONEY:
        print(f"Monetary value: {stmt.object.text}")

Filtering by Entity Type

A common use case is extracting only statements involving specific entity types. Here is how to filter statements by subject or object type:

Python

from statement_extractor import extract_statements, EntityType

text = """
Apple announced revenue of $94.8 billion for Q3 2024.
CEO Tim Cook presented at the company's Cupertino headquarters.
The new iPhone 16 features improved battery life of 22 hours.
"""

result = extract_statements(text)

# Filter for statements where subject is an organization
org_statements = [
    stmt for stmt in result
    if stmt.subject.type == EntityType.ORG
]

# Filter for statements involving monetary values
money_statements = [
    stmt for stmt in result
    if stmt.subject.type == EntityType.MONEY or stmt.object.type == EntityType.MONEY
]

# Filter for statements about people
person_statements = [
    stmt for stmt in result
    if stmt.subject.type == EntityType.PERSON or stmt.object.type == EntityType.PERSON
]

print(f"Found {len(org_statements)} statements from organizations")
print(f"Found {len(money_statements)} statements with monetary values")
print(f"Found {len(person_statements)} statements about people")

The UNKNOWN Type

The UNKNOWN entity type is used as a fallback when the model cannot confidently classify an entity into one of the 12 standard categories. This typically occurs with:

Specialized domain terms: Technical jargon, industry-specific terminology
Ambiguous entities: Terms that could fit multiple categories depending on context
Novel entities: New terms or concepts not well-represented in training data
Abstract concepts: Ideas or qualities that do not fit standard NER categories

Python

from statement_extractor import extract_statements, EntityType

result = extract_statements("The synergy initiative improved operational efficiency.")

for stmt in result:
    if stmt.subject.type == EntityType.UNKNOWN:
        print(f"Unclassified entity: {stmt.subject.text}")
        # Consider manual review or domain-specific handling

When you encounter UNKNOWN entities, consider:

Manual review: Inspect the entity text to determine appropriate handling
Domain mapping: Create application-specific mappings for recurring unknown entities
Context analysis: Use surrounding statements to infer the entity's likely type

Entity Type Standards

Corp-extractor's entity types are based on widely-adopted NER standards, including:

OntoNotes 5.0: The primary source for entity type definitions
ACE (Automatic Content Extraction): Influences the GPE vs LOC distinction
CoNLL-2003: Foundational NER task categories

This alignment with established standards ensures compatibility with other NLP tools and facilitates integration into existing data pipelines.

Examples

This section provides practical examples demonstrating common use cases for the corp-extractor library.

Basic Extraction

Extract statements from text and format the output:

Python

from statement_extractor import extract_statements

text = """
Microsoft announced a partnership with OpenAI in 2019.
The deal was valued at $1 billion and aimed to develop
artificial general intelligence.
"""

result = extract_statements(text)

# Iterate over statements
for stmt in result:
    subject = f"{stmt.subject.text} ({stmt.subject.type})"
    object_ = f"{stmt.object.text} ({stmt.object.type})"
    print(f"{subject} -- {stmt.predicate} --> {object_}")

# Check confidence scores
for stmt in result:
    score = stmt.confidence_score or 0.0
    print(f"[{score:.2f}] {stmt}")

Output:

Text

Microsoft (ORG) -- partnered with --> OpenAI
Microsoft (ORG) -- announced --> partnership
OpenAI (ORG) -- partnership valued at --> $1 billion
Microsoft (ORG) -- aims to develop --> artificial general intelligence

Batch Processing

Use the StatementExtractor class for processing multiple texts efficiently. The model loads once and is reused for all extractions:

Python

from statement_extractor import StatementExtractor

# Initialize extractor with GPU
extractor = StatementExtractor(device="cuda")

texts = [
    "Apple acquired Beats Electronics for $3 billion.",
    "Google was founded by Larry Page and Sergey Brin in 1998.",
    "Amazon announced a new fulfillment center in Texas."
]

# Process multiple texts
for text in texts:
    result = extractor.extract(text)
    print(f"Found {len(result)} statements in: {text[:40]}...")
    for stmt in result:
        print(f"  - {stmt}")
    print()

For CPU-only environments:

Python

# Force CPU usage
extractor = StatementExtractor(device="cpu")

Confidence Filtering

v0.2.0

Filter statements by confidence score to control precision vs recall:

Python

from statement_extractor import extract_statements, ScoringConfig, ExtractionOptions

text = "Elon Musk founded SpaceX in 2002 to reduce space transportation costs."

# High precision mode - only high-confidence statements
scoring = ScoringConfig(min_confidence=0.7)
options = ExtractionOptions(scoring_config=scoring)
result = extract_statements(text, options)

print("High-confidence statements:")
for stmt in result:
    print(f"  [{stmt.confidence_score:.2f}] {stmt}")

You can also filter after extraction for more control:

Python

# Extract all statements first
result = extract_statements(text)

# Apply custom thresholds
high_confidence = [s for s in result if (s.confidence_score or 0) >= 0.8]
medium_confidence = [s for s in result if 0.5 <= (s.confidence_score or 0) < 0.8]
low_confidence = [s for s in result if (s.confidence_score or 0) < 0.5]

print(f"High: {len(high_confidence)}, Medium: {len(medium_confidence)}, Low: {len(low_confidence)}")

Predicate Taxonomy

Map extracted predicates to a controlled vocabulary of canonical forms:

Python

from statement_extractor import PredicateTaxonomy, ExtractionOptions, extract_statements

# Define your canonical predicates
taxonomy = PredicateTaxonomy(predicates=[
    "acquired", "founded", "works_for", "announced",
    "invested_in", "partnered_with", "committed_to"
])

options = ExtractionOptions(predicate_taxonomy=taxonomy)

text = "Google bought YouTube in 2006. Sequoia Capital backed the video platform."
result = extract_statements(text, options)

# View predicate normalization
for stmt in result:
    original = stmt.predicate
    canonical = stmt.canonical_predicate
    if canonical and canonical != original:
        print(f"'{original}' -> '{canonical}'")
    print(f"  {stmt.subject.text} -- {canonical or original} --> {stmt.object.text}")

Output:

Text

'bought' -> 'acquired'
  Google -- acquired --> YouTube
'backed' -> 'invested_in'
  Sequoia Capital -- invested_in --> YouTube

Load taxonomy from a file:

Python

# predicates.txt contains one predicate per line
taxonomy = PredicateTaxonomy.from_file("predicates.txt")

Export Formats

Export extraction results in multiple formats for integration with other systems:

Python

from statement_extractor import (
    extract_statements,
    extract_statements_as_json,
    extract_statements_as_xml,
    extract_statements_as_dict
)

text = "Netflix acquired Spry Fox, a game development studio, in 2022."

# JSON output (default 2-space indent)
json_str = extract_statements_as_json(text)
print(json_str)

# Compact JSON
json_compact = extract_statements_as_json(text, indent=None)

# XML output (raw model format)
xml_str = extract_statements_as_xml(text)
print(xml_str)

# Dictionary output (for programmatic use)
data = extract_statements_as_dict(text)
for stmt in data["statements"]:
    print(f"{stmt['subject']['text']} -> {stmt['predicate']} -> {stmt['object']['text']}")

JSON output format:

JSON

{
  "statements": [
    {
      "subject": {"text": "Netflix", "type": "ORG"},
      "predicate": "acquired",
      "object": {"text": "Spry Fox", "type": "ORG"},
      "source_text": "Netflix acquired Spry Fox",
      "confidence_score": 0.94
    }
  ],
  "source_text": "Netflix acquired Spry Fox, a game development studio, in 2022."
}

Disabling Embeddings

Skip embedding-based features for faster processing when you don't need predicate normalization or semantic deduplication:

Python

from statement_extractor import ExtractionOptions, extract_statements

# Disable embedding-based deduplication
options = ExtractionOptions(
    embedding_dedup=False,      # Use exact string matching for dedup
    predicate_taxonomy=None     # No predicate normalization
)

result = extract_statements(text, options)

When to disable embeddings:

Scenario	Recommendation
Speed critical	Disable embeddings
No GPU available	Consider disabling for faster CPU processing
Need semantic dedup	Keep embeddings enabled
Using predicate taxonomy	Keep embeddings enabled
Simple text, few duplicates	Disable embeddings

Custom Entity Canonicalization

Provide a custom function to normalize entity names:

Python

from statement_extractor import ExtractionOptions, extract_statements

# Define a canonicalization function
def canonicalize_entity(text: str) -> str:
    """Normalize entity names to canonical forms."""
    mappings = {
        "apple": "Apple Inc.",
        "apple inc": "Apple Inc.",
        "apple inc.": "Apple Inc.",
        "google": "Alphabet Inc.",
        "google llc": "Alphabet Inc.",
        "alphabet": "Alphabet Inc.",
        "msft": "Microsoft Corporation",
        "microsoft": "Microsoft Corporation",
    }
    return mappings.get(text.lower().strip(), text)

options = ExtractionOptions(entity_canonicalizer=canonicalize_entity)

text = "Apple and Google announced a partnership. Microsoft joined later."
result = extract_statements(text, options)

for stmt in result:
    # Entities are now canonicalized
    print(f"{stmt.subject.text} -- {stmt.predicate} --> {stmt.object.text}")

Output:

Text

Apple Inc. -- partnered with --> Alphabet Inc.
Microsoft Corporation -- joined --> partnership

Full Pipeline Example

Combining multiple features for production use:

Python

from statement_extractor import (
    StatementExtractor,
    ExtractionOptions,
    ScoringConfig,
    PredicateTaxonomy,
    PredicateComparisonConfig
)

# Configure scoring for high precision
scoring = ScoringConfig(
    min_confidence=0.6,
    quality_weight=1.0,
    redundancy_penalty=0.5
)

# Define canonical predicates
taxonomy = PredicateTaxonomy.from_list([
    "acquired", "founded", "invested_in", "partnered_with",
    "announced", "launched", "hired", "appointed"
])

# Configure predicate matching
predicate_config = PredicateComparisonConfig(
    similarity_threshold=0.7,
    dedup_threshold=0.8
)

# Initialize extractor
extractor = StatementExtractor(
    device="cuda",
    predicate_taxonomy=taxonomy,
    predicate_config=predicate_config,
    scoring_config=scoring
)

# Configure extraction options
options = ExtractionOptions(
    num_beams=6,
    diversity_penalty=1.2,
    deduplicate=True,
    merge_beams=True
)

# Process text
text = """
Amazon Web Services announced a strategic partnership with Anthropic,
investing up to $4 billion in the AI safety startup. The deal, announced
in September 2023, makes AWS Anthropic's primary cloud provider.
"""

result = extractor.extract(text, options)

print(f"Extracted {len(result)} high-confidence statements:\n")
for stmt in result:
    canonical = stmt.canonical_predicate or stmt.predicate
    score = stmt.confidence_score or 0.0
    print(f"[{score:.2f}] {stmt.subject.text} ({stmt.subject.type})")
    print(f"       -- {canonical} -->")
    print(f"       {stmt.object.text} ({stmt.object.type})")
    print()

Output:

Text

Extracted 4 high-confidence statements:

[0.92] Amazon Web Services (ORG)
       -- partnered_with -->
       Anthropic (ORG)

[0.88] Amazon Web Services (ORG)
       -- invested_in -->
       Anthropic (ORG)

[0.85] Amazon Web Services (ORG)
       -- invested_in -->
       $4 billion (MONEY)

[0.78] AWS (ORG)
       -- is primary cloud provider for -->
       Anthropic (ORG)

Pipeline Examples

NEW in v0.5.0

Full Pipeline with Corporate Text

Process corporate announcements with full entity resolution:

Python

from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

pipeline = ExtractionPipeline()

text = """
Amazon CEO Andy Jassy announced plans to hire 10,000 workers in the UK.
The expansion will focus on Amazon Web Services operations in London.
"""

ctx = pipeline.process(text)

print(f"Extracted {ctx.statement_count} statements\n")

for stmt in ctx.labeled_statements:
    # FQN includes role and organization
    print(f"Subject: {stmt.subject_fqn}")
    print(f"Predicate: {stmt.statement.predicate}")
    print(f"Object: {stmt.object_fqn}")

    # Access labels
    for label in stmt.labels:
        print(f"  {label.label_type}: {label.label_value}")

    # Access qualifiers
    subject_quals = stmt.subject_canonical.qualified_entity.qualifiers
    if subject_quals.role:
        print(f"  Role: {subject_quals.role}")
    if subject_quals.org:
        print(f"  Organization: {subject_quals.org}")

    print("-" * 40)

Output:

Text

Extracted 2 statements

Subject: Andy Jassy (CEO, Amazon)
Predicate: announced
Object: plans to hire 10,000 workers in the UK
  sentiment: positive
  Role: CEO
  Organization: Amazon
----------------------------------------
Subject: Amazon (AMZN)
Predicate: expanding operations in
Object: London (UK)
  sentiment: positive
----------------------------------------

Running Specific Stages

Skip qualification and canonicalization for faster processing:

Python

from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Run only stages 1 and 2 (splitting + extraction)
config = PipelineConfig(enabled_stages={1, 2})
pipeline = ExtractionPipeline(config)

ctx = pipeline.process("Tim Cook is CEO of Apple Inc.")

# Access Stage 2 output (PipelineStatement)
for stmt in ctx.statements:
    print(f"{stmt.subject.text} ({stmt.subject.type.value})")
    print(f"  --[{stmt.predicate}]-->")
    print(f"  {stmt.object.text} ({stmt.object.type.value})")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Using Specific Plugins

Enable only internal plugins (no external API calls):

Python

from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Disable external API plugins
config = PipelineConfig(
    disabled_plugins={
        "gleif_qualifier",
        "companies_house_qualifier",
        "sec_edgar_qualifier",
    }
)

pipeline = ExtractionPipeline(config)
ctx = pipeline.process("OpenAI CEO Sam Altman announced GPT-5.")

# Will use person_qualifier (local LLM) but skip external lookups
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

Custom Predicates File

Use a custom predicates JSON file instead of the 324 default predicates:

Python

from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Use custom predicates file
config = PipelineConfig(
    extractor_options={
        "predicates_file": "/path/to/my_predicates.json"
    }
)

pipeline = ExtractionPipeline(config)
ctx = pipeline.process("John works for Apple Inc.")

# All matching relations are returned
for stmt in ctx.statements:
    print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
    print(f"  Category: {stmt.predicate_category}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Custom predicates file format:

JSON

{
  "employment": {
    "works_for": {
      "description": "Employment relationship where person works for organization",
      "threshold": 0.75
    },
    "manages": {
      "description": "Management relationship where person manages entity",
      "threshold": 0.7
    }
  },
  "ownership": {
    "owns": {
      "description": "Ownership relationship",
      "threshold": 0.7
    },
    "acquired": {
      "description": "Acquisition of one entity by another",
      "threshold": 0.75
    }
  }
}

Each category should have fewer than 25 predicates to stay within GLiNER2's training limit for optimal performance.

Accessing Stage Outputs

Access results from each pipeline stage:

Python

from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Microsoft CEO Satya Nadella announced Azure growth.")

# Stage 1: Raw triples
print("=== Stage 1: Raw Triples ===")
for triple in ctx.raw_triples:
    print(f"  {triple.subject_text} -> {triple.predicate_text} -> {triple.object_text}")

# Stage 2: Statements with types
print("\n=== Stage 2: Statements ===")
for stmt in ctx.statements:
    print(f"  {stmt.subject.text} ({stmt.subject.type.value}) -> {stmt.predicate}")

# Stage 3: Qualified entities
print("\n=== Stage 3: Qualified Entities ===")
for ref, qualified in ctx.qualified_entities.items():
    quals = qualified.qualifiers
    print(f"  {qualified.original_text}")
    if quals.role:
        print(f"    Role: {quals.role}")
    if quals.org:
        print(f"    Org: {quals.org}")
    for id_type, id_value in quals.identifiers.items():
        print(f"    {id_type}: {id_value}")

# Stage 4: Canonical entities
print("\n=== Stage 4: Canonical Entities ===")
for ref, canonical in ctx.canonical_entities.items():
    print(f"  {canonical.fqn}")
    if canonical.canonical_match:
        print(f"    Method: {canonical.canonical_match.match_method}")
        print(f"    Confidence: {canonical.canonical_match.match_confidence:.2f}")

# Stage 5: Labeled statements
print("\n=== Stage 5: Labeled Statements ===")
for stmt in ctx.labeled_statements:
    print(f"  {stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")
    for label in stmt.labels:
        print(f"    {label.label_type}: {label.label_value}")

# Stage 6: Taxonomy results (multiple labels per statement)
print("\n=== Stage 6: Taxonomy Results ===")
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
    print(f"  Statement: {source_text[:40]}...")
    for result in results:
        print(f"    {result.full_label} (confidence: {result.confidence:.2f})")

# Timings
print("\n=== Stage Timings ===")
for stage, duration in ctx.stage_timings.items():
    print(f"  {stage}: {duration:.3f}s")

Batch Pipeline Processing

Process multiple documents efficiently:

Python

from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

# Use minimal stages for speed
config = PipelineConfig.minimal()  # Stages 1-2 only
pipeline = ExtractionPipeline(config)

documents = [
    "Apple announced a new MacBook Pro.",
    "Google acquired Fitbit for $2.1 billion.",
    "Tesla CEO Elon Musk unveiled the Cybertruck.",
]

all_statements = []

for doc in documents:
    ctx = pipeline.process(doc)
    for stmt in ctx.statements:
        all_statements.append({
            "subject": stmt.subject.text,
            "subject_type": stmt.subject.type.value,
            "predicate": stmt.predicate,
            "object": stmt.object.text,
            "object_type": stmt.object.type.value,
            "confidence": stmt.confidence_score,
            "source": doc,
        })

print(f"Extracted {len(all_statements)} statements from {len(documents)} documents")

Taxonomy Classification

Stage 6

Classify statements against large taxonomies. Multiple labels may match a single statement above the confidence threshold:

Python

from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()

text = """
Apple announced a commitment to carbon neutrality by 2030.
The company also reported reducing packaging waste by 75%.
"""

ctx = pipeline.process(text)

# Access taxonomy classifications (multiple labels per statement)
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
    print(f"Statement: {source_text[:50]}...")
    print(f"  Taxonomy: {taxonomy_name}")
    print(f"  Labels:")
    for result in results:
        print(f"    - {result.full_label} (confidence: {result.confidence:.2f})")
    print()

Output:

Text

Statement: Apple announced a commitment to carbon neutrality...
  Taxonomy: esg_topics
  Labels:
    - environment:carbon_emissions (confidence: 0.87)
    - environment_benefit:emissions_reduction (confidence: 0.72)
    - governance:sustainability_commitments (confidence: 0.45)

Statement: The company also reported reducing packaging waste...
  Taxonomy: esg_topics
  Labels:
    - environment:waste_management (confidence: 0.92)
    - environment_benefit:waste_reduction (confidence: 0.85)

Using the Persistent Server

NEW in v0.9.7

Start a persistent server to avoid reloading models on every invocation:

Bash

# Terminal 1: Start the server
corp-extractor serve

# Terminal 2: Use --server to delegate processing
corp-extractor --server pipeline "Amazon CEO Andy Jassy announced plans."
corp-extractor --server split -f article.txt --json
corp-extractor --server document process article.txt

# Or set the environment variable
export CORP_EXTRACTOR_SERVER=http://localhost:8111
corp-extractor pipeline "text"  # Automatically uses the server

Python API Server Delegation

NEW in v0.9.8

Pass server_url to delegate processing to a running server from Python code. No local models are loaded — full Pydantic objects are reconstructed from JSON responses.

Python

from statement_extractor import extract_statements
from statement_extractor.pipeline import ExtractionPipeline

# Simple extraction via server
result = extract_statements("Apple announced iPhone.", server_url="http://localhost:8111")
for stmt in result:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")

# Full pipeline via server
pipeline = ExtractionPipeline(server_url="http://localhost:8111")
ctx = pipeline.process("Apple CEO Tim Cook announced a new iPhone.")
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

You can also call the server directly with httpx or curl:

Python

import httpx

# Call the pipeline endpoint
resp = httpx.post("http://localhost:8111/pipeline", json={
    "text": "Apple CEO Tim Cook announced a new iPhone.",
    "config": {"enabled_stages": "1-3"},
}, timeout=300)

# Reconstruct full Pydantic model from response
from statement_extractor.pipeline.context import PipelineContext
ctx = PipelineContext.model_validate(resp.json())
for stmt in ctx.labeled_statements:
    print(f"  {stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

Pipeline with Error Handling

Handle errors and warnings gracefully:

Python

from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

config = PipelineConfig(fail_fast=False)  # Continue on errors
pipeline = ExtractionPipeline(config)

ctx = pipeline.process("Some text that might cause issues...")

# Check for errors
if ctx.has_errors:
    print("Errors occurred:")
    for error in ctx.processing_errors:
        print(f"  - {error}")

# Check for warnings
if ctx.processing_warnings:
    print("Warnings:")
    for warning in ctx.processing_warnings:
        print(f"  - {warning}")

# Process results that succeeded
print(f"\nSuccessfully extracted {ctx.statement_count} statements")

Deployment

Local Inference

Hardware Requirements:

Resource	Minimum	Notes
CPU-only	~4GB RAM	~30s per extraction
GPU	~2GB VRAM	~2s per extraction
Disk	~1.5GB	Model download size

Setup steps:

Bash

# Install the library
pip install corp-extractor[embeddings]

# For GPU support, install PyTorch with CUDA first
pip install torch --index-url https://download.pytorch.org/whl/cu121

Running locally:

Python

from statement_extractor import StatementExtractor

# Auto-detect GPU or fall back to CPU
extractor = StatementExtractor()

# Or explicitly set device
extractor = StatementExtractor(device="cuda")  # GPU
extractor = StatementExtractor(device="cpu")   # CPU

The model uses bfloat16 precision on GPU for faster inference and lower memory usage, and float32 on CPU.

Persistent Server

NEW in v0.9.7

For repeated extractions, use corp-extractor serve to keep all models warm in memory. This eliminates the ~30s startup cost for each invocation.

Bash

# Start the persistent server
corp-extractor serve

# In another terminal, use --server to delegate to it
corp-extractor --server pipeline "Amazon CEO Andy Jassy announced..."
corp-extractor --server split -f article.txt --json
corp-extractor --server document process article.txt

The server runs on http://localhost:8111 by default and exposes three POST endpoints (/pipeline, /split, /document) plus a health check at GET /. All models (T5-Gemma, GLiNER2, embedding models, USearch indexes) are loaded once at startup and reused across requests.

You can also set the CORP_EXTRACTOR_SERVER environment variable so all CLI commands automatically delegate to the server:

Bash

export CORP_EXTRACTOR_SERVER=http://localhost:8111
corp-extractor pipeline "Your text"  # Automatically uses the server

Python API Server Delegation

NEW in v0.9.8

All Python API functions accept a server_url parameter to delegate processing to a running server. No local models are loaded — requests go over HTTP and full Pydantic objects are reconstructed from the response.

Python

from statement_extractor import extract_statements
from statement_extractor.pipeline import ExtractionPipeline
from statement_extractor.document import DocumentPipeline

# Delegate extraction to server
result = extract_statements("text", server_url="http://localhost:8111")

# Pipeline and document pipeline
pipeline = ExtractionPipeline(server_url="http://localhost:8111")
ctx = pipeline.process("Amazon CEO Andy Jassy announced...")

doc_pipeline = DocumentPipeline(server_url="http://localhost:8111")

Note: server_url is for the Python API only. CLI delegation uses --server / --server-url flags or the CORP_EXTRACTOR_SERVER env var.

Cerebrium Serverless (Production)

Why Cerebrium:

Pay-per-use GPU containers; scales to zero when idle.
Shared /persistent-storage volume with the corp-entity-db Cerebrium app, so the entity database, USearch indexes, and embedding model weights are downloaded once and reused across both apps.
One synchronous request/response — no polling. The frontend API route uses maxDuration=300 and retries once on cold-boot timeout.

Setup steps:

Install the Cerebrium CLI and confirm you are in the same project as the corp-entity-db app (so /persistent-storage is shared):
Bash
```
pip install cerebrium
cerebrium projects current
```
Set the HF_TOKEN secret (gated model downloads):
Bash
```
cerebrium secrets set HF_TOKEN <your-token>
```
Deploy:
Bash
```
cd cerebrium
cerebrium deploy
```
Or push to main — .github/workflows/cerebrium-deploy.yml auto-deploys on changes to cerebrium/**.

Call the API (auth: service-account token or per-app inference key):

Bash

curl -X POST \
  -H "Authorization: Bearer $CEREBRIUM_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"text": "<page>Your text here</page>"}' \
  https://api.aws.us-east-1.cerebrium.ai/v4/<project-id>/statement-extractor/extract

Cerebrium returns a {run_id, result, run_time_ms} envelope; the handler payload is in .result.

Hardware: currently ADA_L40 (48 GB, hobby-plan max) — fits T5-Gemma2 in bf16 with comfortable headroom. The Gemma-3-12B GGUF qualifier runs CPU-only via llama-cpp-python.

See cerebrium/README.md for cold-start expectations, GPU alternatives, and troubleshooting.

RunPod Serverless (Legacy)

The original deployment used RunPod serverless GPU containers via runpod/Dockerfile and an async submit-and-poll API surface. Superseded by Cerebrium for the production demo so we can share storage with the corp-entity-db app. The container build is retained at runpod/ for reference; it is no longer the active path.

Statement Extractor Documentation

Split Options

Pipeline Stages

Pipeline Options

Serve Options

Server Endpoints

Document Options

The T5-Gemma 2 Model

Entity Type Recognition

Why Diverse Beam Search?

How It Works

Default Parameters

Confidence Score

Confidence Filtering

Beam Merging vs Best Beam Selection

Why GLiNER2?

Default Predicates

All Matches Returned

Custom Predicates

Entity-Based Scoring

The 5 Stages

Stage 1: Splitting

Stage 2: Extraction

Stage 3: Entity Qualification

Stage 4: Labeling

Stage 5: Taxonomy

Plugin System

Document Pipeline

Chunking Strategy

URL and PDF Support

Cross-Chunk Deduplication

Organization Data Sources

Person Data Sources UPDATED in v0.9.3

Person Types

EntityType Classification

How It Works

Other Tables NEW in v0.9.4

Database Variants

Function Signatures

Usage Examples

StatementExtractor

ExtractionOptions

ScoringConfig

PredicateTaxonomy

PredicateComparisonConfig

Statement

Entity

EntityType

ExtractionResult

PredicateMatch

ExtractionPipeline

PipelineConfig

PipelineContext

PluginRegistry

RawTriple

PipelineStatement

GLiNER2Extractor

EntityQualifiers

CanonicalMatch

CanonicalEntity

StatementLabel

LabeledStatement

TaxonomyResult

ClassificationSchema

TaxonomySchema

GET / — Health Check

POST /pipeline — Full Pipeline

POST /split — Stage 1 Only

POST /document — Document Pipeline

Full Pipeline with Corporate Text

Running Specific Stages

Using Specific Plugins

Custom Predicates File

Accessing Stage Outputs

Batch Pipeline Processing

Taxonomy Classification

Using the Persistent Server

Python API Server Delegation

Pipeline with Error Handling

Python API Server Delegation

`GET /` — Health Check

`POST /pipeline` — Full Pipeline

`POST /split` — Stage 1 Only

`POST /document` — Document Pipeline