corp-extractor v0.7.0

Statement Extractor Documentation

Extract structured subject-predicate-object statements from unstructured text using T5-Gemma 2 and GLiNER2 models with document processing, entity resolution, and taxonomy classification.

Getting Started

Installation

Bash
pip install corp-extractor

The GLiNER2 model (205M params) is downloaded automatically on first use.

GPU support: Install PyTorch with CUDA before installing corp-extractor. The library auto-detects GPU availability at runtime.

Bash
# Example for CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install corp-extractor

Apple Silicon (M1/M2/M3): MPS acceleration is automatically detected. Just install normally:

Bash
pip install corp-extractor

Quick Start

Extract structured statements from text in 5 lines:

Python
from statement_extractor import extract_statements

text = "Apple Inc. acquired Beats Electronics for $3 billion in May 2014."
statements = extract_statements(text)

for stmt in statements:
    print(f"{stmt.subject.text} ({stmt.subject.type}) -> {stmt.predicate} -> {stmt.object.text}")

Output:

Text
Apple Inc. (ORG) -> acquired -> Beats Electronics
Apple Inc. (ORG) -> paid -> $3 billion
Beats Electronics (ORG) -> acquisition price -> $3 billion

Each statement includes confidence scores and extraction method:

Python
for stmt in statements:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
    print(f"  method: {stmt.extraction_method}")  # hybrid, gliner, or model
    print(f"  confidence: {stmt.confidence_score:.2f}")

v0.5.0 features: Plugin-based pipeline architecture with entity qualification, labeling, and taxonomy classification. GLiNER2 entity recognition, entity-based scoring.

v0.6.0 features: Entity embedding database with ~100K+ SEC filers, ~3M GLEIF records, ~5M UK organizations for fast entity qualification.

v0.7.0 features: Document processing for files, URLs, and PDFs with automatic chunking, deduplication, and citation tracking.

v0.8.0 features: Merged qualification and canonicalization into single stage. EntityType classification for organizations (business, nonprofit, government, etc.).

v0.9.0 features: Person database with Wikidata import for notable people (executives, politicians, athletes, artists). PersonQualifier for canonical person identification with role/org context.

v0.9.1 features: Wikidata dump importer (import-wikidata-dump) for large imports without SPARQL timeouts. Uses aria2c for fast parallel downloads. Extracts people via occupation (P106) and position dates (P580/P582).

v0.9.2 features: Organization canonicalization links equivalent records across sources (GLEIF, SEC, Companies House, Wikidata). People canonicalization with priority-based deduplication. Expanded PersonType classification (executive, politician, government, military, legal, etc.).

v0.9.3 features: SEC Form 4 officers import (import-sec-officers) and Companies House officers import (import-ch-officers). People now sourced from Wikidata, SEC Edgar, and Companies House with cross-source canonicalization.

v0.9.4 features: Database v2 schema with normalized INTEGER foreign keys and enum lookup tables. Scalar (int8) embeddings for 75% storage reduction with ~92% recall. New locations import for countries/states/cities with hierarchy. Migration commands: db migrate-v2, db backfill-scalar. New search commands: db search-roles, db search-locations.

v0.9.5 features: USearch HNSW indexes for sub-millisecond search on 50M+ vectors. 3-thread parallel Wikidata dump import (reader/embedder/writer). Multi-record person import (one per position+org). Auto-canonicalization after dump import. New commands: db post-import, db build-index. --hybrid flag for text+embeddings search. fast-import extras (orjson, indexed_bzip2). Zstandard (.zst) dump support. New hf_classifier labeler plugin.

v0.9.6 features: Database v3 — lite databases drop all embedding tables (use USearch indexes for search). db download/db upload now include USearch .bin files. New db_info metadata table with schema_version. Global --db-version CLI flag for backwards compatibility. Filenames: entities-v3.db / entities-v3-lite.db.

v0.9.7 features: Persistent local server (corp-extractor serve) keeps models warm in memory for fast repeated CLI use. New --server / --server-url flags and CORP_EXTRACTOR_SERVER env var to delegate processing to a running server instance.

v0.9.8 features: Python API server delegation — pass server_url to extract_statements(), ExtractionPipeline, or DocumentPipeline to delegate processing to a running server from Python code. No local GPU required. All backends now return standardized Pydantic model_dump() JSON.

Pipeline Quick Start (v0.5.0)

For full entity resolution with qualification, canonicalization, labeling, and taxonomy classification:

Python
from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans to hire workers.")

# Access fully qualified names (e.g., "Andy Jassy (CEO, Amazon)")
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} --[{stmt.statement.predicate}]--> {stmt.object_fqn}")

    # Access labels (sentiment, etc.)
    for label in stmt.labels:
        print(f"  {label.label_type}: {label.label_value}")

CLI usage:

Bash
# Full pipeline
corp-extractor pipeline "Amazon CEO Andy Jassy announced..."

# Run specific stages only
corp-extractor pipeline -f article.txt --stages 1-3

# Process documents, PDFs, and URLs (v0.7.0)
corp-extractor document process article.txt
corp-extractor document process report.pdf
corp-extractor document process report.pdf --pdf-parser glm_ocr_parser
corp-extractor document process https://example.com/article

Using Predicate Taxonomies

Normalize extracted predicates to canonical forms using embedding similarity:

Python
from statement_extractor import extract_statements, PredicateTaxonomy, ExtractionOptions

# Define your domain's canonical predicates
taxonomy = PredicateTaxonomy(predicates=[
    "acquired", "founded", "works_for", "headquartered_in",
    "invested_in", "partnered_with", "announced"
])

options = ExtractionOptions(predicate_taxonomy=taxonomy)

text = "Google bought YouTube for $1.65 billion in 2006."
result = extract_statements(text, options)

for stmt in result:
    print(f"{stmt.predicate} -> {stmt.canonical_predicate}")
    # Output: bought -> acquired

This maps synonyms like "bought", "purchased", "acquired" to a single canonical form, making downstream analysis easier.

Requirements

DependencyVersionNotes
Python3.10+Required
PyTorch2.0+Required
transformers5.0+Required for T5-Gemma2 support
Pydantic2.0+Required
sentence-transformers2.2+Required, for embedding features
GLiNER2latestRequired, for entity recognition and relation extraction (model auto-downloads)

Hardware requirements:

  • NVIDIA GPU: RTX 4090+ recommended for production. Uses bfloat16 precision for efficiency.
  • Apple Silicon: M1/M2/M3 with 16GB+ RAM. MPS acceleration auto-detected.
  • CPU: Functional but slower. Use for development or low-volume processing.
  • Disk: ~100GB for all models and entity database (9.7M+ organizations, 63M+ people).

The library runs entirely locally with no external API dependencies. Models use bfloat16 on CUDA and float32 on MPS/CPU.

Command Line Interface

The corp-extractor CLI provides commands for extraction, document processing, and database management.

Commands Overview

CommandDescriptionUse Case
splitSimple extraction (Stage 1 only)Fast extraction, basic triples
pipelineFull 5-stage pipelineEntity resolution, labeling, taxonomy
documentDocument processingFiles, URLs, PDFs with chunking and deduplication
dbDatabase managementImport, search, upload/download entity database
servePersistent local serverKeep models warm in memory for fast repeated use
pluginsPlugin managementList and inspect available plugins

Global Options

These options apply to all commands when placed before the command name:

OptionDescriptionDefault
--serverDelegate processing to a running corp-extractor serve instance at the default URLhttp://localhost:8111
--server-url URLDelegate processing to a server at a custom URL
--db-version NDatabase schema version for filenameslatest (3)

The CORP_EXTRACTOR_SERVER environment variable can also be set to a server URL, which acts as a fallback when neither --server nor --server-url is provided.

Bash
# Use the default server URL (http://localhost:8111)
corp-extractor --server pipeline "Apple announced a new iPhone."

# Use a custom server URL
corp-extractor --server-url http://gpu-box:8111 pipeline "Apple announced a new iPhone."

# Or set via environment variable
export CORP_EXTRACTOR_SERVER=http://localhost:8111
corp-extractor pipeline "Apple announced a new iPhone."

Installation

For best results, install globally:

Bash
# Using uv (recommended)
uv tool install "corp-extractor[embeddings]"

# Using pipx
pipx install "corp-extractor[embeddings]"

# Using pip
pip install "corp-extractor[embeddings]"

Quick Run with uvx

Run directly without installing using uv:

Bash
uvx corp-extractor split "Apple announced a new iPhone."

Note: First run downloads the model (~1.5GB) which may take a few minutes.


Split Command

The split command extracts sub-statements using the T5-Gemma model. It's fast and simple—use pipeline for full entity resolution.

Bash
# Extract from text argument
corp-extractor split "Apple Inc. announced the iPhone 15."

# Extract from file
corp-extractor split -f article.txt

# Pipe from stdin
cat article.txt | corp-extractor split -

# Output as JSON
corp-extractor split "Tim Cook is CEO of Apple." --json

# Output as XML
corp-extractor split "Tim Cook is CEO of Apple." --xml

# Verbose output with confidence scores
corp-extractor split -f article.txt --verbose

# Use more beams for better quality
corp-extractor split -f article.txt --beams 8

Split Options

OptionDescriptionDefault
-f, --file PATHRead input from file
-o, --outputOutput format: table, json, xmltable
--json / --xmlOutput format shortcuts
-b, --beamsNumber of beams for diverse beam search4
--diversityDiversity penalty for beam search1.0
--no-glinerDisable GLiNER2 extraction
--predicatesComma-separated predicates for relation extraction
--predicates-filePath to custom predicates JSON file
--deviceDevice: auto, cuda, mps, cpuauto
-v, --verboseShow confidence scores and metadata

Pipeline Command

NEW in v0.5.0

The pipeline command runs the full 5-stage extraction pipeline for comprehensive entity resolution and taxonomy classification.

Bash
# Run all 5 stages
corp-extractor pipeline "Amazon CEO Andy Jassy announced plans to hire workers."

# Run from file
corp-extractor pipeline -f article.txt

# Run specific stages
corp-extractor pipeline "..." --stages 1-3
corp-extractor pipeline "..." --stages 1,2,5

# Skip specific stages
corp-extractor pipeline "..." --skip-stages 4,5

# Enable specific plugins only
corp-extractor pipeline "..." --plugins gleif,companies_house

# Disable specific plugins
corp-extractor pipeline "..." --disable-plugins sec_edgar

# Output formats
corp-extractor pipeline "..." -o json
corp-extractor pipeline "..." -o yaml
corp-extractor pipeline "..." -o triples

Pipeline Stages

StageNameDescription
1SplittingText → Raw triples (T5-Gemma)
2ExtractionRaw triples → Typed statements (GLiNER2)
3Entity QualificationAdd identifiers (LEI, CIK, etc.) and canonical names via embedding DB
4LabelingApply sentiment, relation type, confidence
5TaxonomyClassify against large taxonomies (MNLI/embeddings)

Pipeline Options

OptionDescriptionExample
--stagesStages to run1-3 or 1,2,5
--skip-stagesStages to skip4,5
--pluginsEnable only these pluginsgleif,person
--disable-pluginsDisable these pluginssec_edgar
--predicates-fileCustom predicates JSON file for GLiNER2custom.json
-o, --outputOutput formattable, json, yaml, triples

Plugins Command

NEW in v0.5.0

The plugins command lists and inspects available pipeline plugins.

Bash
# List all plugins
corp-extractor plugins list

# List plugins for a specific stage
corp-extractor plugins list --stage 3

# Get details about a plugin
corp-extractor plugins info gleif_qualifier
corp-extractor plugins info person_qualifier

Example output:

Text
Stage 1: Splitting
----------------------------------------
  t5_gemma_splitter  [priority: 100]

Stage 2: Extraction
----------------------------------------
  gliner2_extractor  [priority: 100]

Stage 3: Entity Qualification
----------------------------------------
  person_qualifier (PERSON)  [priority: 100]
  embedding_company_qualifier (ORG)  [priority: 5]

Stage 4: Labeling
----------------------------------------
  sentiment_labeler  [priority: 100]
  confidence_labeler  [priority: 100]
  relation_type_labeler  [priority: 100]

Stage 5: Taxonomy
----------------------------------------
  embedding_taxonomy_classifier  [priority: 100]

Serve Command

NEW in v0.9.7

The serve command starts a persistent local FastAPI server that keeps all models warm in memory. This eliminates the ~30s startup cost for repeated CLI invocations.

Bash
# Start the server (default: 0.0.0.0:8111)
corp-extractor serve

# Custom host and port
corp-extractor serve --host 127.0.0.1 --port 9000

# Skip model warmup (models load on first request)
corp-extractor serve --no-warmup

# Verbose logging
corp-extractor serve -v

Once the server is running, use --server with any extraction command to delegate processing:

Bash
# In another terminal
corp-extractor --server split "Apple announced a new iPhone."
corp-extractor --server pipeline "Amazon CEO Andy Jassy announced plans."
corp-extractor --server pipeline -f article.txt -o json
corp-extractor --server document process article.txt

For Python API delegation, pass server_url to any extraction function (see Examples):

Python
from statement_extractor import extract_statements
result = extract_statements("text", server_url="http://localhost:8111")

Serve Options

OptionDescriptionDefault
--hostBind address0.0.0.0
--portPort number8111
--no-warmupSkip eager model loading (models load on first request)
-v, --verboseEnable debug logging

Server Endpoints

EndpointMethodDescription
/GETHealth check — device info, loaded models, registered plugins
/pipelinePOSTFull extraction pipeline — returns PipelineContext model_dump JSON
/splitPOSTStage 1 extraction only — returns ExtractionResult model_dump JSON
/documentPOSTDocument pipeline — returns DocumentContext model_dump JSON

Output Formats

Table output (default):

Text
Extracted 2 statement(s):

--------------------------------------------------------------------------------
1. Andy Jassy (CEO, Amazon)
   --[announced]-->
   plans to hire workers
--------------------------------------------------------------------------------

JSON output:

JSON
{
  "statement_count": 2,
  "labeled_statements": [
    {
      "subject": {"text": "Andy Jassy", "type": "PERSON", "fqn": "Andy Jassy (CEO, Amazon)"},
      "predicate": "announced",
      "object": {"text": "plans to hire workers", "type": "EVENT"},
      "labels": {"sentiment": "positive"}
    }
  ]
}

Triples output:

Text
Andy Jassy (CEO, Amazon)	announced	plans to hire workers
Amazon	has CEO	Andy Jassy (CEO, Amazon)

Shell Integration

Processing multiple files:

Bash
# Process all .txt files
for f in *.txt; do
  echo "=== $f ==="
  corp-extractor pipeline -f "$f" -o json > "${f%.txt}.json"
done

Combining with jq:

Bash
# Extract just predicates
corp-extractor split "Your text" --json | jq '.statements[].predicate'

# Filter high-confidence statements
corp-extractor split -f article.txt --json | jq '.statements[] | select(.confidence_score > 0.8)'

# Get FQNs from pipeline
corp-extractor pipeline "Your text" -o json | jq '.labeled_statements[].subject.fqn'

Document Command

NEW in v0.7.0

The document command processes files, URLs, and PDFs with automatic chunking and deduplication.

Bash
# Process local files
corp-extractor document process article.txt
corp-extractor document process report.txt --title "Annual Report" --year 2024

# Process local PDFs (auto-detected by .pdf extension)
corp-extractor document process report.pdf
corp-extractor document process report.pdf --pdf-parser glm_ocr_parser

# Process URLs (web pages and PDFs)
corp-extractor document process https://example.com/article
corp-extractor document process https://example.com/report.pdf --use-ocr
corp-extractor document process https://example.com/report.pdf --pdf-parser glm_ocr_parser

# Configure chunking
corp-extractor document process article.txt --max-tokens 500 --overlap 50

# Preview chunking without extraction
corp-extractor document chunk article.txt --max-tokens 500

# Output formats
corp-extractor document process article.txt -o json
corp-extractor document process article.txt -o triples

Document Options

OptionDescriptionDefault
--titleDocument title for citationsFilename
--max-tokensTarget tokens per chunk1000
--overlapToken overlap between chunks100
--use-ocrForce OCR for PDF parsing
--pdf-parserPDF parser plugin name (e.g., glm_ocr_parser)Auto (lowest priority)
--no-summarySkip document summarization
--no-dedupSkip cross-chunk deduplication
--stagesPipeline stages to run1-5

Database Commands

MOVED in v0.10.0

The entity database CLI moved out of corp-extractor into the corp-entity-db project — see that project's docs for search, download, build, and management commands. corp-extractor consumes the database transparently for entity qualification (Stage 3 of the pipeline); you do not need to invoke any db commands to use the extraction pipeline.

See ENTITY_DATABASE.md for the project-level overview.

Core Concepts

Corp-extractor is designed to analyze complex text and extract relationship information about people and organizations. It runs entirely on your hardware (RTX 4090+, Apple M1 16GB+) with no external API dependencies, using multiple fine-tuned small models to transform unstructured text into structured knowledge.

Statement Extraction

Statement extraction is the process of converting unstructured natural language text into structured subject-predicate-object triples. Each triple represents a discrete fact or relationship extracted from the source text.

For example, given the text:

"Apple announced a new iPhone at their Cupertino headquarters."

The extractor produces triples like:

SubjectPredicateObject
Apple (ORG)announcediPhone (PRODUCT)
Apple (ORG)has headquarters inCupertino (GPE)

The T5-Gemma 2 Model

Corp-extractor uses a fine-tuned T5-Gemma 2 model with 540 million parameters. This encoder-decoder architecture excels at sequence-to-sequence tasks, making it well-suited for transforming text into structured XML output.

The model processes input text wrapped in <page> tags and generates XML containing <stmt> elements with subject, predicate, object, and source text spans.

Entity Type Recognition

Each extracted subject and object is classified into one of 12 entity types (plus UNKNOWN):

TypeDescriptionExample
ORGOrganizations, companiesApple, United Nations
PERSONNamed individualsTim Cook, Marie Curie
GPEGeopolitical entitiesFrance, New York City
LOCNon-GPE locationsMount Everest, Pacific Ocean
PRODUCTProducts, artifactsiPhone, Model S
EVENTNamed eventsWorld War II, Olympics
WORK_OF_ARTCreative worksMona Lisa, Hamlet
LAWLegal documentsGDPR, First Amendment
DATETemporal expressionsJanuary 2024, last Tuesday
MONEYMonetary values$50 million, €100
PERCENTPercentages15%, half
QUANTITYMeasurements500 kilometers, 3 tons
UNKNOWNUnclassified entities

Diverse Beam Search

Corp-extractor uses Diverse Beam Search (Vijayakumar et al., 2016) to generate multiple candidate extractions from the same input text.

Why Diverse Beam Search?

Standard beam search tends to produce similar outputs—slight variations of the same interpretation. Diverse Beam Search introduces a diversity penalty that encourages the model to explore fundamentally different extractions.

This is particularly valuable for statement extraction because:

  • A single sentence may contain multiple valid interpretations
  • Different phrasings can capture different aspects of the same fact
  • Merging diverse outputs produces more comprehensive coverage

How It Works

The model generates multiple beams in parallel, each representing a different extraction path. A diversity penalty is applied during generation to prevent beams from converging on identical outputs.

Default Parameters

ParameterDefaultDescription
num_beams4Number of parallel beams to generate
diversity_penalty1.0Strength of diversity encouragement (higher = more diverse)
Python
from statement_extractor import extract_statements

# Use default beam search settings
result = extract_statements("Apple announced a new iPhone.")

# Customize beam search
result = extract_statements(
    "Apple announced a new iPhone.",
    num_beams=6,
    diversity_penalty=1.5
)

Quality Scoring

UPDATED in v0.4.0

Each extracted statement receives a confidence score between 0 and 1, measuring extraction quality through a weighted combination of semantic and entity-based signals.

Confidence Score

The score combines three components using GLiNER2 for entity recognition:

ComponentWeightDescription
Semantic similarity50%Cosine similarity between source text and reassembled triple
Subject entity score25%How entity-like the subject is (via GLiNER2 NER)
Object entity score25%How entity-like the object is (via GLiNER2 NER)

Higher scores indicate the triple is semantically grounded and contains well-formed entities. Lower scores may suggest hallucination or poorly extracted entities.

Confidence Filtering

Use the min_confidence parameter to filter out low-quality extractions:

Python
from statement_extractor import extract_statements

# Only return statements with confidence >= 0.7
result = extract_statements(
    "Apple CEO Tim Cook announced the iPhone 15.",
    min_confidence=0.7
)

# Access individual scores
for stmt in result:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Beam Merging vs Best Beam Selection

Corp-extractor supports two strategies for combining beam outputs:

StrategyDescriptionUse Case
merge (default)Combine unique statements from all beams, deduplicated by contentMaximum coverage
bestReturn only statements from the highest-scoring beamHigher precision
Python
# Merge all beams (default)
result = extract_statements(text, beam_strategy="merge")

# Use only the best beam
result = extract_statements(text, beam_strategy="best")

When using merge, statements are deduplicated based on normalized subject-predicate-object content, and the highest confidence score is retained for duplicates.


GLiNER2 Integration

NEW in v0.4.0

Version 0.4.0 introduces GLiNER2 (205M parameters) for entity recognition and relation extraction, replacing spaCy.

Why GLiNER2?

GLiNER2 is a unified model that handles:

  • Named Entity Recognition - identifying entities with types
  • Relation Extraction - using 324 default predicates across 21 categories
  • Confidence Scoring - real confidence values via include_confidence=True
  • Entity Scoring - measuring how "entity-like" subjects and objects are

Default Predicates

GLiNER2 uses 324 predicates organized into 21 categories loaded from default_predicates.json. Categories include:

  • ownership_control - acquires, owns, has_subsidiary, etc.
  • employment_leadership - employs, is_ceo_of, manages, etc.
  • funding_investment - funds, invests_in, sponsors, etc.
  • supply_chain - supplies, manufactures, distributes_for, etc.
  • legal_regulatory - regulates, violates, complies_with, etc.

Each predicate includes a description for semantic matching and a confidence threshold.

All Matches Returned

GLiNER2 now returns all matching relations, not just the best one. This allows downstream filtering and selection based on your use case:

Python
from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans to hire workers.")

# All matching relations are returned, sorted by confidence
for stmt in ctx.statements:
    print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
    print(f"  Category: {stmt.predicate_category}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Custom Predicates

You can provide custom predicates via a JSON file:

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

config = PipelineConfig(
    extractor_options={"predicates_file": "/path/to/custom_predicates.json"}
)
pipeline = ExtractionPipeline(config)

Or via CLI:

Bash
corp-extractor pipeline "..." --predicates-file custom_predicates.json

Entity-Based Scoring

Confidence scores come directly from GLiNER2 with include_confidence=True:

SourceDescription
Relation confidenceGLiNER2 confidence in the relation match
Entity confidenceGLiNER2 confidence in entity recognition

Pipeline Architecture

Updated in v0.8.0

Version 0.8.0 uses a 5-stage plugin-based pipeline for comprehensive entity resolution, statement enrichment, and taxonomy classification. Qualification and canonicalization have been merged into a single stage using the embedding database.

The 5 Stages

StageNameInputOutputPurpose
1SplittingTextRawTriple[]Extract raw subject-predicate-object triples using T5-Gemma2
2ExtractionRawTriple[]PipelineStatement[]Refine entities with type recognition using GLiNER2
3Entity QualificationEntitiesCanonicalEntity[]Add identifiers (LEI, CIK, etc.) and resolve canonical names via embedding database
4LabelingStatementsLabeledStatement[]Apply sentiment, relation type, confidence labels
5TaxonomyStatementsTaxonomyResult[]Classify against large taxonomies (ESG topics, etc.)

Stage 1: Splitting

The splitting stage transforms raw text into RawTriple objects using the T5-Gemma2 model. Each triple contains:

  • subject_text: The raw subject text
  • predicate_text: The raw predicate/relationship
  • object_text: The raw object text
  • source_sentence: The sentence this triple was extracted from
  • confidence: Extraction confidence score

Stage 2: Extraction

The extraction stage uses GLiNER2 to extract relations and assign entity types, producing PipelineStatement objects with:

  • subject: ExtractedEntity with text, type, span, and confidence
  • object: ExtractedEntity with text, type, span, and confidence
  • predicate: Predicate from GLiNER2's 324 default predicates
  • predicate_category: Category the predicate belongs to (e.g., "employment_leadership")
  • source_text: Source text for this statement
  • confidence_score: Real confidence from GLiNER2

Note: Stage 2 returns all matching relations from GLiNER2, not just the best one. This allows downstream stages to filter, deduplicate, or select based on specific criteria. Relations are sorted by confidence (descending).

Stage 3: Entity Qualification

Entity qualification combines what were previously separate qualification and canonicalization stages. It adds context, external identifiers, and canonical names to entities using the embedding database:

  • PersonQualifier: Adds role, organization, and canonical ID for PERSON entities Enhanced in v0.9.0
    • Uses LLM (Gemma3) to extract role and organization from context
    • Searches person database for notable people (executives, politicians, athletes, etc.)
    • Resolves organization mentions against the organization database
    • Returns canonical Wikidata IDs for matched people
  • EmbeddingCompanyQualifier: Looks up company identifiers (LEI, CIK, UK company numbers) and canonical names using vector similarity search

The output is CanonicalEntity with:

  • entity_type: Classification (business, nonprofit, government, etc.)
  • canonical_match: Match details (id, name, method, confidence)
  • fqn: Fully Qualified Name, e.g., "Tim Cook (CEO, Apple Inc)"
  • External identifiers: lei, ch_number, sec_cik, ticker, etc.
  • resolved_role: Canonical role information from person database v0.9.0
  • resolved_org: Canonical organization information from org database v0.9.0

Note: The embedding-based company qualifier replaces the older API-based qualifiers (GLEIF, Companies House, SEC Edgar APIs) for faster, offline entity resolution.

Stage 4: Labeling

Labeling plugins annotate statements with additional metadata:

  • SentimentLabeler: Adds sentiment classification (positive/negative/neutral)
  • ConfidenceLabeler: Adds confidence scoring
  • RelationTypeLabeler: Classifies relation types

The output is LabeledStatement with:

  • Original statement
  • Canonicalized subject and object
  • List of StatementLabel objects

Stage 5: Taxonomy

Taxonomy classification plugins classify statements against large taxonomies with hundreds of possible values. Multiple labels may match a single statement above the confidence threshold.

  • MNLITaxonomyClassifier: Uses MNLI zero-shot classification for accurate taxonomy labeling
  • EmbeddingTaxonomyClassifier: Uses embedding similarity for faster classification

The output is a list of TaxonomyResult objects, each with:

  • taxonomy_name: Name of the taxonomy (e.g., "esg_topics")
  • category: Top-level category (e.g., "environment", "governance")
  • label: Specific label within the category
  • confidence: Classification confidence score

Both classifiers use hierarchical classification for efficiency: first identify the top-k categories, then return all labels above the threshold within those categories.

Plugin System

Each stage is implemented through plugins registered with PluginRegistry. Plugins can be:

  • Enabled/disabled per invocation
  • Prioritized for execution order
  • Entity-type specific (e.g., PersonQualifier only runs on PERSON entities)
Python
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Run with specific plugins disabled
config = PipelineConfig(
    disabled_plugins={"mnli_taxonomy_classifier"}  # Use embedding classifier instead
)
pipeline = ExtractionPipeline(config)
ctx = pipeline.process(text)

Document Processing

NEW in v0.7.0

Version 0.7.0 introduces document-level processing for handling files, URLs, and PDFs with automatic chunking, deduplication, and citation tracking.

Document Pipeline

The document pipeline:

  1. Loads content from files, URLs, or PDFs
  2. Chunks text into optimal-sized segments for the extraction model
  3. Processes each chunk through the 5-stage extraction pipeline
  4. Deduplicates statements across chunks
  5. Generates optional document summary
  6. Tracks citations back to source chunks

Chunking Strategy

Documents are split into chunks based on token count with configurable overlap:

ParameterDefaultDescription
target_tokens1000Target tokens per chunk
overlap_tokens100Token overlap between consecutive chunks
respect_sentencestrueAvoid splitting mid-sentence

URL and PDF Support

The document pipeline can fetch and process content from URLs and local PDF files:

  • Web pages: HTML content is extracted using Readability-style parsing
  • PDFs: Two built-in parsers available via --pdf-parser flag:
    • pypdf_parser (default) — PyMuPDF text extraction with Tesseract OCR fallback
    • glm_ocr_parser — GLM-OCR 0.9B VLM for high-quality OCR of scans, tables, and formulas
Python
from statement_extractor.document import DocumentPipeline

pipeline = DocumentPipeline()

# Process a web page
ctx = await pipeline.process_url("https://example.com/article")

# Process a PDF with OCR
from statement_extractor.document import URLLoaderConfig
config = URLLoaderConfig(use_ocr=True)
ctx = await pipeline.process_url("https://example.com/report.pdf", config)

# Process a PDF with GLM-OCR parser
config = URLLoaderConfig(pdf_parser_plugin="glm_ocr_parser")
ctx = await pipeline.process_url("https://example.com/report.pdf", config)
Bash
# CLI: local PDF with default parser
corp-extractor document process report.pdf

# CLI: local PDF with GLM-OCR VLM parser
corp-extractor document process report.pdf --pdf-parser glm_ocr_parser

Cross-Chunk Deduplication

When processing long documents, the same fact may appear in multiple chunks. The deduplicator uses embedding similarity to identify and merge duplicate statements, keeping the highest-confidence version with proper citation tracking.


Entity Embedding Database

UPDATED in v0.9.6

The entity embedding database provides fast qualification for both organizations and people using vector similarity search.

Organization Data Sources

SourceRecordsIdentifierDate Fields
Companies House5.5MUK Company Numberfrom_date: Incorporation, to_date: Dissolution
GLEIF2.6MLEI (Legal Entity Identifier)from_date: LEI registration date
Wikidata1.5MQIDfrom_date: Inception (P571), to_date: Dissolution (P576)
SEC Edgar73KCIK (Central Index Key)from_date: First SEC filing date

Total: 9.7M+ organization records

Person Data Sources UPDATED in v0.9.3

SourceRecordsIdentifierCoverage
Companies House27.5MPerson NumberUK company officers and directors
Wikidata36MQIDNotable people with English Wikipedia articles

Total: 63M+ people records

Person Types

PersonTypeDescriptionExample People
executiveC-suite, board membersTim Cook, Satya Nadella
politicianElected officials (presidents, MPs, mayors)Joe Biden, Angela Merkel
governmentCivil servants, diplomats, appointed officialsAmbassadors, agency heads
militaryMilitary officers, armed forces personnelGenerals, admirals
legalJudges, lawyers, legal professionalsSupreme Court justices
professionalKnown for profession (doctors, engineers)Famous surgeons, architects
athleteSports figuresLeBron James, Lionel Messi
artistTraditional creatives (musicians, actors, painters)Tom Hanks, Taylor Swift
mediaInternet/social media personalitiesYouTubers, influencers, podcasters
academicProfessors, researchersNeil deGrasse Tyson
scientistScientists, inventorsResearch scientists
journalistReporters, news presentersAnderson Cooper
entrepreneurFounders, business ownersMark Zuckerberg
activistAdvocates, campaignersGreta Thunberg

People are imported from Companies House (UK company officers) and Wikidata (notable people with English Wikipedia articles). Each person record includes:

  • name: Display name
  • known_for_role: Primary role (e.g., "CEO", "President")
  • known_for_org: Primary organization (e.g., "Apple Inc", "Tesla")
  • country: Country of citizenship
  • person_type: Classification category
  • from_date: Role start date (ISO format)
  • to_date: Role end date (ISO format)
  • birth_date: Date of birth (ISO format) v0.9.2
  • death_date: Date of death if deceased (ISO format) v0.9.2

Note: The same person can have multiple records with different role/org combinations (e.g., Tim Cook as "CEO at Apple" and "Board Director at Nike"). The unique constraint is on (source, source_id, known_for_role, known_for_org).

When organizations are discovered during people import (employers, affiliated orgs), they are automatically inserted into the organizations table if not already present. Each person record has a known_for_org_id foreign key linking to the organizations table, enabling efficient joins and lookups.

EntityType Classification

NEW in v0.8.0

Each organization record is classified with an entity_type field to distinguish between businesses, non-profits, government agencies, and other organization types:

CategoryTypesDescription
Businessbusiness, fund, branchCommercial entities, investment funds, branch offices
Non-profitnonprofit, ngo, foundation, trade_unionCharitable organizations, NGOs, labor unions
Governmentgovernment, international_org, political_partyGovernment agencies, UN/WHO/IMF, political parties
Educationeducational, researchSchools, universities, research institutes
Otherhealthcare, media, sports, religiousHospitals, studios, sports clubs, religious orgs
UnknownunknownClassification not determined

How It Works

  1. Embedding Generation: Organization names are embedded using EmbeddingGemma (300M params)
  2. Vector Search: USearch HNSW indexes enable sub-millisecond approximate nearest neighbor search across 50M+ records
  3. Qualification: When an ORG entity is found, the database is searched for matching organizations
  4. Identifier Resolution: Matched organizations provide LEI, CIK, company numbers, etc.

Other Tables NEW in v0.9.4

TableRecordsDescription
Roles139K+Job titles with Wikidata QIDs (CEO, Director, etc.)
Locations25K+Countries, states, and cities with hierarchy

Database Variants

  • entities-v3-lite.db: Core fields only, no embedding tables (default download)
  • entities-v3.db: Full database with all embedding tables and source metadata
  • organizations_usearch.bin: USearch HNSW index for organization search
  • people_usearch.bin: USearch HNSW index for person search

Entity Database

MOVED in v0.10.0

As of corp-extractor v0.10.0 the entity database is a separate project, corp-entity-db. It provides embedding-based search across organizations, people, roles, and locations sourced from GLEIF, SEC Edgar, Companies House, and Wikidata.

The full reference — schema, sizes, source coverage, EntityType / PersonType classifications, search / download / build CLI, and Python API — lives at corp-entity-db.vercel.app.

How corp-extractor uses the database

corp-extractor depends on corp-entity-db>=0.1.0 and consumes it via the qualifier plugins in Stage 3 of the pipeline:

  • embedding_company_qualifier — looks up organizations by embedding similarity, attaching canonical IDs (LEI, CIK, UK CH number, Wikidata QID) to extracted ORG entities.
  • person_qualifier — looks up notable people, optionally using a local LLM (Gemma-3-12B GGUF) to disambiguate when multiple candidates match a given name + role + organization context.

The pipeline doesn't need any explicit database setup — corp-entity-db will download the lite variant on first use. Backwards-compatible re-export shims under statement_extractor.database (OrganizationDatabase, PersonDatabase, get_database, get_organization_resolver) keep older code working unchanged.

Cerebrium deployment shares the volume

The Cerebrium app for corp-extractor deploys into the same Cerebrium project as the corp-entity-db app, so both share /persistent-storage in us-east-1. Database files, USearch indexes, and the embedding model (google/embeddinggemma-300m) are downloaded once by whichever app boots first and reused by the other.

See ENTITY_DATABASE.md for the project-level overview.

API Reference

Functions

The library provides convenience functions for quick extraction without managing extractor instances.

FunctionReturnsDescription
extract_statements(text, options?, server_url?)ExtractionResultMain extraction function. Returns structured statements with confidence scores. Pass server_url to delegate to a running server.
extract_statements_as_json(text, options?, indent?, server_url?)strReturns extraction result as a JSON string.
extract_statements_as_xml(text, options?, server_url?)strReturns raw XML output from the model.
extract_statements_as_dict(text, options?, server_url?)dictReturns extraction result as a Python dictionary.

Function Signatures

Python
def extract_statements(
    text: str,
    options: Optional[ExtractionOptions] = None,
    server_url: Optional[str] = None,
    **kwargs
) -> ExtractionResult:
    """
    Extract structured statements from text.

    Args:
        text: Input text to extract statements from
        options: Extraction options (or pass individual options as kwargs)
        server_url: URL of a running corp-extractor server to delegate to
                    (e.g. "http://localhost:8111"). When provided, no local
                    models are loaded.
        **kwargs: Individual option overrides (num_beams, diversity_penalty, etc.)

    Returns:
        ExtractionResult containing Statement objects
    """
Python
def extract_statements_as_json(
    text: str,
    options: Optional[ExtractionOptions] = None,
    indent: Optional[int] = 2,
    **kwargs
) -> str:
    """Returns JSON string representation of the extraction result."""
Python
def extract_statements_as_xml(
    text: str,
    options: Optional[ExtractionOptions] = None,
    **kwargs
) -> str:
    """Returns XML string with <statements> containing <stmt> elements."""
Python
def extract_statements_as_dict(
    text: str,
    options: Optional[ExtractionOptions] = None,
    **kwargs
) -> dict:
    """Returns dictionary representation of the extraction result."""

Usage Examples

Python
from statement_extractor import extract_statements, extract_statements_as_json

# Basic extraction
result = extract_statements("Apple acquired Beats for $3 billion.")
for stmt in result:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")

# With options via kwargs
result = extract_statements(
    "Tesla announced new factories.",
    num_beams=6,
    diversity_penalty=1.5
)

# JSON output
json_str = extract_statements_as_json("OpenAI released GPT-4.", indent=2)
print(json_str)

Classes

StatementExtractor

The main extractor class with full control over device, model loading, and extraction options.

Python
class StatementExtractor:
    def __init__(
        self,
        model_id: str = "Corp-o-Rate-Community/statement-extractor",
        device: Optional[str] = None,
        torch_dtype: Optional[torch.dtype] = None,
        predicate_taxonomy: Optional[PredicateTaxonomy] = None,
        predicate_config: Optional[PredicateComparisonConfig] = None,
        scoring_config: Optional[ScoringConfig] = None,
    ):
        """
        Initialize the statement extractor.

        Args:
            model_id: HuggingFace model ID or local path
            device: Device to use ('cuda', 'cpu', or None for auto-detect)
            torch_dtype: Torch dtype (default: bfloat16 on GPU, float32 on CPU)
            predicate_taxonomy: Optional taxonomy for predicate normalization
            predicate_config: Configuration for predicate comparison
            scoring_config: Configuration for quality scoring
        """

    def extract(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
    ) -> ExtractionResult:
        """Extract statements from text."""

    def extract_as_xml(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
    ) -> str:
        """Extract statements and return raw XML output."""

    def extract_as_json(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
        indent: Optional[int] = 2,
    ) -> str:
        """Extract statements and return JSON string."""

    def extract_as_dict(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
    ) -> dict:
        """Extract statements and return as dictionary."""

Example: Custom extractor with GPU control

Python
from statement_extractor import StatementExtractor, ExtractionOptions

# Force CPU usage
extractor = StatementExtractor(device="cpu")

# Extract with custom options
options = ExtractionOptions(num_beams=6, diversity_penalty=1.2)
result = extractor.extract("Microsoft partnered with OpenAI.", options)

ExtractionOptions

Configuration for the extraction process.

Python
class ExtractionOptions(BaseModel):
    # Beam search parameters
    num_beams: int = 4                    # 1-16, beams for diverse beam search
    diversity_penalty: float = 1.0        # >= 0.0, penalty for beam diversity
    max_new_tokens: int = 2048            # 128-8192, max tokens to generate
    min_statement_ratio: float = 1.0      # >= 0.0, min statements per sentence
    max_attempts: int = 3                 # 1-10, extraction retry attempts
    deduplicate: bool = True              # Remove duplicate statements

    # Predicate taxonomy & comparison
    predicate_taxonomy: Optional[PredicateTaxonomy] = None
    predicate_config: Optional[PredicateComparisonConfig] = None

    # Scoring configuration (v0.2.0)
    scoring_config: Optional[ScoringConfig] = None

    # Pluggable canonicalization
    entity_canonicalizer: Optional[Callable[[str], str]] = None

    # Mode flags
    merge_beams: bool = True              # Merge top-N beams vs select best
    embedding_dedup: bool = True          # Use embedding similarity for dedup

ScoringConfig

Quality scoring parameters for beam selection and triple assessment. Added in v0.2.0.

Python
class ScoringConfig(BaseModel):
    quality_weight: float = 1.0           # >= 0.0, weight for confidence scores
    coverage_weight: float = 0.5          # >= 0.0, bonus for source text coverage
    redundancy_penalty: float = 0.3       # >= 0.0, penalty for duplicate triples
    length_penalty: float = 0.1           # >= 0.0, penalty for verbosity
    min_confidence: float = 0.0           # 0.0-1.0, minimum confidence threshold
    merge_top_n: int = 3                  # 1-10, beams to merge when merge_beams=True

Tuning for precision vs recall:

Use Casemin_confidenceNotes
High recall0.0Keep all extractions
Balanced0.5Filter low-confidence triples
High precision0.8Only keep high-confidence triples

PredicateTaxonomy

A taxonomy of canonical predicates for normalization.

Python
class PredicateTaxonomy(BaseModel):
    predicates: list[str]                 # List of canonical predicate forms
    name: Optional[str] = None            # Optional taxonomy name

    @classmethod
    def from_file(cls, path: str | Path) -> "PredicateTaxonomy":
        """Load taxonomy from a file (one predicate per line)."""

    @classmethod
    def from_list(cls, predicates: list[str], name: Optional[str] = None) -> "PredicateTaxonomy":
        """Create taxonomy from a list of predicates."""

Example:

Python
from statement_extractor import PredicateTaxonomy, ExtractionOptions, extract_statements

# Define canonical predicates
taxonomy = PredicateTaxonomy.from_list([
    "acquired", "founded", "works_for", "located_in", "partnered_with"
])

# Use in extraction
options = ExtractionOptions(predicate_taxonomy=taxonomy)
result = extract_statements("Google bought YouTube.", options)
# predicate "bought" maps to canonical "acquired"

PredicateComparisonConfig

Configuration for embedding-based predicate comparison.

Python
class PredicateComparisonConfig(BaseModel):
    embedding_model: str = "sentence-transformers/paraphrase-MiniLM-L6-v2"
    similarity_threshold: float = 0.65    # 0.0-1.0, min similarity for taxonomy match
    dedup_threshold: float = 0.65         # 0.0-1.0, min similarity for duplicates
    normalize_text: bool = True           # Lowercase and strip before embedding

Data Models

All data models use Pydantic for validation and serialization.

Statement

A single extracted subject-predicate-object triple.

Python
class Statement(BaseModel):
    subject: Entity                              # The subject entity
    predicate: str                               # The relationship/predicate
    object: Entity                               # The object entity
    source_text: Optional[str] = None            # Original text span

    # Quality scoring fields (v0.2.0)
    confidence_score: Optional[float] = None     # 0.0-1.0, quality score (semantic + entity)
    evidence_span: Optional[tuple[int, int]] = None  # Character offsets in source
    canonical_predicate: Optional[str] = None    # Canonical form if taxonomy used

    def as_triple(self) -> tuple[str, str, str]:
        """Return as (subject, predicate, object) tuple."""

    def __str__(self) -> str:
        """Format: 'subject -- predicate --> object'"""

Example:

Python
stmt = result.statements[0]
print(stmt.subject.text)           # "Apple Inc."
print(stmt.predicate)              # "acquired"
print(stmt.object.text)            # "Beats Electronics"
print(stmt.confidence_score)       # 0.92
print(stmt.as_triple())            # ("Apple Inc.", "acquired", "Beats Electronics")

Entity

An entity representing a subject or object.

Python
class Entity(BaseModel):
    text: str                        # The entity text
    type: EntityType = UNKNOWN       # The entity type

    def __str__(self) -> str:
        """Format: 'text (TYPE)'"""

EntityType

Enumeration of supported entity types.

Python
class EntityType(str, Enum):
    ORG = "ORG"                 # Organization
    PERSON = "PERSON"           # Person
    GPE = "GPE"                 # Geopolitical entity (country, city, state)
    LOC = "LOC"                 # Non-GPE location
    PRODUCT = "PRODUCT"         # Product
    EVENT = "EVENT"             # Event
    WORK_OF_ART = "WORK_OF_ART" # Creative work
    LAW = "LAW"                 # Legal document
    DATE = "DATE"               # Date or time
    MONEY = "MONEY"             # Monetary value
    PERCENT = "PERCENT"         # Percentage
    QUANTITY = "QUANTITY"       # Quantity or measurement
    UNKNOWN = "UNKNOWN"         # Unknown type

ExtractionResult

Container for extraction results. Supports iteration and length.

Python
class ExtractionResult(BaseModel):
    statements: list[Statement] = []     # List of extracted statements
    source_text: Optional[str] = None    # Original input text

    def __len__(self) -> int:
        """Number of statements."""

    def __iter__(self):
        """Iterate over statements."""

    def to_triples(self) -> list[tuple[str, str, str]]:
        """Return all statements as (subject, predicate, object) tuples."""

Example:

Python
result = extract_statements(text)

# Iterate directly
for stmt in result:
    print(stmt)

# Check count
print(f"Found {len(result)} statements")

# Get as simple tuples
triples = result.to_triples()

PredicateMatch

Result of matching a predicate to a canonical form.

Python
class PredicateMatch(BaseModel):
    original: str                        # The original extracted predicate
    canonical: Optional[str] = None      # Matched canonical predicate, if any
    similarity: float = 0.0              # 0.0-1.0, cosine similarity score
    matched: bool = False                # Whether a match was found above threshold

Example:

Python
from statement_extractor import PredicateComparer, PredicateTaxonomy

taxonomy = PredicateTaxonomy.from_list(["acquired", "founded", "works_for"])
comparer = PredicateComparer(taxonomy=taxonomy)

match = comparer.match_to_canonical("bought")
print(match.original)     # "bought"
print(match.canonical)    # "acquired"
print(match.similarity)   # ~0.82
print(match.matched)      # True

Pipeline API

NEW in v0.5.0

The pipeline API provides comprehensive entity resolution and taxonomy classification through a 5-stage plugin architecture.

ExtractionPipeline

The main orchestrator class that runs all pipeline stages.

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

class ExtractionPipeline:
    def __init__(self, config: Optional[PipelineConfig] = None, server_url: Optional[str] = None):
        """
        Initialize the extraction pipeline.

        Args:
            config: Pipeline configuration (default: all stages enabled)
            server_url: URL of a running corp-extractor server to delegate to.
                        When provided, process() sends HTTP requests instead
                        of loading models locally.
        """

    def process(self, text: str, metadata: Optional[dict] = None) -> PipelineContext:
        """
        Process text through the pipeline stages.

        Args:
            text: Input text to process
            metadata: Optional source metadata (document ID, URL, etc.)

        Returns:
            PipelineContext with results from all stages
        """

Example:

Python
pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans.")

print(f"Statements: {ctx.statement_count}")
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

PipelineConfig

Configuration for stage and plugin selection.

Python
from statement_extractor.pipeline import PipelineConfig

class PipelineConfig(BaseModel):
    enabled_stages: set[int] = {1, 2, 3, 4, 5}  # Stages to run (1-5)
    enabled_plugins: Optional[set[str]] = None   # Plugins to enable (None = all)
    disabled_plugins: set[str] = set()           # Plugins to disable
    fail_fast: bool = False                       # Stop on first error
    parallel_processing: bool = False             # Enable parallel processing
    max_statements: Optional[int] = None          # Limit statements processed

    # Stage-specific options
    splitter_options: dict = {}
    extractor_options: dict = {}
    qualifier_options: dict = {}
    labeler_options: dict = {}
    taxonomy_options: dict = {}

    @classmethod
    def from_stage_string(cls, stages: str, **kwargs) -> "PipelineConfig":
        """Create config from stage string like '1-3' or '1,2,5'."""

    @classmethod
    def default(cls) -> "PipelineConfig":
        """All stages enabled."""

    @classmethod
    def minimal(cls) -> "PipelineConfig":
        """Only splitting and extraction (stages 1-2)."""

Example:

Python
# Run only stages 1-3
config = PipelineConfig(enabled_stages={1, 2, 3})

# Disable specific plugins
config = PipelineConfig(disabled_plugins={"sec_edgar_qualifier"})

# From stage string
config = PipelineConfig.from_stage_string("1-3")

PipelineContext

Data container that flows through all pipeline stages.

Python
from statement_extractor.pipeline import PipelineContext

class PipelineContext(BaseModel):
    # Input
    source_text: str                                    # Original input text
    source_metadata: dict = {}                          # Document metadata

    # Stage outputs
    raw_triples: list[RawTriple] = []                   # Stage 1 output
    statements: list[PipelineStatement] = []           # Stage 2 output
    canonical_entities: dict[str, CanonicalEntity] = {} # Stage 3 output
    labeled_statements: list[LabeledStatement] = []    # Stage 4 output
    taxonomy_results: dict[tuple, list[TaxonomyResult]] = {}  # Stage 5 output (multiple labels per statement)

    # Processing metadata
    processing_errors: list[str] = []
    processing_warnings: list[str] = []
    stage_timings: dict[str, float] = {}

    @property
    def statement_count(self) -> int:
        """Number of statements in final output."""

    @property
    def has_errors(self) -> bool:
        """Check if any errors occurred."""

PluginRegistry

Registry for discovering and managing plugins.

Python
from statement_extractor.pipeline import PluginRegistry

class PluginRegistry:
    @classmethod
    def list_plugins(cls, stage: Optional[int] = None) -> list[dict]:
        """List all registered plugins, optionally filtered by stage."""

    @classmethod
    def get_plugin(cls, name: str) -> Optional[BasePlugin]:
        """Get a plugin by name."""

Pipeline Data Models

RawTriple

Output of Stage 1 (Splitting).

Python
class RawTriple(BaseModel):
    subject_text: str                    # Raw subject text
    predicate_text: str                  # Raw predicate text
    object_text: str                     # Raw object text
    source_sentence: str                 # Source sentence
    confidence: float = 1.0              # Extraction confidence (0-1)

    def as_tuple(self) -> tuple[str, str, str]:
        """Return as (subject, predicate, object) tuple."""

PipelineStatement

Output of Stage 2 (Extraction).

Python
class PipelineStatement(BaseModel):
    subject: ExtractedEntity             # Subject with type, span, confidence
    predicate: str                       # Predicate text
    predicate_category: Optional[str]    # Predicate category (e.g., "employment_leadership")
    object: ExtractedEntity              # Object with type, span, confidence
    source_text: str                     # Source text
    confidence_score: float = 1.0        # Overall confidence (from GLiNER2)
    extraction_method: Optional[str]     # Method: gliner_relation

Note: Stage 2 returns all matching relations from GLiNER2, not just the best one. Relations are sorted by confidence (descending).


GLiNER2Extractor

The Stage 2 extractor plugin that uses GLiNER2 for relation extraction.

Python
from statement_extractor.plugins.extractors.gliner2 import GLiNER2Extractor

class GLiNER2Extractor(BaseExtractorPlugin):
    def __init__(
        self,
        predicates: Optional[list[str]] = None,
        predicates_file: Optional[str | Path] = None,
        entity_types: Optional[list[str]] = None,
        use_default_predicates: bool = True,
    ):
        """
        Initialize the GLiNER2 extractor.

        Args:
            predicates: Custom list of predicate names
            predicates_file: Path to custom predicates JSON file
            entity_types: Entity types to extract (default: all)
            use_default_predicates: Use 324 built-in predicates when no custom provided
        """

Key behaviors:

  • Uses include_confidence=True for real confidence scores from GLiNER2
  • Iterates through 21 predicate categories to stay under GLiNER2's ~25 label limit
  • Returns all matching relations per source sentence (filtered later)
  • Predicates loaded from default_predicates.json (324 predicates)

EntityQualifiers

Qualifiers added in Stage 3.

Python
class EntityQualifiers(BaseModel):
    # Semantic qualifiers
    org: Optional[str] = None            # Organization/employer
    role: Optional[str] = None           # Job title/position

    # Location qualifiers
    region: Optional[str] = None         # State/province
    country: Optional[str] = None        # Country
    city: Optional[str] = None           # City
    jurisdiction: Optional[str] = None   # Legal jurisdiction

    # External identifiers
    identifiers: dict[str, str] = {}     # lei, ch_number, sec_cik, ticker, etc.

    def has_any_qualifier(self) -> bool:
        """Check if any qualifier is set."""

CanonicalMatch

Result of canonical matching in Stage 3.

Python
class CanonicalMatch(BaseModel):
    canonical_id: Optional[str]          # ID in canonical database
    canonical_name: Optional[str]        # Canonical name/label
    match_method: str                    # identifier, name_exact, name_fuzzy, embedding
    match_confidence: float = 1.0        # Confidence in match (0-1)
    match_details: Optional[dict]        # Additional match details

CanonicalEntity

Output of Stage 3 (Entity Qualification).

Python
class CanonicalEntity(BaseModel):
    entity_ref: str                      # Reference to original entity
    original_text: str                   # Original entity text
    entity_type: EntityType              # Entity type
    qualifiers: EntityQualifiers         # Qualifiers and identifiers
    canonical_match: Optional[CanonicalMatch]  # Canonical match if found
    fqn: str                             # Fully Qualified Name
    qualification_sources: list[str]     # Plugins that contributed

StatementLabel

A label applied in Stage 4.

Python
class StatementLabel(BaseModel):
    label_type: str                      # sentiment, relation_type, confidence
    label_value: Union[str, float, bool] # The label value
    confidence: float = 1.0              # Confidence in label
    labeler: Optional[str]               # Plugin that produced the label

LabeledStatement

Final output from Stage 4 (Labeling).

Python
class LabeledStatement(BaseModel):
    statement: PipelineStatement         # Original statement
    subject_canonical: CanonicalEntity   # Canonicalized subject
    object_canonical: CanonicalEntity    # Canonicalized object
    labels: list[StatementLabel] = []    # Applied labels

    @property
    def subject_fqn(self) -> str:
        """Subject's fully qualified name."""

    @property
    def object_fqn(self) -> str:
        """Object's fully qualified name."""

    def get_label(self, label_type: str) -> Optional[StatementLabel]:
        """Get label by type."""

    def as_dict(self) -> dict:
        """Convert to simplified dictionary."""

Example:

Python
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

    # Access labels
    sentiment = stmt.get_label("sentiment")
    if sentiment:
        print(f"  Sentiment: {sentiment.label_value}")

    # Access qualifiers
    subject_quals = stmt.subject_canonical.qualified_entity.qualifiers
    if subject_quals.role:
        print(f"  Role: {subject_quals.role}")

TaxonomyResult

Output of Stage 5 (Taxonomy) classification.

Python
class TaxonomyResult(BaseModel):
    taxonomy_name: str                   # e.g., "esg_topics"
    category: str                        # Top-level category
    label: str                           # Specific label
    label_id: Optional[int] = None       # Numeric ID if available
    confidence: float = 1.0              # Classification confidence (0-1)
    classifier: Optional[str] = None     # Plugin that produced this result
    metadata: dict = {}                  # Additional metadata

    @property
    def full_label(self) -> str:
        """Return category:label format."""

Example:

Python
# Access taxonomy results from context
# Each statement may have multiple labels above the threshold
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
    print(f"Statement: {source_text[:50]}...")
    print(f"  Taxonomy: {taxonomy_name}")
    print(f"  Labels ({len(results)}):")
    for result in results:
        print(f"    - {result.full_label} (confidence: {result.confidence:.2f})")

ClassificationSchema

Schema for simple multi-choice classification (2-20 options). Used by labelers that need GLiNER2 to perform classification.

Python
class ClassificationSchema(BaseModel):
    label_type: str                      # e.g., "sentiment"
    choices: list[str]                   # Available choices
    description: str = ""                # Description for the classifier
    scope: str = "statement"             # statement or entity

TaxonomySchema

Schema for large taxonomy classification (100+ values). Used by taxonomy plugins.

Python
class TaxonomySchema(BaseModel):
    label_type: str                      # e.g., "taxonomy"
    values: list[str] | dict[str, list[str]]  # Flat list or category -> labels
    description: str = ""
    scope: str = "statement"
    label_descriptions: Optional[dict[str, str]] = None  # Descriptions for labels

Server API

NEW in v0.9.7

The corp-extractor serve command starts a FastAPI server that keeps all models warm in memory. All endpoints return standardized Pydantic model_dump() JSON responses. The server can be called from the CLI (--server flag), the Python API (server_url parameter), or any HTTP client.

GET / — Health Check

Returns server status, device info, loaded models, and registered plugins.

Bash
curl http://localhost:8111/

Response:

JSON
{
  "status": "ok",
  "device": "cuda:0",
  "cuda_available": true,
  "mps_available": false,
  "models_loaded": {"extractor": true, "pipeline": true},
  "plugins": {
    "splitters": ["t5_gemma_splitter"],
    "extractors": ["gliner2_extractor"],
    "qualifiers": ["person_qualifier", "embedding_company_qualifier"],
    "labelers": ["sentiment_labeler", "confidence_labeler", "relation_type_labeler"],
    "taxonomy": ["embedding_taxonomy_classifier"]
  }
}

POST /pipeline — Full Pipeline

Runs the full extraction pipeline. Request body:

Python
class PipelineRequest(BaseModel):
    text: str                              # Input text
    config: dict[str, Any] = {}            # Pipeline configuration

Config supports: enabled_stages (str like "1-3" or list), disabled_plugins (list), enabled_plugins (list), extractor_options, splitter_options, qualifier_options, labeler_options, taxonomy_options.

Response: PipelineContext.model_dump() JSON.

Bash
curl -X POST http://localhost:8111/pipeline \
  -H "Content-Type: application/json" \
  -d '{"text": "Apple CEO Tim Cook announced a new iPhone.", "config": {"enabled_stages": "1-3"}}'

POST /split — Stage 1 Only

Runs T5-Gemma extraction only. Request body:

Python
class SplitRequest(BaseModel):
    text: str                              # Input text
    options: dict[str, Any] = {}           # ExtractionOptions fields

Options supports: num_beams, diversity_penalty, max_new_tokens, deduplicate, etc.

Response: ExtractionResult.model_dump() JSON.

Bash
curl -X POST http://localhost:8111/split \
  -H "Content-Type: application/json" \
  -d '{"text": "Apple announced a new iPhone.", "options": {"num_beams": 6}}'

POST /document — Document Pipeline

Runs the document pipeline for text input. Request body:

Python
class DocumentRequest(BaseModel):
    text: str                              # Document text
    title: Optional[str] = None            # Document title
    stages: str = "1-6"                    # Pipeline stages
    max_tokens: int = 1000                 # Target tokens per chunk
    overlap: int = 100                     # Token overlap between chunks
    no_summary: bool = False               # Skip summarization
    no_dedup: bool = False                 # Skip cross-chunk deduplication
Bash
curl -X POST http://localhost:8111/document \
  -H "Content-Type: application/json" \
  -d '{"text": "Long document text...", "title": "Annual Report", "stages": "1-3"}'

Configuration

The statement-extractor library provides fine-grained control over extraction behavior through configuration classes. This section covers all configuration options for tuning precision, recall, and performance.


ExtractionOptions

The primary configuration class for controlling extraction behavior.

ParameterTypeDefaultDescription
num_beamsint4Number of beam search candidates
diversity_penaltyfloat1.0Penalty for beam diversity in diverse beam search
max_new_tokensint2048Maximum generation length in tokens
deduplicateboolTrueRemove duplicate statements from output
merge_beamsboolTrueMerge top beams into single result set (v0.2.0)
embedding_dedupboolTrueUse embedding similarity for deduplication (v0.2.0)
predicateslist[str]NonePredefined predicates for GLiNER2 relation extraction (v0.4.0)
all_triplesboolFalseKeep all candidate triples instead of best per source
predicate_taxonomyPredicateTaxonomyNoneTaxonomy of canonical predicates
scoring_configScoringConfigNoneQuality scoring configuration
entity_canonicalizerCallableNoneCustom function for entity canonicalization

Basic usage:

Python
from statement_extractor import ExtractionOptions, extract_statements

options = ExtractionOptions(
    num_beams=6,
    diversity_penalty=1.2,
    deduplicate=True
)

result = extract_statements("Apple acquired Beats for $3 billion.", options)

ScoringConfig

Added in v0.2.0

Configuration for quality scoring, filtering, and beam selection. Use this to tune the precision-recall tradeoff.

ParameterTypeDefaultDescription
min_confidencefloat0.0Filter threshold (0=recall, 0.7+=precision)
quality_weightfloat1.0Weight for confidence scores
coverage_weightfloat0.5Weight for source text coverage
redundancy_penaltyfloat0.3Penalty for duplicate triples
length_penaltyfloat0.1Penalty for verbose predicates/entities
merge_top_nint3Number of beams to merge

Common configurations:

Python
from statement_extractor import ScoringConfig, ExtractionOptions, extract_statements

# High precision mode - only keep confident extractions
precision_config = ScoringConfig(
    min_confidence=0.7,
    quality_weight=1.5,
    redundancy_penalty=0.5
)

# High recall mode - keep everything
recall_config = ScoringConfig(
    min_confidence=0.0,
    quality_weight=0.5,
    redundancy_penalty=0.1
)

# Use in extraction
options = ExtractionOptions(scoring_config=precision_config)
result = extract_statements(text, options)

Precision vs recall tuning:

Use Casemin_confidencequality_weightNotes
Maximum recall0.00.5Keep all extractions
Balanced0.41.0Good default
High precision0.71.5Fewer false positives
Knowledge base0.82.0Very strict

PredicateComparisonConfig

Added in v0.2.0

Configuration for embedding-based predicate comparison and taxonomy matching. Requires the [embeddings] extra.

ParameterTypeDefaultDescription
embedding_modelstrparaphrase-MiniLM-L6-v2Model for computing similarity
similarity_thresholdfloat0.65Minimum similarity for taxonomy matching
dedup_thresholdfloat0.65Minimum similarity to consider duplicates
normalize_textboolTrueLowercase/strip predicates before embedding

Custom thresholds:

Python
from statement_extractor import (
    PredicateComparisonConfig,
    PredicateTaxonomy,
    ExtractionOptions,
    extract_statements
)

# Stricter matching for precision
config = PredicateComparisonConfig(
    similarity_threshold=0.75,
    dedup_threshold=0.80,
    normalize_text=True
)

taxonomy = PredicateTaxonomy.from_list([
    "acquired", "founded", "works_for", "located_in",
    "partnered_with", "invested_in", "announced"
])

options = ExtractionOptions(
    predicate_taxonomy=taxonomy,
    predicate_config=config
)

result = extract_statements("Google bought YouTube in 2006.", options)

PipelineConfig

NEW in v0.5.0

Configuration for the 5-stage extraction pipeline. Controls which stages run, which plugins are enabled, and stage-specific options.

ParameterTypeDefaultDescription
enabled_stagesset[int]{1, 2, 3, 4, 5}Stages to run (1-6)
enabled_pluginsset[str] | NoneNonePlugins to enable (None = all)
disabled_pluginsset[str]set()Plugins to disable
fail_fastboolFalseStop on first error
parallel_processingboolFalseEnable parallel processing
max_statementsint | NoneNoneLimit statements processed

Stage selection examples:

Python
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Run only splitting and extraction (stages 1-2)
config = PipelineConfig(enabled_stages={1, 2})

# Run stages 1-3 (skip canonicalization and labeling)
config = PipelineConfig(enabled_stages={1, 2, 3})

# From stage string
config = PipelineConfig.from_stage_string("1-3")  # {1, 2, 3}
config = PipelineConfig.from_stage_string("1,2,5")  # {1, 2, 5}

# Use presets
config = PipelineConfig.default()   # All 5 stages
config = PipelineConfig.minimal()   # Stages 1-2 only

Plugin selection examples:

Python
# Disable specific plugins
config = PipelineConfig(
    disabled_plugins={"sec_edgar_qualifier", "companies_house_qualifier"}
)

# Enable only specific plugins
config = PipelineConfig(
    enabled_plugins={"t5_gemma_splitter", "gliner2_extractor", "person_qualifier"}
)

Stage-specific options:

Python
config = PipelineConfig(
    splitter_options={
        "num_beams": 6,
        "diversity_penalty": 1.2,
    },
    extractor_options={
        "predicates_file": "/path/to/custom_predicates.json",  # Custom predicate file
    },
    qualifier_options={
        "timeout": 10.0,  # API timeout
    },
)

GLiNER2 Extractor Options:

OptionTypeDefaultDescription
predicates_filestr | PathNonePath to custom predicates JSON file
predicateslist[str]NoneCustom list of predicate names (overrides file)
entity_typeslist[str]all typesEntity types to extract
use_default_predicatesboolTrueUse 324 built-in predicates when no custom ones provided

Custom Predicates File Format:

JSON
{
  "category_name": {
    "predicate_name": {
      "description": "Description for semantic matching",
      "threshold": 0.7
    }
  }
}

Example:

JSON
{
  "employment": {
    "works_for": {"description": "Employment relationship", "threshold": 0.75},
    "manages": {"description": "Management relationship", "threshold": 0.7}
  },
  "ownership": {
    "owns": {"description": "Ownership relationship", "threshold": 0.7},
    "acquired": {"description": "Acquisition of entity", "threshold": 0.75}
  }
}

Stage Combinations

Common stage combinations for different use cases:

Use CaseStagesDescription
Fast extraction{1, 2}Basic triples with entity types
With qualifiers{1, 2, 3}Add roles, identifiers (no canonicalization)
Full resolution{1, 2, 3, 4}Canonical forms, FQNs (no labeling)
Full pipeline{1, 2, 3, 4, 5}All stages except taxonomy
Complete pipeline{1, 2, 3, 4, 5}All stages including taxonomy
Labeling only{1, 2, 5}Skip qualification/canonicalization
Python
# Fast extraction for high-volume processing
fast_config = PipelineConfig.minimal()

# Full resolution for knowledge graph building
full_config = PipelineConfig.default()

# Custom: qualifiers without external APIs
internal_config = PipelineConfig(
    enabled_stages={1, 2, 3, 4, 5},
    disabled_plugins={"gleif_qualifier", "companies_house_qualifier", "sec_edgar_qualifier"},
)

Entity Types

Corp-extractor classifies extracted subjects and objects into 13 entity types based on common Named Entity Recognition (NER) standards. Understanding these types helps you filter and process extracted statements effectively.

Complete Entity Type Reference

TypeDescriptionExamples
ORGOrganizations, companies, agenciesApple Inc., United Nations, FBI
PERSONIndividual peopleTim Cook, Elon Musk, Jane Doe
GPEGeopolitical entities (countries, cities, states)United States, California, Paris
LOCNon-GPE locationsPacific Ocean, Mount Everest, Central Park
PRODUCTProducts and servicesiPhone 15, Model S, Gmail
EVENTEvents and occurrencesCES 2024, Annual Meeting, World Cup
WORK_OF_ARTCreative works, documents, reportsSustainability Report, Mona Lisa
LAWLegal documents and regulationsGDPR, Clean Air Act, Section 230
DATEDates and time periodsQ3 2024, January 15, 2030
MONEYMonetary values$4.7 billion, 100 million euros
PERCENTPercentages30%, 0.5%, 100%
QUANTITYQuantities and measurements1,000 employees, 50 megawatts
UNKNOWNUnclassified entities (fallback)(varies)

Accessing Entity Types in Code

Each extracted statement contains subject and object entities with a type attribute:

Python
from statement_extractor import extract_statements

result = extract_statements("Apple CEO Tim Cook announced the iPhone 15.")

for stmt in result:
    print(f"Subject: {stmt.subject.text} ({stmt.subject.type})")
    print(f"Object: {stmt.object.text} ({stmt.object.type})")

Output:

Text
Subject: Apple (ORG)
Object: Tim Cook (PERSON)
Subject: Tim Cook (PERSON)
Object: iPhone 15 (PRODUCT)

You can also import the EntityType enum for type checking and comparisons:

Python
from statement_extractor import extract_statements, EntityType

result = extract_statements("Microsoft acquired Activision for $69 billion.")

for stmt in result:
    if stmt.subject.type == EntityType.ORG:
        print(f"Organization found: {stmt.subject.text}")
    if stmt.object.type == EntityType.MONEY:
        print(f"Monetary value: {stmt.object.text}")

Filtering by Entity Type

A common use case is extracting only statements involving specific entity types. Here is how to filter statements by subject or object type:

Python
from statement_extractor import extract_statements, EntityType

text = """
Apple announced revenue of $94.8 billion for Q3 2024.
CEO Tim Cook presented at the company's Cupertino headquarters.
The new iPhone 16 features improved battery life of 22 hours.
"""

result = extract_statements(text)

# Filter for statements where subject is an organization
org_statements = [
    stmt for stmt in result
    if stmt.subject.type == EntityType.ORG
]

# Filter for statements involving monetary values
money_statements = [
    stmt for stmt in result
    if stmt.subject.type == EntityType.MONEY or stmt.object.type == EntityType.MONEY
]

# Filter for statements about people
person_statements = [
    stmt for stmt in result
    if stmt.subject.type == EntityType.PERSON or stmt.object.type == EntityType.PERSON
]

print(f"Found {len(org_statements)} statements from organizations")
print(f"Found {len(money_statements)} statements with monetary values")
print(f"Found {len(person_statements)} statements about people")

The UNKNOWN Type

The UNKNOWN entity type is used as a fallback when the model cannot confidently classify an entity into one of the 12 standard categories. This typically occurs with:

  • Specialized domain terms: Technical jargon, industry-specific terminology
  • Ambiguous entities: Terms that could fit multiple categories depending on context
  • Novel entities: New terms or concepts not well-represented in training data
  • Abstract concepts: Ideas or qualities that do not fit standard NER categories
Python
from statement_extractor import extract_statements, EntityType

result = extract_statements("The synergy initiative improved operational efficiency.")

for stmt in result:
    if stmt.subject.type == EntityType.UNKNOWN:
        print(f"Unclassified entity: {stmt.subject.text}")
        # Consider manual review or domain-specific handling

When you encounter UNKNOWN entities, consider:

  1. Manual review: Inspect the entity text to determine appropriate handling
  2. Domain mapping: Create application-specific mappings for recurring unknown entities
  3. Context analysis: Use surrounding statements to infer the entity's likely type

Entity Type Standards

Corp-extractor's entity types are based on widely-adopted NER standards, including:

  • OntoNotes 5.0: The primary source for entity type definitions
  • ACE (Automatic Content Extraction): Influences the GPE vs LOC distinction
  • CoNLL-2003: Foundational NER task categories

This alignment with established standards ensures compatibility with other NLP tools and facilitates integration into existing data pipelines.

Examples

This section provides practical examples demonstrating common use cases for the corp-extractor library.


Basic Extraction

Extract statements from text and format the output:

Python
from statement_extractor import extract_statements

text = """
Microsoft announced a partnership with OpenAI in 2019.
The deal was valued at $1 billion and aimed to develop
artificial general intelligence.
"""

result = extract_statements(text)

# Iterate over statements
for stmt in result:
    subject = f"{stmt.subject.text} ({stmt.subject.type})"
    object_ = f"{stmt.object.text} ({stmt.object.type})"
    print(f"{subject} -- {stmt.predicate} --> {object_}")

# Check confidence scores
for stmt in result:
    score = stmt.confidence_score or 0.0
    print(f"[{score:.2f}] {stmt}")

Output:

Text
Microsoft (ORG) -- partnered with --> OpenAI
Microsoft (ORG) -- announced --> partnership
OpenAI (ORG) -- partnership valued at --> $1 billion
Microsoft (ORG) -- aims to develop --> artificial general intelligence

Batch Processing

Use the StatementExtractor class for processing multiple texts efficiently. The model loads once and is reused for all extractions:

Python
from statement_extractor import StatementExtractor

# Initialize extractor with GPU
extractor = StatementExtractor(device="cuda")

texts = [
    "Apple acquired Beats Electronics for $3 billion.",
    "Google was founded by Larry Page and Sergey Brin in 1998.",
    "Amazon announced a new fulfillment center in Texas."
]

# Process multiple texts
for text in texts:
    result = extractor.extract(text)
    print(f"Found {len(result)} statements in: {text[:40]}...")
    for stmt in result:
        print(f"  - {stmt}")
    print()

For CPU-only environments:

Python
# Force CPU usage
extractor = StatementExtractor(device="cpu")

Confidence Filtering

v0.2.0

Filter statements by confidence score to control precision vs recall:

Python
from statement_extractor import extract_statements, ScoringConfig, ExtractionOptions

text = "Elon Musk founded SpaceX in 2002 to reduce space transportation costs."

# High precision mode - only high-confidence statements
scoring = ScoringConfig(min_confidence=0.7)
options = ExtractionOptions(scoring_config=scoring)
result = extract_statements(text, options)

print("High-confidence statements:")
for stmt in result:
    print(f"  [{stmt.confidence_score:.2f}] {stmt}")

You can also filter after extraction for more control:

Python
# Extract all statements first
result = extract_statements(text)

# Apply custom thresholds
high_confidence = [s for s in result if (s.confidence_score or 0) >= 0.8]
medium_confidence = [s for s in result if 0.5 <= (s.confidence_score or 0) < 0.8]
low_confidence = [s for s in result if (s.confidence_score or 0) < 0.5]

print(f"High: {len(high_confidence)}, Medium: {len(medium_confidence)}, Low: {len(low_confidence)}")

Predicate Taxonomy

Map extracted predicates to a controlled vocabulary of canonical forms:

Python
from statement_extractor import PredicateTaxonomy, ExtractionOptions, extract_statements

# Define your canonical predicates
taxonomy = PredicateTaxonomy(predicates=[
    "acquired", "founded", "works_for", "announced",
    "invested_in", "partnered_with", "committed_to"
])

options = ExtractionOptions(predicate_taxonomy=taxonomy)

text = "Google bought YouTube in 2006. Sequoia Capital backed the video platform."
result = extract_statements(text, options)

# View predicate normalization
for stmt in result:
    original = stmt.predicate
    canonical = stmt.canonical_predicate
    if canonical and canonical != original:
        print(f"'{original}' -> '{canonical}'")
    print(f"  {stmt.subject.text} -- {canonical or original} --> {stmt.object.text}")

Output:

Text
'bought' -> 'acquired'
  Google -- acquired --> YouTube
'backed' -> 'invested_in'
  Sequoia Capital -- invested_in --> YouTube

Load taxonomy from a file:

Python
# predicates.txt contains one predicate per line
taxonomy = PredicateTaxonomy.from_file("predicates.txt")

Export Formats

Export extraction results in multiple formats for integration with other systems:

Python
from statement_extractor import (
    extract_statements,
    extract_statements_as_json,
    extract_statements_as_xml,
    extract_statements_as_dict
)

text = "Netflix acquired Spry Fox, a game development studio, in 2022."

# JSON output (default 2-space indent)
json_str = extract_statements_as_json(text)
print(json_str)

# Compact JSON
json_compact = extract_statements_as_json(text, indent=None)

# XML output (raw model format)
xml_str = extract_statements_as_xml(text)
print(xml_str)

# Dictionary output (for programmatic use)
data = extract_statements_as_dict(text)
for stmt in data["statements"]:
    print(f"{stmt['subject']['text']} -> {stmt['predicate']} -> {stmt['object']['text']}")

JSON output format:

JSON
{
  "statements": [
    {
      "subject": {"text": "Netflix", "type": "ORG"},
      "predicate": "acquired",
      "object": {"text": "Spry Fox", "type": "ORG"},
      "source_text": "Netflix acquired Spry Fox",
      "confidence_score": 0.94
    }
  ],
  "source_text": "Netflix acquired Spry Fox, a game development studio, in 2022."
}

Disabling Embeddings

Skip embedding-based features for faster processing when you don't need predicate normalization or semantic deduplication:

Python
from statement_extractor import ExtractionOptions, extract_statements

# Disable embedding-based deduplication
options = ExtractionOptions(
    embedding_dedup=False,      # Use exact string matching for dedup
    predicate_taxonomy=None     # No predicate normalization
)

result = extract_statements(text, options)

When to disable embeddings:

ScenarioRecommendation
Speed criticalDisable embeddings
No GPU availableConsider disabling for faster CPU processing
Need semantic dedupKeep embeddings enabled
Using predicate taxonomyKeep embeddings enabled
Simple text, few duplicatesDisable embeddings

Custom Entity Canonicalization

Provide a custom function to normalize entity names:

Python
from statement_extractor import ExtractionOptions, extract_statements

# Define a canonicalization function
def canonicalize_entity(text: str) -> str:
    """Normalize entity names to canonical forms."""
    mappings = {
        "apple": "Apple Inc.",
        "apple inc": "Apple Inc.",
        "apple inc.": "Apple Inc.",
        "google": "Alphabet Inc.",
        "google llc": "Alphabet Inc.",
        "alphabet": "Alphabet Inc.",
        "msft": "Microsoft Corporation",
        "microsoft": "Microsoft Corporation",
    }
    return mappings.get(text.lower().strip(), text)

options = ExtractionOptions(entity_canonicalizer=canonicalize_entity)

text = "Apple and Google announced a partnership. Microsoft joined later."
result = extract_statements(text, options)

for stmt in result:
    # Entities are now canonicalized
    print(f"{stmt.subject.text} -- {stmt.predicate} --> {stmt.object.text}")

Output:

Text
Apple Inc. -- partnered with --> Alphabet Inc.
Microsoft Corporation -- joined --> partnership

Full Pipeline Example

Combining multiple features for production use:

Python
from statement_extractor import (
    StatementExtractor,
    ExtractionOptions,
    ScoringConfig,
    PredicateTaxonomy,
    PredicateComparisonConfig
)

# Configure scoring for high precision
scoring = ScoringConfig(
    min_confidence=0.6,
    quality_weight=1.0,
    redundancy_penalty=0.5
)

# Define canonical predicates
taxonomy = PredicateTaxonomy.from_list([
    "acquired", "founded", "invested_in", "partnered_with",
    "announced", "launched", "hired", "appointed"
])

# Configure predicate matching
predicate_config = PredicateComparisonConfig(
    similarity_threshold=0.7,
    dedup_threshold=0.8
)

# Initialize extractor
extractor = StatementExtractor(
    device="cuda",
    predicate_taxonomy=taxonomy,
    predicate_config=predicate_config,
    scoring_config=scoring
)

# Configure extraction options
options = ExtractionOptions(
    num_beams=6,
    diversity_penalty=1.2,
    deduplicate=True,
    merge_beams=True
)

# Process text
text = """
Amazon Web Services announced a strategic partnership with Anthropic,
investing up to $4 billion in the AI safety startup. The deal, announced
in September 2023, makes AWS Anthropic's primary cloud provider.
"""

result = extractor.extract(text, options)

print(f"Extracted {len(result)} high-confidence statements:\n")
for stmt in result:
    canonical = stmt.canonical_predicate or stmt.predicate
    score = stmt.confidence_score or 0.0
    print(f"[{score:.2f}] {stmt.subject.text} ({stmt.subject.type})")
    print(f"       -- {canonical} -->")
    print(f"       {stmt.object.text} ({stmt.object.type})")
    print()

Output:

Text
Extracted 4 high-confidence statements:

[0.92] Amazon Web Services (ORG)
       -- partnered_with -->
       Anthropic (ORG)

[0.88] Amazon Web Services (ORG)
       -- invested_in -->
       Anthropic (ORG)

[0.85] Amazon Web Services (ORG)
       -- invested_in -->
       $4 billion (MONEY)

[0.78] AWS (ORG)
       -- is primary cloud provider for -->
       Anthropic (ORG)

Pipeline Examples

NEW in v0.5.0

Full Pipeline with Corporate Text

Process corporate announcements with full entity resolution:

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

pipeline = ExtractionPipeline()

text = """
Amazon CEO Andy Jassy announced plans to hire 10,000 workers in the UK.
The expansion will focus on Amazon Web Services operations in London.
"""

ctx = pipeline.process(text)

print(f"Extracted {ctx.statement_count} statements\n")

for stmt in ctx.labeled_statements:
    # FQN includes role and organization
    print(f"Subject: {stmt.subject_fqn}")
    print(f"Predicate: {stmt.statement.predicate}")
    print(f"Object: {stmt.object_fqn}")

    # Access labels
    for label in stmt.labels:
        print(f"  {label.label_type}: {label.label_value}")

    # Access qualifiers
    subject_quals = stmt.subject_canonical.qualified_entity.qualifiers
    if subject_quals.role:
        print(f"  Role: {subject_quals.role}")
    if subject_quals.org:
        print(f"  Organization: {subject_quals.org}")

    print("-" * 40)

Output:

Text
Extracted 2 statements

Subject: Andy Jassy (CEO, Amazon)
Predicate: announced
Object: plans to hire 10,000 workers in the UK
  sentiment: positive
  Role: CEO
  Organization: Amazon
----------------------------------------
Subject: Amazon (AMZN)
Predicate: expanding operations in
Object: London (UK)
  sentiment: positive
----------------------------------------

Running Specific Stages

Skip qualification and canonicalization for faster processing:

Python
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Run only stages 1 and 2 (splitting + extraction)
config = PipelineConfig(enabled_stages={1, 2})
pipeline = ExtractionPipeline(config)

ctx = pipeline.process("Tim Cook is CEO of Apple Inc.")

# Access Stage 2 output (PipelineStatement)
for stmt in ctx.statements:
    print(f"{stmt.subject.text} ({stmt.subject.type.value})")
    print(f"  --[{stmt.predicate}]-->")
    print(f"  {stmt.object.text} ({stmt.object.type.value})")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Using Specific Plugins

Enable only internal plugins (no external API calls):

Python
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Disable external API plugins
config = PipelineConfig(
    disabled_plugins={
        "gleif_qualifier",
        "companies_house_qualifier",
        "sec_edgar_qualifier",
    }
)

pipeline = ExtractionPipeline(config)
ctx = pipeline.process("OpenAI CEO Sam Altman announced GPT-5.")

# Will use person_qualifier (local LLM) but skip external lookups
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

Custom Predicates File

Use a custom predicates JSON file instead of the 324 default predicates:

Python
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Use custom predicates file
config = PipelineConfig(
    extractor_options={
        "predicates_file": "/path/to/my_predicates.json"
    }
)

pipeline = ExtractionPipeline(config)
ctx = pipeline.process("John works for Apple Inc.")

# All matching relations are returned
for stmt in ctx.statements:
    print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
    print(f"  Category: {stmt.predicate_category}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Custom predicates file format:

JSON
{
  "employment": {
    "works_for": {
      "description": "Employment relationship where person works for organization",
      "threshold": 0.75
    },
    "manages": {
      "description": "Management relationship where person manages entity",
      "threshold": 0.7
    }
  },
  "ownership": {
    "owns": {
      "description": "Ownership relationship",
      "threshold": 0.7
    },
    "acquired": {
      "description": "Acquisition of one entity by another",
      "threshold": 0.75
    }
  }
}

Each category should have fewer than 25 predicates to stay within GLiNER2's training limit for optimal performance.


Accessing Stage Outputs

Access results from each pipeline stage:

Python
from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Microsoft CEO Satya Nadella announced Azure growth.")

# Stage 1: Raw triples
print("=== Stage 1: Raw Triples ===")
for triple in ctx.raw_triples:
    print(f"  {triple.subject_text} -> {triple.predicate_text} -> {triple.object_text}")

# Stage 2: Statements with types
print("\n=== Stage 2: Statements ===")
for stmt in ctx.statements:
    print(f"  {stmt.subject.text} ({stmt.subject.type.value}) -> {stmt.predicate}")

# Stage 3: Qualified entities
print("\n=== Stage 3: Qualified Entities ===")
for ref, qualified in ctx.qualified_entities.items():
    quals = qualified.qualifiers
    print(f"  {qualified.original_text}")
    if quals.role:
        print(f"    Role: {quals.role}")
    if quals.org:
        print(f"    Org: {quals.org}")
    for id_type, id_value in quals.identifiers.items():
        print(f"    {id_type}: {id_value}")

# Stage 4: Canonical entities
print("\n=== Stage 4: Canonical Entities ===")
for ref, canonical in ctx.canonical_entities.items():
    print(f"  {canonical.fqn}")
    if canonical.canonical_match:
        print(f"    Method: {canonical.canonical_match.match_method}")
        print(f"    Confidence: {canonical.canonical_match.match_confidence:.2f}")

# Stage 5: Labeled statements
print("\n=== Stage 5: Labeled Statements ===")
for stmt in ctx.labeled_statements:
    print(f"  {stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")
    for label in stmt.labels:
        print(f"    {label.label_type}: {label.label_value}")

# Stage 6: Taxonomy results (multiple labels per statement)
print("\n=== Stage 6: Taxonomy Results ===")
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
    print(f"  Statement: {source_text[:40]}...")
    for result in results:
        print(f"    {result.full_label} (confidence: {result.confidence:.2f})")

# Timings
print("\n=== Stage Timings ===")
for stage, duration in ctx.stage_timings.items():
    print(f"  {stage}: {duration:.3f}s")

Batch Pipeline Processing

Process multiple documents efficiently:

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

# Use minimal stages for speed
config = PipelineConfig.minimal()  # Stages 1-2 only
pipeline = ExtractionPipeline(config)

documents = [
    "Apple announced a new MacBook Pro.",
    "Google acquired Fitbit for $2.1 billion.",
    "Tesla CEO Elon Musk unveiled the Cybertruck.",
]

all_statements = []

for doc in documents:
    ctx = pipeline.process(doc)
    for stmt in ctx.statements:
        all_statements.append({
            "subject": stmt.subject.text,
            "subject_type": stmt.subject.type.value,
            "predicate": stmt.predicate,
            "object": stmt.object.text,
            "object_type": stmt.object.type.value,
            "confidence": stmt.confidence_score,
            "source": doc,
        })

print(f"Extracted {len(all_statements)} statements from {len(documents)} documents")

Taxonomy Classification

Stage 6

Classify statements against large taxonomies. Multiple labels may match a single statement above the confidence threshold:

Python
from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()

text = """
Apple announced a commitment to carbon neutrality by 2030.
The company also reported reducing packaging waste by 75%.
"""

ctx = pipeline.process(text)

# Access taxonomy classifications (multiple labels per statement)
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
    print(f"Statement: {source_text[:50]}...")
    print(f"  Taxonomy: {taxonomy_name}")
    print(f"  Labels:")
    for result in results:
        print(f"    - {result.full_label} (confidence: {result.confidence:.2f})")
    print()

Output:

Text
Statement: Apple announced a commitment to carbon neutrality...
  Taxonomy: esg_topics
  Labels:
    - environment:carbon_emissions (confidence: 0.87)
    - environment_benefit:emissions_reduction (confidence: 0.72)
    - governance:sustainability_commitments (confidence: 0.45)

Statement: The company also reported reducing packaging waste...
  Taxonomy: esg_topics
  Labels:
    - environment:waste_management (confidence: 0.92)
    - environment_benefit:waste_reduction (confidence: 0.85)

Using the Persistent Server

NEW in v0.9.7

Start a persistent server to avoid reloading models on every invocation:

Bash
# Terminal 1: Start the server
corp-extractor serve

# Terminal 2: Use --server to delegate processing
corp-extractor --server pipeline "Amazon CEO Andy Jassy announced plans."
corp-extractor --server split -f article.txt --json
corp-extractor --server document process article.txt

# Or set the environment variable
export CORP_EXTRACTOR_SERVER=http://localhost:8111
corp-extractor pipeline "text"  # Automatically uses the server

Python API Server Delegation

NEW in v0.9.8

Pass server_url to delegate processing to a running server from Python code. No local models are loaded — full Pydantic objects are reconstructed from JSON responses.

Python
from statement_extractor import extract_statements
from statement_extractor.pipeline import ExtractionPipeline

# Simple extraction via server
result = extract_statements("Apple announced iPhone.", server_url="http://localhost:8111")
for stmt in result:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")

# Full pipeline via server
pipeline = ExtractionPipeline(server_url="http://localhost:8111")
ctx = pipeline.process("Apple CEO Tim Cook announced a new iPhone.")
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

You can also call the server directly with httpx or curl:

Python
import httpx

# Call the pipeline endpoint
resp = httpx.post("http://localhost:8111/pipeline", json={
    "text": "Apple CEO Tim Cook announced a new iPhone.",
    "config": {"enabled_stages": "1-3"},
}, timeout=300)

# Reconstruct full Pydantic model from response
from statement_extractor.pipeline.context import PipelineContext
ctx = PipelineContext.model_validate(resp.json())
for stmt in ctx.labeled_statements:
    print(f"  {stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

Pipeline with Error Handling

Handle errors and warnings gracefully:

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

config = PipelineConfig(fail_fast=False)  # Continue on errors
pipeline = ExtractionPipeline(config)

ctx = pipeline.process("Some text that might cause issues...")

# Check for errors
if ctx.has_errors:
    print("Errors occurred:")
    for error in ctx.processing_errors:
        print(f"  - {error}")

# Check for warnings
if ctx.processing_warnings:
    print("Warnings:")
    for warning in ctx.processing_warnings:
        print(f"  - {warning}")

# Process results that succeeded
print(f"\nSuccessfully extracted {ctx.statement_count} statements")

Deployment

Local Inference

Hardware Requirements:

ResourceMinimumNotes
CPU-only~4GB RAM~30s per extraction
GPU~2GB VRAM~2s per extraction
Disk~1.5GBModel download size

Setup steps:

Bash
# Install the library
pip install corp-extractor[embeddings]

# For GPU support, install PyTorch with CUDA first
pip install torch --index-url https://download.pytorch.org/whl/cu121

Running locally:

Python
from statement_extractor import StatementExtractor

# Auto-detect GPU or fall back to CPU
extractor = StatementExtractor()

# Or explicitly set device
extractor = StatementExtractor(device="cuda")  # GPU
extractor = StatementExtractor(device="cpu")   # CPU

The model uses bfloat16 precision on GPU for faster inference and lower memory usage, and float32 on CPU.

Persistent Server

NEW in v0.9.7

For repeated extractions, use corp-extractor serve to keep all models warm in memory. This eliminates the ~30s startup cost for each invocation.

Bash
# Start the persistent server
corp-extractor serve

# In another terminal, use --server to delegate to it
corp-extractor --server pipeline "Amazon CEO Andy Jassy announced..."
corp-extractor --server split -f article.txt --json
corp-extractor --server document process article.txt

The server runs on http://localhost:8111 by default and exposes three POST endpoints (/pipeline, /split, /document) plus a health check at GET /. All models (T5-Gemma, GLiNER2, embedding models, USearch indexes) are loaded once at startup and reused across requests.

You can also set the CORP_EXTRACTOR_SERVER environment variable so all CLI commands automatically delegate to the server:

Bash
export CORP_EXTRACTOR_SERVER=http://localhost:8111
corp-extractor pipeline "Your text"  # Automatically uses the server

Python API Server Delegation

NEW in v0.9.8

All Python API functions accept a server_url parameter to delegate processing to a running server. No local models are loaded — requests go over HTTP and full Pydantic objects are reconstructed from the response.

Python
from statement_extractor import extract_statements
from statement_extractor.pipeline import ExtractionPipeline
from statement_extractor.document import DocumentPipeline

# Delegate extraction to server
result = extract_statements("text", server_url="http://localhost:8111")

# Pipeline and document pipeline
pipeline = ExtractionPipeline(server_url="http://localhost:8111")
ctx = pipeline.process("Amazon CEO Andy Jassy announced...")

doc_pipeline = DocumentPipeline(server_url="http://localhost:8111")

Note: server_url is for the Python API only. CLI delegation uses --server / --server-url flags or the CORP_EXTRACTOR_SERVER env var.

Cerebrium Serverless (Production)

Why Cerebrium:

  • Pay-per-use GPU containers; scales to zero when idle.
  • Shared /persistent-storage volume with the corp-entity-db Cerebrium app, so the entity database, USearch indexes, and embedding model weights are downloaded once and reused across both apps.
  • One synchronous request/response — no polling. The frontend API route uses maxDuration=300 and retries once on cold-boot timeout.

Setup steps:

  1. Install the Cerebrium CLI and confirm you are in the same project as the corp-entity-db app (so /persistent-storage is shared):

    Bash
    pip install cerebrium
    cerebrium projects current
  2. Set the HF_TOKEN secret (gated model downloads):

    Bash
    cerebrium secrets set HF_TOKEN <your-token>
  3. Deploy:

    Bash
    cd cerebrium
    cerebrium deploy

    Or push to main.github/workflows/cerebrium-deploy.yml auto-deploys on changes to cerebrium/**.

  4. Call the API (auth: service-account token or per-app inference key):

    Bash
    curl -X POST \
      -H "Authorization: Bearer $CEREBRIUM_TOKEN" \
      -H "Content-Type: application/json" \
      -d '{"text": "<page>Your text here</page>"}' \
      https://api.aws.us-east-1.cerebrium.ai/v4/<project-id>/statement-extractor/extract

    Cerebrium returns a {run_id, result, run_time_ms} envelope; the handler payload is in .result.

Hardware: currently ADA_L40 (48 GB, hobby-plan max) — fits T5-Gemma2 in bf16 with comfortable headroom. The Gemma-3-12B GGUF qualifier runs CPU-only via llama-cpp-python.

See cerebrium/README.md for cold-start expectations, GPU alternatives, and troubleshooting.

RunPod Serverless (Legacy)

The original deployment used RunPod serverless GPU containers via runpod/Dockerfile and an async submit-and-poll API surface. Superseded by Cerebrium for the production demo so we can share storage with the corp-entity-db app. The container build is retained at runpod/ for reference; it is no longer the active path.