corp-extractor v0.7.0

Statement Extractor Documentation

Extract structured subject-predicate-object statements from unstructured text using T5-Gemma 2 and GLiNER2 models with document processing, entity resolution, and taxonomy classification.

Getting Started

Installation

Bash
pip install corp-extractor

The GLiNER2 model (205M params) is downloaded automatically on first use.

GPU support: Install PyTorch with CUDA before installing corp-extractor. The library auto-detects GPU availability at runtime.

Bash
# Example for CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install corp-extractor

Apple Silicon (M1/M2/M3): MPS acceleration is automatically detected. Just install normally:

Bash
pip install corp-extractor

Quick Start

Extract structured statements from text in 5 lines:

Python
from statement_extractor import extract_statements

text = "Apple Inc. acquired Beats Electronics for $3 billion in May 2014."
statements = extract_statements(text)

for stmt in statements:
    print(f"{stmt.subject.text} ({stmt.subject.type}) -> {stmt.predicate} -> {stmt.object.text}")

Output:

Text
Apple Inc. (ORG) -> acquired -> Beats Electronics
Apple Inc. (ORG) -> paid -> $3 billion
Beats Electronics (ORG) -> acquisition price -> $3 billion

Each statement includes confidence scores and extraction method:

Python
for stmt in statements:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
    print(f"  method: {stmt.extraction_method}")  # hybrid, gliner, or model
    print(f"  confidence: {stmt.confidence_score:.2f}")

v0.5.0 features: Plugin-based pipeline architecture with entity qualification, labeling, and taxonomy classification. GLiNER2 entity recognition, entity-based scoring.

v0.6.0 features: Entity embedding database with ~100K+ SEC filers, ~3M GLEIF records, ~5M UK organizations for fast entity qualification.

v0.7.0 features: Document processing for files, URLs, and PDFs with automatic chunking, deduplication, and citation tracking.

v0.8.0 features: Merged qualification and canonicalization into single stage. EntityType classification for organizations (business, nonprofit, government, etc.).

v0.9.0 features: Person database with Wikidata import for notable people (executives, politicians, athletes, artists). PersonQualifier for canonical person identification with role/org context.

v0.9.1 features: Wikidata dump importer (import-wikidata-dump) for large imports without SPARQL timeouts. Uses aria2c for fast parallel downloads. Extracts people via occupation (P106) and position dates (P580/P582).

v0.9.2 features: Organization canonicalization links equivalent records across sources (GLEIF, SEC, Companies House, Wikidata). People canonicalization with priority-based deduplication. Expanded PersonType classification (executive, politician, government, military, legal, etc.).

v0.9.3 features: SEC Form 4 officers import (import-sec-officers) and Companies House officers import (import-ch-officers). People now sourced from Wikidata, SEC Edgar, and Companies House with cross-source canonicalization.

v0.9.4 features: Database v2 schema with normalized INTEGER foreign keys and enum lookup tables. Scalar (int8) embeddings for 75% storage reduction with ~92% recall. New locations import for countries/states/cities with hierarchy. Migration commands: db migrate-v2, db backfill-scalar. New search commands: db search-roles, db search-locations.

Pipeline Quick Start (v0.5.0)

For full entity resolution with qualification, canonicalization, labeling, and taxonomy classification:

Python
from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans to hire workers.")

# Access fully qualified names (e.g., "Andy Jassy (CEO, Amazon)")
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} --[{stmt.statement.predicate}]--> {stmt.object_fqn}")

    # Access labels (sentiment, etc.)
    for label in stmt.labels:
        print(f"  {label.label_type}: {label.label_value}")

CLI usage:

Bash
# Full pipeline
corp-extractor pipeline "Amazon CEO Andy Jassy announced..."

# Run specific stages only
corp-extractor pipeline -f article.txt --stages 1-3

# Process documents and URLs (v0.7.0)
corp-extractor document process article.txt
corp-extractor document process https://example.com/article
corp-extractor document process report.pdf --use-ocr

Using Predicate Taxonomies

Normalize extracted predicates to canonical forms using embedding similarity:

Python
from statement_extractor import extract_statements, PredicateTaxonomy, ExtractionOptions

# Define your domain's canonical predicates
taxonomy = PredicateTaxonomy(predicates=[
    "acquired", "founded", "works_for", "headquartered_in",
    "invested_in", "partnered_with", "announced"
])

options = ExtractionOptions(predicate_taxonomy=taxonomy)

text = "Google bought YouTube for $1.65 billion in 2006."
result = extract_statements(text, options)

for stmt in result:
    print(f"{stmt.predicate} -> {stmt.canonical_predicate}")
    # Output: bought -> acquired

This maps synonyms like "bought", "purchased", "acquired" to a single canonical form, making downstream analysis easier.

Requirements

DependencyVersionNotes
Python3.10+Required
PyTorch2.0+Required
transformers5.0+Required for T5-Gemma2 support
Pydantic2.0+Required
sentence-transformers2.2+Required, for embedding features
GLiNER2latestRequired, for entity recognition and relation extraction (model auto-downloads)

Hardware requirements:

  • NVIDIA GPU: RTX 4090+ recommended for production. Uses bfloat16 precision for efficiency.
  • Apple Silicon: M1/M2/M3 with 16GB+ RAM. MPS acceleration auto-detected.
  • CPU: Functional but slower. Use for development or low-volume processing.
  • Disk: ~100GB for all models and entity database (10M+ organizations, 40M+ people).

The library runs entirely locally with no external API dependencies. Models use bfloat16 on CUDA and float32 on MPS/CPU.

Command Line Interface

The corp-extractor CLI provides commands for extraction, document processing, and database management.

Commands Overview

CommandDescriptionUse Case
splitSimple extraction (Stage 1 only)Fast extraction, basic triples
pipelineFull 5-stage pipelineEntity resolution, labeling, taxonomy
documentDocument processingFiles, URLs, PDFs with chunking and deduplication
dbDatabase managementImport, search, upload/download entity database
pluginsPlugin managementList and inspect available plugins

Installation

For best results, install globally:

Bash
# Using uv (recommended)
uv tool install "corp-extractor[embeddings]"

# Using pipx
pipx install "corp-extractor[embeddings]"

# Using pip
pip install "corp-extractor[embeddings]"

Quick Run with uvx

Run directly without installing using uv:

Bash
uvx corp-extractor split "Apple announced a new iPhone."

Note: First run downloads the model (~1.5GB) which may take a few minutes.


Split Command

The split command extracts sub-statements using the T5-Gemma model. It's fast and simple—use pipeline for full entity resolution.

Bash
# Extract from text argument
corp-extractor split "Apple Inc. announced the iPhone 15."

# Extract from file
corp-extractor split -f article.txt

# Pipe from stdin
cat article.txt | corp-extractor split -

# Output as JSON
corp-extractor split "Tim Cook is CEO of Apple." --json

# Output as XML
corp-extractor split "Tim Cook is CEO of Apple." --xml

# Verbose output with confidence scores
corp-extractor split -f article.txt --verbose

# Use more beams for better quality
corp-extractor split -f article.txt --beams 8

Split Options

OptionDescriptionDefault
-f, --file PATHRead input from file
-o, --outputOutput format: table, json, xmltable
--json / --xmlOutput format shortcuts
-b, --beamsNumber of beams for diverse beam search4
--diversityDiversity penalty for beam search1.0
--no-glinerDisable GLiNER2 extraction
--predicatesComma-separated predicates for relation extraction
--predicates-filePath to custom predicates JSON file
--deviceDevice: auto, cuda, mps, cpuauto
-v, --verboseShow confidence scores and metadata

Pipeline Command

NEW in v0.5.0

The pipeline command runs the full 5-stage extraction pipeline for comprehensive entity resolution and taxonomy classification.

Bash
# Run all 5 stages
corp-extractor pipeline "Amazon CEO Andy Jassy announced plans to hire workers."

# Run from file
corp-extractor pipeline -f article.txt

# Run specific stages
corp-extractor pipeline "..." --stages 1-3
corp-extractor pipeline "..." --stages 1,2,5

# Skip specific stages
corp-extractor pipeline "..." --skip-stages 4,5

# Enable specific plugins only
corp-extractor pipeline "..." --plugins gleif,companies_house

# Disable specific plugins
corp-extractor pipeline "..." --disable-plugins sec_edgar

# Output formats
corp-extractor pipeline "..." -o json
corp-extractor pipeline "..." -o yaml
corp-extractor pipeline "..." -o triples

Pipeline Stages

StageNameDescription
1SplittingText → Raw triples (T5-Gemma)
2ExtractionRaw triples → Typed statements (GLiNER2)
3Entity QualificationAdd identifiers (LEI, CIK, etc.) and canonical names via embedding DB
4LabelingApply sentiment, relation type, confidence
5TaxonomyClassify against large taxonomies (MNLI/embeddings)

Pipeline Options

OptionDescriptionExample
--stagesStages to run1-3 or 1,2,5
--skip-stagesStages to skip4,5
--pluginsEnable only these pluginsgleif,person
--disable-pluginsDisable these pluginssec_edgar
--predicates-fileCustom predicates JSON file for GLiNER2custom.json
-o, --outputOutput formattable, json, yaml, triples

Plugins Command

NEW in v0.5.0

The plugins command lists and inspects available pipeline plugins.

Bash
# List all plugins
corp-extractor plugins list

# List plugins for a specific stage
corp-extractor plugins list --stage 3

# Get details about a plugin
corp-extractor plugins info gleif_qualifier
corp-extractor plugins info person_qualifier

Example output:

Text
Stage 1: Splitting
----------------------------------------
  t5_gemma_splitter  [priority: 100]

Stage 2: Extraction
----------------------------------------
  gliner2_extractor  [priority: 100]

Stage 3: Entity Qualification
----------------------------------------
  person_qualifier (PERSON)  [priority: 100]
  embedding_company_qualifier (ORG)  [priority: 5]

Stage 4: Labeling
----------------------------------------
  sentiment_labeler  [priority: 100]
  confidence_labeler  [priority: 100]
  relation_type_labeler  [priority: 100]

Stage 5: Taxonomy
----------------------------------------
  embedding_taxonomy_classifier  [priority: 100]

Output Formats

Table output (default):

Text
Extracted 2 statement(s):

--------------------------------------------------------------------------------
1. Andy Jassy (CEO, Amazon)
   --[announced]-->
   plans to hire workers
--------------------------------------------------------------------------------

JSON output:

JSON
{
  "statement_count": 2,
  "labeled_statements": [
    {
      "subject": {"text": "Andy Jassy", "type": "PERSON", "fqn": "Andy Jassy (CEO, Amazon)"},
      "predicate": "announced",
      "object": {"text": "plans to hire workers", "type": "EVENT"},
      "labels": {"sentiment": "positive"}
    }
  ]
}

Triples output:

Text
Andy Jassy (CEO, Amazon)	announced	plans to hire workers
Amazon	has CEO	Andy Jassy (CEO, Amazon)

Shell Integration

Processing multiple files:

Bash
# Process all .txt files
for f in *.txt; do
  echo "=== $f ==="
  corp-extractor pipeline -f "$f" -o json > "${f%.txt}.json"
done

Combining with jq:

Bash
# Extract just predicates
corp-extractor split "Your text" --json | jq '.statements[].predicate'

# Filter high-confidence statements
corp-extractor split -f article.txt --json | jq '.statements[] | select(.confidence_score > 0.8)'

# Get FQNs from pipeline
corp-extractor pipeline "Your text" -o json | jq '.labeled_statements[].subject.fqn'

Document Command

NEW in v0.7.0

The document command processes files, URLs, and PDFs with automatic chunking and deduplication.

Bash
# Process local files
corp-extractor document process article.txt
corp-extractor document process report.txt --title "Annual Report" --year 2024

# Process URLs (web pages and PDFs)
corp-extractor document process https://example.com/article
corp-extractor document process https://example.com/report.pdf --use-ocr

# Configure chunking
corp-extractor document process article.txt --max-tokens 500 --overlap 50

# Preview chunking without extraction
corp-extractor document chunk article.txt --max-tokens 500

# Output formats
corp-extractor document process article.txt -o json
corp-extractor document process article.txt -o triples

Document Options

OptionDescriptionDefault
--titleDocument title for citationsFilename
--max-tokensTarget tokens per chunk1000
--overlapToken overlap between chunks100
--use-ocrForce OCR for PDF parsing
--no-summarySkip document summarization
--no-dedupSkip cross-chunk deduplication
--stagesPipeline stages to run1-5

Database Commands

UPDATED in v0.9.4

The db command group manages the entity embedding database used for organization, person, role, and location qualification.

Bash
# Show database status
corp-extractor db status

# Search for an organization
corp-extractor db search "Microsoft"
corp-extractor db search "Barclays" --source companies_house

# Search for a person (v0.9.0)
corp-extractor db search-people "Tim Cook"
corp-extractor db search-people "Elon Musk" --top-k 5

# Search for roles (v0.9.4)
corp-extractor db search-roles "CEO"
corp-extractor db search-roles "Chief Financial Officer"

# Search for locations (v0.9.4)
corp-extractor db search-locations "California"
corp-extractor db search-locations "Germany" --type country

# Import organizations from data sources
corp-extractor db import-gleif --download
corp-extractor db import-sec --download           # Bulk data (~100K+ filers)
corp-extractor db import-companies-house --download
corp-extractor db import-wikidata --limit 50000   # SPARQL-based

# Import notable people (v0.9.0)
corp-extractor db import-people --type executive --limit 5000
corp-extractor db import-people --all --limit 10000  # All person types

# Import from Wikidata dump (v0.9.1) - avoids SPARQL timeouts
corp-extractor db import-wikidata-dump --download --limit 50000
corp-extractor db import-wikidata-dump --dump /path/to/dump.bz2 --people --no-orgs

# Download/upload from HuggingFace Hub
corp-extractor db download                        # Lite version (default)
corp-extractor db download --full                 # Full version with metadata
corp-extractor db upload                          # Upload with all variants

# Migrate from old schema (companies.db → entities.db)
corp-extractor db migrate companies.db --rename-file

# Migrate to v2 normalized schema (v0.9.4)
corp-extractor db migrate-v2 entities.db entities-v2.db
corp-extractor db migrate-v2 entities.db entities-v2.db --resume  # Resume interrupted

# Generate int8 scalar embeddings (v0.9.4) - 75% smaller
corp-extractor db backfill-scalar
corp-extractor db backfill-scalar --skip-generate  # Only quantize existing

# Local database management
corp-extractor db create-lite entities.db         # Create lite version
corp-extractor db compress entities.db            # Compress with gzip

Organization Data Sources

SourceCommandRecordsIdentifier
GLEIFimport-gleif --download~3.2MLEI
SEC Edgarimport-sec --download~100K+CIK
Companies Houseimport-companies-house --download~5MCompany Number
Wikidata (SPARQL)import-wikidataVariableQID
Wikidata (Dump)import-wikidata-dump --downloadAll with enwikiQID

Person Data Sources v0.9.0

TypeCommandDescription
Executivesimport-people --type executiveCEOs, CFOs, board members
Politiciansimport-people --type politicianElected officials, diplomats
Athletesimport-people --type athleteSports figures, coaches
Artistsimport-people --type artistActors, musicians, directors
All Typesimport-people --allRun all person type queries

Person Import Options

OptionDescription
--skip-existingSkip existing records instead of updating them
--enrich-datesQuery individual records for start/end dates (slower)

Wikidata Dump Import v0.9.1

For large imports that avoid SPARQL timeouts, use the Wikidata JSON dump:

Bash
# Download and import (~100GB dump file)
corp-extractor db import-wikidata-dump --download --limit 50000

# Import only people
corp-extractor db import-wikidata-dump --download --people --no-orgs --limit 100000

# Import only organizations
corp-extractor db import-wikidata-dump --download --orgs --no-people --limit 100000

# Import only locations (v0.9.4)
corp-extractor db import-wikidata-dump --dump dump.bz2 --locations --no-people --no-orgs

# Use existing dump file
corp-extractor db import-wikidata-dump --dump /path/to/latest-all.json.bz2

Fast download with aria2c: Install aria2c for 10-20x faster downloads:

Bash
brew install aria2   # macOS
apt install aria2    # Ubuntu/Debian
OptionDescription
--downloadDownload the Wikidata dump (~100GB)
--dump PATHUse existing dump file (.bz2 or .gz)
--people/--no-peopleImport people (default: yes)
--orgs/--no-orgsImport organizations (default: yes)
--locations/--no-locationsImport locations (default: no) v0.9.4
--no-aria2Don't use aria2c even if available

Advantages over SPARQL:

  • No timeouts (processes locally)
  • Complete coverage (all notable people/orgs with English Wikipedia)
  • Captures people via occupation (P106) even if position type is generic
  • Extracts role dates from position qualifiers (P580/P582)
  • Imports locations with hierarchical parent relationships (v0.9.4)

Download location: ~/.cache/corp-extractor/wikidata-latest-all.json.bz2

Note: Use -v (verbose) to see detailed logs of skipped records during import:

Bash
corp-extractor db import-people --type executive -v

People records include from_date and to_date for role tenure. The same person can have multiple records with different role/org combinations (unique on source_id + role + org).

Organizations discovered during people import (employers, affiliated orgs) are automatically inserted into the organizations table if they don't already exist. This creates foreign key links via known_for_org_id.

Database Variants

FileDescriptionUse Case
entities-lite.dbCore fields + embeddings onlyDefault download, fast searches
entities.dbFull database with source metadataWhen you need complete record data
*.db.gzGzip compressed versionsFaster downloads, auto-decompressed

Database Options

OptionDescriptionDefault
--db PATHDatabase file path~/.cache/corp-extractor/entities.db
--limit NLimit number of records
--downloadDownload source data automatically
--fullDownload full version instead of lite
--no-liteSkip creating lite version on upload
--no-compressSkip creating compressed versions

See COMPANY_DB.md for complete build and publish instructions.

Core Concepts

Corp-extractor is designed to analyze complex text and extract relationship information about people and organizations. It runs entirely on your hardware (RTX 4090+, Apple M1 16GB+) with no external API dependencies, using multiple fine-tuned small models to transform unstructured text into structured knowledge.

Statement Extraction

Statement extraction is the process of converting unstructured natural language text into structured subject-predicate-object triples. Each triple represents a discrete fact or relationship extracted from the source text.

For example, given the text:

"Apple announced a new iPhone at their Cupertino headquarters."

The extractor produces triples like:

SubjectPredicateObject
Apple (ORG)announcediPhone (PRODUCT)
Apple (ORG)has headquarters inCupertino (GPE)

The T5-Gemma 2 Model

Corp-extractor uses a fine-tuned T5-Gemma 2 model with 540 million parameters. This encoder-decoder architecture excels at sequence-to-sequence tasks, making it well-suited for transforming text into structured XML output.

The model processes input text wrapped in <page> tags and generates XML containing <stmt> elements with subject, predicate, object, and source text spans.

Entity Type Recognition

Each extracted subject and object is classified into one of 12 entity types (plus UNKNOWN):

TypeDescriptionExample
ORGOrganizations, companiesApple, United Nations
PERSONNamed individualsTim Cook, Marie Curie
GPEGeopolitical entitiesFrance, New York City
LOCNon-GPE locationsMount Everest, Pacific Ocean
PRODUCTProducts, artifactsiPhone, Model S
EVENTNamed eventsWorld War II, Olympics
WORK_OF_ARTCreative worksMona Lisa, Hamlet
LAWLegal documentsGDPR, First Amendment
DATETemporal expressionsJanuary 2024, last Tuesday
MONEYMonetary values$50 million, €100
PERCENTPercentages15%, half
QUANTITYMeasurements500 kilometers, 3 tons
UNKNOWNUnclassified entities

Diverse Beam Search

Corp-extractor uses Diverse Beam Search (Vijayakumar et al., 2016) to generate multiple candidate extractions from the same input text.

Why Diverse Beam Search?

Standard beam search tends to produce similar outputs—slight variations of the same interpretation. Diverse Beam Search introduces a diversity penalty that encourages the model to explore fundamentally different extractions.

This is particularly valuable for statement extraction because:

  • A single sentence may contain multiple valid interpretations
  • Different phrasings can capture different aspects of the same fact
  • Merging diverse outputs produces more comprehensive coverage

How It Works

The model generates multiple beams in parallel, each representing a different extraction path. A diversity penalty is applied during generation to prevent beams from converging on identical outputs.

Default Parameters

ParameterDefaultDescription
num_beams4Number of parallel beams to generate
diversity_penalty1.0Strength of diversity encouragement (higher = more diverse)
Python
from statement_extractor import extract_statements

# Use default beam search settings
result = extract_statements("Apple announced a new iPhone.")

# Customize beam search
result = extract_statements(
    "Apple announced a new iPhone.",
    num_beams=6,
    diversity_penalty=1.5
)

Quality Scoring

UPDATED in v0.4.0

Each extracted statement receives a confidence score between 0 and 1, measuring extraction quality through a weighted combination of semantic and entity-based signals.

Confidence Score

The score combines three components using GLiNER2 for entity recognition:

ComponentWeightDescription
Semantic similarity50%Cosine similarity between source text and reassembled triple
Subject entity score25%How entity-like the subject is (via GLiNER2 NER)
Object entity score25%How entity-like the object is (via GLiNER2 NER)

Higher scores indicate the triple is semantically grounded and contains well-formed entities. Lower scores may suggest hallucination or poorly extracted entities.

Confidence Filtering

Use the min_confidence parameter to filter out low-quality extractions:

Python
from statement_extractor import extract_statements

# Only return statements with confidence >= 0.7
result = extract_statements(
    "Apple CEO Tim Cook announced the iPhone 15.",
    min_confidence=0.7
)

# Access individual scores
for stmt in result:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Beam Merging vs Best Beam Selection

Corp-extractor supports two strategies for combining beam outputs:

StrategyDescriptionUse Case
merge (default)Combine unique statements from all beams, deduplicated by contentMaximum coverage
bestReturn only statements from the highest-scoring beamHigher precision
Python
# Merge all beams (default)
result = extract_statements(text, beam_strategy="merge")

# Use only the best beam
result = extract_statements(text, beam_strategy="best")

When using merge, statements are deduplicated based on normalized subject-predicate-object content, and the highest confidence score is retained for duplicates.


GLiNER2 Integration

NEW in v0.4.0

Version 0.4.0 introduces GLiNER2 (205M parameters) for entity recognition and relation extraction, replacing spaCy.

Why GLiNER2?

GLiNER2 is a unified model that handles:

  • Named Entity Recognition - identifying entities with types
  • Relation Extraction - using 324 default predicates across 21 categories
  • Confidence Scoring - real confidence values via include_confidence=True
  • Entity Scoring - measuring how "entity-like" subjects and objects are

Default Predicates

GLiNER2 uses 324 predicates organized into 21 categories loaded from default_predicates.json. Categories include:

  • ownership_control - acquires, owns, has_subsidiary, etc.
  • employment_leadership - employs, is_ceo_of, manages, etc.
  • funding_investment - funds, invests_in, sponsors, etc.
  • supply_chain - supplies, manufactures, distributes_for, etc.
  • legal_regulatory - regulates, violates, complies_with, etc.

Each predicate includes a description for semantic matching and a confidence threshold.

All Matches Returned

GLiNER2 now returns all matching relations, not just the best one. This allows downstream filtering and selection based on your use case:

Python
from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans to hire workers.")

# All matching relations are returned, sorted by confidence
for stmt in ctx.statements:
    print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
    print(f"  Category: {stmt.predicate_category}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Custom Predicates

You can provide custom predicates via a JSON file:

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

config = PipelineConfig(
    extractor_options={"predicates_file": "/path/to/custom_predicates.json"}
)
pipeline = ExtractionPipeline(config)

Or via CLI:

Bash
corp-extractor pipeline "..." --predicates-file custom_predicates.json

Entity-Based Scoring

Confidence scores come directly from GLiNER2 with include_confidence=True:

SourceDescription
Relation confidenceGLiNER2 confidence in the relation match
Entity confidenceGLiNER2 confidence in entity recognition

Pipeline Architecture

Updated in v0.8.0

Version 0.8.0 uses a 5-stage plugin-based pipeline for comprehensive entity resolution, statement enrichment, and taxonomy classification. Qualification and canonicalization have been merged into a single stage using the embedding database.

The 5 Stages

StageNameInputOutputPurpose
1SplittingTextRawTriple[]Extract raw subject-predicate-object triples using T5-Gemma2
2ExtractionRawTriple[]PipelineStatement[]Refine entities with type recognition using GLiNER2
3Entity QualificationEntitiesCanonicalEntity[]Add identifiers (LEI, CIK, etc.) and resolve canonical names via embedding database
4LabelingStatementsLabeledStatement[]Apply sentiment, relation type, confidence labels
5TaxonomyStatementsTaxonomyResult[]Classify against large taxonomies (ESG topics, etc.)

Stage 1: Splitting

The splitting stage transforms raw text into RawTriple objects using the T5-Gemma2 model. Each triple contains:

  • subject_text: The raw subject text
  • predicate_text: The raw predicate/relationship
  • object_text: The raw object text
  • source_sentence: The sentence this triple was extracted from
  • confidence: Extraction confidence score

Stage 2: Extraction

The extraction stage uses GLiNER2 to extract relations and assign entity types, producing PipelineStatement objects with:

  • subject: ExtractedEntity with text, type, span, and confidence
  • object: ExtractedEntity with text, type, span, and confidence
  • predicate: Predicate from GLiNER2's 324 default predicates
  • predicate_category: Category the predicate belongs to (e.g., "employment_leadership")
  • source_text: Source text for this statement
  • confidence_score: Real confidence from GLiNER2

Note: Stage 2 returns all matching relations from GLiNER2, not just the best one. This allows downstream stages to filter, deduplicate, or select based on specific criteria. Relations are sorted by confidence (descending).

Stage 3: Entity Qualification

Entity qualification combines what were previously separate qualification and canonicalization stages. It adds context, external identifiers, and canonical names to entities using the embedding database:

  • PersonQualifier: Adds role, organization, and canonical ID for PERSON entities Enhanced in v0.9.0
    • Uses LLM (Gemma3) to extract role and organization from context
    • Searches person database for notable people (executives, politicians, athletes, etc.)
    • Resolves organization mentions against the organization database
    • Returns canonical Wikidata IDs for matched people
  • EmbeddingCompanyQualifier: Looks up company identifiers (LEI, CIK, UK company numbers) and canonical names using vector similarity search

The output is CanonicalEntity with:

  • entity_type: Classification (business, nonprofit, government, etc.)
  • canonical_match: Match details (id, name, method, confidence)
  • fqn: Fully Qualified Name, e.g., "Tim Cook (CEO, Apple Inc)"
  • External identifiers: lei, ch_number, sec_cik, ticker, etc.
  • resolved_role: Canonical role information from person database v0.9.0
  • resolved_org: Canonical organization information from org database v0.9.0

Note: The embedding-based company qualifier replaces the older API-based qualifiers (GLEIF, Companies House, SEC Edgar APIs) for faster, offline entity resolution.

Stage 4: Labeling

Labeling plugins annotate statements with additional metadata:

  • SentimentLabeler: Adds sentiment classification (positive/negative/neutral)
  • ConfidenceLabeler: Adds confidence scoring
  • RelationTypeLabeler: Classifies relation types

The output is LabeledStatement with:

  • Original statement
  • Canonicalized subject and object
  • List of StatementLabel objects

Stage 5: Taxonomy

Taxonomy classification plugins classify statements against large taxonomies with hundreds of possible values. Multiple labels may match a single statement above the confidence threshold.

  • MNLITaxonomyClassifier: Uses MNLI zero-shot classification for accurate taxonomy labeling
  • EmbeddingTaxonomyClassifier: Uses embedding similarity for faster classification

The output is a list of TaxonomyResult objects, each with:

  • taxonomy_name: Name of the taxonomy (e.g., "esg_topics")
  • category: Top-level category (e.g., "environment", "governance")
  • label: Specific label within the category
  • confidence: Classification confidence score

Both classifiers use hierarchical classification for efficiency: first identify the top-k categories, then return all labels above the threshold within those categories.

Plugin System

Each stage is implemented through plugins registered with PluginRegistry. Plugins can be:

  • Enabled/disabled per invocation
  • Prioritized for execution order
  • Entity-type specific (e.g., PersonQualifier only runs on PERSON entities)
Python
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Run with specific plugins disabled
config = PipelineConfig(
    disabled_plugins={"mnli_taxonomy_classifier"}  # Use embedding classifier instead
)
pipeline = ExtractionPipeline(config)
ctx = pipeline.process(text)

Document Processing

NEW in v0.7.0

Version 0.7.0 introduces document-level processing for handling files, URLs, and PDFs with automatic chunking, deduplication, and citation tracking.

Document Pipeline

The document pipeline:

  1. Loads content from files, URLs, or PDFs
  2. Chunks text into optimal-sized segments for the extraction model
  3. Processes each chunk through the 5-stage extraction pipeline
  4. Deduplicates statements across chunks
  5. Generates optional document summary
  6. Tracks citations back to source chunks

Chunking Strategy

Documents are split into chunks based on token count with configurable overlap:

ParameterDefaultDescription
target_tokens1000Target tokens per chunk
overlap_tokens100Token overlap between consecutive chunks
respect_sentencestrueAvoid splitting mid-sentence

URL and PDF Support

The document pipeline can fetch and process content from URLs:

  • Web pages: HTML content is extracted using Readability-style parsing
  • PDFs: Parsed using PyMuPDF with optional OCR for scanned documents
Python
from statement_extractor.document import DocumentPipeline

pipeline = DocumentPipeline()

# Process a web page
ctx = await pipeline.process_url("https://example.com/article")

# Process a PDF with OCR
from statement_extractor.document import URLLoaderConfig
config = URLLoaderConfig(use_ocr=True)
ctx = await pipeline.process_url("https://example.com/report.pdf", config)

Cross-Chunk Deduplication

When processing long documents, the same fact may appear in multiple chunks. The deduplicator uses embedding similarity to identify and merge duplicate statements, keeping the highest-confidence version with proper citation tracking.


Entity Embedding Database

UPDATED in v0.9.0

The entity embedding database provides fast qualification for both organizations and people using vector similarity search.

Organization Data Sources

SourceRecordsIdentifierDate Fields
Companies House5.5MUK Company Numberfrom_date: Incorporation, to_date: Dissolution
GLEIF2.6MLEI (Legal Entity Identifier)from_date: LEI registration date
Wikidata1.5MQIDfrom_date: Inception (P571), to_date: Dissolution (P576)
SEC Edgar73KCIK (Central Index Key)from_date: First SEC filing date

Total: 9.6M+ organization records

Person Data Sources UPDATED in v0.9.3

SourceRecordsIdentifierCoverage
Companies House27.5MPerson NumberUK company officers and directors
Wikidata13.4MQIDNotable people with English Wikipedia articles

Total: 41M+ people records

Person Types

PersonTypeDescriptionExample People
executiveC-suite, board membersTim Cook, Satya Nadella
politicianElected officials (presidents, MPs, mayors)Joe Biden, Angela Merkel
governmentCivil servants, diplomats, appointed officialsAmbassadors, agency heads
militaryMilitary officers, armed forces personnelGenerals, admirals
legalJudges, lawyers, legal professionalsSupreme Court justices
professionalKnown for profession (doctors, engineers)Famous surgeons, architects
athleteSports figuresLeBron James, Lionel Messi
artistTraditional creatives (musicians, actors, painters)Tom Hanks, Taylor Swift
mediaInternet/social media personalitiesYouTubers, influencers, podcasters
academicProfessors, researchersNeil deGrasse Tyson
scientistScientists, inventorsResearch scientists
journalistReporters, news presentersAnderson Cooper
entrepreneurFounders, business ownersMark Zuckerberg
activistAdvocates, campaignersGreta Thunberg

People are imported from Companies House (UK company officers) and Wikidata (notable people with English Wikipedia articles). Each person record includes:

  • name: Display name
  • known_for_role: Primary role (e.g., "CEO", "President")
  • known_for_org: Primary organization (e.g., "Apple Inc", "Tesla")
  • country: Country of citizenship
  • person_type: Classification category
  • from_date: Role start date (ISO format)
  • to_date: Role end date (ISO format)
  • birth_date: Date of birth (ISO format) v0.9.2
  • death_date: Date of death if deceased (ISO format) v0.9.2

Note: The same person can have multiple records with different role/org combinations (e.g., Tim Cook as "CEO at Apple" and "Board Director at Nike"). The unique constraint is on (source, source_id, known_for_role, known_for_org).

When organizations are discovered during people import (employers, affiliated orgs), they are automatically inserted into the organizations table if not already present. Each person record has a known_for_org_id foreign key linking to the organizations table, enabling efficient joins and lookups.

EntityType Classification

NEW in v0.8.0

Each organization record is classified with an entity_type field to distinguish between businesses, non-profits, government agencies, and other organization types:

CategoryTypesDescription
Businessbusiness, fund, branchCommercial entities, investment funds, branch offices
Non-profitnonprofit, ngo, foundation, trade_unionCharitable organizations, NGOs, labor unions
Governmentgovernment, international_org, political_partyGovernment agencies, UN/WHO/IMF, political parties
Educationeducational, researchSchools, universities, research institutes
Otherhealthcare, media, sports, religiousHospitals, studios, sports clubs, religious orgs
UnknownunknownClassification not determined

How It Works

  1. Embedding Generation: Organization names are embedded using EmbeddingGemma (300M params)
  2. Vector Search: sqlite-vec enables fast similarity search across millions of records
  3. Qualification: When an ORG entity is found, the database is searched for matching organizations
  4. Identifier Resolution: Matched organizations provide LEI, CIK, company numbers, etc.

Other Tables NEW in v0.9.4

TableRecordsDescription
Roles94K+Job titles with Wikidata QIDs (CEO, Director, etc.)
Locations25K+Countries, states, and cities with hierarchy

Database Variants

  • entities-lite.db (30.1 GB): Core fields and int8 embeddings only (default download)
  • entities.db (32.2 GB): Full database with complete source metadata
  • *.db.gz: Gzip compressed versions for faster downloads

Entity Database

The entity database provides fast lookup and qualification of organizations, people, roles, and locations using vector similarity search. It stores records from authoritative sources with 768-dimensional embeddings for semantic matching.

UPDATED in v0.9.4

Quick Start

Bash
# Download the pre-built database
corp-extractor db download

# Check what's in it
corp-extractor db status

# Search for organizations
corp-extractor db search "Microsoft"

# Search for people
corp-extractor db search-people "Tim Cook"

# Search for roles (v0.9.4)
corp-extractor db search-roles "CEO"

# Search for locations (v0.9.4)
corp-extractor db search-locations "California"

The database is automatically used by the pipeline's qualification stage (Stage 3) to resolve entity names to canonical identifiers.


Getting the Database

Download Pre-built Database

The fastest way to get started is downloading from HuggingFace:

Bash
# Download lite version (default, smaller, faster)
corp-extractor db download

# Download full version (includes complete source metadata)
corp-extractor db download --full

Database variants:

FileSizeContents
entities-lite.db30.1 GBCore fields + int8 embeddings only
entities.db32.2 GBFull records with source metadata

Storage location: ~/.cache/corp-extractor/entities-v2.db (v0.9.4+)

HuggingFace repo: Corp-o-Rate-Community/entity-references

Automatic Download

If you use the pipeline without downloading first, the database downloads automatically:

Python
from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Microsoft CEO Satya Nadella announced...")
# Database downloaded automatically if not present

Database Schema

The database uses SQLite with the sqlite-vec extension for vector similarity search.

Schema v2 (Normalized)

v0.9.4

The v2 schema uses INTEGER foreign keys to enum lookup tables instead of TEXT columns:

sql
-- Enum tables: source_types, people_types, organization_types, location_types
-- Organization: source_id (FK), entity_type_id (FK), region_id (FK to locations)
-- People: source_id (FK), person_type_id (FK), country_id (FK), known_for_role_id (FK)
-- Roles: qid, name, source_id (FK), canon_id
-- Locations: qid, name, source_id (FK), location_type_id (FK), parent_ids (hierarchy)

Organizations Table

sql
CREATE TABLE organizations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    qid INTEGER,                 -- Wikidata QID as integer (v0.9.4)
    name TEXT NOT NULL,
    name_normalized TEXT NOT NULL,
    source_id INTEGER NOT NULL,  -- FK to source_types(id)
    source_identifier TEXT NOT NULL,  -- LEI, CIK, Company Number
    region_id INTEGER,           -- FK to locations(id) (v0.9.4)
    entity_type_id INTEGER NOT NULL,  -- FK to organization_types(id)
    from_date TEXT,              -- ISO YYYY-MM-DD
    to_date TEXT,                -- ISO YYYY-MM-DD
    record TEXT NOT NULL,        -- JSON (empty in lite version)
    UNIQUE(source_identifier, source_id)
);

-- Both float32 and int8 embeddings supported (v0.9.4)
CREATE VIRTUAL TABLE organization_embeddings USING vec0(
    org_id INTEGER PRIMARY KEY, embedding float[768]
);
CREATE VIRTUAL TABLE organization_embeddings_scalar USING vec0(
    org_id INTEGER PRIMARY KEY, embedding int8[768]
);

People Table

sql
CREATE TABLE people (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    qid INTEGER,                 -- Wikidata QID as integer (v0.9.4)
    name TEXT NOT NULL,
    name_normalized TEXT NOT NULL,
    source_id INTEGER NOT NULL,  -- FK to source_types(id)
    source_identifier TEXT NOT NULL,  -- QID, Owner CIK, Person number
    country_id INTEGER,          -- FK to locations(id) (v0.9.4)
    person_type_id INTEGER NOT NULL,  -- FK to people_types(id)
    known_for_role_id INTEGER,   -- FK to roles(id) (v0.9.4)
    known_for_org TEXT DEFAULT '',
    known_for_org_id INTEGER,    -- FK to organizations(id)
    from_date TEXT,              -- Role start date (ISO)
    to_date TEXT,                -- Role end date (ISO)
    birth_date TEXT,             -- ISO YYYY-MM-DD
    death_date TEXT,             -- ISO YYYY-MM-DD
    record TEXT NOT NULL,
    UNIQUE(source_identifier, source_id, known_for_role_id, known_for_org_id)
);

CREATE VIRTUAL TABLE person_embeddings USING vec0(
    person_id INTEGER PRIMARY KEY, embedding float[768]
);
CREATE VIRTUAL TABLE person_embeddings_scalar USING vec0(
    person_id INTEGER PRIMARY KEY, embedding int8[768]
);

New Tables (v0.9.4)

sql
-- Roles table for job titles
CREATE TABLE roles (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    qid INTEGER,                 -- Wikidata QID (e.g., 484876 for CEO)
    name TEXT NOT NULL,          -- "Chief Executive Officer"
    name_normalized TEXT NOT NULL,
    source_id INTEGER NOT NULL,  -- FK to source_types(id)
    canon_id INTEGER DEFAULT NULL,
    UNIQUE(name_normalized, source_id)
);

-- Locations table for geopolitical entities
CREATE TABLE locations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    qid INTEGER,                 -- Wikidata QID (e.g., 30 for USA)
    name TEXT NOT NULL,          -- "United States", "California"
    name_normalized TEXT NOT NULL,
    source_id INTEGER NOT NULL,  -- FK to source_types(id)
    source_identifier TEXT,      -- "US", "CA"
    parent_ids TEXT,             -- JSON array of parent location IDs
    location_type_id INTEGER NOT NULL,  -- FK to location_types(id)
    UNIQUE(source_identifier, source_id)
);

Entity Types

Organization EntityTypes

CategoryTypes
Businessbusiness, fund, branch
Non-profitnonprofit, ngo, foundation, trade_union
Governmentgovernment, international_org, political_party
Educationeducational, research
Otherhealthcare, media, sports, religious, unknown

Person PersonTypes

TypeDescriptionExamples
executiveC-suite, board membersTim Cook, Satya Nadella
politicianElected officialsPresidents, MPs, mayors
governmentCivil servants, diplomatsAgency heads, ambassadors
militaryArmed forces personnelGenerals, admirals
legalJudges, lawyersSupreme Court justices
professionalKnown for professionFamous surgeons, architects
academicProfessors, researchersNeil deGrasse Tyson
scientistScientists, inventorsResearch scientists
athleteSports figuresLeBron James
artistTraditional creativesMusicians, actors, painters
mediaInternet personalitiesYouTubers, influencers
journalistReporters, presentersAnderson Cooper
entrepreneurFounders, business ownersMark Zuckerberg
activistAdvocates, campaignersGreta Thunberg

Simplified Location Types

v0.9.4
TypeDescriptionExamples
continentContinentsEurope, Asia, Africa
countrySovereign statesUnited States, Germany, Japan
subdivisionStates, provinces, regionsCalifornia, Bavaria, Ontario
cityCities, towns, municipalitiesNew York, Paris, Tokyo
districtDistricts, boroughs, neighborhoodsManhattan, Westminster
historicFormer countries, historic territoriesSoviet Union, Prussia

Data Sources

Organizations

SourceRecordsIdentifierCoverage
Companies House5.5MUK Company NumberUK registered companies
GLEIF2.6MLEI (Legal Entity Identifier)Global legal entities
Wikidata1.5MQIDNotable organizations
SEC Edgar73KCIK (Central Index Key)US public companies

Total: 9.6M+ organizations

People

SourceRecordsIdentifierCoverage
Companies House27.5MPerson numberUK company officers
Wikidata13.4MQIDNotable people worldwide

Total: 41M+ people

Other Tables

TableRecordsDescription
Roles94KJob titles with Wikidata QIDs
Locations25KCountries, states, cities with hierarchy

Python API

Search Organizations

Python
from statement_extractor.database import OrganizationDatabase

db = OrganizationDatabase()

# Search by name (hybrid: text + embedding)
matches = db.search_by_name("Microsoft Corporation", top_k=5)
for match in matches:
    print(f"{match.company.name} ({match.company.source}:{match.company.source_id})")
    print(f"  Similarity: {match.similarity_score:.3f}")
    print(f"  Type: {match.company.entity_type}")

# Search by embedding
from statement_extractor.database import CompanyEmbedder

embedder = CompanyEmbedder()
embedding = embedder.embed("Microsoft")
matches = db.search(embedding, top_k=10, min_similarity=0.7)

Search People

Python
from statement_extractor.database import PersonDatabase

db = PersonDatabase()

# Search by name
matches = db.search_by_name("Tim Cook", top_k=5)
for match in matches:
    print(f"{match.person.name} - {match.person.known_for_role} at {match.person.known_for_org}")
    print(f"  Wikidata: {match.person.source_id}")
    print(f"  Type: {match.person.person_type}")

Use in Pipeline

The database is automatically used by qualification plugins:

Python
from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Microsoft CEO Satya Nadella announced new AI features.")

for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} --[{stmt.statement.predicate}]--> {stmt.object_fqn}")
    # e.g., "Satya Nadella (CEO, Microsoft) --[announced]--> new AI features"

Add Custom Records

Python
from statement_extractor.database import OrganizationDatabase, CompanyRecord, EntityType

db = OrganizationDatabase()

record = CompanyRecord(
    name="My Company Inc",
    source="custom",
    source_id="CUSTOM001",
    region="US",
    entity_type=EntityType.business,
    record={"custom_field": "value"},
)
db.add_record(record)

Building Your Own Database

Import Organizations

Bash
# Companies House - UK companies (5.5M records)
corp-extractor db import-companies-house --download

# GLEIF - Global LEI data (2.6M records)
corp-extractor db import-gleif --download
corp-extractor db import-gleif /path/to/lei-data.json --limit 50000

# SEC Edgar - US public companies (73K filers)
corp-extractor db import-sec --download

# Wikidata organizations via SPARQL (1.5M records)
corp-extractor db import-wikidata --limit 50000

Import People

Bash
# Import by person type
corp-extractor db import-people --type executive --limit 5000
corp-extractor db import-people --type politician --limit 5000
corp-extractor db import-people --type athlete --limit 5000

# Import all person types
corp-extractor db import-people --all --limit 50000

# Skip existing records (faster for incremental updates)
corp-extractor db import-people --type executive --skip-existing

# Fetch role start/end dates (slower, queries per person)
corp-extractor db import-people --type executive --enrich-dates

Wikidata Dump Import

v0.9.4

For large imports without SPARQL query timeouts:

Bash
# Download and import from Wikidata dump (~100GB compressed)
corp-extractor db import-wikidata-dump --download --limit 100000

# Import from local dump file
corp-extractor db import-wikidata-dump --dump /path/to/latest-all.json.bz2

# Import only people (no organizations)
corp-extractor db import-wikidata-dump --dump dump.bz2 --people --no-orgs

# Import only locations (countries, states, cities) - v0.9.4
corp-extractor db import-wikidata-dump --dump dump.bz2 --locations --no-people --no-orgs

# Resume interrupted import
corp-extractor db import-wikidata-dump --dump dump.bz2 --resume

# Skip records already in database
corp-extractor db import-wikidata-dump --dump dump.bz2 --skip-updates

Fast download with aria2c: Install aria2c for 10-20x faster downloads:

Bash
brew install aria2   # macOS
apt install aria2    # Ubuntu/Debian

Full Build Process

Bash
# 1. Import from all sources
corp-extractor db import-gleif --download
corp-extractor db import-sec --download
corp-extractor db import-companies-house --download
corp-extractor db import-wikidata --limit 100000
corp-extractor db import-wikidata-dump --download --people --no-orgs --limit 100000

# 2. Link equivalent records
corp-extractor db canonicalize

# 3. Generate scalar embeddings (75% storage reduction)
corp-extractor db backfill-scalar

# 4. Check status
corp-extractor db status

# 5. Upload to HuggingFace
export HF_TOKEN="hf_..."
corp-extractor db upload

Migrate to v2 Schema

v0.9.4

To migrate an existing v1 database to the normalized v2 schema:

Bash
# Create new v2 database (preserves original)
corp-extractor db migrate-v2 entities.db entities-v2.db

# Resume interrupted migration
corp-extractor db migrate-v2 entities.db entities-v2.db --resume

The v2 schema provides:

  • INTEGER FK columns instead of TEXT enums (better performance)
  • New enum lookup tables for type filtering
  • New roles and locations tables
  • QIDs as integers (Q prefix stripped)
  • Human-readable views with JOINs

Canonicalization

Link equivalent records across sources:

Bash
corp-extractor db canonicalize

Organizations

Matches organizations by:

  • Global identifiers: LEI, CIK, ticker (no region check needed)
  • Normalized name + region: Handles suffix variations (Ltd → Limited, Corp → Corporation)

Source priority: gleif > sec_edgar > companies_house > wikipedia

People

v0.9.3

Matches people by:

  • Normalized name + same organization: Uses org canonical group to link people across sources
  • Normalized name + overlapping date ranges: Links records with matching tenure periods

Source priority: wikidata > sec_edgar > companies_house

Canonicalization enables prominence-based search re-ranking that boosts entities with records from multiple authoritative sources.


Data Models

CompanyRecord

Python
class CompanyRecord(BaseModel):
    name: str                    # Organization name
    source: str                  # 'gleif', 'sec_edgar', 'companies_house', 'wikipedia'
    source_id: str               # LEI, CIK, UK Company Number, or QID
    region: str                  # Country/region code
    entity_type: EntityType      # Classification
    from_date: Optional[str]     # ISO YYYY-MM-DD
    to_date: Optional[str]       # ISO YYYY-MM-DD
    record: dict[str, Any]       # Full source record (empty in lite)

    @property
    def canonical_id(self) -> str:
        return f"{self.source}:{self.source_id}"

PersonRecord

Python
class PersonRecord(BaseModel):
    name: str                    # Display name
    source: str                  # 'wikidata'
    source_id: str               # Wikidata QID
    country: str                 # Country code
    person_type: PersonType      # Classification
    known_for_role: str          # Primary role
    known_for_org: str           # Primary organization name
    known_for_org_id: Optional[int]  # FK to organizations
    from_date: Optional[str]     # Role start (ISO)
    to_date: Optional[str]       # Role end (ISO)
    birth_date: Optional[str]    # Birth date (ISO)
    death_date: Optional[str]    # Death date (ISO)
    record: dict[str, Any]       # Full source record

    @property
    def is_historic(self) -> bool:
        return self.death_date is not None

Match Results

Python
class CompanyMatch(BaseModel):
    company: CompanyRecord
    similarity_score: float      # 0.0 to 1.0

class PersonMatch(BaseModel):
    person: PersonRecord
    similarity_score: float      # 0.0 to 1.0
    llm_confirmed: bool          # Whether LLM validated match

Embedding Model

Embeddings are generated using google/embeddinggemma-300m:

  • Parameters: 300M (lightweight)
  • Dimensions: 768
  • Optimized for: CPU inference
  • Auto-download: Model downloads automatically on first use
Python
from statement_extractor.database import CompanyEmbedder

embedder = CompanyEmbedder()
embedding = embedder.embed("Apple Inc")  # Returns 768-dim numpy array

Troubleshooting

Database not found:

Text
Error: Database not found at ~/.cache/corp-extractor/entities.db

Run corp-extractor db download to fetch the pre-built database.

sqlite-vec extension error:

Text
Error: no such module: vec0

The sqlite-vec extension should install automatically. If not: pip install sqlite-vec

Memory issues with large dumps:

Bash
# Import in smaller batches
corp-extractor db import-wikidata-dump --dump dump.bz2 --limit 10000 --skip-updates
# Then resume for more
corp-extractor db import-wikidata-dump --dump dump.bz2 --limit 10000 --skip-updates --resume

Resume interrupted import:

Bash
corp-extractor db import-wikidata-dump --dump dump.bz2 --resume

Progress is saved to ~/.cache/corp-extractor/wikidata-dump-progress.json.

API Reference

Functions

The library provides convenience functions for quick extraction without managing extractor instances.

FunctionReturnsDescription
extract_statements(text, options?)ExtractionResultMain extraction function. Returns structured statements with confidence scores.
extract_statements_as_json(text, options?, indent?)strReturns extraction result as a JSON string.
extract_statements_as_xml(text, options?)strReturns raw XML output from the model.
extract_statements_as_dict(text, options?)dictReturns extraction result as a Python dictionary.

Function Signatures

Python
def extract_statements(
    text: str,
    options: Optional[ExtractionOptions] = None,
    **kwargs
) -> ExtractionResult:
    """
    Extract structured statements from text.

    Args:
        text: Input text to extract statements from
        options: Extraction options (or pass individual options as kwargs)
        **kwargs: Individual option overrides (num_beams, diversity_penalty, etc.)

    Returns:
        ExtractionResult containing Statement objects
    """
Python
def extract_statements_as_json(
    text: str,
    options: Optional[ExtractionOptions] = None,
    indent: Optional[int] = 2,
    **kwargs
) -> str:
    """Returns JSON string representation of the extraction result."""
Python
def extract_statements_as_xml(
    text: str,
    options: Optional[ExtractionOptions] = None,
    **kwargs
) -> str:
    """Returns XML string with <statements> containing <stmt> elements."""
Python
def extract_statements_as_dict(
    text: str,
    options: Optional[ExtractionOptions] = None,
    **kwargs
) -> dict:
    """Returns dictionary representation of the extraction result."""

Usage Examples

Python
from statement_extractor import extract_statements, extract_statements_as_json

# Basic extraction
result = extract_statements("Apple acquired Beats for $3 billion.")
for stmt in result:
    print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")

# With options via kwargs
result = extract_statements(
    "Tesla announced new factories.",
    num_beams=6,
    diversity_penalty=1.5
)

# JSON output
json_str = extract_statements_as_json("OpenAI released GPT-4.", indent=2)
print(json_str)

Classes

StatementExtractor

The main extractor class with full control over device, model loading, and extraction options.

Python
class StatementExtractor:
    def __init__(
        self,
        model_id: str = "Corp-o-Rate-Community/statement-extractor",
        device: Optional[str] = None,
        torch_dtype: Optional[torch.dtype] = None,
        predicate_taxonomy: Optional[PredicateTaxonomy] = None,
        predicate_config: Optional[PredicateComparisonConfig] = None,
        scoring_config: Optional[ScoringConfig] = None,
    ):
        """
        Initialize the statement extractor.

        Args:
            model_id: HuggingFace model ID or local path
            device: Device to use ('cuda', 'cpu', or None for auto-detect)
            torch_dtype: Torch dtype (default: bfloat16 on GPU, float32 on CPU)
            predicate_taxonomy: Optional taxonomy for predicate normalization
            predicate_config: Configuration for predicate comparison
            scoring_config: Configuration for quality scoring
        """

    def extract(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
    ) -> ExtractionResult:
        """Extract statements from text."""

    def extract_as_xml(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
    ) -> str:
        """Extract statements and return raw XML output."""

    def extract_as_json(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
        indent: Optional[int] = 2,
    ) -> str:
        """Extract statements and return JSON string."""

    def extract_as_dict(
        self,
        text: str,
        options: Optional[ExtractionOptions] = None,
    ) -> dict:
        """Extract statements and return as dictionary."""

Example: Custom extractor with GPU control

Python
from statement_extractor import StatementExtractor, ExtractionOptions

# Force CPU usage
extractor = StatementExtractor(device="cpu")

# Extract with custom options
options = ExtractionOptions(num_beams=6, diversity_penalty=1.2)
result = extractor.extract("Microsoft partnered with OpenAI.", options)

ExtractionOptions

Configuration for the extraction process.

Python
class ExtractionOptions(BaseModel):
    # Beam search parameters
    num_beams: int = 4                    # 1-16, beams for diverse beam search
    diversity_penalty: float = 1.0        # >= 0.0, penalty for beam diversity
    max_new_tokens: int = 2048            # 128-8192, max tokens to generate
    min_statement_ratio: float = 1.0      # >= 0.0, min statements per sentence
    max_attempts: int = 3                 # 1-10, extraction retry attempts
    deduplicate: bool = True              # Remove duplicate statements

    # Predicate taxonomy & comparison
    predicate_taxonomy: Optional[PredicateTaxonomy] = None
    predicate_config: Optional[PredicateComparisonConfig] = None

    # Scoring configuration (v0.2.0)
    scoring_config: Optional[ScoringConfig] = None

    # Pluggable canonicalization
    entity_canonicalizer: Optional[Callable[[str], str]] = None

    # Mode flags
    merge_beams: bool = True              # Merge top-N beams vs select best
    embedding_dedup: bool = True          # Use embedding similarity for dedup

ScoringConfig

Quality scoring parameters for beam selection and triple assessment. Added in v0.2.0.

Python
class ScoringConfig(BaseModel):
    quality_weight: float = 1.0           # >= 0.0, weight for confidence scores
    coverage_weight: float = 0.5          # >= 0.0, bonus for source text coverage
    redundancy_penalty: float = 0.3       # >= 0.0, penalty for duplicate triples
    length_penalty: float = 0.1           # >= 0.0, penalty for verbosity
    min_confidence: float = 0.0           # 0.0-1.0, minimum confidence threshold
    merge_top_n: int = 3                  # 1-10, beams to merge when merge_beams=True

Tuning for precision vs recall:

Use Casemin_confidenceNotes
High recall0.0Keep all extractions
Balanced0.5Filter low-confidence triples
High precision0.8Only keep high-confidence triples

PredicateTaxonomy

A taxonomy of canonical predicates for normalization.

Python
class PredicateTaxonomy(BaseModel):
    predicates: list[str]                 # List of canonical predicate forms
    name: Optional[str] = None            # Optional taxonomy name

    @classmethod
    def from_file(cls, path: str | Path) -> "PredicateTaxonomy":
        """Load taxonomy from a file (one predicate per line)."""

    @classmethod
    def from_list(cls, predicates: list[str], name: Optional[str] = None) -> "PredicateTaxonomy":
        """Create taxonomy from a list of predicates."""

Example:

Python
from statement_extractor import PredicateTaxonomy, ExtractionOptions, extract_statements

# Define canonical predicates
taxonomy = PredicateTaxonomy.from_list([
    "acquired", "founded", "works_for", "located_in", "partnered_with"
])

# Use in extraction
options = ExtractionOptions(predicate_taxonomy=taxonomy)
result = extract_statements("Google bought YouTube.", options)
# predicate "bought" maps to canonical "acquired"

PredicateComparisonConfig

Configuration for embedding-based predicate comparison.

Python
class PredicateComparisonConfig(BaseModel):
    embedding_model: str = "sentence-transformers/paraphrase-MiniLM-L6-v2"
    similarity_threshold: float = 0.65    # 0.0-1.0, min similarity for taxonomy match
    dedup_threshold: float = 0.65         # 0.0-1.0, min similarity for duplicates
    normalize_text: bool = True           # Lowercase and strip before embedding

Data Models

All data models use Pydantic for validation and serialization.

Statement

A single extracted subject-predicate-object triple.

Python
class Statement(BaseModel):
    subject: Entity                              # The subject entity
    predicate: str                               # The relationship/predicate
    object: Entity                               # The object entity
    source_text: Optional[str] = None            # Original text span

    # Quality scoring fields (v0.2.0)
    confidence_score: Optional[float] = None     # 0.0-1.0, quality score (semantic + entity)
    evidence_span: Optional[tuple[int, int]] = None  # Character offsets in source
    canonical_predicate: Optional[str] = None    # Canonical form if taxonomy used

    def as_triple(self) -> tuple[str, str, str]:
        """Return as (subject, predicate, object) tuple."""

    def __str__(self) -> str:
        """Format: 'subject -- predicate --> object'"""

Example:

Python
stmt = result.statements[0]
print(stmt.subject.text)           # "Apple Inc."
print(stmt.predicate)              # "acquired"
print(stmt.object.text)            # "Beats Electronics"
print(stmt.confidence_score)       # 0.92
print(stmt.as_triple())            # ("Apple Inc.", "acquired", "Beats Electronics")

Entity

An entity representing a subject or object.

Python
class Entity(BaseModel):
    text: str                        # The entity text
    type: EntityType = UNKNOWN       # The entity type

    def __str__(self) -> str:
        """Format: 'text (TYPE)'"""

EntityType

Enumeration of supported entity types.

Python
class EntityType(str, Enum):
    ORG = "ORG"                 # Organization
    PERSON = "PERSON"           # Person
    GPE = "GPE"                 # Geopolitical entity (country, city, state)
    LOC = "LOC"                 # Non-GPE location
    PRODUCT = "PRODUCT"         # Product
    EVENT = "EVENT"             # Event
    WORK_OF_ART = "WORK_OF_ART" # Creative work
    LAW = "LAW"                 # Legal document
    DATE = "DATE"               # Date or time
    MONEY = "MONEY"             # Monetary value
    PERCENT = "PERCENT"         # Percentage
    QUANTITY = "QUANTITY"       # Quantity or measurement
    UNKNOWN = "UNKNOWN"         # Unknown type

ExtractionResult

Container for extraction results. Supports iteration and length.

Python
class ExtractionResult(BaseModel):
    statements: list[Statement] = []     # List of extracted statements
    source_text: Optional[str] = None    # Original input text

    def __len__(self) -> int:
        """Number of statements."""

    def __iter__(self):
        """Iterate over statements."""

    def to_triples(self) -> list[tuple[str, str, str]]:
        """Return all statements as (subject, predicate, object) tuples."""

Example:

Python
result = extract_statements(text)

# Iterate directly
for stmt in result:
    print(stmt)

# Check count
print(f"Found {len(result)} statements")

# Get as simple tuples
triples = result.to_triples()

PredicateMatch

Result of matching a predicate to a canonical form.

Python
class PredicateMatch(BaseModel):
    original: str                        # The original extracted predicate
    canonical: Optional[str] = None      # Matched canonical predicate, if any
    similarity: float = 0.0              # 0.0-1.0, cosine similarity score
    matched: bool = False                # Whether a match was found above threshold

Example:

Python
from statement_extractor import PredicateComparer, PredicateTaxonomy

taxonomy = PredicateTaxonomy.from_list(["acquired", "founded", "works_for"])
comparer = PredicateComparer(taxonomy=taxonomy)

match = comparer.match_to_canonical("bought")
print(match.original)     # "bought"
print(match.canonical)    # "acquired"
print(match.similarity)   # ~0.82
print(match.matched)      # True

Pipeline API

NEW in v0.5.0

The pipeline API provides comprehensive entity resolution and taxonomy classification through a 5-stage plugin architecture.

ExtractionPipeline

The main orchestrator class that runs all pipeline stages.

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

class ExtractionPipeline:
    def __init__(self, config: Optional[PipelineConfig] = None):
        """
        Initialize the extraction pipeline.

        Args:
            config: Pipeline configuration (default: all stages enabled)
        """

    def process(self, text: str, metadata: Optional[dict] = None) -> PipelineContext:
        """
        Process text through the pipeline stages.

        Args:
            text: Input text to process
            metadata: Optional source metadata (document ID, URL, etc.)

        Returns:
            PipelineContext with results from all stages
        """

Example:

Python
pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans.")

print(f"Statements: {ctx.statement_count}")
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

PipelineConfig

Configuration for stage and plugin selection.

Python
from statement_extractor.pipeline import PipelineConfig

class PipelineConfig(BaseModel):
    enabled_stages: set[int] = {1, 2, 3, 4, 5}  # Stages to run (1-5)
    enabled_plugins: Optional[set[str]] = None   # Plugins to enable (None = all)
    disabled_plugins: set[str] = set()           # Plugins to disable
    fail_fast: bool = False                       # Stop on first error
    parallel_processing: bool = False             # Enable parallel processing
    max_statements: Optional[int] = None          # Limit statements processed

    # Stage-specific options
    splitter_options: dict = {}
    extractor_options: dict = {}
    qualifier_options: dict = {}
    labeler_options: dict = {}
    taxonomy_options: dict = {}

    @classmethod
    def from_stage_string(cls, stages: str, **kwargs) -> "PipelineConfig":
        """Create config from stage string like '1-3' or '1,2,5'."""

    @classmethod
    def default(cls) -> "PipelineConfig":
        """All stages enabled."""

    @classmethod
    def minimal(cls) -> "PipelineConfig":
        """Only splitting and extraction (stages 1-2)."""

Example:

Python
# Run only stages 1-3
config = PipelineConfig(enabled_stages={1, 2, 3})

# Disable specific plugins
config = PipelineConfig(disabled_plugins={"sec_edgar_qualifier"})

# From stage string
config = PipelineConfig.from_stage_string("1-3")

PipelineContext

Data container that flows through all pipeline stages.

Python
from statement_extractor.pipeline import PipelineContext

class PipelineContext(BaseModel):
    # Input
    source_text: str                                    # Original input text
    source_metadata: dict = {}                          # Document metadata

    # Stage outputs
    raw_triples: list[RawTriple] = []                   # Stage 1 output
    statements: list[PipelineStatement] = []           # Stage 2 output
    canonical_entities: dict[str, CanonicalEntity] = {} # Stage 3 output
    labeled_statements: list[LabeledStatement] = []    # Stage 4 output
    taxonomy_results: dict[tuple, list[TaxonomyResult]] = {}  # Stage 5 output (multiple labels per statement)

    # Processing metadata
    processing_errors: list[str] = []
    processing_warnings: list[str] = []
    stage_timings: dict[str, float] = {}

    @property
    def statement_count(self) -> int:
        """Number of statements in final output."""

    @property
    def has_errors(self) -> bool:
        """Check if any errors occurred."""

PluginRegistry

Registry for discovering and managing plugins.

Python
from statement_extractor.pipeline import PluginRegistry

class PluginRegistry:
    @classmethod
    def list_plugins(cls, stage: Optional[int] = None) -> list[dict]:
        """List all registered plugins, optionally filtered by stage."""

    @classmethod
    def get_plugin(cls, name: str) -> Optional[BasePlugin]:
        """Get a plugin by name."""

Pipeline Data Models

RawTriple

Output of Stage 1 (Splitting).

Python
class RawTriple(BaseModel):
    subject_text: str                    # Raw subject text
    predicate_text: str                  # Raw predicate text
    object_text: str                     # Raw object text
    source_sentence: str                 # Source sentence
    confidence: float = 1.0              # Extraction confidence (0-1)

    def as_tuple(self) -> tuple[str, str, str]:
        """Return as (subject, predicate, object) tuple."""

PipelineStatement

Output of Stage 2 (Extraction).

Python
class PipelineStatement(BaseModel):
    subject: ExtractedEntity             # Subject with type, span, confidence
    predicate: str                       # Predicate text
    predicate_category: Optional[str]    # Predicate category (e.g., "employment_leadership")
    object: ExtractedEntity              # Object with type, span, confidence
    source_text: str                     # Source text
    confidence_score: float = 1.0        # Overall confidence (from GLiNER2)
    extraction_method: Optional[str]     # Method: gliner_relation

Note: Stage 2 returns all matching relations from GLiNER2, not just the best one. Relations are sorted by confidence (descending).


GLiNER2Extractor

The Stage 2 extractor plugin that uses GLiNER2 for relation extraction.

Python
from statement_extractor.plugins.extractors.gliner2 import GLiNER2Extractor

class GLiNER2Extractor(BaseExtractorPlugin):
    def __init__(
        self,
        predicates: Optional[list[str]] = None,
        predicates_file: Optional[str | Path] = None,
        entity_types: Optional[list[str]] = None,
        use_default_predicates: bool = True,
    ):
        """
        Initialize the GLiNER2 extractor.

        Args:
            predicates: Custom list of predicate names
            predicates_file: Path to custom predicates JSON file
            entity_types: Entity types to extract (default: all)
            use_default_predicates: Use 324 built-in predicates when no custom provided
        """

Key behaviors:

  • Uses include_confidence=True for real confidence scores from GLiNER2
  • Iterates through 21 predicate categories to stay under GLiNER2's ~25 label limit
  • Returns all matching relations per source sentence (filtered later)
  • Predicates loaded from default_predicates.json (324 predicates)

EntityQualifiers

Qualifiers added in Stage 3.

Python
class EntityQualifiers(BaseModel):
    # Semantic qualifiers
    org: Optional[str] = None            # Organization/employer
    role: Optional[str] = None           # Job title/position

    # Location qualifiers
    region: Optional[str] = None         # State/province
    country: Optional[str] = None        # Country
    city: Optional[str] = None           # City
    jurisdiction: Optional[str] = None   # Legal jurisdiction

    # External identifiers
    identifiers: dict[str, str] = {}     # lei, ch_number, sec_cik, ticker, etc.

    def has_any_qualifier(self) -> bool:
        """Check if any qualifier is set."""

CanonicalMatch

Result of canonical matching in Stage 3.

Python
class CanonicalMatch(BaseModel):
    canonical_id: Optional[str]          # ID in canonical database
    canonical_name: Optional[str]        # Canonical name/label
    match_method: str                    # identifier, name_exact, name_fuzzy, embedding
    match_confidence: float = 1.0        # Confidence in match (0-1)
    match_details: Optional[dict]        # Additional match details

CanonicalEntity

Output of Stage 3 (Entity Qualification).

Python
class CanonicalEntity(BaseModel):
    entity_ref: str                      # Reference to original entity
    original_text: str                   # Original entity text
    entity_type: EntityType              # Entity type
    qualifiers: EntityQualifiers         # Qualifiers and identifiers
    canonical_match: Optional[CanonicalMatch]  # Canonical match if found
    fqn: str                             # Fully Qualified Name
    qualification_sources: list[str]     # Plugins that contributed

StatementLabel

A label applied in Stage 4.

Python
class StatementLabel(BaseModel):
    label_type: str                      # sentiment, relation_type, confidence
    label_value: Union[str, float, bool] # The label value
    confidence: float = 1.0              # Confidence in label
    labeler: Optional[str]               # Plugin that produced the label

LabeledStatement

Final output from Stage 4 (Labeling).

Python
class LabeledStatement(BaseModel):
    statement: PipelineStatement         # Original statement
    subject_canonical: CanonicalEntity   # Canonicalized subject
    object_canonical: CanonicalEntity    # Canonicalized object
    labels: list[StatementLabel] = []    # Applied labels

    @property
    def subject_fqn(self) -> str:
        """Subject's fully qualified name."""

    @property
    def object_fqn(self) -> str:
        """Object's fully qualified name."""

    def get_label(self, label_type: str) -> Optional[StatementLabel]:
        """Get label by type."""

    def as_dict(self) -> dict:
        """Convert to simplified dictionary."""

Example:

Python
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

    # Access labels
    sentiment = stmt.get_label("sentiment")
    if sentiment:
        print(f"  Sentiment: {sentiment.label_value}")

    # Access qualifiers
    subject_quals = stmt.subject_canonical.qualified_entity.qualifiers
    if subject_quals.role:
        print(f"  Role: {subject_quals.role}")

TaxonomyResult

Output of Stage 5 (Taxonomy) classification.

Python
class TaxonomyResult(BaseModel):
    taxonomy_name: str                   # e.g., "esg_topics"
    category: str                        # Top-level category
    label: str                           # Specific label
    label_id: Optional[int] = None       # Numeric ID if available
    confidence: float = 1.0              # Classification confidence (0-1)
    classifier: Optional[str] = None     # Plugin that produced this result
    metadata: dict = {}                  # Additional metadata

    @property
    def full_label(self) -> str:
        """Return category:label format."""

Example:

Python
# Access taxonomy results from context
# Each statement may have multiple labels above the threshold
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
    print(f"Statement: {source_text[:50]}...")
    print(f"  Taxonomy: {taxonomy_name}")
    print(f"  Labels ({len(results)}):")
    for result in results:
        print(f"    - {result.full_label} (confidence: {result.confidence:.2f})")

ClassificationSchema

Schema for simple multi-choice classification (2-20 options). Used by labelers that need GLiNER2 to perform classification.

Python
class ClassificationSchema(BaseModel):
    label_type: str                      # e.g., "sentiment"
    choices: list[str]                   # Available choices
    description: str = ""                # Description for the classifier
    scope: str = "statement"             # statement or entity

TaxonomySchema

Schema for large taxonomy classification (100+ values). Used by taxonomy plugins.

Python
class TaxonomySchema(BaseModel):
    label_type: str                      # e.g., "taxonomy"
    values: list[str] | dict[str, list[str]]  # Flat list or category -> labels
    description: str = ""
    scope: str = "statement"
    label_descriptions: Optional[dict[str, str]] = None  # Descriptions for labels

Configuration

The statement-extractor library provides fine-grained control over extraction behavior through configuration classes. This section covers all configuration options for tuning precision, recall, and performance.


ExtractionOptions

The primary configuration class for controlling extraction behavior.

ParameterTypeDefaultDescription
num_beamsint4Number of beam search candidates
diversity_penaltyfloat1.0Penalty for beam diversity in diverse beam search
max_new_tokensint2048Maximum generation length in tokens
deduplicateboolTrueRemove duplicate statements from output
merge_beamsboolTrueMerge top beams into single result set (v0.2.0)
embedding_dedupboolTrueUse embedding similarity for deduplication (v0.2.0)
predicateslist[str]NonePredefined predicates for GLiNER2 relation extraction (v0.4.0)
all_triplesboolFalseKeep all candidate triples instead of best per source
predicate_taxonomyPredicateTaxonomyNoneTaxonomy of canonical predicates
scoring_configScoringConfigNoneQuality scoring configuration
entity_canonicalizerCallableNoneCustom function for entity canonicalization

Basic usage:

Python
from statement_extractor import ExtractionOptions, extract_statements

options = ExtractionOptions(
    num_beams=6,
    diversity_penalty=1.2,
    deduplicate=True
)

result = extract_statements("Apple acquired Beats for $3 billion.", options)

ScoringConfig

Added in v0.2.0

Configuration for quality scoring, filtering, and beam selection. Use this to tune the precision-recall tradeoff.

ParameterTypeDefaultDescription
min_confidencefloat0.0Filter threshold (0=recall, 0.7+=precision)
quality_weightfloat1.0Weight for confidence scores
coverage_weightfloat0.5Weight for source text coverage
redundancy_penaltyfloat0.3Penalty for duplicate triples
length_penaltyfloat0.1Penalty for verbose predicates/entities
merge_top_nint3Number of beams to merge

Common configurations:

Python
from statement_extractor import ScoringConfig, ExtractionOptions, extract_statements

# High precision mode - only keep confident extractions
precision_config = ScoringConfig(
    min_confidence=0.7,
    quality_weight=1.5,
    redundancy_penalty=0.5
)

# High recall mode - keep everything
recall_config = ScoringConfig(
    min_confidence=0.0,
    quality_weight=0.5,
    redundancy_penalty=0.1
)

# Use in extraction
options = ExtractionOptions(scoring_config=precision_config)
result = extract_statements(text, options)

Precision vs recall tuning:

Use Casemin_confidencequality_weightNotes
Maximum recall0.00.5Keep all extractions
Balanced0.41.0Good default
High precision0.71.5Fewer false positives
Knowledge base0.82.0Very strict

PredicateComparisonConfig

Added in v0.2.0

Configuration for embedding-based predicate comparison and taxonomy matching. Requires the [embeddings] extra.

ParameterTypeDefaultDescription
embedding_modelstrparaphrase-MiniLM-L6-v2Model for computing similarity
similarity_thresholdfloat0.65Minimum similarity for taxonomy matching
dedup_thresholdfloat0.65Minimum similarity to consider duplicates
normalize_textboolTrueLowercase/strip predicates before embedding

Custom thresholds:

Python
from statement_extractor import (
    PredicateComparisonConfig,
    PredicateTaxonomy,
    ExtractionOptions,
    extract_statements
)

# Stricter matching for precision
config = PredicateComparisonConfig(
    similarity_threshold=0.75,
    dedup_threshold=0.80,
    normalize_text=True
)

taxonomy = PredicateTaxonomy.from_list([
    "acquired", "founded", "works_for", "located_in",
    "partnered_with", "invested_in", "announced"
])

options = ExtractionOptions(
    predicate_taxonomy=taxonomy,
    predicate_config=config
)

result = extract_statements("Google bought YouTube in 2006.", options)

PipelineConfig

NEW in v0.5.0

Configuration for the 5-stage extraction pipeline. Controls which stages run, which plugins are enabled, and stage-specific options.

ParameterTypeDefaultDescription
enabled_stagesset[int]{1, 2, 3, 4, 5}Stages to run (1-6)
enabled_pluginsset[str] | NoneNonePlugins to enable (None = all)
disabled_pluginsset[str]set()Plugins to disable
fail_fastboolFalseStop on first error
parallel_processingboolFalseEnable parallel processing
max_statementsint | NoneNoneLimit statements processed

Stage selection examples:

Python
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Run only splitting and extraction (stages 1-2)
config = PipelineConfig(enabled_stages={1, 2})

# Run stages 1-3 (skip canonicalization and labeling)
config = PipelineConfig(enabled_stages={1, 2, 3})

# From stage string
config = PipelineConfig.from_stage_string("1-3")  # {1, 2, 3}
config = PipelineConfig.from_stage_string("1,2,5")  # {1, 2, 5}

# Use presets
config = PipelineConfig.default()   # All 5 stages
config = PipelineConfig.minimal()   # Stages 1-2 only

Plugin selection examples:

Python
# Disable specific plugins
config = PipelineConfig(
    disabled_plugins={"sec_edgar_qualifier", "companies_house_qualifier"}
)

# Enable only specific plugins
config = PipelineConfig(
    enabled_plugins={"t5_gemma_splitter", "gliner2_extractor", "person_qualifier"}
)

Stage-specific options:

Python
config = PipelineConfig(
    splitter_options={
        "num_beams": 6,
        "diversity_penalty": 1.2,
    },
    extractor_options={
        "predicates_file": "/path/to/custom_predicates.json",  # Custom predicate file
    },
    qualifier_options={
        "timeout": 10.0,  # API timeout
    },
)

GLiNER2 Extractor Options:

OptionTypeDefaultDescription
predicates_filestr | PathNonePath to custom predicates JSON file
predicateslist[str]NoneCustom list of predicate names (overrides file)
entity_typeslist[str]all typesEntity types to extract
use_default_predicatesboolTrueUse 324 built-in predicates when no custom ones provided

Custom Predicates File Format:

JSON
{
  "category_name": {
    "predicate_name": {
      "description": "Description for semantic matching",
      "threshold": 0.7
    }
  }
}

Example:

JSON
{
  "employment": {
    "works_for": {"description": "Employment relationship", "threshold": 0.75},
    "manages": {"description": "Management relationship", "threshold": 0.7}
  },
  "ownership": {
    "owns": {"description": "Ownership relationship", "threshold": 0.7},
    "acquired": {"description": "Acquisition of entity", "threshold": 0.75}
  }
}

Stage Combinations

Common stage combinations for different use cases:

Use CaseStagesDescription
Fast extraction{1, 2}Basic triples with entity types
With qualifiers{1, 2, 3}Add roles, identifiers (no canonicalization)
Full resolution{1, 2, 3, 4}Canonical forms, FQNs (no labeling)
Full pipeline{1, 2, 3, 4, 5}All stages except taxonomy
Complete pipeline{1, 2, 3, 4, 5}All stages including taxonomy
Labeling only{1, 2, 5}Skip qualification/canonicalization
Python
# Fast extraction for high-volume processing
fast_config = PipelineConfig.minimal()

# Full resolution for knowledge graph building
full_config = PipelineConfig.default()

# Custom: qualifiers without external APIs
internal_config = PipelineConfig(
    enabled_stages={1, 2, 3, 4, 5},
    disabled_plugins={"gleif_qualifier", "companies_house_qualifier", "sec_edgar_qualifier"},
)

Entity Types

Corp-extractor classifies extracted subjects and objects into 13 entity types based on common Named Entity Recognition (NER) standards. Understanding these types helps you filter and process extracted statements effectively.

Complete Entity Type Reference

TypeDescriptionExamples
ORGOrganizations, companies, agenciesApple Inc., United Nations, FBI
PERSONIndividual peopleTim Cook, Elon Musk, Jane Doe
GPEGeopolitical entities (countries, cities, states)United States, California, Paris
LOCNon-GPE locationsPacific Ocean, Mount Everest, Central Park
PRODUCTProducts and servicesiPhone 15, Model S, Gmail
EVENTEvents and occurrencesCES 2024, Annual Meeting, World Cup
WORK_OF_ARTCreative works, documents, reportsSustainability Report, Mona Lisa
LAWLegal documents and regulationsGDPR, Clean Air Act, Section 230
DATEDates and time periodsQ3 2024, January 15, 2030
MONEYMonetary values$4.7 billion, 100 million euros
PERCENTPercentages30%, 0.5%, 100%
QUANTITYQuantities and measurements1,000 employees, 50 megawatts
UNKNOWNUnclassified entities (fallback)(varies)

Accessing Entity Types in Code

Each extracted statement contains subject and object entities with a type attribute:

Python
from statement_extractor import extract_statements

result = extract_statements("Apple CEO Tim Cook announced the iPhone 15.")

for stmt in result:
    print(f"Subject: {stmt.subject.text} ({stmt.subject.type})")
    print(f"Object: {stmt.object.text} ({stmt.object.type})")

Output:

Text
Subject: Apple (ORG)
Object: Tim Cook (PERSON)
Subject: Tim Cook (PERSON)
Object: iPhone 15 (PRODUCT)

You can also import the EntityType enum for type checking and comparisons:

Python
from statement_extractor import extract_statements, EntityType

result = extract_statements("Microsoft acquired Activision for $69 billion.")

for stmt in result:
    if stmt.subject.type == EntityType.ORG:
        print(f"Organization found: {stmt.subject.text}")
    if stmt.object.type == EntityType.MONEY:
        print(f"Monetary value: {stmt.object.text}")

Filtering by Entity Type

A common use case is extracting only statements involving specific entity types. Here is how to filter statements by subject or object type:

Python
from statement_extractor import extract_statements, EntityType

text = """
Apple announced revenue of $94.8 billion for Q3 2024.
CEO Tim Cook presented at the company's Cupertino headquarters.
The new iPhone 16 features improved battery life of 22 hours.
"""

result = extract_statements(text)

# Filter for statements where subject is an organization
org_statements = [
    stmt for stmt in result
    if stmt.subject.type == EntityType.ORG
]

# Filter for statements involving monetary values
money_statements = [
    stmt for stmt in result
    if stmt.subject.type == EntityType.MONEY or stmt.object.type == EntityType.MONEY
]

# Filter for statements about people
person_statements = [
    stmt for stmt in result
    if stmt.subject.type == EntityType.PERSON or stmt.object.type == EntityType.PERSON
]

print(f"Found {len(org_statements)} statements from organizations")
print(f"Found {len(money_statements)} statements with monetary values")
print(f"Found {len(person_statements)} statements about people")

The UNKNOWN Type

The UNKNOWN entity type is used as a fallback when the model cannot confidently classify an entity into one of the 12 standard categories. This typically occurs with:

  • Specialized domain terms: Technical jargon, industry-specific terminology
  • Ambiguous entities: Terms that could fit multiple categories depending on context
  • Novel entities: New terms or concepts not well-represented in training data
  • Abstract concepts: Ideas or qualities that do not fit standard NER categories
Python
from statement_extractor import extract_statements, EntityType

result = extract_statements("The synergy initiative improved operational efficiency.")

for stmt in result:
    if stmt.subject.type == EntityType.UNKNOWN:
        print(f"Unclassified entity: {stmt.subject.text}")
        # Consider manual review or domain-specific handling

When you encounter UNKNOWN entities, consider:

  1. Manual review: Inspect the entity text to determine appropriate handling
  2. Domain mapping: Create application-specific mappings for recurring unknown entities
  3. Context analysis: Use surrounding statements to infer the entity's likely type

Entity Type Standards

Corp-extractor's entity types are based on widely-adopted NER standards, including:

  • OntoNotes 5.0: The primary source for entity type definitions
  • ACE (Automatic Content Extraction): Influences the GPE vs LOC distinction
  • CoNLL-2003: Foundational NER task categories

This alignment with established standards ensures compatibility with other NLP tools and facilitates integration into existing data pipelines.

Examples

This section provides practical examples demonstrating common use cases for the corp-extractor library.


Basic Extraction

Extract statements from text and format the output:

Python
from statement_extractor import extract_statements

text = """
Microsoft announced a partnership with OpenAI in 2019.
The deal was valued at $1 billion and aimed to develop
artificial general intelligence.
"""

result = extract_statements(text)

# Iterate over statements
for stmt in result:
    subject = f"{stmt.subject.text} ({stmt.subject.type})"
    object_ = f"{stmt.object.text} ({stmt.object.type})"
    print(f"{subject} -- {stmt.predicate} --> {object_}")

# Check confidence scores
for stmt in result:
    score = stmt.confidence_score or 0.0
    print(f"[{score:.2f}] {stmt}")

Output:

Text
Microsoft (ORG) -- partnered with --> OpenAI
Microsoft (ORG) -- announced --> partnership
OpenAI (ORG) -- partnership valued at --> $1 billion
Microsoft (ORG) -- aims to develop --> artificial general intelligence

Batch Processing

Use the StatementExtractor class for processing multiple texts efficiently. The model loads once and is reused for all extractions:

Python
from statement_extractor import StatementExtractor

# Initialize extractor with GPU
extractor = StatementExtractor(device="cuda")

texts = [
    "Apple acquired Beats Electronics for $3 billion.",
    "Google was founded by Larry Page and Sergey Brin in 1998.",
    "Amazon announced a new fulfillment center in Texas."
]

# Process multiple texts
for text in texts:
    result = extractor.extract(text)
    print(f"Found {len(result)} statements in: {text[:40]}...")
    for stmt in result:
        print(f"  - {stmt}")
    print()

For CPU-only environments:

Python
# Force CPU usage
extractor = StatementExtractor(device="cpu")

Confidence Filtering

v0.2.0

Filter statements by confidence score to control precision vs recall:

Python
from statement_extractor import extract_statements, ScoringConfig, ExtractionOptions

text = "Elon Musk founded SpaceX in 2002 to reduce space transportation costs."

# High precision mode - only high-confidence statements
scoring = ScoringConfig(min_confidence=0.7)
options = ExtractionOptions(scoring_config=scoring)
result = extract_statements(text, options)

print("High-confidence statements:")
for stmt in result:
    print(f"  [{stmt.confidence_score:.2f}] {stmt}")

You can also filter after extraction for more control:

Python
# Extract all statements first
result = extract_statements(text)

# Apply custom thresholds
high_confidence = [s for s in result if (s.confidence_score or 0) >= 0.8]
medium_confidence = [s for s in result if 0.5 <= (s.confidence_score or 0) < 0.8]
low_confidence = [s for s in result if (s.confidence_score or 0) < 0.5]

print(f"High: {len(high_confidence)}, Medium: {len(medium_confidence)}, Low: {len(low_confidence)}")

Predicate Taxonomy

Map extracted predicates to a controlled vocabulary of canonical forms:

Python
from statement_extractor import PredicateTaxonomy, ExtractionOptions, extract_statements

# Define your canonical predicates
taxonomy = PredicateTaxonomy(predicates=[
    "acquired", "founded", "works_for", "announced",
    "invested_in", "partnered_with", "committed_to"
])

options = ExtractionOptions(predicate_taxonomy=taxonomy)

text = "Google bought YouTube in 2006. Sequoia Capital backed the video platform."
result = extract_statements(text, options)

# View predicate normalization
for stmt in result:
    original = stmt.predicate
    canonical = stmt.canonical_predicate
    if canonical and canonical != original:
        print(f"'{original}' -> '{canonical}'")
    print(f"  {stmt.subject.text} -- {canonical or original} --> {stmt.object.text}")

Output:

Text
'bought' -> 'acquired'
  Google -- acquired --> YouTube
'backed' -> 'invested_in'
  Sequoia Capital -- invested_in --> YouTube

Load taxonomy from a file:

Python
# predicates.txt contains one predicate per line
taxonomy = PredicateTaxonomy.from_file("predicates.txt")

Export Formats

Export extraction results in multiple formats for integration with other systems:

Python
from statement_extractor import (
    extract_statements,
    extract_statements_as_json,
    extract_statements_as_xml,
    extract_statements_as_dict
)

text = "Netflix acquired Spry Fox, a game development studio, in 2022."

# JSON output (default 2-space indent)
json_str = extract_statements_as_json(text)
print(json_str)

# Compact JSON
json_compact = extract_statements_as_json(text, indent=None)

# XML output (raw model format)
xml_str = extract_statements_as_xml(text)
print(xml_str)

# Dictionary output (for programmatic use)
data = extract_statements_as_dict(text)
for stmt in data["statements"]:
    print(f"{stmt['subject']['text']} -> {stmt['predicate']} -> {stmt['object']['text']}")

JSON output format:

JSON
{
  "statements": [
    {
      "subject": {"text": "Netflix", "type": "ORG"},
      "predicate": "acquired",
      "object": {"text": "Spry Fox", "type": "ORG"},
      "source_text": "Netflix acquired Spry Fox",
      "confidence_score": 0.94
    }
  ],
  "source_text": "Netflix acquired Spry Fox, a game development studio, in 2022."
}

Disabling Embeddings

Skip embedding-based features for faster processing when you don't need predicate normalization or semantic deduplication:

Python
from statement_extractor import ExtractionOptions, extract_statements

# Disable embedding-based deduplication
options = ExtractionOptions(
    embedding_dedup=False,      # Use exact string matching for dedup
    predicate_taxonomy=None     # No predicate normalization
)

result = extract_statements(text, options)

When to disable embeddings:

ScenarioRecommendation
Speed criticalDisable embeddings
No GPU availableConsider disabling for faster CPU processing
Need semantic dedupKeep embeddings enabled
Using predicate taxonomyKeep embeddings enabled
Simple text, few duplicatesDisable embeddings

Custom Entity Canonicalization

Provide a custom function to normalize entity names:

Python
from statement_extractor import ExtractionOptions, extract_statements

# Define a canonicalization function
def canonicalize_entity(text: str) -> str:
    """Normalize entity names to canonical forms."""
    mappings = {
        "apple": "Apple Inc.",
        "apple inc": "Apple Inc.",
        "apple inc.": "Apple Inc.",
        "google": "Alphabet Inc.",
        "google llc": "Alphabet Inc.",
        "alphabet": "Alphabet Inc.",
        "msft": "Microsoft Corporation",
        "microsoft": "Microsoft Corporation",
    }
    return mappings.get(text.lower().strip(), text)

options = ExtractionOptions(entity_canonicalizer=canonicalize_entity)

text = "Apple and Google announced a partnership. Microsoft joined later."
result = extract_statements(text, options)

for stmt in result:
    # Entities are now canonicalized
    print(f"{stmt.subject.text} -- {stmt.predicate} --> {stmt.object.text}")

Output:

Text
Apple Inc. -- partnered with --> Alphabet Inc.
Microsoft Corporation -- joined --> partnership

Full Pipeline Example

Combining multiple features for production use:

Python
from statement_extractor import (
    StatementExtractor,
    ExtractionOptions,
    ScoringConfig,
    PredicateTaxonomy,
    PredicateComparisonConfig
)

# Configure scoring for high precision
scoring = ScoringConfig(
    min_confidence=0.6,
    quality_weight=1.0,
    redundancy_penalty=0.5
)

# Define canonical predicates
taxonomy = PredicateTaxonomy.from_list([
    "acquired", "founded", "invested_in", "partnered_with",
    "announced", "launched", "hired", "appointed"
])

# Configure predicate matching
predicate_config = PredicateComparisonConfig(
    similarity_threshold=0.7,
    dedup_threshold=0.8
)

# Initialize extractor
extractor = StatementExtractor(
    device="cuda",
    predicate_taxonomy=taxonomy,
    predicate_config=predicate_config,
    scoring_config=scoring
)

# Configure extraction options
options = ExtractionOptions(
    num_beams=6,
    diversity_penalty=1.2,
    deduplicate=True,
    merge_beams=True
)

# Process text
text = """
Amazon Web Services announced a strategic partnership with Anthropic,
investing up to $4 billion in the AI safety startup. The deal, announced
in September 2023, makes AWS Anthropic's primary cloud provider.
"""

result = extractor.extract(text, options)

print(f"Extracted {len(result)} high-confidence statements:\n")
for stmt in result:
    canonical = stmt.canonical_predicate or stmt.predicate
    score = stmt.confidence_score or 0.0
    print(f"[{score:.2f}] {stmt.subject.text} ({stmt.subject.type})")
    print(f"       -- {canonical} -->")
    print(f"       {stmt.object.text} ({stmt.object.type})")
    print()

Output:

Text
Extracted 4 high-confidence statements:

[0.92] Amazon Web Services (ORG)
       -- partnered_with -->
       Anthropic (ORG)

[0.88] Amazon Web Services (ORG)
       -- invested_in -->
       Anthropic (ORG)

[0.85] Amazon Web Services (ORG)
       -- invested_in -->
       $4 billion (MONEY)

[0.78] AWS (ORG)
       -- is primary cloud provider for -->
       Anthropic (ORG)

Pipeline Examples

NEW in v0.5.0

Full Pipeline with Corporate Text

Process corporate announcements with full entity resolution:

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

pipeline = ExtractionPipeline()

text = """
Amazon CEO Andy Jassy announced plans to hire 10,000 workers in the UK.
The expansion will focus on Amazon Web Services operations in London.
"""

ctx = pipeline.process(text)

print(f"Extracted {ctx.statement_count} statements\n")

for stmt in ctx.labeled_statements:
    # FQN includes role and organization
    print(f"Subject: {stmt.subject_fqn}")
    print(f"Predicate: {stmt.statement.predicate}")
    print(f"Object: {stmt.object_fqn}")

    # Access labels
    for label in stmt.labels:
        print(f"  {label.label_type}: {label.label_value}")

    # Access qualifiers
    subject_quals = stmt.subject_canonical.qualified_entity.qualifiers
    if subject_quals.role:
        print(f"  Role: {subject_quals.role}")
    if subject_quals.org:
        print(f"  Organization: {subject_quals.org}")

    print("-" * 40)

Output:

Text
Extracted 2 statements

Subject: Andy Jassy (CEO, Amazon)
Predicate: announced
Object: plans to hire 10,000 workers in the UK
  sentiment: positive
  Role: CEO
  Organization: Amazon
----------------------------------------
Subject: Amazon (AMZN)
Predicate: expanding operations in
Object: London (UK)
  sentiment: positive
----------------------------------------

Running Specific Stages

Skip qualification and canonicalization for faster processing:

Python
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Run only stages 1 and 2 (splitting + extraction)
config = PipelineConfig(enabled_stages={1, 2})
pipeline = ExtractionPipeline(config)

ctx = pipeline.process("Tim Cook is CEO of Apple Inc.")

# Access Stage 2 output (PipelineStatement)
for stmt in ctx.statements:
    print(f"{stmt.subject.text} ({stmt.subject.type.value})")
    print(f"  --[{stmt.predicate}]-->")
    print(f"  {stmt.object.text} ({stmt.object.type.value})")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Using Specific Plugins

Enable only internal plugins (no external API calls):

Python
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Disable external API plugins
config = PipelineConfig(
    disabled_plugins={
        "gleif_qualifier",
        "companies_house_qualifier",
        "sec_edgar_qualifier",
    }
)

pipeline = ExtractionPipeline(config)
ctx = pipeline.process("OpenAI CEO Sam Altman announced GPT-5.")

# Will use person_qualifier (local LLM) but skip external lookups
for stmt in ctx.labeled_statements:
    print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")

Custom Predicates File

Use a custom predicates JSON file instead of the 324 default predicates:

Python
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline

# Use custom predicates file
config = PipelineConfig(
    extractor_options={
        "predicates_file": "/path/to/my_predicates.json"
    }
)

pipeline = ExtractionPipeline(config)
ctx = pipeline.process("John works for Apple Inc.")

# All matching relations are returned
for stmt in ctx.statements:
    print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
    print(f"  Category: {stmt.predicate_category}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")

Custom predicates file format:

JSON
{
  "employment": {
    "works_for": {
      "description": "Employment relationship where person works for organization",
      "threshold": 0.75
    },
    "manages": {
      "description": "Management relationship where person manages entity",
      "threshold": 0.7
    }
  },
  "ownership": {
    "owns": {
      "description": "Ownership relationship",
      "threshold": 0.7
    },
    "acquired": {
      "description": "Acquisition of one entity by another",
      "threshold": 0.75
    }
  }
}

Each category should have fewer than 25 predicates to stay within GLiNER2's training limit for optimal performance.


Accessing Stage Outputs

Access results from each pipeline stage:

Python
from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
ctx = pipeline.process("Microsoft CEO Satya Nadella announced Azure growth.")

# Stage 1: Raw triples
print("=== Stage 1: Raw Triples ===")
for triple in ctx.raw_triples:
    print(f"  {triple.subject_text} -> {triple.predicate_text} -> {triple.object_text}")

# Stage 2: Statements with types
print("\n=== Stage 2: Statements ===")
for stmt in ctx.statements:
    print(f"  {stmt.subject.text} ({stmt.subject.type.value}) -> {stmt.predicate}")

# Stage 3: Qualified entities
print("\n=== Stage 3: Qualified Entities ===")
for ref, qualified in ctx.qualified_entities.items():
    quals = qualified.qualifiers
    print(f"  {qualified.original_text}")
    if quals.role:
        print(f"    Role: {quals.role}")
    if quals.org:
        print(f"    Org: {quals.org}")
    for id_type, id_value in quals.identifiers.items():
        print(f"    {id_type}: {id_value}")

# Stage 4: Canonical entities
print("\n=== Stage 4: Canonical Entities ===")
for ref, canonical in ctx.canonical_entities.items():
    print(f"  {canonical.fqn}")
    if canonical.canonical_match:
        print(f"    Method: {canonical.canonical_match.match_method}")
        print(f"    Confidence: {canonical.canonical_match.match_confidence:.2f}")

# Stage 5: Labeled statements
print("\n=== Stage 5: Labeled Statements ===")
for stmt in ctx.labeled_statements:
    print(f"  {stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")
    for label in stmt.labels:
        print(f"    {label.label_type}: {label.label_value}")

# Stage 6: Taxonomy results (multiple labels per statement)
print("\n=== Stage 6: Taxonomy Results ===")
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
    print(f"  Statement: {source_text[:40]}...")
    for result in results:
        print(f"    {result.full_label} (confidence: {result.confidence:.2f})")

# Timings
print("\n=== Stage Timings ===")
for stage, duration in ctx.stage_timings.items():
    print(f"  {stage}: {duration:.3f}s")

Batch Pipeline Processing

Process multiple documents efficiently:

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

# Use minimal stages for speed
config = PipelineConfig.minimal()  # Stages 1-2 only
pipeline = ExtractionPipeline(config)

documents = [
    "Apple announced a new MacBook Pro.",
    "Google acquired Fitbit for $2.1 billion.",
    "Tesla CEO Elon Musk unveiled the Cybertruck.",
]

all_statements = []

for doc in documents:
    ctx = pipeline.process(doc)
    for stmt in ctx.statements:
        all_statements.append({
            "subject": stmt.subject.text,
            "subject_type": stmt.subject.type.value,
            "predicate": stmt.predicate,
            "object": stmt.object.text,
            "object_type": stmt.object.type.value,
            "confidence": stmt.confidence_score,
            "source": doc,
        })

print(f"Extracted {len(all_statements)} statements from {len(documents)} documents")

Taxonomy Classification

Stage 6

Classify statements against large taxonomies. Multiple labels may match a single statement above the confidence threshold:

Python
from statement_extractor.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()

text = """
Apple announced a commitment to carbon neutrality by 2030.
The company also reported reducing packaging waste by 75%.
"""

ctx = pipeline.process(text)

# Access taxonomy classifications (multiple labels per statement)
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
    print(f"Statement: {source_text[:50]}...")
    print(f"  Taxonomy: {taxonomy_name}")
    print(f"  Labels:")
    for result in results:
        print(f"    - {result.full_label} (confidence: {result.confidence:.2f})")
    print()

Output:

Text
Statement: Apple announced a commitment to carbon neutrality...
  Taxonomy: esg_topics
  Labels:
    - environment:carbon_emissions (confidence: 0.87)
    - environment_benefit:emissions_reduction (confidence: 0.72)
    - governance:sustainability_commitments (confidence: 0.45)

Statement: The company also reported reducing packaging waste...
  Taxonomy: esg_topics
  Labels:
    - environment:waste_management (confidence: 0.92)
    - environment_benefit:waste_reduction (confidence: 0.85)

Pipeline with Error Handling

Handle errors and warnings gracefully:

Python
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig

config = PipelineConfig(fail_fast=False)  # Continue on errors
pipeline = ExtractionPipeline(config)

ctx = pipeline.process("Some text that might cause issues...")

# Check for errors
if ctx.has_errors:
    print("Errors occurred:")
    for error in ctx.processing_errors:
        print(f"  - {error}")

# Check for warnings
if ctx.processing_warnings:
    print("Warnings:")
    for warning in ctx.processing_warnings:
        print(f"  - {warning}")

# Process results that succeeded
print(f"\nSuccessfully extracted {ctx.statement_count} statements")

Deployment

Local Inference

Hardware Requirements:

ResourceMinimumNotes
CPU-only~4GB RAM~30s per extraction
GPU~2GB VRAM~2s per extraction
Disk~1.5GBModel download size

Setup steps:

Bash
# Install the library
pip install corp-extractor[embeddings]

# For GPU support, install PyTorch with CUDA first
pip install torch --index-url https://download.pytorch.org/whl/cu121

Running locally:

Python
from statement_extractor import StatementExtractor

# Auto-detect GPU or fall back to CPU
extractor = StatementExtractor()

# Or explicitly set device
extractor = StatementExtractor(device="cuda")  # GPU
extractor = StatementExtractor(device="cpu")   # CPU

The model uses bfloat16 precision on GPU for faster inference and lower memory usage, and float32 on CPU.

RunPod Serverless

Why RunPod:

  • Pay-per-use: ~$0.0002/sec on average
  • Scales to zero: No cost when idle
  • No infrastructure: Managed GPU containers

Setup steps:

  1. Clone the repository and build the Docker image:
Bash
cd runpod
docker build --platform linux/amd64 -t your-username/statement-extractor .
  1. Push to Docker Hub:
Bash
docker push your-username/statement-extractor
  1. Create a RunPod serverless endpoint:

    • Go to RunPod Console
    • Create new endpoint with your Docker image
    • Configure GPU type (RTX 3090 recommended)
    • Set Active Workers: 0, Max Workers: 1-3
  2. Call the API:

Bash
curl -X POST https://api.runpod.ai/v2/YOUR_ENDPOINT/runsync \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": {"text": "<page>Your text here</page>"}}'

Pricing:

GPU TypeCostNotes
RTX 3090~$0.00031/secRecommended
Idle$0Scales to zero

Typical extraction costs less than $0.001 per request at ~2s processing time.