Statement Extractor Documentation
Extract structured subject-predicate-object statements from unstructured text using T5-Gemma 2 and GLiNER2 models with document processing, entity resolution, and taxonomy classification.
Getting Started
Installation
pip install corp-extractorThe GLiNER2 model (205M params) is downloaded automatically on first use.
GPU support: Install PyTorch with CUDA before installing corp-extractor. The library auto-detects GPU availability at runtime.
# Example for CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install corp-extractorApple Silicon (M1/M2/M3): MPS acceleration is automatically detected. Just install normally:
pip install corp-extractorQuick Start
Extract structured statements from text in 5 lines:
from statement_extractor import extract_statements
text = "Apple Inc. acquired Beats Electronics for $3 billion in May 2014."
statements = extract_statements(text)
for stmt in statements:
print(f"{stmt.subject.text} ({stmt.subject.type}) -> {stmt.predicate} -> {stmt.object.text}")Output:
Apple Inc. (ORG) -> acquired -> Beats Electronics
Apple Inc. (ORG) -> paid -> $3 billion
Beats Electronics (ORG) -> acquisition price -> $3 billionEach statement includes confidence scores and extraction method:
for stmt in statements:
print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
print(f" method: {stmt.extraction_method}") # hybrid, gliner, or model
print(f" confidence: {stmt.confidence_score:.2f}")v0.5.0 features: Plugin-based pipeline architecture with entity qualification, labeling, and taxonomy classification. GLiNER2 entity recognition, entity-based scoring.
v0.6.0 features: Entity embedding database with ~100K+ SEC filers, ~3M GLEIF records, ~5M UK organizations for fast entity qualification.
v0.7.0 features: Document processing for files, URLs, and PDFs with automatic chunking, deduplication, and citation tracking.
v0.8.0 features: Merged qualification and canonicalization into single stage. EntityType classification for organizations (business, nonprofit, government, etc.).
v0.9.0 features: Person database with Wikidata import for notable people (executives, politicians, athletes, artists). PersonQualifier for canonical person identification with role/org context.
v0.9.1 features: Wikidata dump importer (import-wikidata-dump) for large imports without SPARQL timeouts. Uses aria2c for fast parallel downloads. Extracts people via occupation (P106) and position dates (P580/P582).
v0.9.2 features: Organization canonicalization links equivalent records across sources (GLEIF, SEC, Companies House, Wikidata). People canonicalization with priority-based deduplication. Expanded PersonType classification (executive, politician, government, military, legal, etc.).
v0.9.3 features: SEC Form 4 officers import (import-sec-officers) and Companies House officers import (import-ch-officers). People now sourced from Wikidata, SEC Edgar, and Companies House with cross-source canonicalization.
v0.9.4 features: Database v2 schema with normalized INTEGER foreign keys and enum lookup tables. Scalar (int8) embeddings for 75% storage reduction with ~92% recall. New locations import for countries/states/cities with hierarchy. Migration commands: db migrate-v2, db backfill-scalar. New search commands: db search-roles, db search-locations.
Pipeline Quick Start (v0.5.0)
For full entity resolution with qualification, canonicalization, labeling, and taxonomy classification:
from statement_extractor.pipeline import ExtractionPipeline
pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans to hire workers.")
# Access fully qualified names (e.g., "Andy Jassy (CEO, Amazon)")
for stmt in ctx.labeled_statements:
print(f"{stmt.subject_fqn} --[{stmt.statement.predicate}]--> {stmt.object_fqn}")
# Access labels (sentiment, etc.)
for label in stmt.labels:
print(f" {label.label_type}: {label.label_value}")CLI usage:
# Full pipeline
corp-extractor pipeline "Amazon CEO Andy Jassy announced..."
# Run specific stages only
corp-extractor pipeline -f article.txt --stages 1-3
# Process documents and URLs (v0.7.0)
corp-extractor document process article.txt
corp-extractor document process https://example.com/article
corp-extractor document process report.pdf --use-ocrUsing Predicate Taxonomies
Normalize extracted predicates to canonical forms using embedding similarity:
from statement_extractor import extract_statements, PredicateTaxonomy, ExtractionOptions
# Define your domain's canonical predicates
taxonomy = PredicateTaxonomy(predicates=[
"acquired", "founded", "works_for", "headquartered_in",
"invested_in", "partnered_with", "announced"
])
options = ExtractionOptions(predicate_taxonomy=taxonomy)
text = "Google bought YouTube for $1.65 billion in 2006."
result = extract_statements(text, options)
for stmt in result:
print(f"{stmt.predicate} -> {stmt.canonical_predicate}")
# Output: bought -> acquiredThis maps synonyms like "bought", "purchased", "acquired" to a single canonical form, making downstream analysis easier.
Requirements
| Dependency | Version | Notes |
|---|---|---|
| Python | 3.10+ | Required |
| PyTorch | 2.0+ | Required |
| transformers | 5.0+ | Required for T5-Gemma2 support |
| Pydantic | 2.0+ | Required |
| sentence-transformers | 2.2+ | Required, for embedding features |
| GLiNER2 | latest | Required, for entity recognition and relation extraction (model auto-downloads) |
Hardware requirements:
- NVIDIA GPU: RTX 4090+ recommended for production. Uses bfloat16 precision for efficiency.
- Apple Silicon: M1/M2/M3 with 16GB+ RAM. MPS acceleration auto-detected.
- CPU: Functional but slower. Use for development or low-volume processing.
- Disk: ~100GB for all models and entity database (10M+ organizations, 40M+ people).
The library runs entirely locally with no external API dependencies. Models use bfloat16 on CUDA and float32 on MPS/CPU.
Command Line Interface
The corp-extractor CLI provides commands for extraction, document processing, and database management.
Commands Overview
| Command | Description | Use Case |
|---|---|---|
split | Simple extraction (Stage 1 only) | Fast extraction, basic triples |
pipeline | Full 5-stage pipeline | Entity resolution, labeling, taxonomy |
document | Document processing | Files, URLs, PDFs with chunking and deduplication |
db | Database management | Import, search, upload/download entity database |
plugins | Plugin management | List and inspect available plugins |
Installation
For best results, install globally:
# Using uv (recommended)
uv tool install "corp-extractor[embeddings]"
# Using pipx
pipx install "corp-extractor[embeddings]"
# Using pip
pip install "corp-extractor[embeddings]"Quick Run with uvx
Run directly without installing using uv:
uvx corp-extractor split "Apple announced a new iPhone."Note: First run downloads the model (~1.5GB) which may take a few minutes.
Split Command
The split command extracts sub-statements using the T5-Gemma model. It's fast and simple—use pipeline for full entity resolution.
# Extract from text argument
corp-extractor split "Apple Inc. announced the iPhone 15."
# Extract from file
corp-extractor split -f article.txt
# Pipe from stdin
cat article.txt | corp-extractor split -
# Output as JSON
corp-extractor split "Tim Cook is CEO of Apple." --json
# Output as XML
corp-extractor split "Tim Cook is CEO of Apple." --xml
# Verbose output with confidence scores
corp-extractor split -f article.txt --verbose
# Use more beams for better quality
corp-extractor split -f article.txt --beams 8Split Options
| Option | Description | Default |
|---|---|---|
-f, --file PATH | Read input from file | — |
-o, --output | Output format: table, json, xml | table |
--json / --xml | Output format shortcuts | — |
-b, --beams | Number of beams for diverse beam search | 4 |
--diversity | Diversity penalty for beam search | 1.0 |
--no-gliner | Disable GLiNER2 extraction | — |
--predicates | Comma-separated predicates for relation extraction | — |
--predicates-file | Path to custom predicates JSON file | — |
--device | Device: auto, cuda, mps, cpu | auto |
-v, --verbose | Show confidence scores and metadata | — |
Pipeline Command
NEW in v0.5.0The pipeline command runs the full 5-stage extraction pipeline for comprehensive entity resolution and taxonomy classification.
# Run all 5 stages
corp-extractor pipeline "Amazon CEO Andy Jassy announced plans to hire workers."
# Run from file
corp-extractor pipeline -f article.txt
# Run specific stages
corp-extractor pipeline "..." --stages 1-3
corp-extractor pipeline "..." --stages 1,2,5
# Skip specific stages
corp-extractor pipeline "..." --skip-stages 4,5
# Enable specific plugins only
corp-extractor pipeline "..." --plugins gleif,companies_house
# Disable specific plugins
corp-extractor pipeline "..." --disable-plugins sec_edgar
# Output formats
corp-extractor pipeline "..." -o json
corp-extractor pipeline "..." -o yaml
corp-extractor pipeline "..." -o triplesPipeline Stages
| Stage | Name | Description |
|---|---|---|
| 1 | Splitting | Text → Raw triples (T5-Gemma) |
| 2 | Extraction | Raw triples → Typed statements (GLiNER2) |
| 3 | Entity Qualification | Add identifiers (LEI, CIK, etc.) and canonical names via embedding DB |
| 4 | Labeling | Apply sentiment, relation type, confidence |
| 5 | Taxonomy | Classify against large taxonomies (MNLI/embeddings) |
Pipeline Options
| Option | Description | Example |
|---|---|---|
--stages | Stages to run | 1-3 or 1,2,5 |
--skip-stages | Stages to skip | 4,5 |
--plugins | Enable only these plugins | gleif,person |
--disable-plugins | Disable these plugins | sec_edgar |
--predicates-file | Custom predicates JSON file for GLiNER2 | custom.json |
-o, --output | Output format | table, json, yaml, triples |
Plugins Command
NEW in v0.5.0The plugins command lists and inspects available pipeline plugins.
# List all plugins
corp-extractor plugins list
# List plugins for a specific stage
corp-extractor plugins list --stage 3
# Get details about a plugin
corp-extractor plugins info gleif_qualifier
corp-extractor plugins info person_qualifierExample output:
Stage 1: Splitting
----------------------------------------
t5_gemma_splitter [priority: 100]
Stage 2: Extraction
----------------------------------------
gliner2_extractor [priority: 100]
Stage 3: Entity Qualification
----------------------------------------
person_qualifier (PERSON) [priority: 100]
embedding_company_qualifier (ORG) [priority: 5]
Stage 4: Labeling
----------------------------------------
sentiment_labeler [priority: 100]
confidence_labeler [priority: 100]
relation_type_labeler [priority: 100]
Stage 5: Taxonomy
----------------------------------------
embedding_taxonomy_classifier [priority: 100]Output Formats
Table output (default):
Extracted 2 statement(s):
--------------------------------------------------------------------------------
1. Andy Jassy (CEO, Amazon)
--[announced]-->
plans to hire workers
--------------------------------------------------------------------------------JSON output:
{
"statement_count": 2,
"labeled_statements": [
{
"subject": {"text": "Andy Jassy", "type": "PERSON", "fqn": "Andy Jassy (CEO, Amazon)"},
"predicate": "announced",
"object": {"text": "plans to hire workers", "type": "EVENT"},
"labels": {"sentiment": "positive"}
}
]
}Triples output:
Andy Jassy (CEO, Amazon) announced plans to hire workers
Amazon has CEO Andy Jassy (CEO, Amazon)Shell Integration
Processing multiple files:
# Process all .txt files
for f in *.txt; do
echo "=== $f ==="
corp-extractor pipeline -f "$f" -o json > "${f%.txt}.json"
doneCombining with jq:
# Extract just predicates
corp-extractor split "Your text" --json | jq '.statements[].predicate'
# Filter high-confidence statements
corp-extractor split -f article.txt --json | jq '.statements[] | select(.confidence_score > 0.8)'
# Get FQNs from pipeline
corp-extractor pipeline "Your text" -o json | jq '.labeled_statements[].subject.fqn'Document Command
NEW in v0.7.0The document command processes files, URLs, and PDFs with automatic chunking and deduplication.
# Process local files
corp-extractor document process article.txt
corp-extractor document process report.txt --title "Annual Report" --year 2024
# Process URLs (web pages and PDFs)
corp-extractor document process https://example.com/article
corp-extractor document process https://example.com/report.pdf --use-ocr
# Configure chunking
corp-extractor document process article.txt --max-tokens 500 --overlap 50
# Preview chunking without extraction
corp-extractor document chunk article.txt --max-tokens 500
# Output formats
corp-extractor document process article.txt -o json
corp-extractor document process article.txt -o triplesDocument Options
| Option | Description | Default |
|---|---|---|
--title | Document title for citations | Filename |
--max-tokens | Target tokens per chunk | 1000 |
--overlap | Token overlap between chunks | 100 |
--use-ocr | Force OCR for PDF parsing | — |
--no-summary | Skip document summarization | — |
--no-dedup | Skip cross-chunk deduplication | — |
--stages | Pipeline stages to run | 1-5 |
Database Commands
UPDATED in v0.9.4The db command group manages the entity embedding database used for organization, person, role, and location qualification.
# Show database status
corp-extractor db status
# Search for an organization
corp-extractor db search "Microsoft"
corp-extractor db search "Barclays" --source companies_house
# Search for a person (v0.9.0)
corp-extractor db search-people "Tim Cook"
corp-extractor db search-people "Elon Musk" --top-k 5
# Search for roles (v0.9.4)
corp-extractor db search-roles "CEO"
corp-extractor db search-roles "Chief Financial Officer"
# Search for locations (v0.9.4)
corp-extractor db search-locations "California"
corp-extractor db search-locations "Germany" --type country
# Import organizations from data sources
corp-extractor db import-gleif --download
corp-extractor db import-sec --download # Bulk data (~100K+ filers)
corp-extractor db import-companies-house --download
corp-extractor db import-wikidata --limit 50000 # SPARQL-based
# Import notable people (v0.9.0)
corp-extractor db import-people --type executive --limit 5000
corp-extractor db import-people --all --limit 10000 # All person types
# Import from Wikidata dump (v0.9.1) - avoids SPARQL timeouts
corp-extractor db import-wikidata-dump --download --limit 50000
corp-extractor db import-wikidata-dump --dump /path/to/dump.bz2 --people --no-orgs
# Download/upload from HuggingFace Hub
corp-extractor db download # Lite version (default)
corp-extractor db download --full # Full version with metadata
corp-extractor db upload # Upload with all variants
# Migrate from old schema (companies.db → entities.db)
corp-extractor db migrate companies.db --rename-file
# Migrate to v2 normalized schema (v0.9.4)
corp-extractor db migrate-v2 entities.db entities-v2.db
corp-extractor db migrate-v2 entities.db entities-v2.db --resume # Resume interrupted
# Generate int8 scalar embeddings (v0.9.4) - 75% smaller
corp-extractor db backfill-scalar
corp-extractor db backfill-scalar --skip-generate # Only quantize existing
# Local database management
corp-extractor db create-lite entities.db # Create lite version
corp-extractor db compress entities.db # Compress with gzipOrganization Data Sources
| Source | Command | Records | Identifier |
|---|---|---|---|
| GLEIF | import-gleif --download | ~3.2M | LEI |
| SEC Edgar | import-sec --download | ~100K+ | CIK |
| Companies House | import-companies-house --download | ~5M | Company Number |
| Wikidata (SPARQL) | import-wikidata | Variable | QID |
| Wikidata (Dump) | import-wikidata-dump --download | All with enwiki | QID |
Person Data Sources v0.9.0
| Type | Command | Description |
|---|---|---|
| Executives | import-people --type executive | CEOs, CFOs, board members |
| Politicians | import-people --type politician | Elected officials, diplomats |
| Athletes | import-people --type athlete | Sports figures, coaches |
| Artists | import-people --type artist | Actors, musicians, directors |
| All Types | import-people --all | Run all person type queries |
Person Import Options
| Option | Description |
|---|---|
--skip-existing | Skip existing records instead of updating them |
--enrich-dates | Query individual records for start/end dates (slower) |
Wikidata Dump Import v0.9.1
For large imports that avoid SPARQL timeouts, use the Wikidata JSON dump:
# Download and import (~100GB dump file)
corp-extractor db import-wikidata-dump --download --limit 50000
# Import only people
corp-extractor db import-wikidata-dump --download --people --no-orgs --limit 100000
# Import only organizations
corp-extractor db import-wikidata-dump --download --orgs --no-people --limit 100000
# Import only locations (v0.9.4)
corp-extractor db import-wikidata-dump --dump dump.bz2 --locations --no-people --no-orgs
# Use existing dump file
corp-extractor db import-wikidata-dump --dump /path/to/latest-all.json.bz2Fast download with aria2c: Install aria2c for 10-20x faster downloads:
brew install aria2 # macOS
apt install aria2 # Ubuntu/Debian| Option | Description |
|---|---|
--download | Download the Wikidata dump (~100GB) |
--dump PATH | Use existing dump file (.bz2 or .gz) |
--people/--no-people | Import people (default: yes) |
--orgs/--no-orgs | Import organizations (default: yes) |
--locations/--no-locations | Import locations (default: no) v0.9.4 |
--no-aria2 | Don't use aria2c even if available |
Advantages over SPARQL:
- No timeouts (processes locally)
- Complete coverage (all notable people/orgs with English Wikipedia)
- Captures people via occupation (P106) even if position type is generic
- Extracts role dates from position qualifiers (P580/P582)
- Imports locations with hierarchical parent relationships (v0.9.4)
Download location: ~/.cache/corp-extractor/wikidata-latest-all.json.bz2
Note: Use -v (verbose) to see detailed logs of skipped records during import:
corp-extractor db import-people --type executive -vPeople records include from_date and to_date for role tenure. The same person can have multiple records with different role/org combinations (unique on source_id + role + org).
Organizations discovered during people import (employers, affiliated orgs) are automatically inserted into the organizations table if they don't already exist. This creates foreign key links via known_for_org_id.
Database Variants
| File | Description | Use Case |
|---|---|---|
entities-lite.db | Core fields + embeddings only | Default download, fast searches |
entities.db | Full database with source metadata | When you need complete record data |
*.db.gz | Gzip compressed versions | Faster downloads, auto-decompressed |
Database Options
| Option | Description | Default |
|---|---|---|
--db PATH | Database file path | ~/.cache/corp-extractor/entities.db |
--limit N | Limit number of records | — |
--download | Download source data automatically | — |
--full | Download full version instead of lite | — |
--no-lite | Skip creating lite version on upload | — |
--no-compress | Skip creating compressed versions | — |
See COMPANY_DB.md for complete build and publish instructions.
Core Concepts
Corp-extractor is designed to analyze complex text and extract relationship information about people and organizations. It runs entirely on your hardware (RTX 4090+, Apple M1 16GB+) with no external API dependencies, using multiple fine-tuned small models to transform unstructured text into structured knowledge.
Statement Extraction
Statement extraction is the process of converting unstructured natural language text into structured subject-predicate-object triples. Each triple represents a discrete fact or relationship extracted from the source text.
For example, given the text:
"Apple announced a new iPhone at their Cupertino headquarters."
The extractor produces triples like:
| Subject | Predicate | Object |
|---|---|---|
| Apple (ORG) | announced | iPhone (PRODUCT) |
| Apple (ORG) | has headquarters in | Cupertino (GPE) |
The T5-Gemma 2 Model
Corp-extractor uses a fine-tuned T5-Gemma 2 model with 540 million parameters. This encoder-decoder architecture excels at sequence-to-sequence tasks, making it well-suited for transforming text into structured XML output.
The model processes input text wrapped in <page> tags and generates XML containing <stmt> elements with subject, predicate, object, and source text spans.
Entity Type Recognition
Each extracted subject and object is classified into one of 12 entity types (plus UNKNOWN):
| Type | Description | Example |
|---|---|---|
ORG | Organizations, companies | Apple, United Nations |
PERSON | Named individuals | Tim Cook, Marie Curie |
GPE | Geopolitical entities | France, New York City |
LOC | Non-GPE locations | Mount Everest, Pacific Ocean |
PRODUCT | Products, artifacts | iPhone, Model S |
EVENT | Named events | World War II, Olympics |
WORK_OF_ART | Creative works | Mona Lisa, Hamlet |
LAW | Legal documents | GDPR, First Amendment |
DATE | Temporal expressions | January 2024, last Tuesday |
MONEY | Monetary values | $50 million, €100 |
PERCENT | Percentages | 15%, half |
QUANTITY | Measurements | 500 kilometers, 3 tons |
UNKNOWN | Unclassified entities | — |
Diverse Beam Search
Corp-extractor uses Diverse Beam Search (Vijayakumar et al., 2016) to generate multiple candidate extractions from the same input text.
Why Diverse Beam Search?
Standard beam search tends to produce similar outputs—slight variations of the same interpretation. Diverse Beam Search introduces a diversity penalty that encourages the model to explore fundamentally different extractions.
This is particularly valuable for statement extraction because:
- A single sentence may contain multiple valid interpretations
- Different phrasings can capture different aspects of the same fact
- Merging diverse outputs produces more comprehensive coverage
How It Works
The model generates multiple beams in parallel, each representing a different extraction path. A diversity penalty is applied during generation to prevent beams from converging on identical outputs.
Default Parameters
| Parameter | Default | Description |
|---|---|---|
num_beams | 4 | Number of parallel beams to generate |
diversity_penalty | 1.0 | Strength of diversity encouragement (higher = more diverse) |
from statement_extractor import extract_statements
# Use default beam search settings
result = extract_statements("Apple announced a new iPhone.")
# Customize beam search
result = extract_statements(
"Apple announced a new iPhone.",
num_beams=6,
diversity_penalty=1.5
)Quality Scoring
UPDATED in v0.4.0Each extracted statement receives a confidence score between 0 and 1, measuring extraction quality through a weighted combination of semantic and entity-based signals.
Confidence Score
The score combines three components using GLiNER2 for entity recognition:
| Component | Weight | Description |
|---|---|---|
| Semantic similarity | 50% | Cosine similarity between source text and reassembled triple |
| Subject entity score | 25% | How entity-like the subject is (via GLiNER2 NER) |
| Object entity score | 25% | How entity-like the object is (via GLiNER2 NER) |
Higher scores indicate the triple is semantically grounded and contains well-formed entities. Lower scores may suggest hallucination or poorly extracted entities.
Confidence Filtering
Use the min_confidence parameter to filter out low-quality extractions:
from statement_extractor import extract_statements
# Only return statements with confidence >= 0.7
result = extract_statements(
"Apple CEO Tim Cook announced the iPhone 15.",
min_confidence=0.7
)
# Access individual scores
for stmt in result:
print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
print(f" Confidence: {stmt.confidence_score:.2f}")Beam Merging vs Best Beam Selection
Corp-extractor supports two strategies for combining beam outputs:
| Strategy | Description | Use Case |
|---|---|---|
merge (default) | Combine unique statements from all beams, deduplicated by content | Maximum coverage |
best | Return only statements from the highest-scoring beam | Higher precision |
# Merge all beams (default)
result = extract_statements(text, beam_strategy="merge")
# Use only the best beam
result = extract_statements(text, beam_strategy="best")When using merge, statements are deduplicated based on normalized subject-predicate-object content, and the highest confidence score is retained for duplicates.
GLiNER2 Integration
NEW in v0.4.0Version 0.4.0 introduces GLiNER2 (205M parameters) for entity recognition and relation extraction, replacing spaCy.
Why GLiNER2?
GLiNER2 is a unified model that handles:
- Named Entity Recognition - identifying entities with types
- Relation Extraction - using 324 default predicates across 21 categories
- Confidence Scoring - real confidence values via
include_confidence=True - Entity Scoring - measuring how "entity-like" subjects and objects are
Default Predicates
GLiNER2 uses 324 predicates organized into 21 categories loaded from default_predicates.json. Categories include:
- ownership_control - acquires, owns, has_subsidiary, etc.
- employment_leadership - employs, is_ceo_of, manages, etc.
- funding_investment - funds, invests_in, sponsors, etc.
- supply_chain - supplies, manufactures, distributes_for, etc.
- legal_regulatory - regulates, violates, complies_with, etc.
Each predicate includes a description for semantic matching and a confidence threshold.
All Matches Returned
GLiNER2 now returns all matching relations, not just the best one. This allows downstream filtering and selection based on your use case:
from statement_extractor.pipeline import ExtractionPipeline
pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans to hire workers.")
# All matching relations are returned, sorted by confidence
for stmt in ctx.statements:
print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
print(f" Category: {stmt.predicate_category}")
print(f" Confidence: {stmt.confidence_score:.2f}")Custom Predicates
You can provide custom predicates via a JSON file:
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig
config = PipelineConfig(
extractor_options={"predicates_file": "/path/to/custom_predicates.json"}
)
pipeline = ExtractionPipeline(config)Or via CLI:
corp-extractor pipeline "..." --predicates-file custom_predicates.jsonEntity-Based Scoring
Confidence scores come directly from GLiNER2 with include_confidence=True:
| Source | Description |
|---|---|
| Relation confidence | GLiNER2 confidence in the relation match |
| Entity confidence | GLiNER2 confidence in entity recognition |
Pipeline Architecture
Updated in v0.8.0Version 0.8.0 uses a 5-stage plugin-based pipeline for comprehensive entity resolution, statement enrichment, and taxonomy classification. Qualification and canonicalization have been merged into a single stage using the embedding database.
The 5 Stages
| Stage | Name | Input | Output | Purpose |
|---|---|---|---|---|
| 1 | Splitting | Text | RawTriple[] | Extract raw subject-predicate-object triples using T5-Gemma2 |
| 2 | Extraction | RawTriple[] | PipelineStatement[] | Refine entities with type recognition using GLiNER2 |
| 3 | Entity Qualification | Entities | CanonicalEntity[] | Add identifiers (LEI, CIK, etc.) and resolve canonical names via embedding database |
| 4 | Labeling | Statements | LabeledStatement[] | Apply sentiment, relation type, confidence labels |
| 5 | Taxonomy | Statements | TaxonomyResult[] | Classify against large taxonomies (ESG topics, etc.) |
Stage 1: Splitting
The splitting stage transforms raw text into RawTriple objects using the T5-Gemma2 model. Each triple contains:
- subject_text: The raw subject text
- predicate_text: The raw predicate/relationship
- object_text: The raw object text
- source_sentence: The sentence this triple was extracted from
- confidence: Extraction confidence score
Stage 2: Extraction
The extraction stage uses GLiNER2 to extract relations and assign entity types, producing PipelineStatement objects with:
- subject:
ExtractedEntitywith text, type, span, and confidence - object:
ExtractedEntitywith text, type, span, and confidence - predicate: Predicate from GLiNER2's 324 default predicates
- predicate_category: Category the predicate belongs to (e.g., "employment_leadership")
- source_text: Source text for this statement
- confidence_score: Real confidence from GLiNER2
Note: Stage 2 returns all matching relations from GLiNER2, not just the best one. This allows downstream stages to filter, deduplicate, or select based on specific criteria. Relations are sorted by confidence (descending).
Stage 3: Entity Qualification
Entity qualification combines what were previously separate qualification and canonicalization stages. It adds context, external identifiers, and canonical names to entities using the embedding database:
- PersonQualifier: Adds role, organization, and canonical ID for PERSON entities Enhanced in v0.9.0
- Uses LLM (Gemma3) to extract role and organization from context
- Searches person database for notable people (executives, politicians, athletes, etc.)
- Resolves organization mentions against the organization database
- Returns canonical Wikidata IDs for matched people
- EmbeddingCompanyQualifier: Looks up company identifiers (LEI, CIK, UK company numbers) and canonical names using vector similarity search
The output is CanonicalEntity with:
- entity_type: Classification (business, nonprofit, government, etc.)
- canonical_match: Match details (id, name, method, confidence)
- fqn: Fully Qualified Name, e.g., "Tim Cook (CEO, Apple Inc)"
- External identifiers:
lei,ch_number,sec_cik,ticker, etc. - resolved_role: Canonical role information from person database v0.9.0
- resolved_org: Canonical organization information from org database v0.9.0
Note: The embedding-based company qualifier replaces the older API-based qualifiers (GLEIF, Companies House, SEC Edgar APIs) for faster, offline entity resolution.
Stage 4: Labeling
Labeling plugins annotate statements with additional metadata:
- SentimentLabeler: Adds sentiment classification (positive/negative/neutral)
- ConfidenceLabeler: Adds confidence scoring
- RelationTypeLabeler: Classifies relation types
The output is LabeledStatement with:
- Original statement
- Canonicalized subject and object
- List of
StatementLabelobjects
Stage 5: Taxonomy
Taxonomy classification plugins classify statements against large taxonomies with hundreds of possible values. Multiple labels may match a single statement above the confidence threshold.
- MNLITaxonomyClassifier: Uses MNLI zero-shot classification for accurate taxonomy labeling
- EmbeddingTaxonomyClassifier: Uses embedding similarity for faster classification
The output is a list of TaxonomyResult objects, each with:
- taxonomy_name: Name of the taxonomy (e.g., "esg_topics")
- category: Top-level category (e.g., "environment", "governance")
- label: Specific label within the category
- confidence: Classification confidence score
Both classifiers use hierarchical classification for efficiency: first identify the top-k categories, then return all labels above the threshold within those categories.
Plugin System
Each stage is implemented through plugins registered with PluginRegistry. Plugins can be:
- Enabled/disabled per invocation
- Prioritized for execution order
- Entity-type specific (e.g., PersonQualifier only runs on PERSON entities)
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline
# Run with specific plugins disabled
config = PipelineConfig(
disabled_plugins={"mnli_taxonomy_classifier"} # Use embedding classifier instead
)
pipeline = ExtractionPipeline(config)
ctx = pipeline.process(text)Document Processing
NEW in v0.7.0Version 0.7.0 introduces document-level processing for handling files, URLs, and PDFs with automatic chunking, deduplication, and citation tracking.
Document Pipeline
The document pipeline:
- Loads content from files, URLs, or PDFs
- Chunks text into optimal-sized segments for the extraction model
- Processes each chunk through the 5-stage extraction pipeline
- Deduplicates statements across chunks
- Generates optional document summary
- Tracks citations back to source chunks
Chunking Strategy
Documents are split into chunks based on token count with configurable overlap:
| Parameter | Default | Description |
|---|---|---|
target_tokens | 1000 | Target tokens per chunk |
overlap_tokens | 100 | Token overlap between consecutive chunks |
respect_sentences | true | Avoid splitting mid-sentence |
URL and PDF Support
The document pipeline can fetch and process content from URLs:
- Web pages: HTML content is extracted using Readability-style parsing
- PDFs: Parsed using PyMuPDF with optional OCR for scanned documents
from statement_extractor.document import DocumentPipeline
pipeline = DocumentPipeline()
# Process a web page
ctx = await pipeline.process_url("https://example.com/article")
# Process a PDF with OCR
from statement_extractor.document import URLLoaderConfig
config = URLLoaderConfig(use_ocr=True)
ctx = await pipeline.process_url("https://example.com/report.pdf", config)Cross-Chunk Deduplication
When processing long documents, the same fact may appear in multiple chunks. The deduplicator uses embedding similarity to identify and merge duplicate statements, keeping the highest-confidence version with proper citation tracking.
Entity Embedding Database
UPDATED in v0.9.0The entity embedding database provides fast qualification for both organizations and people using vector similarity search.
Organization Data Sources
| Source | Records | Identifier | Date Fields |
|---|---|---|---|
| Companies House | 5.5M | UK Company Number | from_date: Incorporation, to_date: Dissolution |
| GLEIF | 2.6M | LEI (Legal Entity Identifier) | from_date: LEI registration date |
| Wikidata | 1.5M | QID | from_date: Inception (P571), to_date: Dissolution (P576) |
| SEC Edgar | 73K | CIK (Central Index Key) | from_date: First SEC filing date |
Total: 9.6M+ organization records
Person Data Sources UPDATED in v0.9.3
| Source | Records | Identifier | Coverage |
|---|---|---|---|
| Companies House | 27.5M | Person Number | UK company officers and directors |
| Wikidata | 13.4M | QID | Notable people with English Wikipedia articles |
Total: 41M+ people records
Person Types
| PersonType | Description | Example People |
|---|---|---|
executive | C-suite, board members | Tim Cook, Satya Nadella |
politician | Elected officials (presidents, MPs, mayors) | Joe Biden, Angela Merkel |
government | Civil servants, diplomats, appointed officials | Ambassadors, agency heads |
military | Military officers, armed forces personnel | Generals, admirals |
legal | Judges, lawyers, legal professionals | Supreme Court justices |
professional | Known for profession (doctors, engineers) | Famous surgeons, architects |
athlete | Sports figures | LeBron James, Lionel Messi |
artist | Traditional creatives (musicians, actors, painters) | Tom Hanks, Taylor Swift |
media | Internet/social media personalities | YouTubers, influencers, podcasters |
academic | Professors, researchers | Neil deGrasse Tyson |
scientist | Scientists, inventors | Research scientists |
journalist | Reporters, news presenters | Anderson Cooper |
entrepreneur | Founders, business owners | Mark Zuckerberg |
activist | Advocates, campaigners | Greta Thunberg |
People are imported from Companies House (UK company officers) and Wikidata (notable people with English Wikipedia articles). Each person record includes:
- name: Display name
- known_for_role: Primary role (e.g., "CEO", "President")
- known_for_org: Primary organization (e.g., "Apple Inc", "Tesla")
- country: Country of citizenship
- person_type: Classification category
- from_date: Role start date (ISO format)
- to_date: Role end date (ISO format)
- birth_date: Date of birth (ISO format) v0.9.2
- death_date: Date of death if deceased (ISO format) v0.9.2
Note: The same person can have multiple records with different role/org combinations (e.g., Tim Cook as "CEO at Apple" and "Board Director at Nike"). The unique constraint is on (source, source_id, known_for_role, known_for_org).
When organizations are discovered during people import (employers, affiliated orgs), they are automatically inserted into the organizations table if not already present. Each person record has a known_for_org_id foreign key linking to the organizations table, enabling efficient joins and lookups.
EntityType Classification
NEW in v0.8.0Each organization record is classified with an entity_type field to distinguish between businesses, non-profits, government agencies, and other organization types:
| Category | Types | Description |
|---|---|---|
| Business | business, fund, branch | Commercial entities, investment funds, branch offices |
| Non-profit | nonprofit, ngo, foundation, trade_union | Charitable organizations, NGOs, labor unions |
| Government | government, international_org, political_party | Government agencies, UN/WHO/IMF, political parties |
| Education | educational, research | Schools, universities, research institutes |
| Other | healthcare, media, sports, religious | Hospitals, studios, sports clubs, religious orgs |
| Unknown | unknown | Classification not determined |
How It Works
- Embedding Generation: Organization names are embedded using EmbeddingGemma (300M params)
- Vector Search: sqlite-vec enables fast similarity search across millions of records
- Qualification: When an ORG entity is found, the database is searched for matching organizations
- Identifier Resolution: Matched organizations provide LEI, CIK, company numbers, etc.
Other Tables NEW in v0.9.4
| Table | Records | Description |
|---|---|---|
| Roles | 94K+ | Job titles with Wikidata QIDs (CEO, Director, etc.) |
| Locations | 25K+ | Countries, states, and cities with hierarchy |
Database Variants
- entities-lite.db (30.1 GB): Core fields and int8 embeddings only (default download)
- entities.db (32.2 GB): Full database with complete source metadata
- *.db.gz: Gzip compressed versions for faster downloads
Entity Database
The entity database provides fast lookup and qualification of organizations, people, roles, and locations using vector similarity search. It stores records from authoritative sources with 768-dimensional embeddings for semantic matching.
UPDATED in v0.9.4Quick Start
# Download the pre-built database
corp-extractor db download
# Check what's in it
corp-extractor db status
# Search for organizations
corp-extractor db search "Microsoft"
# Search for people
corp-extractor db search-people "Tim Cook"
# Search for roles (v0.9.4)
corp-extractor db search-roles "CEO"
# Search for locations (v0.9.4)
corp-extractor db search-locations "California"The database is automatically used by the pipeline's qualification stage (Stage 3) to resolve entity names to canonical identifiers.
Getting the Database
Download Pre-built Database
The fastest way to get started is downloading from HuggingFace:
# Download lite version (default, smaller, faster)
corp-extractor db download
# Download full version (includes complete source metadata)
corp-extractor db download --fullDatabase variants:
| File | Size | Contents |
|---|---|---|
entities-lite.db | 30.1 GB | Core fields + int8 embeddings only |
entities.db | 32.2 GB | Full records with source metadata |
Storage location: ~/.cache/corp-extractor/entities-v2.db (v0.9.4+)
HuggingFace repo: Corp-o-Rate-Community/entity-references
Automatic Download
If you use the pipeline without downloading first, the database downloads automatically:
from statement_extractor.pipeline import ExtractionPipeline
pipeline = ExtractionPipeline()
ctx = pipeline.process("Microsoft CEO Satya Nadella announced...")
# Database downloaded automatically if not presentDatabase Schema
The database uses SQLite with the sqlite-vec extension for vector similarity search.
Schema v2 (Normalized)
v0.9.4The v2 schema uses INTEGER foreign keys to enum lookup tables instead of TEXT columns:
-- Enum tables: source_types, people_types, organization_types, location_types
-- Organization: source_id (FK), entity_type_id (FK), region_id (FK to locations)
-- People: source_id (FK), person_type_id (FK), country_id (FK), known_for_role_id (FK)
-- Roles: qid, name, source_id (FK), canon_id
-- Locations: qid, name, source_id (FK), location_type_id (FK), parent_ids (hierarchy)Organizations Table
CREATE TABLE organizations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
qid INTEGER, -- Wikidata QID as integer (v0.9.4)
name TEXT NOT NULL,
name_normalized TEXT NOT NULL,
source_id INTEGER NOT NULL, -- FK to source_types(id)
source_identifier TEXT NOT NULL, -- LEI, CIK, Company Number
region_id INTEGER, -- FK to locations(id) (v0.9.4)
entity_type_id INTEGER NOT NULL, -- FK to organization_types(id)
from_date TEXT, -- ISO YYYY-MM-DD
to_date TEXT, -- ISO YYYY-MM-DD
record TEXT NOT NULL, -- JSON (empty in lite version)
UNIQUE(source_identifier, source_id)
);
-- Both float32 and int8 embeddings supported (v0.9.4)
CREATE VIRTUAL TABLE organization_embeddings USING vec0(
org_id INTEGER PRIMARY KEY, embedding float[768]
);
CREATE VIRTUAL TABLE organization_embeddings_scalar USING vec0(
org_id INTEGER PRIMARY KEY, embedding int8[768]
);People Table
CREATE TABLE people (
id INTEGER PRIMARY KEY AUTOINCREMENT,
qid INTEGER, -- Wikidata QID as integer (v0.9.4)
name TEXT NOT NULL,
name_normalized TEXT NOT NULL,
source_id INTEGER NOT NULL, -- FK to source_types(id)
source_identifier TEXT NOT NULL, -- QID, Owner CIK, Person number
country_id INTEGER, -- FK to locations(id) (v0.9.4)
person_type_id INTEGER NOT NULL, -- FK to people_types(id)
known_for_role_id INTEGER, -- FK to roles(id) (v0.9.4)
known_for_org TEXT DEFAULT '',
known_for_org_id INTEGER, -- FK to organizations(id)
from_date TEXT, -- Role start date (ISO)
to_date TEXT, -- Role end date (ISO)
birth_date TEXT, -- ISO YYYY-MM-DD
death_date TEXT, -- ISO YYYY-MM-DD
record TEXT NOT NULL,
UNIQUE(source_identifier, source_id, known_for_role_id, known_for_org_id)
);
CREATE VIRTUAL TABLE person_embeddings USING vec0(
person_id INTEGER PRIMARY KEY, embedding float[768]
);
CREATE VIRTUAL TABLE person_embeddings_scalar USING vec0(
person_id INTEGER PRIMARY KEY, embedding int8[768]
);New Tables (v0.9.4)
-- Roles table for job titles
CREATE TABLE roles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
qid INTEGER, -- Wikidata QID (e.g., 484876 for CEO)
name TEXT NOT NULL, -- "Chief Executive Officer"
name_normalized TEXT NOT NULL,
source_id INTEGER NOT NULL, -- FK to source_types(id)
canon_id INTEGER DEFAULT NULL,
UNIQUE(name_normalized, source_id)
);
-- Locations table for geopolitical entities
CREATE TABLE locations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
qid INTEGER, -- Wikidata QID (e.g., 30 for USA)
name TEXT NOT NULL, -- "United States", "California"
name_normalized TEXT NOT NULL,
source_id INTEGER NOT NULL, -- FK to source_types(id)
source_identifier TEXT, -- "US", "CA"
parent_ids TEXT, -- JSON array of parent location IDs
location_type_id INTEGER NOT NULL, -- FK to location_types(id)
UNIQUE(source_identifier, source_id)
);Entity Types
Organization EntityTypes
| Category | Types |
|---|---|
| Business | business, fund, branch |
| Non-profit | nonprofit, ngo, foundation, trade_union |
| Government | government, international_org, political_party |
| Education | educational, research |
| Other | healthcare, media, sports, religious, unknown |
Person PersonTypes
| Type | Description | Examples |
|---|---|---|
executive | C-suite, board members | Tim Cook, Satya Nadella |
politician | Elected officials | Presidents, MPs, mayors |
government | Civil servants, diplomats | Agency heads, ambassadors |
military | Armed forces personnel | Generals, admirals |
legal | Judges, lawyers | Supreme Court justices |
professional | Known for profession | Famous surgeons, architects |
academic | Professors, researchers | Neil deGrasse Tyson |
scientist | Scientists, inventors | Research scientists |
athlete | Sports figures | LeBron James |
artist | Traditional creatives | Musicians, actors, painters |
media | Internet personalities | YouTubers, influencers |
journalist | Reporters, presenters | Anderson Cooper |
entrepreneur | Founders, business owners | Mark Zuckerberg |
activist | Advocates, campaigners | Greta Thunberg |
Simplified Location Types
v0.9.4| Type | Description | Examples |
|---|---|---|
continent | Continents | Europe, Asia, Africa |
country | Sovereign states | United States, Germany, Japan |
subdivision | States, provinces, regions | California, Bavaria, Ontario |
city | Cities, towns, municipalities | New York, Paris, Tokyo |
district | Districts, boroughs, neighborhoods | Manhattan, Westminster |
historic | Former countries, historic territories | Soviet Union, Prussia |
Data Sources
Organizations
| Source | Records | Identifier | Coverage |
|---|---|---|---|
| Companies House | 5.5M | UK Company Number | UK registered companies |
| GLEIF | 2.6M | LEI (Legal Entity Identifier) | Global legal entities |
| Wikidata | 1.5M | QID | Notable organizations |
| SEC Edgar | 73K | CIK (Central Index Key) | US public companies |
Total: 9.6M+ organizations
People
| Source | Records | Identifier | Coverage |
|---|---|---|---|
| Companies House | 27.5M | Person number | UK company officers |
| Wikidata | 13.4M | QID | Notable people worldwide |
Total: 41M+ people
Other Tables
| Table | Records | Description |
|---|---|---|
| Roles | 94K | Job titles with Wikidata QIDs |
| Locations | 25K | Countries, states, cities with hierarchy |
Python API
Search Organizations
from statement_extractor.database import OrganizationDatabase
db = OrganizationDatabase()
# Search by name (hybrid: text + embedding)
matches = db.search_by_name("Microsoft Corporation", top_k=5)
for match in matches:
print(f"{match.company.name} ({match.company.source}:{match.company.source_id})")
print(f" Similarity: {match.similarity_score:.3f}")
print(f" Type: {match.company.entity_type}")
# Search by embedding
from statement_extractor.database import CompanyEmbedder
embedder = CompanyEmbedder()
embedding = embedder.embed("Microsoft")
matches = db.search(embedding, top_k=10, min_similarity=0.7)Search People
from statement_extractor.database import PersonDatabase
db = PersonDatabase()
# Search by name
matches = db.search_by_name("Tim Cook", top_k=5)
for match in matches:
print(f"{match.person.name} - {match.person.known_for_role} at {match.person.known_for_org}")
print(f" Wikidata: {match.person.source_id}")
print(f" Type: {match.person.person_type}")Use in Pipeline
The database is automatically used by qualification plugins:
from statement_extractor.pipeline import ExtractionPipeline
pipeline = ExtractionPipeline()
ctx = pipeline.process("Microsoft CEO Satya Nadella announced new AI features.")
for stmt in ctx.labeled_statements:
print(f"{stmt.subject_fqn} --[{stmt.statement.predicate}]--> {stmt.object_fqn}")
# e.g., "Satya Nadella (CEO, Microsoft) --[announced]--> new AI features"Add Custom Records
from statement_extractor.database import OrganizationDatabase, CompanyRecord, EntityType
db = OrganizationDatabase()
record = CompanyRecord(
name="My Company Inc",
source="custom",
source_id="CUSTOM001",
region="US",
entity_type=EntityType.business,
record={"custom_field": "value"},
)
db.add_record(record)Building Your Own Database
Import Organizations
# Companies House - UK companies (5.5M records)
corp-extractor db import-companies-house --download
# GLEIF - Global LEI data (2.6M records)
corp-extractor db import-gleif --download
corp-extractor db import-gleif /path/to/lei-data.json --limit 50000
# SEC Edgar - US public companies (73K filers)
corp-extractor db import-sec --download
# Wikidata organizations via SPARQL (1.5M records)
corp-extractor db import-wikidata --limit 50000Import People
# Import by person type
corp-extractor db import-people --type executive --limit 5000
corp-extractor db import-people --type politician --limit 5000
corp-extractor db import-people --type athlete --limit 5000
# Import all person types
corp-extractor db import-people --all --limit 50000
# Skip existing records (faster for incremental updates)
corp-extractor db import-people --type executive --skip-existing
# Fetch role start/end dates (slower, queries per person)
corp-extractor db import-people --type executive --enrich-datesWikidata Dump Import
v0.9.4For large imports without SPARQL query timeouts:
# Download and import from Wikidata dump (~100GB compressed)
corp-extractor db import-wikidata-dump --download --limit 100000
# Import from local dump file
corp-extractor db import-wikidata-dump --dump /path/to/latest-all.json.bz2
# Import only people (no organizations)
corp-extractor db import-wikidata-dump --dump dump.bz2 --people --no-orgs
# Import only locations (countries, states, cities) - v0.9.4
corp-extractor db import-wikidata-dump --dump dump.bz2 --locations --no-people --no-orgs
# Resume interrupted import
corp-extractor db import-wikidata-dump --dump dump.bz2 --resume
# Skip records already in database
corp-extractor db import-wikidata-dump --dump dump.bz2 --skip-updatesFast download with aria2c: Install aria2c for 10-20x faster downloads:
brew install aria2 # macOS
apt install aria2 # Ubuntu/DebianFull Build Process
# 1. Import from all sources
corp-extractor db import-gleif --download
corp-extractor db import-sec --download
corp-extractor db import-companies-house --download
corp-extractor db import-wikidata --limit 100000
corp-extractor db import-wikidata-dump --download --people --no-orgs --limit 100000
# 2. Link equivalent records
corp-extractor db canonicalize
# 3. Generate scalar embeddings (75% storage reduction)
corp-extractor db backfill-scalar
# 4. Check status
corp-extractor db status
# 5. Upload to HuggingFace
export HF_TOKEN="hf_..."
corp-extractor db uploadMigrate to v2 Schema
v0.9.4To migrate an existing v1 database to the normalized v2 schema:
# Create new v2 database (preserves original)
corp-extractor db migrate-v2 entities.db entities-v2.db
# Resume interrupted migration
corp-extractor db migrate-v2 entities.db entities-v2.db --resumeThe v2 schema provides:
- INTEGER FK columns instead of TEXT enums (better performance)
- New enum lookup tables for type filtering
- New roles and locations tables
- QIDs as integers (Q prefix stripped)
- Human-readable views with JOINs
Canonicalization
Link equivalent records across sources:
corp-extractor db canonicalizeOrganizations
Matches organizations by:
- Global identifiers: LEI, CIK, ticker (no region check needed)
- Normalized name + region: Handles suffix variations (Ltd → Limited, Corp → Corporation)
Source priority: gleif > sec_edgar > companies_house > wikipedia
People
v0.9.3Matches people by:
- Normalized name + same organization: Uses org canonical group to link people across sources
- Normalized name + overlapping date ranges: Links records with matching tenure periods
Source priority: wikidata > sec_edgar > companies_house
Canonicalization enables prominence-based search re-ranking that boosts entities with records from multiple authoritative sources.
Data Models
CompanyRecord
class CompanyRecord(BaseModel):
name: str # Organization name
source: str # 'gleif', 'sec_edgar', 'companies_house', 'wikipedia'
source_id: str # LEI, CIK, UK Company Number, or QID
region: str # Country/region code
entity_type: EntityType # Classification
from_date: Optional[str] # ISO YYYY-MM-DD
to_date: Optional[str] # ISO YYYY-MM-DD
record: dict[str, Any] # Full source record (empty in lite)
@property
def canonical_id(self) -> str:
return f"{self.source}:{self.source_id}"PersonRecord
class PersonRecord(BaseModel):
name: str # Display name
source: str # 'wikidata'
source_id: str # Wikidata QID
country: str # Country code
person_type: PersonType # Classification
known_for_role: str # Primary role
known_for_org: str # Primary organization name
known_for_org_id: Optional[int] # FK to organizations
from_date: Optional[str] # Role start (ISO)
to_date: Optional[str] # Role end (ISO)
birth_date: Optional[str] # Birth date (ISO)
death_date: Optional[str] # Death date (ISO)
record: dict[str, Any] # Full source record
@property
def is_historic(self) -> bool:
return self.death_date is not NoneMatch Results
class CompanyMatch(BaseModel):
company: CompanyRecord
similarity_score: float # 0.0 to 1.0
class PersonMatch(BaseModel):
person: PersonRecord
similarity_score: float # 0.0 to 1.0
llm_confirmed: bool # Whether LLM validated matchEmbedding Model
Embeddings are generated using google/embeddinggemma-300m:
- Parameters: 300M (lightweight)
- Dimensions: 768
- Optimized for: CPU inference
- Auto-download: Model downloads automatically on first use
from statement_extractor.database import CompanyEmbedder
embedder = CompanyEmbedder()
embedding = embedder.embed("Apple Inc") # Returns 768-dim numpy arrayTroubleshooting
Database not found:
Error: Database not found at ~/.cache/corp-extractor/entities.dbRun corp-extractor db download to fetch the pre-built database.
sqlite-vec extension error:
Error: no such module: vec0The sqlite-vec extension should install automatically. If not: pip install sqlite-vec
Memory issues with large dumps:
# Import in smaller batches
corp-extractor db import-wikidata-dump --dump dump.bz2 --limit 10000 --skip-updates
# Then resume for more
corp-extractor db import-wikidata-dump --dump dump.bz2 --limit 10000 --skip-updates --resumeResume interrupted import:
corp-extractor db import-wikidata-dump --dump dump.bz2 --resumeProgress is saved to ~/.cache/corp-extractor/wikidata-dump-progress.json.
API Reference
Functions
The library provides convenience functions for quick extraction without managing extractor instances.
| Function | Returns | Description |
|---|---|---|
extract_statements(text, options?) | ExtractionResult | Main extraction function. Returns structured statements with confidence scores. |
extract_statements_as_json(text, options?, indent?) | str | Returns extraction result as a JSON string. |
extract_statements_as_xml(text, options?) | str | Returns raw XML output from the model. |
extract_statements_as_dict(text, options?) | dict | Returns extraction result as a Python dictionary. |
Function Signatures
def extract_statements(
text: str,
options: Optional[ExtractionOptions] = None,
**kwargs
) -> ExtractionResult:
"""
Extract structured statements from text.
Args:
text: Input text to extract statements from
options: Extraction options (or pass individual options as kwargs)
**kwargs: Individual option overrides (num_beams, diversity_penalty, etc.)
Returns:
ExtractionResult containing Statement objects
"""def extract_statements_as_json(
text: str,
options: Optional[ExtractionOptions] = None,
indent: Optional[int] = 2,
**kwargs
) -> str:
"""Returns JSON string representation of the extraction result."""def extract_statements_as_xml(
text: str,
options: Optional[ExtractionOptions] = None,
**kwargs
) -> str:
"""Returns XML string with <statements> containing <stmt> elements."""def extract_statements_as_dict(
text: str,
options: Optional[ExtractionOptions] = None,
**kwargs
) -> dict:
"""Returns dictionary representation of the extraction result."""Usage Examples
from statement_extractor import extract_statements, extract_statements_as_json
# Basic extraction
result = extract_statements("Apple acquired Beats for $3 billion.")
for stmt in result:
print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
# With options via kwargs
result = extract_statements(
"Tesla announced new factories.",
num_beams=6,
diversity_penalty=1.5
)
# JSON output
json_str = extract_statements_as_json("OpenAI released GPT-4.", indent=2)
print(json_str)Classes
StatementExtractor
The main extractor class with full control over device, model loading, and extraction options.
class StatementExtractor:
def __init__(
self,
model_id: str = "Corp-o-Rate-Community/statement-extractor",
device: Optional[str] = None,
torch_dtype: Optional[torch.dtype] = None,
predicate_taxonomy: Optional[PredicateTaxonomy] = None,
predicate_config: Optional[PredicateComparisonConfig] = None,
scoring_config: Optional[ScoringConfig] = None,
):
"""
Initialize the statement extractor.
Args:
model_id: HuggingFace model ID or local path
device: Device to use ('cuda', 'cpu', or None for auto-detect)
torch_dtype: Torch dtype (default: bfloat16 on GPU, float32 on CPU)
predicate_taxonomy: Optional taxonomy for predicate normalization
predicate_config: Configuration for predicate comparison
scoring_config: Configuration for quality scoring
"""
def extract(
self,
text: str,
options: Optional[ExtractionOptions] = None,
) -> ExtractionResult:
"""Extract statements from text."""
def extract_as_xml(
self,
text: str,
options: Optional[ExtractionOptions] = None,
) -> str:
"""Extract statements and return raw XML output."""
def extract_as_json(
self,
text: str,
options: Optional[ExtractionOptions] = None,
indent: Optional[int] = 2,
) -> str:
"""Extract statements and return JSON string."""
def extract_as_dict(
self,
text: str,
options: Optional[ExtractionOptions] = None,
) -> dict:
"""Extract statements and return as dictionary."""Example: Custom extractor with GPU control
from statement_extractor import StatementExtractor, ExtractionOptions
# Force CPU usage
extractor = StatementExtractor(device="cpu")
# Extract with custom options
options = ExtractionOptions(num_beams=6, diversity_penalty=1.2)
result = extractor.extract("Microsoft partnered with OpenAI.", options)ExtractionOptions
Configuration for the extraction process.
class ExtractionOptions(BaseModel):
# Beam search parameters
num_beams: int = 4 # 1-16, beams for diverse beam search
diversity_penalty: float = 1.0 # >= 0.0, penalty for beam diversity
max_new_tokens: int = 2048 # 128-8192, max tokens to generate
min_statement_ratio: float = 1.0 # >= 0.0, min statements per sentence
max_attempts: int = 3 # 1-10, extraction retry attempts
deduplicate: bool = True # Remove duplicate statements
# Predicate taxonomy & comparison
predicate_taxonomy: Optional[PredicateTaxonomy] = None
predicate_config: Optional[PredicateComparisonConfig] = None
# Scoring configuration (v0.2.0)
scoring_config: Optional[ScoringConfig] = None
# Pluggable canonicalization
entity_canonicalizer: Optional[Callable[[str], str]] = None
# Mode flags
merge_beams: bool = True # Merge top-N beams vs select best
embedding_dedup: bool = True # Use embedding similarity for dedupScoringConfig
Quality scoring parameters for beam selection and triple assessment. Added in v0.2.0.
class ScoringConfig(BaseModel):
quality_weight: float = 1.0 # >= 0.0, weight for confidence scores
coverage_weight: float = 0.5 # >= 0.0, bonus for source text coverage
redundancy_penalty: float = 0.3 # >= 0.0, penalty for duplicate triples
length_penalty: float = 0.1 # >= 0.0, penalty for verbosity
min_confidence: float = 0.0 # 0.0-1.0, minimum confidence threshold
merge_top_n: int = 3 # 1-10, beams to merge when merge_beams=TrueTuning for precision vs recall:
| Use Case | min_confidence | Notes |
|---|---|---|
| High recall | 0.0 | Keep all extractions |
| Balanced | 0.5 | Filter low-confidence triples |
| High precision | 0.8 | Only keep high-confidence triples |
PredicateTaxonomy
A taxonomy of canonical predicates for normalization.
class PredicateTaxonomy(BaseModel):
predicates: list[str] # List of canonical predicate forms
name: Optional[str] = None # Optional taxonomy name
@classmethod
def from_file(cls, path: str | Path) -> "PredicateTaxonomy":
"""Load taxonomy from a file (one predicate per line)."""
@classmethod
def from_list(cls, predicates: list[str], name: Optional[str] = None) -> "PredicateTaxonomy":
"""Create taxonomy from a list of predicates."""Example:
from statement_extractor import PredicateTaxonomy, ExtractionOptions, extract_statements
# Define canonical predicates
taxonomy = PredicateTaxonomy.from_list([
"acquired", "founded", "works_for", "located_in", "partnered_with"
])
# Use in extraction
options = ExtractionOptions(predicate_taxonomy=taxonomy)
result = extract_statements("Google bought YouTube.", options)
# predicate "bought" maps to canonical "acquired"PredicateComparisonConfig
Configuration for embedding-based predicate comparison.
class PredicateComparisonConfig(BaseModel):
embedding_model: str = "sentence-transformers/paraphrase-MiniLM-L6-v2"
similarity_threshold: float = 0.65 # 0.0-1.0, min similarity for taxonomy match
dedup_threshold: float = 0.65 # 0.0-1.0, min similarity for duplicates
normalize_text: bool = True # Lowercase and strip before embeddingData Models
All data models use Pydantic for validation and serialization.
Statement
A single extracted subject-predicate-object triple.
class Statement(BaseModel):
subject: Entity # The subject entity
predicate: str # The relationship/predicate
object: Entity # The object entity
source_text: Optional[str] = None # Original text span
# Quality scoring fields (v0.2.0)
confidence_score: Optional[float] = None # 0.0-1.0, quality score (semantic + entity)
evidence_span: Optional[tuple[int, int]] = None # Character offsets in source
canonical_predicate: Optional[str] = None # Canonical form if taxonomy used
def as_triple(self) -> tuple[str, str, str]:
"""Return as (subject, predicate, object) tuple."""
def __str__(self) -> str:
"""Format: 'subject -- predicate --> object'"""Example:
stmt = result.statements[0]
print(stmt.subject.text) # "Apple Inc."
print(stmt.predicate) # "acquired"
print(stmt.object.text) # "Beats Electronics"
print(stmt.confidence_score) # 0.92
print(stmt.as_triple()) # ("Apple Inc.", "acquired", "Beats Electronics")Entity
An entity representing a subject or object.
class Entity(BaseModel):
text: str # The entity text
type: EntityType = UNKNOWN # The entity type
def __str__(self) -> str:
"""Format: 'text (TYPE)'"""EntityType
Enumeration of supported entity types.
class EntityType(str, Enum):
ORG = "ORG" # Organization
PERSON = "PERSON" # Person
GPE = "GPE" # Geopolitical entity (country, city, state)
LOC = "LOC" # Non-GPE location
PRODUCT = "PRODUCT" # Product
EVENT = "EVENT" # Event
WORK_OF_ART = "WORK_OF_ART" # Creative work
LAW = "LAW" # Legal document
DATE = "DATE" # Date or time
MONEY = "MONEY" # Monetary value
PERCENT = "PERCENT" # Percentage
QUANTITY = "QUANTITY" # Quantity or measurement
UNKNOWN = "UNKNOWN" # Unknown typeExtractionResult
Container for extraction results. Supports iteration and length.
class ExtractionResult(BaseModel):
statements: list[Statement] = [] # List of extracted statements
source_text: Optional[str] = None # Original input text
def __len__(self) -> int:
"""Number of statements."""
def __iter__(self):
"""Iterate over statements."""
def to_triples(self) -> list[tuple[str, str, str]]:
"""Return all statements as (subject, predicate, object) tuples."""Example:
result = extract_statements(text)
# Iterate directly
for stmt in result:
print(stmt)
# Check count
print(f"Found {len(result)} statements")
# Get as simple tuples
triples = result.to_triples()PredicateMatch
Result of matching a predicate to a canonical form.
class PredicateMatch(BaseModel):
original: str # The original extracted predicate
canonical: Optional[str] = None # Matched canonical predicate, if any
similarity: float = 0.0 # 0.0-1.0, cosine similarity score
matched: bool = False # Whether a match was found above thresholdExample:
from statement_extractor import PredicateComparer, PredicateTaxonomy
taxonomy = PredicateTaxonomy.from_list(["acquired", "founded", "works_for"])
comparer = PredicateComparer(taxonomy=taxonomy)
match = comparer.match_to_canonical("bought")
print(match.original) # "bought"
print(match.canonical) # "acquired"
print(match.similarity) # ~0.82
print(match.matched) # TruePipeline API
NEW in v0.5.0The pipeline API provides comprehensive entity resolution and taxonomy classification through a 5-stage plugin architecture.
ExtractionPipeline
The main orchestrator class that runs all pipeline stages.
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig
class ExtractionPipeline:
def __init__(self, config: Optional[PipelineConfig] = None):
"""
Initialize the extraction pipeline.
Args:
config: Pipeline configuration (default: all stages enabled)
"""
def process(self, text: str, metadata: Optional[dict] = None) -> PipelineContext:
"""
Process text through the pipeline stages.
Args:
text: Input text to process
metadata: Optional source metadata (document ID, URL, etc.)
Returns:
PipelineContext with results from all stages
"""Example:
pipeline = ExtractionPipeline()
ctx = pipeline.process("Amazon CEO Andy Jassy announced plans.")
print(f"Statements: {ctx.statement_count}")
for stmt in ctx.labeled_statements:
print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")PipelineConfig
Configuration for stage and plugin selection.
from statement_extractor.pipeline import PipelineConfig
class PipelineConfig(BaseModel):
enabled_stages: set[int] = {1, 2, 3, 4, 5} # Stages to run (1-5)
enabled_plugins: Optional[set[str]] = None # Plugins to enable (None = all)
disabled_plugins: set[str] = set() # Plugins to disable
fail_fast: bool = False # Stop on first error
parallel_processing: bool = False # Enable parallel processing
max_statements: Optional[int] = None # Limit statements processed
# Stage-specific options
splitter_options: dict = {}
extractor_options: dict = {}
qualifier_options: dict = {}
labeler_options: dict = {}
taxonomy_options: dict = {}
@classmethod
def from_stage_string(cls, stages: str, **kwargs) -> "PipelineConfig":
"""Create config from stage string like '1-3' or '1,2,5'."""
@classmethod
def default(cls) -> "PipelineConfig":
"""All stages enabled."""
@classmethod
def minimal(cls) -> "PipelineConfig":
"""Only splitting and extraction (stages 1-2)."""Example:
# Run only stages 1-3
config = PipelineConfig(enabled_stages={1, 2, 3})
# Disable specific plugins
config = PipelineConfig(disabled_plugins={"sec_edgar_qualifier"})
# From stage string
config = PipelineConfig.from_stage_string("1-3")PipelineContext
Data container that flows through all pipeline stages.
from statement_extractor.pipeline import PipelineContext
class PipelineContext(BaseModel):
# Input
source_text: str # Original input text
source_metadata: dict = {} # Document metadata
# Stage outputs
raw_triples: list[RawTriple] = [] # Stage 1 output
statements: list[PipelineStatement] = [] # Stage 2 output
canonical_entities: dict[str, CanonicalEntity] = {} # Stage 3 output
labeled_statements: list[LabeledStatement] = [] # Stage 4 output
taxonomy_results: dict[tuple, list[TaxonomyResult]] = {} # Stage 5 output (multiple labels per statement)
# Processing metadata
processing_errors: list[str] = []
processing_warnings: list[str] = []
stage_timings: dict[str, float] = {}
@property
def statement_count(self) -> int:
"""Number of statements in final output."""
@property
def has_errors(self) -> bool:
"""Check if any errors occurred."""PluginRegistry
Registry for discovering and managing plugins.
from statement_extractor.pipeline import PluginRegistry
class PluginRegistry:
@classmethod
def list_plugins(cls, stage: Optional[int] = None) -> list[dict]:
"""List all registered plugins, optionally filtered by stage."""
@classmethod
def get_plugin(cls, name: str) -> Optional[BasePlugin]:
"""Get a plugin by name."""Pipeline Data Models
RawTriple
Output of Stage 1 (Splitting).
class RawTriple(BaseModel):
subject_text: str # Raw subject text
predicate_text: str # Raw predicate text
object_text: str # Raw object text
source_sentence: str # Source sentence
confidence: float = 1.0 # Extraction confidence (0-1)
def as_tuple(self) -> tuple[str, str, str]:
"""Return as (subject, predicate, object) tuple."""PipelineStatement
Output of Stage 2 (Extraction).
class PipelineStatement(BaseModel):
subject: ExtractedEntity # Subject with type, span, confidence
predicate: str # Predicate text
predicate_category: Optional[str] # Predicate category (e.g., "employment_leadership")
object: ExtractedEntity # Object with type, span, confidence
source_text: str # Source text
confidence_score: float = 1.0 # Overall confidence (from GLiNER2)
extraction_method: Optional[str] # Method: gliner_relationNote: Stage 2 returns all matching relations from GLiNER2, not just the best one. Relations are sorted by confidence (descending).
GLiNER2Extractor
The Stage 2 extractor plugin that uses GLiNER2 for relation extraction.
from statement_extractor.plugins.extractors.gliner2 import GLiNER2Extractor
class GLiNER2Extractor(BaseExtractorPlugin):
def __init__(
self,
predicates: Optional[list[str]] = None,
predicates_file: Optional[str | Path] = None,
entity_types: Optional[list[str]] = None,
use_default_predicates: bool = True,
):
"""
Initialize the GLiNER2 extractor.
Args:
predicates: Custom list of predicate names
predicates_file: Path to custom predicates JSON file
entity_types: Entity types to extract (default: all)
use_default_predicates: Use 324 built-in predicates when no custom provided
"""Key behaviors:
- Uses
include_confidence=Truefor real confidence scores from GLiNER2 - Iterates through 21 predicate categories to stay under GLiNER2's ~25 label limit
- Returns all matching relations per source sentence (filtered later)
- Predicates loaded from
default_predicates.json(324 predicates)
EntityQualifiers
Qualifiers added in Stage 3.
class EntityQualifiers(BaseModel):
# Semantic qualifiers
org: Optional[str] = None # Organization/employer
role: Optional[str] = None # Job title/position
# Location qualifiers
region: Optional[str] = None # State/province
country: Optional[str] = None # Country
city: Optional[str] = None # City
jurisdiction: Optional[str] = None # Legal jurisdiction
# External identifiers
identifiers: dict[str, str] = {} # lei, ch_number, sec_cik, ticker, etc.
def has_any_qualifier(self) -> bool:
"""Check if any qualifier is set."""CanonicalMatch
Result of canonical matching in Stage 3.
class CanonicalMatch(BaseModel):
canonical_id: Optional[str] # ID in canonical database
canonical_name: Optional[str] # Canonical name/label
match_method: str # identifier, name_exact, name_fuzzy, embedding
match_confidence: float = 1.0 # Confidence in match (0-1)
match_details: Optional[dict] # Additional match detailsCanonicalEntity
Output of Stage 3 (Entity Qualification).
class CanonicalEntity(BaseModel):
entity_ref: str # Reference to original entity
original_text: str # Original entity text
entity_type: EntityType # Entity type
qualifiers: EntityQualifiers # Qualifiers and identifiers
canonical_match: Optional[CanonicalMatch] # Canonical match if found
fqn: str # Fully Qualified Name
qualification_sources: list[str] # Plugins that contributedStatementLabel
A label applied in Stage 4.
class StatementLabel(BaseModel):
label_type: str # sentiment, relation_type, confidence
label_value: Union[str, float, bool] # The label value
confidence: float = 1.0 # Confidence in label
labeler: Optional[str] # Plugin that produced the labelLabeledStatement
Final output from Stage 4 (Labeling).
class LabeledStatement(BaseModel):
statement: PipelineStatement # Original statement
subject_canonical: CanonicalEntity # Canonicalized subject
object_canonical: CanonicalEntity # Canonicalized object
labels: list[StatementLabel] = [] # Applied labels
@property
def subject_fqn(self) -> str:
"""Subject's fully qualified name."""
@property
def object_fqn(self) -> str:
"""Object's fully qualified name."""
def get_label(self, label_type: str) -> Optional[StatementLabel]:
"""Get label by type."""
def as_dict(self) -> dict:
"""Convert to simplified dictionary."""Example:
for stmt in ctx.labeled_statements:
print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")
# Access labels
sentiment = stmt.get_label("sentiment")
if sentiment:
print(f" Sentiment: {sentiment.label_value}")
# Access qualifiers
subject_quals = stmt.subject_canonical.qualified_entity.qualifiers
if subject_quals.role:
print(f" Role: {subject_quals.role}")TaxonomyResult
Output of Stage 5 (Taxonomy) classification.
class TaxonomyResult(BaseModel):
taxonomy_name: str # e.g., "esg_topics"
category: str # Top-level category
label: str # Specific label
label_id: Optional[int] = None # Numeric ID if available
confidence: float = 1.0 # Classification confidence (0-1)
classifier: Optional[str] = None # Plugin that produced this result
metadata: dict = {} # Additional metadata
@property
def full_label(self) -> str:
"""Return category:label format."""Example:
# Access taxonomy results from context
# Each statement may have multiple labels above the threshold
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
print(f"Statement: {source_text[:50]}...")
print(f" Taxonomy: {taxonomy_name}")
print(f" Labels ({len(results)}):")
for result in results:
print(f" - {result.full_label} (confidence: {result.confidence:.2f})")ClassificationSchema
Schema for simple multi-choice classification (2-20 options). Used by labelers that need GLiNER2 to perform classification.
class ClassificationSchema(BaseModel):
label_type: str # e.g., "sentiment"
choices: list[str] # Available choices
description: str = "" # Description for the classifier
scope: str = "statement" # statement or entityTaxonomySchema
Schema for large taxonomy classification (100+ values). Used by taxonomy plugins.
class TaxonomySchema(BaseModel):
label_type: str # e.g., "taxonomy"
values: list[str] | dict[str, list[str]] # Flat list or category -> labels
description: str = ""
scope: str = "statement"
label_descriptions: Optional[dict[str, str]] = None # Descriptions for labelsConfiguration
The statement-extractor library provides fine-grained control over extraction behavior through configuration classes. This section covers all configuration options for tuning precision, recall, and performance.
ExtractionOptions
The primary configuration class for controlling extraction behavior.
| Parameter | Type | Default | Description |
|---|---|---|---|
num_beams | int | 4 | Number of beam search candidates |
diversity_penalty | float | 1.0 | Penalty for beam diversity in diverse beam search |
max_new_tokens | int | 2048 | Maximum generation length in tokens |
deduplicate | bool | True | Remove duplicate statements from output |
merge_beams | bool | True | Merge top beams into single result set (v0.2.0) |
embedding_dedup | bool | True | Use embedding similarity for deduplication (v0.2.0) |
predicates | list[str] | None | Predefined predicates for GLiNER2 relation extraction (v0.4.0) |
all_triples | bool | False | Keep all candidate triples instead of best per source |
predicate_taxonomy | PredicateTaxonomy | None | Taxonomy of canonical predicates |
scoring_config | ScoringConfig | None | Quality scoring configuration |
entity_canonicalizer | Callable | None | Custom function for entity canonicalization |
Basic usage:
from statement_extractor import ExtractionOptions, extract_statements
options = ExtractionOptions(
num_beams=6,
diversity_penalty=1.2,
deduplicate=True
)
result = extract_statements("Apple acquired Beats for $3 billion.", options)ScoringConfig
Added in v0.2.0
Configuration for quality scoring, filtering, and beam selection. Use this to tune the precision-recall tradeoff.
| Parameter | Type | Default | Description |
|---|---|---|---|
min_confidence | float | 0.0 | Filter threshold (0=recall, 0.7+=precision) |
quality_weight | float | 1.0 | Weight for confidence scores |
coverage_weight | float | 0.5 | Weight for source text coverage |
redundancy_penalty | float | 0.3 | Penalty for duplicate triples |
length_penalty | float | 0.1 | Penalty for verbose predicates/entities |
merge_top_n | int | 3 | Number of beams to merge |
Common configurations:
from statement_extractor import ScoringConfig, ExtractionOptions, extract_statements
# High precision mode - only keep confident extractions
precision_config = ScoringConfig(
min_confidence=0.7,
quality_weight=1.5,
redundancy_penalty=0.5
)
# High recall mode - keep everything
recall_config = ScoringConfig(
min_confidence=0.0,
quality_weight=0.5,
redundancy_penalty=0.1
)
# Use in extraction
options = ExtractionOptions(scoring_config=precision_config)
result = extract_statements(text, options)Precision vs recall tuning:
| Use Case | min_confidence | quality_weight | Notes |
|---|---|---|---|
| Maximum recall | 0.0 | 0.5 | Keep all extractions |
| Balanced | 0.4 | 1.0 | Good default |
| High precision | 0.7 | 1.5 | Fewer false positives |
| Knowledge base | 0.8 | 2.0 | Very strict |
PredicateComparisonConfig
Added in v0.2.0
Configuration for embedding-based predicate comparison and taxonomy matching. Requires the [embeddings] extra.
| Parameter | Type | Default | Description |
|---|---|---|---|
embedding_model | str | paraphrase-MiniLM-L6-v2 | Model for computing similarity |
similarity_threshold | float | 0.65 | Minimum similarity for taxonomy matching |
dedup_threshold | float | 0.65 | Minimum similarity to consider duplicates |
normalize_text | bool | True | Lowercase/strip predicates before embedding |
Custom thresholds:
from statement_extractor import (
PredicateComparisonConfig,
PredicateTaxonomy,
ExtractionOptions,
extract_statements
)
# Stricter matching for precision
config = PredicateComparisonConfig(
similarity_threshold=0.75,
dedup_threshold=0.80,
normalize_text=True
)
taxonomy = PredicateTaxonomy.from_list([
"acquired", "founded", "works_for", "located_in",
"partnered_with", "invested_in", "announced"
])
options = ExtractionOptions(
predicate_taxonomy=taxonomy,
predicate_config=config
)
result = extract_statements("Google bought YouTube in 2006.", options)PipelineConfig
NEW in v0.5.0Configuration for the 5-stage extraction pipeline. Controls which stages run, which plugins are enabled, and stage-specific options.
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled_stages | set[int] | {1, 2, 3, 4, 5} | Stages to run (1-6) |
enabled_plugins | set[str] | None | None | Plugins to enable (None = all) |
disabled_plugins | set[str] | set() | Plugins to disable |
fail_fast | bool | False | Stop on first error |
parallel_processing | bool | False | Enable parallel processing |
max_statements | int | None | None | Limit statements processed |
Stage selection examples:
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline
# Run only splitting and extraction (stages 1-2)
config = PipelineConfig(enabled_stages={1, 2})
# Run stages 1-3 (skip canonicalization and labeling)
config = PipelineConfig(enabled_stages={1, 2, 3})
# From stage string
config = PipelineConfig.from_stage_string("1-3") # {1, 2, 3}
config = PipelineConfig.from_stage_string("1,2,5") # {1, 2, 5}
# Use presets
config = PipelineConfig.default() # All 5 stages
config = PipelineConfig.minimal() # Stages 1-2 onlyPlugin selection examples:
# Disable specific plugins
config = PipelineConfig(
disabled_plugins={"sec_edgar_qualifier", "companies_house_qualifier"}
)
# Enable only specific plugins
config = PipelineConfig(
enabled_plugins={"t5_gemma_splitter", "gliner2_extractor", "person_qualifier"}
)Stage-specific options:
config = PipelineConfig(
splitter_options={
"num_beams": 6,
"diversity_penalty": 1.2,
},
extractor_options={
"predicates_file": "/path/to/custom_predicates.json", # Custom predicate file
},
qualifier_options={
"timeout": 10.0, # API timeout
},
)GLiNER2 Extractor Options:
| Option | Type | Default | Description |
|---|---|---|---|
predicates_file | str | Path | None | Path to custom predicates JSON file |
predicates | list[str] | None | Custom list of predicate names (overrides file) |
entity_types | list[str] | all types | Entity types to extract |
use_default_predicates | bool | True | Use 324 built-in predicates when no custom ones provided |
Custom Predicates File Format:
{
"category_name": {
"predicate_name": {
"description": "Description for semantic matching",
"threshold": 0.7
}
}
}Example:
{
"employment": {
"works_for": {"description": "Employment relationship", "threshold": 0.75},
"manages": {"description": "Management relationship", "threshold": 0.7}
},
"ownership": {
"owns": {"description": "Ownership relationship", "threshold": 0.7},
"acquired": {"description": "Acquisition of entity", "threshold": 0.75}
}
}Stage Combinations
Common stage combinations for different use cases:
| Use Case | Stages | Description |
|---|---|---|
| Fast extraction | {1, 2} | Basic triples with entity types |
| With qualifiers | {1, 2, 3} | Add roles, identifiers (no canonicalization) |
| Full resolution | {1, 2, 3, 4} | Canonical forms, FQNs (no labeling) |
| Full pipeline | {1, 2, 3, 4, 5} | All stages except taxonomy |
| Complete pipeline | {1, 2, 3, 4, 5} | All stages including taxonomy |
| Labeling only | {1, 2, 5} | Skip qualification/canonicalization |
# Fast extraction for high-volume processing
fast_config = PipelineConfig.minimal()
# Full resolution for knowledge graph building
full_config = PipelineConfig.default()
# Custom: qualifiers without external APIs
internal_config = PipelineConfig(
enabled_stages={1, 2, 3, 4, 5},
disabled_plugins={"gleif_qualifier", "companies_house_qualifier", "sec_edgar_qualifier"},
)Entity Types
Corp-extractor classifies extracted subjects and objects into 13 entity types based on common Named Entity Recognition (NER) standards. Understanding these types helps you filter and process extracted statements effectively.
Complete Entity Type Reference
| Type | Description | Examples |
|---|---|---|
ORG | Organizations, companies, agencies | Apple Inc., United Nations, FBI |
PERSON | Individual people | Tim Cook, Elon Musk, Jane Doe |
GPE | Geopolitical entities (countries, cities, states) | United States, California, Paris |
LOC | Non-GPE locations | Pacific Ocean, Mount Everest, Central Park |
PRODUCT | Products and services | iPhone 15, Model S, Gmail |
EVENT | Events and occurrences | CES 2024, Annual Meeting, World Cup |
WORK_OF_ART | Creative works, documents, reports | Sustainability Report, Mona Lisa |
LAW | Legal documents and regulations | GDPR, Clean Air Act, Section 230 |
DATE | Dates and time periods | Q3 2024, January 15, 2030 |
MONEY | Monetary values | $4.7 billion, 100 million euros |
PERCENT | Percentages | 30%, 0.5%, 100% |
QUANTITY | Quantities and measurements | 1,000 employees, 50 megawatts |
UNKNOWN | Unclassified entities (fallback) | (varies) |
Accessing Entity Types in Code
Each extracted statement contains subject and object entities with a type attribute:
from statement_extractor import extract_statements
result = extract_statements("Apple CEO Tim Cook announced the iPhone 15.")
for stmt in result:
print(f"Subject: {stmt.subject.text} ({stmt.subject.type})")
print(f"Object: {stmt.object.text} ({stmt.object.type})")Output:
Subject: Apple (ORG)
Object: Tim Cook (PERSON)
Subject: Tim Cook (PERSON)
Object: iPhone 15 (PRODUCT)You can also import the EntityType enum for type checking and comparisons:
from statement_extractor import extract_statements, EntityType
result = extract_statements("Microsoft acquired Activision for $69 billion.")
for stmt in result:
if stmt.subject.type == EntityType.ORG:
print(f"Organization found: {stmt.subject.text}")
if stmt.object.type == EntityType.MONEY:
print(f"Monetary value: {stmt.object.text}")Filtering by Entity Type
A common use case is extracting only statements involving specific entity types. Here is how to filter statements by subject or object type:
from statement_extractor import extract_statements, EntityType
text = """
Apple announced revenue of $94.8 billion for Q3 2024.
CEO Tim Cook presented at the company's Cupertino headquarters.
The new iPhone 16 features improved battery life of 22 hours.
"""
result = extract_statements(text)
# Filter for statements where subject is an organization
org_statements = [
stmt for stmt in result
if stmt.subject.type == EntityType.ORG
]
# Filter for statements involving monetary values
money_statements = [
stmt for stmt in result
if stmt.subject.type == EntityType.MONEY or stmt.object.type == EntityType.MONEY
]
# Filter for statements about people
person_statements = [
stmt for stmt in result
if stmt.subject.type == EntityType.PERSON or stmt.object.type == EntityType.PERSON
]
print(f"Found {len(org_statements)} statements from organizations")
print(f"Found {len(money_statements)} statements with monetary values")
print(f"Found {len(person_statements)} statements about people")The UNKNOWN Type
The UNKNOWN entity type is used as a fallback when the model cannot confidently classify an entity into one of the 12 standard categories. This typically occurs with:
- Specialized domain terms: Technical jargon, industry-specific terminology
- Ambiguous entities: Terms that could fit multiple categories depending on context
- Novel entities: New terms or concepts not well-represented in training data
- Abstract concepts: Ideas or qualities that do not fit standard NER categories
from statement_extractor import extract_statements, EntityType
result = extract_statements("The synergy initiative improved operational efficiency.")
for stmt in result:
if stmt.subject.type == EntityType.UNKNOWN:
print(f"Unclassified entity: {stmt.subject.text}")
# Consider manual review or domain-specific handlingWhen you encounter UNKNOWN entities, consider:
- Manual review: Inspect the entity text to determine appropriate handling
- Domain mapping: Create application-specific mappings for recurring unknown entities
- Context analysis: Use surrounding statements to infer the entity's likely type
Entity Type Standards
Corp-extractor's entity types are based on widely-adopted NER standards, including:
- OntoNotes 5.0: The primary source for entity type definitions
- ACE (Automatic Content Extraction): Influences the GPE vs LOC distinction
- CoNLL-2003: Foundational NER task categories
This alignment with established standards ensures compatibility with other NLP tools and facilitates integration into existing data pipelines.
Examples
This section provides practical examples demonstrating common use cases for the corp-extractor library.
Basic Extraction
Extract statements from text and format the output:
from statement_extractor import extract_statements
text = """
Microsoft announced a partnership with OpenAI in 2019.
The deal was valued at $1 billion and aimed to develop
artificial general intelligence.
"""
result = extract_statements(text)
# Iterate over statements
for stmt in result:
subject = f"{stmt.subject.text} ({stmt.subject.type})"
object_ = f"{stmt.object.text} ({stmt.object.type})"
print(f"{subject} -- {stmt.predicate} --> {object_}")
# Check confidence scores
for stmt in result:
score = stmt.confidence_score or 0.0
print(f"[{score:.2f}] {stmt}")Output:
Microsoft (ORG) -- partnered with --> OpenAI
Microsoft (ORG) -- announced --> partnership
OpenAI (ORG) -- partnership valued at --> $1 billion
Microsoft (ORG) -- aims to develop --> artificial general intelligenceBatch Processing
Use the StatementExtractor class for processing multiple texts efficiently. The model loads once and is reused for all extractions:
from statement_extractor import StatementExtractor
# Initialize extractor with GPU
extractor = StatementExtractor(device="cuda")
texts = [
"Apple acquired Beats Electronics for $3 billion.",
"Google was founded by Larry Page and Sergey Brin in 1998.",
"Amazon announced a new fulfillment center in Texas."
]
# Process multiple texts
for text in texts:
result = extractor.extract(text)
print(f"Found {len(result)} statements in: {text[:40]}...")
for stmt in result:
print(f" - {stmt}")
print()For CPU-only environments:
# Force CPU usage
extractor = StatementExtractor(device="cpu")Confidence Filtering
v0.2.0Filter statements by confidence score to control precision vs recall:
from statement_extractor import extract_statements, ScoringConfig, ExtractionOptions
text = "Elon Musk founded SpaceX in 2002 to reduce space transportation costs."
# High precision mode - only high-confidence statements
scoring = ScoringConfig(min_confidence=0.7)
options = ExtractionOptions(scoring_config=scoring)
result = extract_statements(text, options)
print("High-confidence statements:")
for stmt in result:
print(f" [{stmt.confidence_score:.2f}] {stmt}")You can also filter after extraction for more control:
# Extract all statements first
result = extract_statements(text)
# Apply custom thresholds
high_confidence = [s for s in result if (s.confidence_score or 0) >= 0.8]
medium_confidence = [s for s in result if 0.5 <= (s.confidence_score or 0) < 0.8]
low_confidence = [s for s in result if (s.confidence_score or 0) < 0.5]
print(f"High: {len(high_confidence)}, Medium: {len(medium_confidence)}, Low: {len(low_confidence)}")Predicate Taxonomy
Map extracted predicates to a controlled vocabulary of canonical forms:
from statement_extractor import PredicateTaxonomy, ExtractionOptions, extract_statements
# Define your canonical predicates
taxonomy = PredicateTaxonomy(predicates=[
"acquired", "founded", "works_for", "announced",
"invested_in", "partnered_with", "committed_to"
])
options = ExtractionOptions(predicate_taxonomy=taxonomy)
text = "Google bought YouTube in 2006. Sequoia Capital backed the video platform."
result = extract_statements(text, options)
# View predicate normalization
for stmt in result:
original = stmt.predicate
canonical = stmt.canonical_predicate
if canonical and canonical != original:
print(f"'{original}' -> '{canonical}'")
print(f" {stmt.subject.text} -- {canonical or original} --> {stmt.object.text}")Output:
'bought' -> 'acquired'
Google -- acquired --> YouTube
'backed' -> 'invested_in'
Sequoia Capital -- invested_in --> YouTubeLoad taxonomy from a file:
# predicates.txt contains one predicate per line
taxonomy = PredicateTaxonomy.from_file("predicates.txt")Export Formats
Export extraction results in multiple formats for integration with other systems:
from statement_extractor import (
extract_statements,
extract_statements_as_json,
extract_statements_as_xml,
extract_statements_as_dict
)
text = "Netflix acquired Spry Fox, a game development studio, in 2022."
# JSON output (default 2-space indent)
json_str = extract_statements_as_json(text)
print(json_str)
# Compact JSON
json_compact = extract_statements_as_json(text, indent=None)
# XML output (raw model format)
xml_str = extract_statements_as_xml(text)
print(xml_str)
# Dictionary output (for programmatic use)
data = extract_statements_as_dict(text)
for stmt in data["statements"]:
print(f"{stmt['subject']['text']} -> {stmt['predicate']} -> {stmt['object']['text']}")JSON output format:
{
"statements": [
{
"subject": {"text": "Netflix", "type": "ORG"},
"predicate": "acquired",
"object": {"text": "Spry Fox", "type": "ORG"},
"source_text": "Netflix acquired Spry Fox",
"confidence_score": 0.94
}
],
"source_text": "Netflix acquired Spry Fox, a game development studio, in 2022."
}Disabling Embeddings
Skip embedding-based features for faster processing when you don't need predicate normalization or semantic deduplication:
from statement_extractor import ExtractionOptions, extract_statements
# Disable embedding-based deduplication
options = ExtractionOptions(
embedding_dedup=False, # Use exact string matching for dedup
predicate_taxonomy=None # No predicate normalization
)
result = extract_statements(text, options)When to disable embeddings:
| Scenario | Recommendation |
|---|---|
| Speed critical | Disable embeddings |
| No GPU available | Consider disabling for faster CPU processing |
| Need semantic dedup | Keep embeddings enabled |
| Using predicate taxonomy | Keep embeddings enabled |
| Simple text, few duplicates | Disable embeddings |
Custom Entity Canonicalization
Provide a custom function to normalize entity names:
from statement_extractor import ExtractionOptions, extract_statements
# Define a canonicalization function
def canonicalize_entity(text: str) -> str:
"""Normalize entity names to canonical forms."""
mappings = {
"apple": "Apple Inc.",
"apple inc": "Apple Inc.",
"apple inc.": "Apple Inc.",
"google": "Alphabet Inc.",
"google llc": "Alphabet Inc.",
"alphabet": "Alphabet Inc.",
"msft": "Microsoft Corporation",
"microsoft": "Microsoft Corporation",
}
return mappings.get(text.lower().strip(), text)
options = ExtractionOptions(entity_canonicalizer=canonicalize_entity)
text = "Apple and Google announced a partnership. Microsoft joined later."
result = extract_statements(text, options)
for stmt in result:
# Entities are now canonicalized
print(f"{stmt.subject.text} -- {stmt.predicate} --> {stmt.object.text}")Output:
Apple Inc. -- partnered with --> Alphabet Inc.
Microsoft Corporation -- joined --> partnershipFull Pipeline Example
Combining multiple features for production use:
from statement_extractor import (
StatementExtractor,
ExtractionOptions,
ScoringConfig,
PredicateTaxonomy,
PredicateComparisonConfig
)
# Configure scoring for high precision
scoring = ScoringConfig(
min_confidence=0.6,
quality_weight=1.0,
redundancy_penalty=0.5
)
# Define canonical predicates
taxonomy = PredicateTaxonomy.from_list([
"acquired", "founded", "invested_in", "partnered_with",
"announced", "launched", "hired", "appointed"
])
# Configure predicate matching
predicate_config = PredicateComparisonConfig(
similarity_threshold=0.7,
dedup_threshold=0.8
)
# Initialize extractor
extractor = StatementExtractor(
device="cuda",
predicate_taxonomy=taxonomy,
predicate_config=predicate_config,
scoring_config=scoring
)
# Configure extraction options
options = ExtractionOptions(
num_beams=6,
diversity_penalty=1.2,
deduplicate=True,
merge_beams=True
)
# Process text
text = """
Amazon Web Services announced a strategic partnership with Anthropic,
investing up to $4 billion in the AI safety startup. The deal, announced
in September 2023, makes AWS Anthropic's primary cloud provider.
"""
result = extractor.extract(text, options)
print(f"Extracted {len(result)} high-confidence statements:\n")
for stmt in result:
canonical = stmt.canonical_predicate or stmt.predicate
score = stmt.confidence_score or 0.0
print(f"[{score:.2f}] {stmt.subject.text} ({stmt.subject.type})")
print(f" -- {canonical} -->")
print(f" {stmt.object.text} ({stmt.object.type})")
print()Output:
Extracted 4 high-confidence statements:
[0.92] Amazon Web Services (ORG)
-- partnered_with -->
Anthropic (ORG)
[0.88] Amazon Web Services (ORG)
-- invested_in -->
Anthropic (ORG)
[0.85] Amazon Web Services (ORG)
-- invested_in -->
$4 billion (MONEY)
[0.78] AWS (ORG)
-- is primary cloud provider for -->
Anthropic (ORG)Pipeline Examples
NEW in v0.5.0Full Pipeline with Corporate Text
Process corporate announcements with full entity resolution:
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig
pipeline = ExtractionPipeline()
text = """
Amazon CEO Andy Jassy announced plans to hire 10,000 workers in the UK.
The expansion will focus on Amazon Web Services operations in London.
"""
ctx = pipeline.process(text)
print(f"Extracted {ctx.statement_count} statements\n")
for stmt in ctx.labeled_statements:
# FQN includes role and organization
print(f"Subject: {stmt.subject_fqn}")
print(f"Predicate: {stmt.statement.predicate}")
print(f"Object: {stmt.object_fqn}")
# Access labels
for label in stmt.labels:
print(f" {label.label_type}: {label.label_value}")
# Access qualifiers
subject_quals = stmt.subject_canonical.qualified_entity.qualifiers
if subject_quals.role:
print(f" Role: {subject_quals.role}")
if subject_quals.org:
print(f" Organization: {subject_quals.org}")
print("-" * 40)Output:
Extracted 2 statements
Subject: Andy Jassy (CEO, Amazon)
Predicate: announced
Object: plans to hire 10,000 workers in the UK
sentiment: positive
Role: CEO
Organization: Amazon
----------------------------------------
Subject: Amazon (AMZN)
Predicate: expanding operations in
Object: London (UK)
sentiment: positive
----------------------------------------Running Specific Stages
Skip qualification and canonicalization for faster processing:
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline
# Run only stages 1 and 2 (splitting + extraction)
config = PipelineConfig(enabled_stages={1, 2})
pipeline = ExtractionPipeline(config)
ctx = pipeline.process("Tim Cook is CEO of Apple Inc.")
# Access Stage 2 output (PipelineStatement)
for stmt in ctx.statements:
print(f"{stmt.subject.text} ({stmt.subject.type.value})")
print(f" --[{stmt.predicate}]-->")
print(f" {stmt.object.text} ({stmt.object.type.value})")
print(f" Confidence: {stmt.confidence_score:.2f}")Using Specific Plugins
Enable only internal plugins (no external API calls):
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline
# Disable external API plugins
config = PipelineConfig(
disabled_plugins={
"gleif_qualifier",
"companies_house_qualifier",
"sec_edgar_qualifier",
}
)
pipeline = ExtractionPipeline(config)
ctx = pipeline.process("OpenAI CEO Sam Altman announced GPT-5.")
# Will use person_qualifier (local LLM) but skip external lookups
for stmt in ctx.labeled_statements:
print(f"{stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")Custom Predicates File
Use a custom predicates JSON file instead of the 324 default predicates:
from statement_extractor.pipeline import PipelineConfig, ExtractionPipeline
# Use custom predicates file
config = PipelineConfig(
extractor_options={
"predicates_file": "/path/to/my_predicates.json"
}
)
pipeline = ExtractionPipeline(config)
ctx = pipeline.process("John works for Apple Inc.")
# All matching relations are returned
for stmt in ctx.statements:
print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
print(f" Category: {stmt.predicate_category}")
print(f" Confidence: {stmt.confidence_score:.2f}")Custom predicates file format:
{
"employment": {
"works_for": {
"description": "Employment relationship where person works for organization",
"threshold": 0.75
},
"manages": {
"description": "Management relationship where person manages entity",
"threshold": 0.7
}
},
"ownership": {
"owns": {
"description": "Ownership relationship",
"threshold": 0.7
},
"acquired": {
"description": "Acquisition of one entity by another",
"threshold": 0.75
}
}
}Each category should have fewer than 25 predicates to stay within GLiNER2's training limit for optimal performance.
Accessing Stage Outputs
Access results from each pipeline stage:
from statement_extractor.pipeline import ExtractionPipeline
pipeline = ExtractionPipeline()
ctx = pipeline.process("Microsoft CEO Satya Nadella announced Azure growth.")
# Stage 1: Raw triples
print("=== Stage 1: Raw Triples ===")
for triple in ctx.raw_triples:
print(f" {triple.subject_text} -> {triple.predicate_text} -> {triple.object_text}")
# Stage 2: Statements with types
print("\n=== Stage 2: Statements ===")
for stmt in ctx.statements:
print(f" {stmt.subject.text} ({stmt.subject.type.value}) -> {stmt.predicate}")
# Stage 3: Qualified entities
print("\n=== Stage 3: Qualified Entities ===")
for ref, qualified in ctx.qualified_entities.items():
quals = qualified.qualifiers
print(f" {qualified.original_text}")
if quals.role:
print(f" Role: {quals.role}")
if quals.org:
print(f" Org: {quals.org}")
for id_type, id_value in quals.identifiers.items():
print(f" {id_type}: {id_value}")
# Stage 4: Canonical entities
print("\n=== Stage 4: Canonical Entities ===")
for ref, canonical in ctx.canonical_entities.items():
print(f" {canonical.fqn}")
if canonical.canonical_match:
print(f" Method: {canonical.canonical_match.match_method}")
print(f" Confidence: {canonical.canonical_match.match_confidence:.2f}")
# Stage 5: Labeled statements
print("\n=== Stage 5: Labeled Statements ===")
for stmt in ctx.labeled_statements:
print(f" {stmt.subject_fqn} -> {stmt.statement.predicate} -> {stmt.object_fqn}")
for label in stmt.labels:
print(f" {label.label_type}: {label.label_value}")
# Stage 6: Taxonomy results (multiple labels per statement)
print("\n=== Stage 6: Taxonomy Results ===")
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
print(f" Statement: {source_text[:40]}...")
for result in results:
print(f" {result.full_label} (confidence: {result.confidence:.2f})")
# Timings
print("\n=== Stage Timings ===")
for stage, duration in ctx.stage_timings.items():
print(f" {stage}: {duration:.3f}s")Batch Pipeline Processing
Process multiple documents efficiently:
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig
# Use minimal stages for speed
config = PipelineConfig.minimal() # Stages 1-2 only
pipeline = ExtractionPipeline(config)
documents = [
"Apple announced a new MacBook Pro.",
"Google acquired Fitbit for $2.1 billion.",
"Tesla CEO Elon Musk unveiled the Cybertruck.",
]
all_statements = []
for doc in documents:
ctx = pipeline.process(doc)
for stmt in ctx.statements:
all_statements.append({
"subject": stmt.subject.text,
"subject_type": stmt.subject.type.value,
"predicate": stmt.predicate,
"object": stmt.object.text,
"object_type": stmt.object.type.value,
"confidence": stmt.confidence_score,
"source": doc,
})
print(f"Extracted {len(all_statements)} statements from {len(documents)} documents")Taxonomy Classification
Stage 6Classify statements against large taxonomies. Multiple labels may match a single statement above the confidence threshold:
from statement_extractor.pipeline import ExtractionPipeline
pipeline = ExtractionPipeline()
text = """
Apple announced a commitment to carbon neutrality by 2030.
The company also reported reducing packaging waste by 75%.
"""
ctx = pipeline.process(text)
# Access taxonomy classifications (multiple labels per statement)
for (source_text, taxonomy_name), results in ctx.taxonomy_results.items():
print(f"Statement: {source_text[:50]}...")
print(f" Taxonomy: {taxonomy_name}")
print(f" Labels:")
for result in results:
print(f" - {result.full_label} (confidence: {result.confidence:.2f})")
print()Output:
Statement: Apple announced a commitment to carbon neutrality...
Taxonomy: esg_topics
Labels:
- environment:carbon_emissions (confidence: 0.87)
- environment_benefit:emissions_reduction (confidence: 0.72)
- governance:sustainability_commitments (confidence: 0.45)
Statement: The company also reported reducing packaging waste...
Taxonomy: esg_topics
Labels:
- environment:waste_management (confidence: 0.92)
- environment_benefit:waste_reduction (confidence: 0.85)Pipeline with Error Handling
Handle errors and warnings gracefully:
from statement_extractor.pipeline import ExtractionPipeline, PipelineConfig
config = PipelineConfig(fail_fast=False) # Continue on errors
pipeline = ExtractionPipeline(config)
ctx = pipeline.process("Some text that might cause issues...")
# Check for errors
if ctx.has_errors:
print("Errors occurred:")
for error in ctx.processing_errors:
print(f" - {error}")
# Check for warnings
if ctx.processing_warnings:
print("Warnings:")
for warning in ctx.processing_warnings:
print(f" - {warning}")
# Process results that succeeded
print(f"\nSuccessfully extracted {ctx.statement_count} statements")Deployment
Local Inference
Hardware Requirements:
| Resource | Minimum | Notes |
|---|---|---|
| CPU-only | ~4GB RAM | ~30s per extraction |
| GPU | ~2GB VRAM | ~2s per extraction |
| Disk | ~1.5GB | Model download size |
Setup steps:
# Install the library
pip install corp-extractor[embeddings]
# For GPU support, install PyTorch with CUDA first
pip install torch --index-url https://download.pytorch.org/whl/cu121Running locally:
from statement_extractor import StatementExtractor
# Auto-detect GPU or fall back to CPU
extractor = StatementExtractor()
# Or explicitly set device
extractor = StatementExtractor(device="cuda") # GPU
extractor = StatementExtractor(device="cpu") # CPUThe model uses bfloat16 precision on GPU for faster inference and lower memory usage, and float32 on CPU.
RunPod Serverless
Why RunPod:
- Pay-per-use: ~$0.0002/sec on average
- Scales to zero: No cost when idle
- No infrastructure: Managed GPU containers
Setup steps:
- Clone the repository and build the Docker image:
cd runpod
docker build --platform linux/amd64 -t your-username/statement-extractor .- Push to Docker Hub:
docker push your-username/statement-extractor-
Create a RunPod serverless endpoint:
- Go to RunPod Console
- Create new endpoint with your Docker image
- Configure GPU type (RTX 3090 recommended)
- Set Active Workers: 0, Max Workers: 1-3
-
Call the API:
curl -X POST https://api.runpod.ai/v2/YOUR_ENDPOINT/runsync \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": {"text": "<page>Your text here</page>"}}'Pricing:
| GPU Type | Cost | Notes |
|---|---|---|
| RTX 3090 | ~$0.00031/sec | Recommended |
| Idle | $0 | Scales to zero |
Typical extraction costs less than $0.001 per request at ~2s processing time.