EXTRACT STATEMENTS.
MAP RELATIONSHIPS.
A Python library designed to analyze complex text and extract relationship information about people and organizations. Runs entirely on your hardware (RTX 4090+, Apple M1 16GB+) with no external API dependencies. Uses fine-tuned T5-Gemma 2 for statement splitting and coreference resolution, plus GLiNER2 for entity extraction. Includes a database of 10M+ organizations and 40M+ people with quantized embeddings for fast entity qualification (~100GB disk for all models and data).
Statements
No statements extracted yet.
Enter some text and click "Extract Statements" to begin.
Relationship Graph
Graph will appear after extracting statements
Export
// No statements to exportDefault Predicates
The extractor uses GLiNER2 relation extraction with these default predicates. Each predicate has a confidence threshold (typically 0.65-0.8) that filters low-confidence matches. You can override these defaults by providing a custom predicates_file parameter to the GLiNER2Extractor.
Statement Taxonomy
Stage 6 of the pipeline classifies statements against this ESG taxonomy using embedding similarity or MNLI inference. Each topic includes descriptions to guide classification. You can provide a custom taxonomy via taxonomy_file parameter to the taxonomy classifier plugins.
Quick Start
# Command Line Interface (v0.2.4+)
# ============================================
# Install globally (recommended)
# ============================================
# Using uv (recommended)
uv tool install "corp-extractor[embeddings]"
# Or using pipx
pipx install "corp-extractor[embeddings]"
# Or using pip
pip install "corp-extractor[embeddings]"
# ============================================
# Quick run with uvx (no install)
# ============================================
# Note: First run downloads the model (~1.5GB)
uvx corp-extractor "Apple announced a new iPhone."
# ============================================
# Usage Examples
# ============================================
# Extract from text argument
corp-extractor "Apple Inc. announced the iPhone 15 at their September event."
# Extract from file
corp-extractor -f article.txt
# Pipe from stdin
cat article.txt | corp-extractor -
# Output as JSON (with full metadata)
corp-extractor "Tim Cook is CEO of Apple." --json
# Output as XML (raw model output)
corp-extractor -f article.txt --xml
# Verbose output with confidence scores
corp-extractor -f article.txt --verbose
# Use more beams for better quality
corp-extractor -f article.txt --beams 8
# Use custom predicate taxonomy
corp-extractor -f article.txt --taxonomy predicates.txt
# Use GPU explicitly
corp-extractor -f article.txt --device cuda
# Filter low-confidence results
corp-extractor -f article.txt --min-confidence 0.7
# ============================================
# All CLI Options
# ============================================
# corp-extractor --help
#
# -f, --file PATH Read input from file
# -o, --output [table|json|xml] Output format (default: table)
# --json Output as JSON (shortcut)
# --xml Output as XML (shortcut)
# -b, --beams INTEGER Number of beams (default: 4)
# --diversity FLOAT Diversity penalty (default: 1.0)
# --max-tokens INTEGER Max tokens to generate (default: 2048)
# --no-dedup Disable deduplication
# --no-embeddings Disable embedding-based dedup (faster)
# --no-merge Disable beam merging
# --predicates PATH Load predicate list for GLiNER2 relation extraction
# --all-triples Keep all candidate triples (default: best per source)
# --dedup-threshold FLOAT Deduplication threshold (default: 0.65)
# --min-confidence FLOAT Min confidence filter (default: 0)
# --taxonomy PATH Load predicate taxonomy from file
# --taxonomy-threshold FLOAT Taxonomy matching threshold (default: 0.5)
# --device [auto|cuda|mps|cpu] Device to use (default: auto)
# -v, --verbose Show confidence scores and metadata
# -q, --quiet Suppress progress messages
# --version Show versionFor AI Assistants
SKILL.md for AI Assistants
Add to your project's CLAUDE.md or .cursorrules to enable statement extraction
# SKILL: Statement Extraction with corp-extractor
Use the `corp-extractor` Python library to extract structured subject-predicate-object statements from text. Returns Pydantic models with confidence scores.
## Installation
```bash
pip install corp-extractor[embeddings] # Recommended: includes semantic deduplication
```
For GPU support, install PyTorch with CUDA first:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install corp-extractor[embeddings]
```
## Quick Usage
```python
from statement_extractor import extract_statements
result = extract_statements("""
Apple Inc. announced the iPhone 15 at their September event.
Tim Cook presented the new features to customers worldwide.
""")
for stmt in result:
print(f"{stmt.subject.text} ({stmt.subject.type})")
print(f" --[{stmt.predicate}]--> {stmt.object.text}")
print(f" Confidence: {stmt.confidence_score:.2f}")
```
## Output Formats
```python
from statement_extractor import (
extract_statements, # Returns ExtractionResult with Statement objects
extract_statements_as_json, # Returns JSON string
extract_statements_as_xml, # Returns XML string
extract_statements_as_dict, # Returns dict
)
```
## Statement Object Structure
Each `Statement` has:
- `subject.text` - Subject entity text
- `subject.type` - Entity type (ORG, PERSON, GPE, etc.)
- `predicate` - The relationship/action
- `object.text` - Object entity text
- `object.type` - Object entity type
- `source_text` - Original sentence
- `confidence_score` - Groundedness score (0-1)
- `canonical_predicate` - Normalized predicate (if taxonomy used)
## Entity Types
ORG, PERSON, GPE (countries/cities), LOC, PRODUCT, EVENT, WORK_OF_ART, LAW, DATE, MONEY, PERCENT, QUANTITY, UNKNOWN
## Precision Mode (Filter Low-Confidence)
```python
from statement_extractor import ExtractionOptions, ScoringConfig
options = ExtractionOptions(
scoring_config=ScoringConfig(min_confidence=0.7)
)
result = extract_statements(text, options)
```
## Predicate Taxonomy (Normalize Predicates)
```python
from statement_extractor import PredicateTaxonomy, ExtractionOptions
taxonomy = PredicateTaxonomy(predicates=[
"acquired", "founded", "works_for", "headquartered_in"
])
options = ExtractionOptions(predicate_taxonomy=taxonomy)
result = extract_statements(text, options)
# "bought" -> "acquired" via semantic similarity
for stmt in result:
if stmt.canonical_predicate:
print(f"Normalized: {stmt.predicate} -> {stmt.canonical_predicate}")
```
## Batch Processing
```python
from statement_extractor import StatementExtractor
extractor = StatementExtractor(device="cuda") # or "cpu"
for text in texts:
result = extractor.extract(text)
```
## Best Practices
1. Use `[embeddings]` extra for semantic deduplication
2. Filter by `confidence_score >= 0.7` for high precision
3. Use predicate taxonomies for consistent knowledge graphs
4. Process large documents in chunks (by paragraph/section)
5. GPU recommended for production (~2GB VRAM needed)
## Links
- PyPI: https://pypi.org/project/corp-extractor/
- Docs: https://statement-extractor.corp-o-rate.com/docs
- Model: https://huggingface.co/Corp-o-Rate-Community/statement-extractorMultiple Models, One Pipeline
Corp-extractor uses multiple fine-tuned small models to transform unstructured text into structured relationship data—all running locally on your hardware with no external services.
Pipeline stages:
- T5-Gemma 2 (540M params) — Splits text into atomic statements and resolves coreferences. Trained on 70,000+ pages of corporate and news documents.
- GLiNER2 (205M params) — Extracts subject/predicate/object with entity types (ORG, PERSON, GPE, etc.) and 324 predefined predicates.
- Entity Database — Qualifies entities against 10M+ organizations and 40M+ people with quantized embeddings for sub-second lookups.
- BERT classifiers — Small models for sentiment labeling and embedding similarity for taxonomy classification.
Hardware: Requires ~100GB disk for all models and database. Runs on RTX 4090+ or Apple M1/M2/M3 with 16GB+ RAM.
How It Works
5-Stage Pipeline Architecture v0.8.0
Text flows through a modular plugin-based pipeline. Each stage transforms the data progressively, from raw text to fully qualified, labeled statements with taxonomy classifications.
Pipeline Stages
| Stage | Name | Purpose | Key Technology |
|---|---|---|---|
| 1 | Splitting | Text → Atomic Statements | T5-Gemma2 (540M params) with Diverse Beam Search |
| 2 | Extraction | Atomic Statements → Typed Triples | GLiNER2 (205M params) entity recognition |
| 3 | Qualification | Entities → Canonical names, identifiers, FQN | Company embedding database (SEC, GLEIF, UK Companies House) |
| 4 | Labeling | Add simple classifications | Multi-choice classifiers (sentiment, relation type) |
| 5 | Taxonomy | Classify against ESG taxonomy | MNLI zero-shot or embedding similarity |
Data Flow
Data is progressively enriched through each stage, from raw text to fully qualified statements with entity types, canonical names, sentiment labels, and taxonomy classifications.
Technical Features
Diverse Beam Search
The T5-Gemma2 model uses Diverse Beam Search (Vijayakumar et al., 2016) to generate 4 diverse candidate outputs, exploring multiple interpretations of the text.
GLiNER2 Entity Extraction
GLiNER2 (205M params) refines entity boundaries and scores how "entity-like" subjects and objects are. Uses 324 default predicates across 21 categories for relation extraction.
Entity Qualification
Company embedding database (~100K+ SEC, ~3M GLEIF, ~5M UK companies) provides fast vector similarity search to resolve entities to canonical names with identifiers (LEI, CIK, company numbers) and FQN.
Taxonomy Classification
Statements are classified against an ESG taxonomy using either MNLI zero-shot classification or embedding similarity, returning multiple labels above confidence thresholds.
Known Limitations
- 1.Complex sentences: Very long sentences with multiple nested clauses may result in incomplete extraction or incorrect predicate assignment.
- 2.Implicit relationships: The model works best with explicit statements. Implied or contextual relationships may be missed.
- 3.Domain specificity: Trained primarily on corporate/news text. Performance may vary on highly technical or specialized content.
- 4.Coreference limits: While the model resolves many pronouns, complex anaphora chains or ambiguous references may not resolve correctly.
- 5.Entity type coverage: Some specialized entity types (e.g., scientific terms, technical products) may default to UNKNOWN.
Roadmap & Areas for Improvement
✓ Recently Completed
- 5-Stage Pipeline Architecture (v0.8.0) — Merged qualification + canonicalization into single stage
- Company Embedding Database (v0.8.0) — Fast vector search for ~100K+ SEC, ~3M GLEIF, ~5M UK companies
- Taxonomy Classification (v0.5.0) — MNLI + embedding-based ESG taxonomy classification
- Entity Qualification (v0.5.0) — LEI, ticker, CIK lookups with canonical names and FQN
- Statement Labeling (v0.5.0) — Sentiment analysis and relation type classification
- GLiNER2 Integration (v0.4.0) — 205M param model for entity recognition and relation extraction
Larger Training Dataset
Expanding beyond 77K examples with more diverse sources
Multi-hop Reasoning
Better handling of statements that span multiple sentences
Negation Handling
Better detection of negative statements and contradictions
Knowledge Graph Integration
Link extracted entities to external knowledge bases (Wikidata, etc.)
We Need Your Feedback
This model is actively being improved. If you encounter incorrect extractions, missing statements, or have suggestions for improvement, we'd love to hear from you. Use the "Correct" button above to submit fixes, or reach out directly.
neil@corp-o-rate.comAbout Corp-o-Rate
The Glassdoor of ESG
Real corporate intelligence from real people. Track what companies actually do, not what they claim.
Corp-o-Rate is building a community-powered corporate accountability platform. We believe that glossy sustainability reports and PR-polished ESG claims don't tell the full story. Our mission is to surface the truth about corporate behavior through crowdsourced intelligence, AI-powered analysis, and transparent data.
This statement extraction model is one piece of that puzzle — automatically extracting relationships and meaningful statements from research, news, and corporate documents. Available as the corp-extractor Python library on PyPI. This is the first part of our analysis and we'll be releasing other re-usable components as we progress.
Community-Driven
Powered by employees, consumers, and researchers sharing real knowledge about corporate practices.
AI-Powered
Using NLP and knowledge graphs to structure, connect, and analyze corporate claims at scale.
100% Independent
No corporate sponsors. No conflicts of interest. Just transparent corporate intelligence.
We're Pre-Funding & Running on Fumes
Corp-o-Rate is currently bootstrapped and self-funded. We're building in public, shipping what we can, and working toward our mission one step at a time. If you believe in corporate accountability and transparent business intelligence, we'd love your support.
Help us train better models
Help us scale the platform
Data, research, or distribution
Shop smarter. Invest better. Know which companies match your values.