EXTRACT STATEMENTS.
MAP RELATIONSHIPS.

A Python library designed to analyze complex text and extract relationship information about people and organizations. Runs entirely on your hardware (RTX 4090+, Apple M1 16GB+) with no external API dependencies. Uses fine-tuned T5-Gemma 2 for statement splitting and coreference resolution, plus GLiNER2 for entity extraction. Includes a database of 10M+ organizations and 40M+ people with quantized embeddings for fast entity qualification (~100GB disk for all models and data).

Try an example:
0 / 4,000

Powered by corp-o-rate

Statements

No statements extracted yet.

Enter some text and click "Extract Statements" to begin.

Relationship Graph

Graph will appear after extracting statements

Export

// No statements to export

Default Predicates

The extractor uses GLiNER2 relation extraction with these default predicates. Each predicate has a confidence threshold (typically 0.65-0.8) that filters low-confidence matches. You can override these defaults by providing a custom predicates_file parameter to the GLiNER2Extractor.

Loading predicates...

Statement Taxonomy

Stage 6 of the pipeline classifies statements against this ESG taxonomy using embedding similarity or MNLI inference. Each topic includes descriptions to guide classification. You can provide a custom taxonomy via taxonomy_file parameter to the taxonomy classifier plugins.

Loading taxonomy...

Quick Start

# Command Line Interface (v0.2.4+)

# ============================================
# Install globally (recommended)
# ============================================

# Using uv (recommended)
uv tool install "corp-extractor[embeddings]"

# Or using pipx
pipx install "corp-extractor[embeddings]"

# Or using pip
pip install "corp-extractor[embeddings]"

# ============================================
# Quick run with uvx (no install)
# ============================================
# Note: First run downloads the model (~1.5GB)
uvx corp-extractor "Apple announced a new iPhone."

# ============================================
# Usage Examples
# ============================================

# Extract from text argument
corp-extractor "Apple Inc. announced the iPhone 15 at their September event."

# Extract from file
corp-extractor -f article.txt

# Pipe from stdin
cat article.txt | corp-extractor -

# Output as JSON (with full metadata)
corp-extractor "Tim Cook is CEO of Apple." --json

# Output as XML (raw model output)
corp-extractor -f article.txt --xml

# Verbose output with confidence scores
corp-extractor -f article.txt --verbose

# Use more beams for better quality
corp-extractor -f article.txt --beams 8

# Use custom predicate taxonomy
corp-extractor -f article.txt --taxonomy predicates.txt

# Use GPU explicitly
corp-extractor -f article.txt --device cuda

# Filter low-confidence results
corp-extractor -f article.txt --min-confidence 0.7

# ============================================
# All CLI Options
# ============================================
# corp-extractor --help
#
# -f, --file PATH              Read input from file
# -o, --output [table|json|xml] Output format (default: table)
# --json                       Output as JSON (shortcut)
# --xml                        Output as XML (shortcut)
# -b, --beams INTEGER          Number of beams (default: 4)
# --diversity FLOAT            Diversity penalty (default: 1.0)
# --max-tokens INTEGER         Max tokens to generate (default: 2048)
# --no-dedup                   Disable deduplication
# --no-embeddings              Disable embedding-based dedup (faster)
# --no-merge                   Disable beam merging
# --predicates PATH            Load predicate list for GLiNER2 relation extraction
# --all-triples                Keep all candidate triples (default: best per source)
# --dedup-threshold FLOAT      Deduplication threshold (default: 0.65)
# --min-confidence FLOAT       Min confidence filter (default: 0)
# --taxonomy PATH              Load predicate taxonomy from file
# --taxonomy-threshold FLOAT   Taxonomy matching threshold (default: 0.5)
# --device [auto|cuda|mps|cpu] Device to use (default: auto)
# -v, --verbose                Show confidence scores and metadata
# -q, --quiet                  Suppress progress messages
# --version                    Show version

For AI Assistants

SKILL.md for AI Assistants

Add to your project's CLAUDE.md or .cursorrules to enable statement extraction

# SKILL: Statement Extraction with corp-extractor

Use the `corp-extractor` Python library to extract structured subject-predicate-object statements from text. Returns Pydantic models with confidence scores.

## Installation

```bash
pip install corp-extractor[embeddings]  # Recommended: includes semantic deduplication
```

For GPU support, install PyTorch with CUDA first:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install corp-extractor[embeddings]
```

## Quick Usage

```python
from statement_extractor import extract_statements

result = extract_statements("""
    Apple Inc. announced the iPhone 15 at their September event.
    Tim Cook presented the new features to customers worldwide.
""")

for stmt in result:
    print(f"{stmt.subject.text} ({stmt.subject.type})")
    print(f"  --[{stmt.predicate}]--> {stmt.object.text}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")
```

## Output Formats

```python
from statement_extractor import (
    extract_statements,        # Returns ExtractionResult with Statement objects
    extract_statements_as_json,  # Returns JSON string
    extract_statements_as_xml,   # Returns XML string
    extract_statements_as_dict,  # Returns dict
)
```

## Statement Object Structure

Each `Statement` has:
- `subject.text` - Subject entity text
- `subject.type` - Entity type (ORG, PERSON, GPE, etc.)
- `predicate` - The relationship/action
- `object.text` - Object entity text
- `object.type` - Object entity type
- `source_text` - Original sentence
- `confidence_score` - Groundedness score (0-1)
- `canonical_predicate` - Normalized predicate (if taxonomy used)

## Entity Types

ORG, PERSON, GPE (countries/cities), LOC, PRODUCT, EVENT, WORK_OF_ART, LAW, DATE, MONEY, PERCENT, QUANTITY, UNKNOWN

## Precision Mode (Filter Low-Confidence)

```python
from statement_extractor import ExtractionOptions, ScoringConfig

options = ExtractionOptions(
    scoring_config=ScoringConfig(min_confidence=0.7)
)
result = extract_statements(text, options)
```

## Predicate Taxonomy (Normalize Predicates)

```python
from statement_extractor import PredicateTaxonomy, ExtractionOptions

taxonomy = PredicateTaxonomy(predicates=[
    "acquired", "founded", "works_for", "headquartered_in"
])
options = ExtractionOptions(predicate_taxonomy=taxonomy)
result = extract_statements(text, options)

# "bought" -> "acquired" via semantic similarity
for stmt in result:
    if stmt.canonical_predicate:
        print(f"Normalized: {stmt.predicate} -> {stmt.canonical_predicate}")
```

## Batch Processing

```python
from statement_extractor import StatementExtractor

extractor = StatementExtractor(device="cuda")  # or "cpu"
for text in texts:
    result = extractor.extract(text)
```

## Best Practices

1. Use `[embeddings]` extra for semantic deduplication
2. Filter by `confidence_score >= 0.7` for high precision
3. Use predicate taxonomies for consistent knowledge graphs
4. Process large documents in chunks (by paragraph/section)
5. GPU recommended for production (~2GB VRAM needed)

## Links

- PyPI: https://pypi.org/project/corp-extractor/
- Docs: https://statement-extractor.corp-o-rate.com/docs
- Model: https://huggingface.co/Corp-o-Rate-Community/statement-extractor
Save as SKILL.md or append to CLAUDE.md in your project root

Multiple Models, One Pipeline

Corp-extractor uses multiple fine-tuned small models to transform unstructured text into structured relationship data—all running locally on your hardware with no external services.

Pipeline stages:

  • T5-Gemma 2 (540M params) — Splits text into atomic statements and resolves coreferences. Trained on 70,000+ pages of corporate and news documents.
  • GLiNER2 (205M params) — Extracts subject/predicate/object with entity types (ORG, PERSON, GPE, etc.) and 324 predefined predicates.
  • Entity Database — Qualifies entities against 10M+ organizations and 40M+ people with quantized embeddings for sub-second lookups.
  • BERT classifiers — Small models for sentiment labeling and embedding similarity for taxonomy classification.

Hardware: Requires ~100GB disk for all models and database. Runs on RTX 4090+ or Apple M1/M2/M3 with 16GB+ RAM.

How It Works

5-Stage Pipeline Architecture v0.8.0

Input
Raw Text
1
Splitting
T5-Gemma2
2
Extraction
GLiNER2
3
Qualification
Embedding DB
4
Labeling
Multi-choice
5
Taxonomy
MNLI / Embed
Output
Statements

Text flows through a modular plugin-based pipeline. Each stage transforms the data progressively, from raw text to fully qualified, labeled statements with taxonomy classifications.

Pipeline Stages

StageNamePurposeKey Technology
1SplittingText → Atomic StatementsT5-Gemma2 (540M params) with Diverse Beam Search
2ExtractionAtomic Statements → Typed TriplesGLiNER2 (205M params) entity recognition
3QualificationEntities → Canonical names, identifiers, FQNCompany embedding database (SEC, GLEIF, UK Companies House)
4LabelingAdd simple classificationsMulti-choice classifiers (sentiment, relation type)
5TaxonomyClassify against ESG taxonomyMNLI zero-shot or embedding similarity

Data Flow

Text
Atomic Statements
Typed Triples
Canonical Entities
Labeled Statements
Taxonomy Results

Data is progressively enriched through each stage, from raw text to fully qualified statements with entity types, canonical names, sentiment labels, and taxonomy classifications.

Technical Features

Diverse Beam Search

The T5-Gemma2 model uses Diverse Beam Search (Vijayakumar et al., 2016) to generate 4 diverse candidate outputs, exploring multiple interpretations of the text.

GLiNER2 Entity Extraction

GLiNER2 (205M params) refines entity boundaries and scores how "entity-like" subjects and objects are. Uses 324 default predicates across 21 categories for relation extraction.

Entity Qualification

Company embedding database (~100K+ SEC, ~3M GLEIF, ~5M UK companies) provides fast vector similarity search to resolve entities to canonical names with identifiers (LEI, CIK, company numbers) and FQN.

Taxonomy Classification

Statements are classified against an ESG taxonomy using either MNLI zero-shot classification or embedding similarity, returning multiple labels above confidence thresholds.

Known Limitations

  • 1.Complex sentences: Very long sentences with multiple nested clauses may result in incomplete extraction or incorrect predicate assignment.
  • 2.Implicit relationships: The model works best with explicit statements. Implied or contextual relationships may be missed.
  • 3.Domain specificity: Trained primarily on corporate/news text. Performance may vary on highly technical or specialized content.
  • 4.Coreference limits: While the model resolves many pronouns, complex anaphora chains or ambiguous references may not resolve correctly.
  • 5.Entity type coverage: Some specialized entity types (e.g., scientific terms, technical products) may default to UNKNOWN.

Roadmap & Areas for Improvement

Recently Completed

  • 5-Stage Pipeline Architecture (v0.8.0) — Merged qualification + canonicalization into single stage
  • Company Embedding Database (v0.8.0) — Fast vector search for ~100K+ SEC, ~3M GLEIF, ~5M UK companies
  • Taxonomy Classification (v0.5.0) — MNLI + embedding-based ESG taxonomy classification
  • Entity Qualification (v0.5.0) — LEI, ticker, CIK lookups with canonical names and FQN
  • Statement Labeling (v0.5.0) — Sentiment analysis and relation type classification
  • GLiNER2 Integration (v0.4.0) — 205M param model for entity recognition and relation extraction

Larger Training Dataset

Expanding beyond 77K examples with more diverse sources

Multi-hop Reasoning

Better handling of statements that span multiple sentences

Negation Handling

Better detection of negative statements and contradictions

Knowledge Graph Integration

Link extracted entities to external knowledge bases (Wikidata, etc.)

We Need Your Feedback

This model is actively being improved. If you encounter incorrect extractions, missing statements, or have suggestions for improvement, we'd love to hear from you. Use the "Correct" button above to submit fixes, or reach out directly.

neil@corp-o-rate.com
Who We Are

About Corp-o-Rate

CORP-O-RATE

The Glassdoor of ESG

Real corporate intelligence from real people. Track what companies actually do, not what they claim.

Corp-o-Rate is building a community-powered corporate accountability platform. We believe that glossy sustainability reports and PR-polished ESG claims don't tell the full story. Our mission is to surface the truth about corporate behavior through crowdsourced intelligence, AI-powered analysis, and transparent data.

This statement extraction model is one piece of that puzzle — automatically extracting relationships and meaningful statements from research, news, and corporate documents. Available as the corp-extractor Python library on PyPI. This is the first part of our analysis and we'll be releasing other re-usable components as we progress.

Community-Driven

Powered by employees, consumers, and researchers sharing real knowledge about corporate practices.

AI-Powered

Using NLP and knowledge graphs to structure, connect, and analyze corporate claims at scale.

100% Independent

No corporate sponsors. No conflicts of interest. Just transparent corporate intelligence.

We're Pre-Funding & Running on Fumes

Corp-o-Rate is currently bootstrapped and self-funded. We're building in public, shipping what we can, and working toward our mission one step at a time. If you believe in corporate accountability and transparent business intelligence, we'd love your support.

GPU Credits

Help us train better models

Angel Investment

Help us scale the platform

Partnerships

Data, research, or distribution

Shop smarter. Invest better. Know which companies match your values.