corp-extractor demo

EXTRACT STATEMENTS.
MAP RELATIONSHIPS.

A Python library designed to analyze complex text and extract relationship information about people and organizations. Runs entirely on your hardware (RTX 4090+, Apple M1 16GB+) with no external API dependencies. Uses fine-tuned T5-Gemma 2 for statement splitting and coreference resolution, plus GLiNER2 for entity extraction. Includes a database of 10M+ organizations and 40M+ people with quantized embeddings for fast entity qualification (~100GB disk for all models and data).

Statements

No statements extracted yet.

Enter some text and click "Extract Statements" to begin.

Relationship Graph

Graph will appear after extracting statements

Export

// No statements to export

Default Predicates

The extractor uses GLiNER2 relation extraction with these default predicates. Each predicate has a confidence threshold (typically 0.65-0.8) that filters low-confidence matches. You can override these defaults by providing a custom predicates_file parameter to the GLiNER2Extractor.

Loading predicates...

Statement Taxonomy

Stage 6 of the pipeline classifies statements against this ESG taxonomy using embedding similarity or MNLI inference. Each topic includes descriptions to guide classification. You can provide a custom taxonomy via taxonomy_file parameter to the taxonomy classifier plugins.

Loading taxonomy...

Quick Start

PyPI:corp-extractor

Model:Corp-o-Rate-Community/statement-extractor

# Command Line Interface (v0.2.4+)

# ============================================
# Install globally (recommended)
# ============================================

# Using uv (recommended)
uv tool install "corp-extractor[embeddings]"

# Or using pipx
pipx install "corp-extractor[embeddings]"

# Or using pip
pip install "corp-extractor[embeddings]"

# ============================================
# Quick run with uvx (no install)
# ============================================
# Note: First run downloads the model (~1.5GB)
uvx corp-extractor "Apple announced a new iPhone."

# ============================================
# Usage Examples
# ============================================

# Extract from text argument
corp-extractor "Apple Inc. announced the iPhone 15 at their September event."

# Extract from file
corp-extractor -f article.txt

# Pipe from stdin
cat article.txt | corp-extractor -

# Output as JSON (with full metadata)
corp-extractor "Tim Cook is CEO of Apple." --json

# Output as XML (raw model output)
corp-extractor -f article.txt --xml

# Verbose output with confidence scores
corp-extractor -f article.txt --verbose

# Use more beams for better quality
corp-extractor -f article.txt --beams 8

# Use custom predicate taxonomy
corp-extractor -f article.txt --taxonomy predicates.txt

# Use GPU explicitly
corp-extractor -f article.txt --device cuda

# Filter low-confidence results
corp-extractor -f article.txt --min-confidence 0.7

# ============================================
# All CLI Options
# ============================================
# corp-extractor --help
#
# -f, --file PATH              Read input from file
# -o, --output [table|json|xml] Output format (default: table)
# --json                       Output as JSON (shortcut)
# --xml                        Output as XML (shortcut)
# -b, --beams INTEGER          Number of beams (default: 4)
# --diversity FLOAT            Diversity penalty (default: 1.0)
# --max-tokens INTEGER         Max tokens to generate (default: 2048)
# --no-dedup                   Disable deduplication
# --no-embeddings              Disable embedding-based dedup (faster)
# --no-merge                   Disable beam merging
# --predicates PATH            Load predicate list for GLiNER2 relation extraction
# --all-triples                Keep all candidate triples (default: best per source)
# --dedup-threshold FLOAT      Deduplication threshold (default: 0.65)
# --min-confidence FLOAT       Min confidence filter (default: 0)
# --taxonomy PATH              Load predicate taxonomy from file
# --taxonomy-threshold FLOAT   Taxonomy matching threshold (default: 0.5)
# --device [auto|cuda|mps|cpu] Device to use (default: auto)
# -v, --verbose                Show confidence scores and metadata
# -q, --quiet                  Suppress progress messages
# --version                    Show version

View full documentation →

For AI Assistants

SKILL.md for AI Assistants

Add to your project's CLAUDE.md or .cursorrules to enable statement extraction

# SKILL: Statement Extraction with corp-extractor

Use the `corp-extractor` Python library to extract structured subject-predicate-object statements from text. Returns Pydantic models with confidence scores.

## Installation

```bash
pip install corp-extractor[embeddings]  # Recommended: includes semantic deduplication
```

For GPU support, install PyTorch with CUDA first:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install corp-extractor[embeddings]
```

## Quick Usage

```python
from statement_extractor import extract_statements

result = extract_statements("""
    Apple Inc. announced the iPhone 15 at their September event.
    Tim Cook presented the new features to customers worldwide.
""")

for stmt in result:
    print(f"{stmt.subject.text} ({stmt.subject.type})")
    print(f"  --[{stmt.predicate}]--> {stmt.object.text}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")
```

## Output Formats

```python
from statement_extractor import (
    extract_statements,        # Returns ExtractionResult with Statement objects
    extract_statements_as_json,  # Returns JSON string
    extract_statements_as_xml,   # Returns XML string
    extract_statements_as_dict,  # Returns dict
)
```

## Statement Object Structure

Each `Statement` has:
- `subject.text` - Subject entity text
- `subject.type` - Entity type (ORG, PERSON, GPE, etc.)
- `predicate` - The relationship/action
- `object.text` - Object entity text
- `object.type` - Object entity type
- `source_text` - Original sentence
- `confidence_score` - Groundedness score (0-1)
- `canonical_predicate` - Normalized predicate (if taxonomy used)

## Entity Types

ORG, PERSON, GPE (countries/cities), LOC, PRODUCT, EVENT, WORK_OF_ART, LAW, DATE, MONEY, PERCENT, QUANTITY, UNKNOWN

## Precision Mode (Filter Low-Confidence)

```python
from statement_extractor import ExtractionOptions, ScoringConfig

options = ExtractionOptions(
    scoring_config=ScoringConfig(min_confidence=0.7)
)
result = extract_statements(text, options)
```

## Predicate Taxonomy (Normalize Predicates)

```python
from statement_extractor import PredicateTaxonomy, ExtractionOptions

taxonomy = PredicateTaxonomy(predicates=[
    "acquired", "founded", "works_for", "headquartered_in"
])
options = ExtractionOptions(predicate_taxonomy=taxonomy)
result = extract_statements(text, options)

# "bought" -> "acquired" via semantic similarity
for stmt in result:
    if stmt.canonical_predicate:
        print(f"Normalized: {stmt.predicate} -> {stmt.canonical_predicate}")
```

## Batch Processing

```python
from statement_extractor import StatementExtractor

extractor = StatementExtractor(device="cuda")  # or "cpu"
for text in texts:
    result = extractor.extract(text)
```

## Best Practices

1. Use `[embeddings]` extra for semantic deduplication
2. Filter by `confidence_score >= 0.7` for high precision
3. Use predicate taxonomies for consistent knowledge graphs
4. Process large documents in chunks (by paragraph/section)
5. GPU recommended for production (~2GB VRAM needed)

## Links

- PyPI: https://pypi.org/project/corp-extractor/
- Docs: https://statement-extractor.corp-o-rate.com/docs
- Model: https://huggingface.co/Corp-o-Rate-Community/statement-extractor

Save as SKILL.md or append to CLAUDE.md in your project root

THE PIPELINE

Multiple Models, One Pipeline

Corp-extractor uses multiple fine-tuned small models to transform unstructured text into structured relationship data—all running locally on your hardware with no external services.

Pipeline stages:

T5-Gemma 2 (540M params) — Splits text into atomic statements and resolves coreferences. Trained on 70,000+ pages of corporate and news documents.
GLiNER2 (205M params) — Extracts subject/predicate/object with entity types (ORG, PERSON, GPE, etc.) and 324 predefined predicates.
Entity Database — Qualifies entities against 10M+ organizations and 40M+ people with quantized embeddings for sub-second lookups.
BERT classifiers — Small models for sentiment labeling and embedding similarity for taxonomy classification.

Hardware: Requires ~100GB disk for all models and database. Runs on RTX 4090+ or Apple M1/M2/M3 with 16GB+ RAM.

TECHNICAL DETAILS

How It Works

5-Stage Pipeline Architecture v0.8.0

Input

Raw Text

Splitting

T5-Gemma2

Extraction

GLiNER2

Qualification

Embedding DB

Labeling

Multi-choice

Taxonomy

MNLI / Embed

Output

Statements

Text flows through a modular plugin-based pipeline. Each stage transforms the data progressively, from raw text to fully qualified, labeled statements with taxonomy classifications.

Pipeline Stages

Stage	Name	Purpose	Key Technology
1	Splitting	Text → Atomic Statements	T5-Gemma2 (540M params) with Diverse Beam Search
2	Extraction	Atomic Statements → Typed Triples	GLiNER2 (205M params) entity recognition
3	Qualification	Entities → Canonical names, identifiers, FQN	Company embedding database (SEC, GLEIF, UK Companies House)
4	Labeling	Add simple classifications	Multi-choice classifiers (sentiment, relation type)
5	Taxonomy	Classify against ESG taxonomy	MNLI zero-shot or embedding similarity

Data Flow

Text

Atomic Statements

Typed Triples

Canonical Entities

Labeled Statements

Taxonomy Results

Data is progressively enriched through each stage, from raw text to fully qualified statements with entity types, canonical names, sentiment labels, and taxonomy classifications.

Technical Features

Diverse Beam Search

The T5-Gemma2 model uses Diverse Beam Search (Vijayakumar et al., 2016) to generate 4 diverse candidate outputs, exploring multiple interpretations of the text.

GLiNER2 Entity Extraction

GLiNER2 (205M params) refines entity boundaries and scores how "entity-like" subjects and objects are. Uses 324 default predicates across 21 categories for relation extraction.

Entity Qualification

Company embedding database (~100K+ SEC, ~3M GLEIF, ~5M UK companies) provides fast vector similarity search to resolve entities to canonical names with identifiers (LEI, CIK, company numbers) and FQN.

Taxonomy Classification

Statements are classified against an ESG taxonomy using either MNLI zero-shot classification or embedding similarity, returning multiple labels above confidence thresholds.

Known Limitations

1.Complex sentences: Very long sentences with multiple nested clauses may result in incomplete extraction or incorrect predicate assignment.
2.Implicit relationships: The model works best with explicit statements. Implied or contextual relationships may be missed.
3.Domain specificity: Trained primarily on corporate/news text. Performance may vary on highly technical or specialized content.
4.Coreference limits: While the model resolves many pronouns, complex anaphora chains or ambiguous references may not resolve correctly.
5.Entity type coverage: Some specialized entity types (e.g., scientific terms, technical products) may default to UNKNOWN.

Roadmap & Areas for Improvement

✓ Recently Completed

5-Stage Pipeline Architecture (v0.8.0) — Merged qualification + canonicalization into single stage
Company Embedding Database (v0.8.0) — Fast vector search for ~100K+ SEC, ~3M GLEIF, ~5M UK companies
Taxonomy Classification (v0.5.0) — MNLI + embedding-based ESG taxonomy classification
Entity Qualification (v0.5.0) — LEI, ticker, CIK lookups with canonical names and FQN
Statement Labeling (v0.5.0) — Sentiment analysis and relation type classification
GLiNER2 Integration (v0.4.0) — 205M param model for entity recognition and relation extraction

Larger Training Dataset

Expanding beyond 77K examples with more diverse sources

Multi-hop Reasoning

Better handling of statements that span multiple sentences

Negation Handling

Better detection of negative statements and contradictions

Knowledge Graph Integration

Link extracted entities to external knowledge bases (Wikidata, etc.)

We Need Your Feedback

This model is actively being improved. If you encounter incorrect extractions, missing statements, or have suggestions for improvement, we'd love to hear from you. Use the "Correct" button above to submit fixes, or reach out directly.

neil@corp-o-rate.com

Who We Are

About Corp-o-Rate

The Glassdoor of ESG

Real corporate intelligence from real people. Track what companies actually do, not what they claim.

Corp-o-Rate is building a community-powered corporate accountability platform. We believe that glossy sustainability reports and PR-polished ESG claims don't tell the full story. Our mission is to surface the truth about corporate behavior through crowdsourced intelligence, AI-powered analysis, and transparent data.

This statement extraction model is one piece of that puzzle — automatically extracting relationships and meaningful statements from research, news, and corporate documents. Available as the corp-extractor Python library on PyPI. This is the first part of our analysis and we'll be releasing other re-usable components as we progress.

Community-Driven

AI-Powered

Using NLP and knowledge graphs to structure, connect, and analyze corporate claims at scale.

100% Independent

No corporate sponsors. No conflicts of interest. Just transparent corporate intelligence.

We're Pre-Funding & Running on Fumes

Corp-o-Rate is currently bootstrapped and self-funded. We're building in public, shipping what we can, and working toward our mission one step at a time. If you believe in corporate accountability and transparent business intelligence, we'd love your support.

GPU Credits

Help us train better models

Angel Investment

Help us scale the platform

Partnerships

Data, research, or distribution

Get in Touch Visit Corp-o-Rate

Shop smarter. Invest better. Know which companies match your values.

EXTRACT STATEMENTS.MAP RELATIONSHIPS.

Statements

Relationship Graph

Export

Default Predicates

Statement Taxonomy

Quick Start

For AI Assistants

SKILL.md for AI Assistants

Multiple Models, One Pipeline

How It Works

5-Stage Pipeline Architecture v0.8.0

Pipeline Stages

Data Flow

Technical Features

Diverse Beam Search

GLiNER2 Entity Extraction

Entity Qualification

Taxonomy Classification

Known Limitations

Roadmap & Areas for Improvement

✓ Recently Completed

Larger Training Dataset

Multi-hop Reasoning

Negation Handling

Knowledge Graph Integration

We Need Your Feedback

About Corp-o-Rate

The Glassdoor of ESG

Community-Driven

AI-Powered

100% Independent

We're Pre-Funding & Running on Fumes

EXTRACT STATEMENTS.
MAP RELATIONSHIPS.