Skip to Content
Unified docs shell with shared Classifyre tokens and acid-green highlight accents.
DetectorsCustom Detectors

Custom Detectors

Custom detectors let you define exactly what to look for — pattern by pattern, label by label — without writing code. Three execution methods cover different use cases. You can combine them or add a structured extraction layer on top of any method.


Methods at a glance

MethodWhat it doesWhen to use
RULESETMatches text using regex patterns and keyword listsKnown, deterministic patterns — IBANs, internal codes, compliance phrases
CLASSIFIERClassifies text into your custom labels using a language modelSubjective or semantic categories — tone, risk, promotional content
ENTITYExtracts named entities from text using GLiNERDomain-specific entities — regulatory IDs, custom person types, product codes

RULESET

A ruleset detector matches text without any ML model. It runs regex and keyword rules and fires a finding when a match is found. Rules are evaluated on every scanned asset. Confidence is fixed (regex: 0.93, keyword: 0.82), so confidence_threshold acts as a gate — keep it below those values (default 0.7 works fine).

Regex rules

{
  "method": "RULESET",
  "ruleset": {
    "regex_rules": [
      {
        "id": "iban_de",
        "name": "German IBAN",
        "pattern": "\\bDE\\d{20}\\b",
        "flags": "i",
        "severity": "high"
      },
      {
        "id": "sku_internal",
        "name": "Internal SKU",
        "pattern": "\\bINT-[A-Z]{2,4}-\\d{4,6}\\b",
        "severity": "medium"
      }
    ]
  }
}

Fields:

FieldRequiredDescription
idYesStable identifier — appears in finding metadata and finding_type
nameYesHuman-readable label shown in the UI
patternYesPython-compatible regex. Max 512 chars, no recursive patterns
flagsNoi = case-insensitive, m = multiline, s = dotall
severityNocritical, high, medium, low, info — defaults to medium

When de is in the detector’s languages list, the engine also runs detection on a compound-split variant of the content — long German compound words (16+ chars) are split at known boundaries before matching. This catches things like DatenschutzbeauftragterDatenschutz beauftragter.

Keyword rules

{
  "method": "RULESET",
  "ruleset": {
    "keyword_rules": [
      {
        "id": "data_protection_terms",
        "name": "Data Protection Terms",
        "keywords": ["Datenschutzbeauftragter", "DSGVO", "Datenschutzfolgeabschätzung"],
        "case_sensitive": false,
        "severity": "low"
      }
    ]
  }
}
FieldRequiredDescription
idYesStable identifier
nameYesDisplay name
keywordsYesList of literal strings to match (at least one)
case_sensitiveNoDefault false
severityNoDefaults to low

Safety limits

Patterns are validated before compilation. A rule is silently skipped if:

  • Pattern is longer than 512 characters
  • Pattern contains recursive constructs ((?R, (?0, (?P>)
  • Pattern has more than 4 .* sequences
  • Pattern has nested quantifiers that could cause catastrophic backtracking

CLASSIFIER

A classifier detector assigns your text to one or more custom labels using a language model. It works in two modes depending on how many training examples you have provided.

How it picks a model (training strategy)

ConditionModel usedNotes
All labels have ≥ min_examples_per_label accepted examplesSetFit (paraphrase-multilingual-MiniLM-L12-v2)Trained once on first scan, cached locally by example fingerprint
Not enough examples for any labelZero-shot (mDeBERTa-v3-base-mnli-xnli)Uses hypothesis_template to infer — no training needed

The default threshold is 8 examples per label. Once you reach it for every label, SetFit is used and the model is retrained whenever the examples change.

Zero-shot first, SetFit later. You can deploy a CLASSIFIER detector immediately with zero examples — it will use the zero-shot model. As you accumulate confirmed findings from scans, run Train to bake them into a SetFit model that is faster and more accurate for your specific domain.

Minimal zero-shot setup

{
  "method": "CLASSIFIER",
  "classifier": {
    "labels": [
      { "id": "risk_term", "name": "Risk Term", "description": "Legal or compliance risk wording" },
      { "id": "neutral",   "name": "Neutral",   "description": "Ordinary business language" }
    ],
    "hypothesis_template": "This text contains {}.",
    "zero_shot_model": "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"
  }
}

The {} in hypothesis_template is replaced with the label name at inference time. A good template makes a difference. Compare:

TemplateLabel substitutedEffective hypothesis
"This text contains {}.""Risk Term"This text contains Risk Term.
"This text is {}.""promotional content"This text is promotional content.
"This document discusses {}.""financial advice"This document discusses financial advice.

Write the template so the completed sentence is a natural, true description of content you do want to flag.

Label fields

FieldRequiredDescription
idYesStable snake_case identifier — referenced in training examples and finding metadata
nameYesHuman-readable name — used by the zero-shot model as the candidate label
descriptionNoLonger hint — useful for documenting intent, not used by models

Training example label values must match the label id, not the name. If you change a label’s id, existing training examples referencing the old ID are no longer counted toward that label.


Training a CLASSIFIER

Training resolves the detector’s strategy to SETFIT or ZERO_SHOT based on how many accepted examples exist per label. It does not run the ML training itself — the actual SetFit training happens inside the CLI scan job on first use. The API training step commits the current example set and computes a deterministic fingerprint used to invalidate the cached model when examples change.

Ways to add training examples

1. From confirmed scan findings — the most natural workflow. Run a scan, review findings, mark correct ones as Resolved and incorrect ones as False Positive. Then click Train in the UI. The API collects all Resolved findings for this detector as positive examples.

2. Upload a file — supported formats: .csv, .tsv, .txt, .json. The file parser auto-detects column headers. A two-column CSV with text,label header is the simplest format:

text,label
"This lease agreement contains force majeure waivers.",risk_term
"Meeting notes from the Q2 planning session.",neutral
"The liability clause contradicts applicable law.",risk_term
"We shipped 50 units to warehouse B.",neutral

3. Paste inline — plain text, one example per line, with the label as the last token:

This agreement limits damages to direct losses only. risk_term
Weekly status update for the engineering team. neutral

How many examples do you need?

GoalMinimumBetter
Switch from zero-shot to SetFit8 per label20–30 per label
Good accuracy on short texts15 per label40+ per label
Good accuracy on long documents20 per label50+ per label

The 8-example threshold is the default min_examples_per_label. You can lower it, but SetFit trains better with more examples. If accuracy matters, aim for 20+ per label with varied phrasing.

Training strategy in findings

After running Train, the training run record shows the resolved strategy:

strategy valueMeaning
SETFITEnough labeled examples — SetFit model will be trained on first scan
ZERO_SHOTNot enough labeled examples — mDeBERTa zero-shot is used at runtime
RULESETDetector uses the ruleset method, no training needed
ENTITYDetector uses the entity method, no training needed

ENTITY

An entity detector extracts named spans from text using GLiNER — a generalist NER model that handles any label you give it without fine-tuning.

{
  "method": "ENTITY",
  "entity": {
    "entity_labels": [
      "vendor name",
      "contractor",
      "Auftragnehmer",
      "IBAN",
      "regulatory fine amount"
    ],
    "model": "urchade/gliner_multi-v2.1"
  }
}

Entity labels are free-form strings in any language — GLiNER uses the label text itself as the semantic description. More descriptive labels give better results than short ones.

Good labelLess good label
"German supervisory authority""authority"
"regulatory fine amount in euros""fine"
"IBAN bank account number""IBAN"
"internal employee ID""ID"

Findings include start/end character offsets in the original content and an entity_label field in the metadata.

No training required. GLiNER generalizes from the label text — just describe what you want to find.


Extractor (structured output)

Any method can have an optional extractor block that runs after detection fires. Instead of just flagging content, the extractor pulls structured fields out of it and attaches them to the finding as extracted_data.

{
  "method": "CLASSIFIER",
  "classifier": { "..." },
  "extractor": {
    "enabled": true,
    "fields": [
      {
        "name": "contract_party",
        "description": "The contracting party mentioned",
        "type": "string",
        "entity_label": "contracting party",
        "min_confidence": 0.4
      },
      {
        "name": "effective_date",
        "description": "Contract effective date",
        "type": "string",
        "regex_pattern": "(?P<value>\\d{1,2}\\.\\d{1,2}\\.\\d{4})",
        "aggregate": "first"
      }
    ],
    "content_limit": 4000
  }
}

Extractor fields

FieldDescription
nameOutput key in extracted_data
descriptionHuman hint for the GLiNER extraction
typestring, number, boolean, list[string], list[number]
entity_labelGLiNER label — used when no regex_pattern is set
regex_patternPython regex with a (?P<value>...) named capture group
regex_flagsi, m, s — defaults to i
aggregatefirst, last, list, join, count — how multiple matches are combined
min_confidenceMinimum GLiNER confidence for this field (0–1, default 0.4)
requiredIf true, the entire extraction is dropped when this field is empty

content_limit controls how many characters of the original document are passed to the extractor. The classifier matched_content is capped at 320 chars — for deeper extraction you need more context. Default is 4000.


Complete example: CLASSIFIER + extractor

{
  "custom_detector_key": "contract_risk_de",
  "name": "DACH Contract Risk",
  "method": "CLASSIFIER",
  "languages": ["de", "en"],
  "confidence_threshold": 0.7,
  "classifier": {
    "labels": [
      { "id": "risk_term",  "name": "Risk Term",  "description": "Legal or compliance risk wording" },
      { "id": "neutral",    "name": "Neutral",     "description": "Ordinary business language" }
    ],
    "hypothesis_template": "This text contains {}.",
    "min_examples_per_label": 8,
    "zero_shot_model": "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli",
    "setfit_model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    "training_examples": []
  },
  "extractor": {
    "enabled": true,
    "fields": [
      {
        "name": "risk_clause",
        "description": "The specific legal clause that triggered detection",
        "type": "string",
        "entity_label": "legal risk clause",
        "min_confidence": 0.35,
        "aggregate": "join"
      }
    ],
    "content_limit": 4000
  }
}

Choosing the right method

Is the pattern deterministic and expressible as a regex or keyword list?
  → RULESET

Is the content subjective, contextual, or semantic?
  → CLASSIFIER

  Do you have labeled examples, or can you collect them through scan feedback?
    → Yes → CLASSIFIER, accumulate examples, then Train for SetFit
    → No  → CLASSIFIER with zero_shot_model and good hypothesis_template

Do you need to extract specific named spans (people, IDs, amounts)?
  → ENTITY

Do you need structured fields pulled out when detection fires?
  → Add an extractor block to any of the above

Common pitfalls

Classifier label IDs in training examples must match exactly. Training examples reference label IDs, not names. A mismatch means those examples are silently ignored when counting coverage.

Zero-shot accuracy depends on label names. The zero-shot model scores each label name against the content via a natural language hypothesis. Vague names like "bad" or "other" perform poorly. Use descriptive names like "passive-aggressive workplace communication".

SetFit requires at least 2 labels with enough examples. A classifier with a single label cannot train SetFit — you need at least two distinct labels with min_examples_per_label examples each.

RULESET confidence is fixed, not learned. Regex matches come in at 0.93 and keyword matches at 0.82. If you set confidence_threshold above 0.93, regex findings will also be filtered out.

Entity labels are semantic descriptions, not ontology nodes. GLiNER uses the label string as a natural language prompt. "IBAN bank account number" works better than "IBAN" because it gives the model more context.

Last updated on