mvansegbroeck's picture
Update README.md
f8f94d6 verified
metadata
license: apache-2.0
language:
  - en
library_name: gliner
datasets:
  - gretelai/gretel-pii-masking-en-v1
pipeline_tag: token-classification
tags:
  - PII
  - PHI
  - GLiNER
  - information extraction
  - encoder
  - entity recognition
  - privacy

Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection

This Gretel GLiNER model is a fine-tuned version of the GLiNER base model knowledgator/gliner-bi-base-v1.0, specifically trained for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI). Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types. For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the GLiNER Model Card.

The model was fine-tuned on the gretelai/gretel-pii-masking-en-v1 dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities.

  1. Training: Utilized the training split of the synthetic dataset.
  2. Validation: Monitored performance using the validation set to adjust training parameters.
  3. Evaluation: Assessed final performance on the test set using PII/PHI entity annotations as ground truth.

For detailed statistics on the dataset, including domain and entity type distributions, visit the dataset documentation on Hugging Face.

Model Performance

All fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:

Model Accuracy Precision Recall F1 Score
gretelai/gretel-gliner-bi-small-v1.0 0.89 0.98 0.91 0.94
gretelai/gretel-gliner-bi-base-v1.0 0.91 0.98 0.92 0.95
gretelai/gretel-gliner-bi-large-v1.0 0.91 0.99 0.93 0.95

Installation & Usage

Ensure you have Python installed. Then, install or update the gliner package:

pip install gliner -U

Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:

from gliner import GLiNER

# Load the fine-tuned GLiNER model
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-base-v1.0")

# Sample text containing PII/PHI entities
text = """
Purchase Order
----------------
Date: 10/05/2023
----------------
Customer Name: CID-982305
Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704
Phone: (312) 555-7890 (555-876-5432)
Email: [email protected]
"""

# Define the labels for PII/PHI entities
labels = [
    "medical_record_number",
    "date_of_birth",
    "ssn",
    "date",
    "first_name",
    "email",
    "last_name",
    "customer_id",
    "employee_id",
    "name",
    "street_address",
    "phone_number",
    "ipv4",
    "credit_card_number",
    "license_plate",
    "address",
    "user_name",
    "device_identifier",
    "bank_routing_number",
    "date_time",
    "company_name",
    "unique_identifier",
    "biometric_identifier",
    "account_number",
    "city",
    "certificate_license_number",
    "time",
    "postcode",
    "vehicle_identifier",
    "coordinate",
    "country",
    "api_key",
    "ipv6",
    "password",
    "health_plan_beneficiary_number",
    "national_id",
    "tax_id",
    "url",
    "state",
    "swift_bic",
    "cvv",
    "pin"
]

# Predict entities with a confidence threshold of 0.7
entities = model.predict_entities(text, labels, threshold=0.7)

# Display the detected entities
for entity in entities:
    print(f"{entity['text']} => {entity['label']}")

Expected Output:

CID-982305 => customer_id
1234 Oak Street, Suite 400 => street_address
Springfield => city
IL => state
62704 => postcode
(312) 555-7890 => phone_number
555-876-5432 => phone_number
[email protected] => email

Use Cases

Gretel GLiNER is ideal for applications requiring detection and redaction of sensitive information:

  • Healthcare: Automating the extraction and redaction of patient information from medical records.
  • Finance: Identifying and securing financial data such as account numbers and transaction details.
  • Cybersecurity: Detecting sensitive information in logs and security reports.
  • Legal: Processing contracts and legal documents to protect client information.
  • Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI.

Citation and Usage

If you use this dataset in your research or applications, please cite it as:

@dataset{gretel-pii-masking-en-v1,
  author       = {Gretel AI},
  title        = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
  year         = {2024},
  month        = {10},
  publisher    = {Gretel},
  howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
}

For questions, issues, or additional information, please visit our Synthetic Data Discord community or reach out to gretel.ai.