HATCorpus BERT Authorship Classifier

Model Description

This model is a BERT-based binary text classifier trained to distinguish between human-written and AI-generated English text.

  • Base model: bert-base-uncased
  • Task: Binary text classification
  • Labels:
    • 0 โ†’ Human-written
    • 1 โ†’ AI-generated

The model was trained on HATCorpus, a curated dataset of human and AI-authored text.


Intended Use

This model is intended for:

  • Research on AI-generated text detection
  • Benchmarking authorship classifiers
  • Educational and exploratory use

It is not intended for:

  • Surveillance or enforcement
  • Determining individual authorship
  • High-stakes automated decisions

Training Data

  • Dataset: HATCorpus
  • Sources: Wikipedia, Project Gutenberg, AI-generated text
  • Language: English
  • Split: Train / Validation

For dataset details, see:
๐Ÿ‘‰ https://huggingface.co/datasets/ky1916/HATCorpus


Training Details

  • Architecture: BERT-base
  • Optimizer: AdamW
  • Loss: Cross-entropy
  • Max sequence length: 512
  • Batch size: 8
  • Epochs: 3

Evaluation

Metric Value
Accuracy 93.58%
Precision 88.04%
Recall 96.43%
F1-score 92.05%

(Evaluated against dataset at https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text)


Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("your-username/hatcorpus-bert-authorship")
model = AutoModelForSequenceClassification.from_pretrained("your-username/hatcorpus-bert-authorship")

text = "This is an example sentence."
inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=1).item()

print("AI-generated" if prediction == 1 else "Human-written")

Limitations

  • Performance may degrade on very short text
  • Model may rely on stylistic cues rather than semantic understanding
  • Does not generalize to all LLMs or writing styles

Citation

If you use this model in your research, please cite both the model and dataset:

@model{hatcorpus_bert_2025,
  title     = {HATCorpus BERT Authorship Classifier},
  year      = {2025},
  publisher = {Hugging Face},
  note      = {Fine-tuned from bert-base-uncased for human vs AI text classification}
}
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ky1916/hatcorpus-bert-authorship

Finetuned
(6282)
this model

Dataset used to train ky1916/hatcorpus-bert-authorship