HATCorpus BERT Authorship Classifier

Model Description

This model is a BERT-based binary text classifier trained to distinguish between human-written and AI-generated English text.

Base model: bert-base-uncased
Task: Binary text classification
Labels:
- 0 → Human-written
- 1 → AI-generated

The model was trained on HATCorpus, a curated dataset of human and AI-authored text.

Intended Use

This model is intended for:

Research on AI-generated text detection
Benchmarking authorship classifiers
Educational and exploratory use

It is not intended for:

Surveillance or enforcement
Determining individual authorship
High-stakes automated decisions

Training Data

Dataset: HATCorpus
Sources: Wikipedia, Project Gutenberg, AI-generated text
Language: English
Split: Train / Validation

For dataset details, see:
👉 https://huggingface.co/datasets/ky1916/HATCorpus

Training Details

Architecture: BERT-base
Optimizer: AdamW
Loss: Cross-entropy
Max sequence length: 512
Batch size: 8
Epochs: 3

Evaluation

Metric	Value
Accuracy	93.58%
Precision	88.04%
Recall	96.43%
F1-score	92.05%

(Evaluated against dataset at https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text)

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("your-username/hatcorpus-bert-authorship")
model = AutoModelForSequenceClassification.from_pretrained("your-username/hatcorpus-bert-authorship")

text = "This is an example sentence."
inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=1).item()

print("AI-generated" if prediction == 1 else "Human-written")

Limitations

Performance may degrade on very short text
Model may rely on stylistic cues rather than semantic understanding
Does not generalize to all LLMs or writing styles

Citation

If you use this model in your research, please cite both the model and dataset:

@model{hatcorpus_bert_2025,
  title     = {HATCorpus BERT Authorship Classifier},
  year      = {2025},
  publisher = {Hugging Face},
  note      = {Fine-tuned from bert-base-uncased for human vs AI text classification}
}

Downloads last month: 12

Model tree for ky1916/hatcorpus-bert-authorship

Base model

google-bert/bert-base-uncased

Finetuned

(6282)

this model

ky1916
/

hatcorpus-bert-authorship