PETER | VerbaNexAI

PETER is a novel transformer-based architecture designed to process and understand phonesthemes (sound-meaning patterns) in text using a multi-layered embedding approach. It combines phonestheme embeddings with positional encodings and word-level information to create rich text representations.

Architecture Overview

PETER uses four main types of embeddings that are combined:

Word Positional Encoding: Tracks position of words in the sequence
Word Segment Embedding: Vector representation of complete words
Phonestheme Positional Encoding: Tracks position of phonesthemes within words
Phonestheme Embedding: Vector representation of phonesthemes (sound units)

The model processes text by:

Breaking words into phonesthemes using syllabification
Converting text to IPA (International Phonetic Alphabet) representation
Applying multiple embedding layers
Processing through a transformer encoder
Generating predictions at the phonestheme level

Installation

Clone the repository:

git clone git@github.com:VerbaNexAI/PETER.git
cd PETER

Create and activate a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

Install required packages:
```
pip install -r requirements.txt
```

Project Structure

.
├── data/                      # Training and testing data
├── experiments/               # Jupyter notebooks for testing
├── logs/                      # Training logs
├── models/                    # Saved model weights and embeddings
├── src/                       # Source code
│   ├── embeddings/           # Embedding implementations
│   ├── tokenizers/           # Text tokenization modules
│   └── utils/                # Helper functions

Usage

Basic Example

from src.peter import TransformerEncoder
from src.tokenizers.phonestheme_tokenizer import PhonesthemeTokenizer
from src.tokenizers.word_tokenizer import WordSegmentEmbedding

# Initialize tokenizers
phonestheme_tokenizer = PhonesthemeTokenizer(
    model="models/phonestheme_embedding_es.model",
    max_length=150
)

word_tokenizer = WordSegmentEmbedding(
    model="models/word_embedding_es.model",
    max_length=512
)

# Initialize model
model = TransformerEncoder(
    phonestheme_tokenizer=phonestheme_tokenizer,
    word_tokenizer=word_tokenizer,
    vocab_size=8000,
    d_model=150,
    nhead=5,
    num_layers=6
)

# Load pretrained weights
model.load_state_dict(torch.load("models/peter_model_trained.pth"))

# Process text
text = "hola buenos días"
phonestheme_tokens = phonestheme_tokenizer.encode(text)
word_tokens = word_tokenizer.encode(text)

output = model(
    input_ids=phonestheme_tokens["input_ids"].unsqueeze(0),
    attention_mask=phonestheme_tokens["attention_mask"].unsqueeze(0),
    word_input_ids=word_tokens.unsqueeze(0)
)

Training

The model can be trained using your own data. Check the notebooks in the experiments/ directory for training examples.

Requirements

Main dependencies:

Python 3.8+
PyTorch 2.0+
epitran
pylabeador
gensim
num2words

See requirements.txt for complete list.

License

This project is licensed under the terms of the LICENSE file included in the repository.

Citation

If you use this work, please cite:

@article{PETERpaper,
  title={PETER: Phonesthemes Encoder from Transformer Embedding Representation},
  author={E. Payares, E. Puertas, J.C. Martinez-Santos},
  year={2024}
}

Contact

For questions or issues, please open an issue in the repository.