input_preprocessing¶
input_preprocessing ¶
Input preprocessing for LLM prompts.
Best practices: Remove noise, normalize whitespace, control length.
PreprocessingStats
dataclass
¶
PreprocessingStats(rows_processed: int, chars_before: int, chars_after: int, truncated_count: int, null_count: int)
Statistics from preprocessing operation.
TextCleaner ¶
UnicodeNormalizer ¶
Normalize Unicode to canonical form (NFC).
ControlCharRemover ¶
Remove control characters that confuse tokenizers.
clean ¶
Replace control chars with space (preserves word boundaries).
SpecialCharCleaner ¶
Remove noise characters while preserving semantic punctuation.
Source code in ondine/utils/input_preprocessing.py
clean ¶
Remove ®™© and excessive special chars.
Source code in ondine/utils/input_preprocessing.py
WhitespaceNormalizer ¶
Collapse multiple spaces/tabs/newlines.
clean ¶
Replace tabs/newlines with spaces, collapse multiples.
TextTruncator ¶
Intelligently truncate at word boundaries.
Source code in ondine/utils/input_preprocessing.py
clean ¶
Truncate respecting word boundaries.
Source code in ondine/utils/input_preprocessing.py
TextPreprocessor ¶
Composable text preprocessor following Chain of Responsibility.
Single Responsibility: Orchestrate cleaning steps. Open/Closed: Extensible via cleaners list. Dependency Inversion: Depends on Protocol, not concrete classes.
Initialize with default cleaning pipeline.
Source code in ondine/utils/input_preprocessing.py
process ¶
Apply all cleaners in sequence.
preprocess_dataframe ¶
preprocess_dataframe(df: DataFrame, input_columns: list[str], max_length: int = 500) -> tuple[pd.DataFrame, PreprocessingStats]
Preprocess input columns in dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input dataframe |
required |
input_columns
|
list[str]
|
Columns to clean |
required |
max_length
|
int
|
Max chars per field |
500
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, PreprocessingStats]
|
(cleaned_df, stats) |