Add German/Dutch prefixes and German title/degree suffixes#191
Merged
Conversation
Closes #18. Adds prefixes (aan, aen, auf, dem, freiherr, freiherrin, heer, het, op, te, tho, thoe, vande, vd) and titles/suffixes (Dipl.-Ing., FH-Prof., Gräfin, Me., PD, Priv.-Doz., RA, Univ.Prof., WP, ba, bsc, meng, stb, MdB/MdL/MdEP/MdA/MdHB/MdBB) that don't collide with existing English-language parsing. Also fixes join_on_conjunctions() to register a conjunction-merged piece (e.g. "von" + "und" + "zu") as a prefix too, mirroring the existing title-handling, so multi-word prefix chains like German "von und zu" bridge correctly into the last name instead of getting stranded in the middle name. Deliberately left out short, high-frequency English words (to, in, an, then, ten) that collide with common Korean/Vietnamese given-name syllables in the middle-token position, and bare "v" as a prefix, which collides with ordinary Western middle initials. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Covers two gaps flagged in review: a merged piece that's registered as
both a title and a prefix ("freiherr"), and a chain with more than one
non-contiguous conjunction bridging prefixes into the last name.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #18 (Thomas Bachem's 58-test gist of German/Dutch names and international degrees, open for 11 years).
aan,aen,auf,dem,freiherr,freiherrin,heer,het,op,te,tho,thoe,vande,vd.Dipl.-Ing.,FH-Prof.,Gräfin,Me.,PD,Priv.-Doz.,RA,Univ.Prof.,WP,ba,bsc,meng,stb,MdB/MdL/MdEP/MdA/MdHB/MdBB.join_on_conjunctions()to register a conjunction-merged piece (e.g.von+und+zu) as a prefix too, mirroring the existing title-handling, so multi-word prefix chains like German "von und zu" bridge correctly into the last name instead of getting stranded in the middle name.This takes the gist's suite from 21/68 passing to 46/68.
Deliberately not included, with reasoning verified by test:
to,in,an,then,tenas global prefixes — these are common Korean/Vietnamese given-name syllables in the middle-token position (e.g.Park In Hwan), and adding them regresses a currently-correct parse for those names, not just an ambiguous case.vas a prefix (for German "v. Kloppenheim") — collides with ordinary Western middle initials (John V. Smithbreaks).suffixfor what this library correctly parses as a leadingtitle(e.g.Mag.,RA,Dipl.-Ing.) — consistent with existing conventions forDr./MD/PhD, so not changed.Dr. rer. nat.,LL. M.,M. Sc.need new joining logic beyond config additions — out of scope here.Test plan
python -m pytest tests/— 990 passed, 4 skipped, 22 xfailed (no regressions)tests/test_conjunctions.pycovering thejoin_on_conjunctionsprefix-bridging fixvdalready a suffix acronym for a different meaning,rasimilarly,freiherralready a leading title)🤖 Generated with Claude Code