There are two distinct questions here, to which implementations can provide diff...

lucb1e · 2024-04-26T18:34:45

Alright, but don't leave us hanging: what does Python3 use for (2) that you say I was badly off on? (Or, in actuality, never thought about or meant to make claims about.) Now we still can't make good choices for performance!

https://stackoverflow.com/questions/1838170/what-is-internal... says Python3.3 picks either a one-, two-, or four-byte representation depending on which is the smallest one that can represent all characters in a string. If you have one character in the string that requires >2 bytes to represent, it'll make every character take 4 bytes in memory such that you can have O(1) lookups on arbitrary offsets. The more you know :)

aktiur · 2024-04-27T19:05:08

Pre-python 3.2, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else.

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...

samus · 2024-04-26T22:27:18

Since Java 9, the Java JRE does something similar: if a string contains only characters in ISO-8859-1 then it is stored as such, else the usual storage format (int16) is used.

tialaramex · 2024-04-26T18:52:35

Yeah, I started writing about what you found (the answer to (2) for Python) and I realised that's a huge rabbit hole I was venturing down and decided to stop short and post, so, apologies I guess.