You can verify on the documentation by switching the version. So ... since 3.2: ...

aktiur · 2024-04-26T16:36:50

> In 3.1 it was the default encoding of string (the type str I guess).

No, what was used was what sys.getdefaultencoding(), which was already UTF-8 in 3.1 (I checked the source code).

At that time, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else. But that did not exist in 3.1, where you could either use the "narrow" build of python (that used UTF-16) or the "wide" build (that used UTF-32).

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...

_ache_ · 2024-04-26T16:57:42

Thank you ! The documentation was misleading about "default encoding of string".

int_19h · 2024-04-26T21:41:00

The simple thing to remember is that for all versions of Python going back 12 years, there's no such thing as "default encoding of string". A Python string is defined as a sequence of 32-bit Unicode codepoints, and that is how Python code perceives it in all respects. How it is stored internally is an implementation detail that does not affect you.

Dylan16807 · 2024-04-27T00:30:49

32 bit specifically?

The most expansive Unicode has ever been was 31 bits, and UTF-8 is also capable of at most 31 bits.

int_19h · 2024-04-27T10:12:14

You're right, the docs just say "Unicode codepoints", and standard facilities like "\U..." or chr() will refuse anything above U+10FFFF. However I'm not sure that still holds true when third-party native modules are in the picture.