PEP 686 – Make UTF-8 mode default

nerdponx · 2024-04-26T14:44:29

Default text file encoding being platform-dependent always drove me nuts. This is a welcome change.

I also appreciate that they did not attempt to tackle filesystem encoding here, which is a separate issue that drives me nuts, but separately.

reply

Dwedit · 2024-04-26T20:08:25

With system-default code pages on Windows, it's not only platform-dependent, it's also System Locale dependent.

Windows badly dropped the ball here by not providing a simple opt-in way to make all the Ansi functions (TextOutA, etc) use the UTF-8 code page, until many many years later with the manifest file. This should have been a feature introduced in NT4 or Windows 98, not something that's put off until midway through Windows 10's development cycle.

reply

sheepscreek · 2024-04-26T22:00:11

I suspect that is a symptom of Microsoft being an enormously large organization. Coordinating a change like this that cuts across all apps, services and drivers is monumental. Honestly it is quite refreshing to see them do it with Copilot integration across all things MS. I don’t use it though, just admire the valiant effort and focus it takes to pull off something like this.

Of course - goes without saying, only works when the directive comes from all the way at the top. Otherwise there will be just too many conflicting incentives for any real change to happen.

While I am on this topic - I want to mention Apple. It is absolutely bonkers how they have done exactly the is countless times. Like changing your entire platform architecture! It could have been like opening a can of worms but they knew what they were doing. Kudos to them.

Also..(sorry, this is becoming a long post) civil and industrial engineering firms routinely pull off projects like that. But the point I wanted to emphasize is that it’s very uncommon in tech which prides on having decentralized and semi-autonomous teams vs centralized and highly aligned teams.

reply

samus · 2024-04-27T06:26:26

> While I am on this topic - I want to mention Apple. It is absolutely bonkers how they have done exactly the is countless times. Like changing your entire platform architecture! It could have been like opening a can of worms but they knew what they were doing. Kudos to them.

Apple has a walled garden approach to managing their ecosystem, and within the confines of their garden they just do what's necessary. AFAIK, Apple doesn't care about the possibilty to run binaries from the '90s on a modern stack.

Edit: even though it's expensive, it's possible to conduct such ecosystem-wide changes if you hold all cards in your hand. Microsoft was able to reengineer the graphical subsystem somewhere between XP and 8. Doing something like this is magnitudes more difficult on Linux (Wayland says hi). Google could maybe do it withij their Android corner, but they generally give a sh*t about backwards compatibility.

reply

fl0ki · 2024-04-29T15:12:47

> Apple has a walled garden approach to managing their ecosystem, and within the confines of their garden they just do what's necessary.

I don't think the walled garden makes much of a difference when it comes to compatibility on, say, macOS. They still have to carefully weigh the ecosystem-wide cost of deprecating old APIs against the ecosystem-wide long-term benefits. Yes the decision remains entirely their own, but a lot of stakeholders indirectly weigh on the decision.

GTK and Qt also make backwards-incompatible new versions as they evolve. The biggest difference here is that in theory someone could keep maintaining the old library code if they decided that updating their application code was always going to be harder. How rarely this actually happens gives weight to the argument that developers can accept occasional API overhauls in exchange for staying on the well-supported low-tech-debt path.

So walled or open made no difference here, even in the open platform, application developers are largely at the mercy of where development effort on libraries and frameworks is going. Nobody can afford to make their own exclusive frameworks to an acceptable standard, and if we want to get away from the technical debt of the 90s then the shared frameworks have to make breaking changes occasionally and strategically.

> AFAIK, Apple doesn't care about the possibilty to run binaries from the '90s on a modern stack.

Definitely, and I don't either. It's kind of a silver lining that Apple wasn't the enterprise heavy-hitter that Microsoft was at the time, because if it had been, its entire culture and landscape would be shaped by it like Microsoft's was. I think we have enough of that in the industry already.

When an old platform is that old, it's really hard to justify making it a seamless subset of the modern platform, and it makes more sense to talk about some form of virtualization. This is where even Windows falls down on both counts. How well modern Windows runs old software is far more variable than people assume until they try it. Anything with 8-bit colors may not work at all.

VirtualBox, qemu, etc. have increasingly poor support for DOS-based Windows (95, 98, ME) because not enough people care about that even in the context of virtualization. After trying every free virtualization option to run some 90s Windows software, I ended up finding that WINE was more compatible with that era than modern Windows is, without any of the jank of running a real Windows in qemu or VirtualBox.

So even with the OS most famous for backwards-compatibility and the enormous technical debt that carries, compatibility has been slowly sliding, even worse than open source projects with no direct lineage to the same platform and no commercial motives.

It's perfectly justifiable to reset technical debt here, whether walled or open. If people have enough need to run old software, there should be a market of solutions to that problem, yet it generally remains niche or hobbyist, and even the big commercial vendors overestimate how well they're doing it.

reply

hprotagonist · 2024-04-27T13:52:51

I still see people getting nailed by CP1251.

Recently, i've gotten bit by UTF16 (because somewhere along the line somewhere on a windows machine generated a file by piping it in powershell)

reply

kevin_thibedeau · 2024-04-27T00:05:35

"UCS-2 is enough for anyone"

Dwedit · 2024-04-27T01:46:39

UCS-2 is why we have the WTF-8 encoding standard, which allows mismatched UTF-16 surrogate pairs to survive a round-trip through an 8-bit encoding.

https://simonsapin.github.io/wtf-8/

reply

layer8 · 2024-04-26T16:10:24

Historically it made sense, when most software was local-only, and text files were expected to be in the local encoding. Not just platform-dependent, but user’s preferred locale-dependent. This is also how the C standard library operates.

For example, on Unix/Linux, using iso-8859-1 was common when using Western-European languages, and in Europe it became common to switch to iso-8859-15 after the Euro was introduced, because it contained the € symbol. UTF-8 only began to work flawlessly in the later aughts. Debian switched to it as the default with the Etch release in 2010.

reply

da_chicken · 2024-04-26T17:46:54

It's still not that uncommon to see programs on Linux not understanding multibyte UTF-8.

It's also true that essentially nothing on Linux supports the UTF-8 byte order mark. Yes, it's meaningless for UTF-8, but it is explicitly allowed in the specifications. Since Microsoft tends to always include a BOM in any flavor of Unicode, this means Linux often chokes on valid UTF-8 text files from Windows systems.

reply

tialaramex · 2024-04-26T18:14:00

The BOM cases are at best a consequence of trying to use poor quality Windows software to do stuff it's not suited to. It's true that in terms of Unicode text it's valid for a UTF-8 string to have a BOM, but just because that's true in the text itself doesn't magically change file formats which long pre-dated that.

Most obviously shebang (the practice of writing #!/path/to/interpreter at the start of a script) is specifically defined on those first two bytes. It doesn't make any sense have a BOM here because that's not the format, and inventing a new rule later which says you can do it doesn't make that true, any more than in 2024 the German government can decide Germany didn't invade Poland in 1939, that's not how Time's Arrow works.

reply

tremon · 2024-04-26T23:20:53

poor quality Windows software to do stuff it's not suited to

Depends how wide your definition of "poor quality" is. All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.

reply

account42 · 2024-04-29T10:01:27

> Depends how wide your definition of "poor quality" is.

This is an example of poor quality software:

> All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.

Powershell is not that old. Assuming local encoding is inexcusable here.

reply

nerdponx · 2024-04-26T18:11:54

Interestingly, Python is one of those programs.

You need to use the special "utf-8-sig" encoding for that, which is not prominently advertised anywhere in the documentation (but it is stated deep inside the "Unicode HOWTO").

I never understood why ignoring this special character requires a totally separate encoding.

reply

duskwuff · 2024-04-26T19:54:45

> I never understood why ignoring this special character requires a totally separate encoding.

Because the BOM is indistinguishable from the "real" UTF-8 encoding of U+FEFF (zero-width no-break space). Trimming that codepoint in the UTF-8 decoder means that some strings like "\uFEFF" can't be safely round-tripped; adding it in the encoder is invalid in many contexts.

reply

thayne · 2024-04-27T02:23:22

Really? In my experience it's pretty rare for Linux programs not to understand any multibyte utf-8 (which would be anything that isn't ascii). What is somewhat common is failing on code points outside the basic multilingual plane (codepoints that don't fit in 16 bits).

Dylan16807 · 2024-04-26T17:35:16

> Not just platform-dependent, but user’s preferred locale-dependent.

Historically it made sense to be locale-dependent, but even then it was annoying to be platform-dependent.

One is not a subset of the other.

reply

layer8 · 2024-04-26T18:33:57

Not sure what you mean by that with regard to encodings. The C APIs were explicitly designed to abstract from that, and together with libraries like iconv is was rather straightforward. You only needed to be aware that there is a difference between internal and external encoding, and maybe decide between char and wchar_t.

Dylan16807 · 2024-04-27T00:02:29

Not everything is C, and nothing like that saves you when you move your floppy between computers.

hermitdev · 2024-04-26T17:45:51

> platform-dependent.

It's 2024 and we still can't all agree on line endings. Mac vs Win vs Unix...

reply

Y-bar · 2024-04-26T17:55:15

Mac OS and Unix agreed about twenty years ago to use the same ending: https://superuser.com/a/439443

Dylan16807 · 2024-04-26T18:08:37

By which time XP was already in the middle of releasing, so it was too late to get Windows on board.

It's too bad, with a bit more planning and an earlier realization that Unicode cannot in fact fit into 16 bits then Windows might have used UTF-8 internally.

reply

jmb99 · 2024-04-26T21:23:18

Unless I’m mistaken, Rhapsody (released 1997) used LF, not CR. At that point it was pretty clear Mac was moving towards Unix through NeXTSTEP, meaning every OS except windows would be using LF. Microsoft would’ve had around 6 years before the release of XP, and probably would’ve had time to start the transition with Win2K at the end of 1999.

mixmastamyk · 2024-04-27T06:11:52

Every OS except the one that had 95% market share in late 90s. Apple was only propped up “Weekend at Bernies” style to appease regulators.

account42 · 2024-04-29T10:11:46

> and an earlier realization that Unicode cannot in fact fit into 16 bits

The Unicode consortium already realized it when they decided on Han unification, they just didn't accept it yet.

reply

Longhanks · 2024-04-26T18:05:51

It's 2024, everything but Windows is UTF-8 \n since twenty years.

int_19h · 2024-04-26T21:29:57

Linux was definitely not uniformly UTF-8 twenty years ago. It was one of the many available locales, but it was still common to use other encodings, and plenty of software didn't handle multibyte well in general.

anthk · 2024-04-26T17:33:15

Emacs was amazing for that; builtin text encoders/decoders/transcoders for everything.

hollerith · 2024-04-26T17:40:36

My experience was that brittleness around text encoding in Emacs (versions 22 and 23 or so) was a constant source of annoyance for years.

IIRC, the main way this brittleness bit me was that every time a buffer containing a non-ASCII character was saved, Emacs would engage me in a conversation (which I found tedious and distracting) about what coding system I would like to use to save the file, and I never found a sane way to configure it to avoid such conversations even after spending hours learning about how Emacs does coding systems: I simply had to wait (a year or 3) for a new version of Emacs in which the code for saving buffers worked better.

I think some people like engaging in these conversations with their computers even though the conversations are very boring and repetitive and that such conversation-likers are numerous among Emacs users or at least Emacs maintainers.

reply

anthk · 2024-04-27T11:24:02

TBH Gvim and most editors did the same on saving prompts, but for sure you could edit that under Emacs with M-x configure, and Emacs supported weirdly encoded files on the spot.

andrewshadura · 2024-04-27T06:45:41

Etch came out in 2007, not 2010.

layer8 · 2024-04-27T11:57:33

Ah, I had misremembered, and misread https://www.debian.org/releases/etch/.

fbdab103 · 2024-04-26T17:50:52

A different one that just bit me the other day was implicitly changing line endings. Local testing on my corporate laptop all went according to plan. Deploy to linux host and downstream application cannot consume it because it requires CRLF.

Just one of those stupid little things you have to remember from time to time. Although, why does newly written software require a specific line terminator is a valid question.

reply

selimnairb · 2024-04-26T16:22:27

Yeah, this has bitten me several times as soon as a people use the code on Windows.

jillesvangurp · 2024-04-26T13:45:08

Not relying on flaky system defaults is a good thing. These things have a way of turning around and being different than what you assume them to be. A few years ago I was dealing with Ubuntu and some init.d scripts. One issue I ran into was that some script we used to launch Java (this was before docker) was running as root (bad, I know) and with a shell that did not set UTF-8 as the default like would be completely normal for regular users. And of course that revealed some bad APIs that we were using in Java that use the os default. Most of these things have variants that allow you to set the encoding at this point and a lot of static code checkers will warn you if you use the wrong one. But of course it only takes one place for this to start messing up content.

These days it's less of an issue but I would simply not rely on the os to get this right ever for this. Most uses of encodings other than UTF-8 are extremely likely to be unintentional at this point. And if it is intentional, you should be very explicit about it and not rely on weird indirect configuration through the OS that may or may not line up.

So, good change. Anything that breaks over this is probably better off with the simple fix added. And it's not worth leaving everything else as broken as it is with content corruption bugs just waiting to happen.

reply

ok_computer · 2024-04-27T01:31:31

I was using .gitignore generated by an aliased touch function in powershell. Despite my best efforts, I could not get git to respect its gitignore. Figured out the touched text file was utf-16 and basically not respected at all. Lesson learned I uuchanged a system default to utf-8 but just rely on my text editor now.

account42 · 2024-04-29T10:27:14

Global locales were a mistake in general, not just the encoding part. printf("%f", 4.2) should not magically output different strings depending on the environment, that just causes more problems than it solves. Instead you should have to explicitly pass the local information (or relevant parts of it) to functions that you want to make locale-dependent.

anordal · 2024-04-26T20:25:07

The following heuristic has become increasingly true over the last couple of decades: If you have some kind of "charset" configuration anywhere, and it's not UTF-8, it's wrong.

Python 2 was charset agnostic, so it always worked, but the improvement with Python 3 was not only an improvement – how to tell a Python 3 script from a Python 2 script?

* If it contains the string "utf-8", it's Python3.

* If it only works if your locale is C.UTF-8, it's Python3.

Needless to say, I welcome this change. The way I understand it, it would "repair" Python 3.

reply

Euphorbium · 2024-04-26T14:53:41

I thought it was default since python 3.

lucb1e · 2024-04-26T15:39:08

You may be thinking of strings where the u"" prefix was made obsolete in python3. Then again, trying on Python 2.7 just now, typing "éķů" results in it printing the UTF-8 bytes for those characters so I don't actually know what that u prefix ever did, but one of the big py2-to-3 changes was strings having an encoding and byte strings being for byte sequences without encodings

This change seems to be about things like open('filename', mode='r') mainly on Windows where the default encoding is not UTF-8 and so you'd have to specify open('filename', mode='r', encoding='UTF-8')

reply

jcranmer · 2024-04-26T17:27:19

Python has two types of strings: byte strings (every character is in the range of 0-255) and Unicode strings (every character is a Unicode codepoint). In Python 2.x, "" maps to a byte string and u"" maps to a Unicode string; in Python 3.x, "" maps to a unicode string and b"" maps to a byte string.

If you typed in "éķů" in Python 2.7, what you get is a string consisting of the hex chars 0xC3 0xA9 0xC4 0xB7 0xC5 0xAF, which if you printed it out and displayed it as UTF-8--the default of most terminals--would appear to be éķů. But "éķů"[1] would return a byte string of \xa9 which isn't valid UTF-8 and would likely display as garbage.

If you instead had used u"éķů", you'd instead get a string of three Unicode code points, U+00E9 U+0137 U+016F. And u"éķů"[1] would return u"ķ", which is a valid Unicode character.

reply

aktiur · 2024-04-26T16:07:27

> strings having an encoding and byte strings being for byte sequences without encodings

You got it kind of backwards. `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints), without reference to any encoding. `bytes` are arbitrary sequence of octets. If you have some `bytes` object that somehow stands for text, you need to know that it is text and what its encoding is to be able to interpret it correctly (by decoding it to `str`).

And, if you got a `str` and want to serialize it (for writing or transmitting), you need to choose an encoding, because different encodings will generate different `bytes`.

As an example :

>>> "évènement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'

>>> "évènement".encode("latin-1") b'\xe9v\xe8nement'

reply

chrismorgan · 2024-04-27T10:46:46

> `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints)

It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:

  >>> '\udead'.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
  >>> '\ud83d\ude41'.encode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.

account42 · 2024-04-29T10:33:09

Maybe we need another PEP that switches the default to WTF-8 [0] aka UTF-8 but let's ignore that a chunk of code points was reserved as surrogates and just encode them like any other code point.

[0] https://simonsapin.github.io/wtf-8/

reply

chrismorgan · 2024-04-29T19:25:44

My comment was completely unrelated to PEP 686. WTF-8 is emphatically not intended to be used as a file encoding.

lucb1e · 2024-04-26T16:47:41

> `str` are sequence of unicode codepoints [...] without reference to any encoding

I guess I see it from the programmer's perspective: to handle bytes coming from the disk/network as a string, I need to specify an encoding, so they are (to me) byte sequences with an encoding assigned. Didn't realize strings don't have an encoding in Python's internal string handling but are, instead, something like an array of integers pointing to unicode code points. Not sure if this viewpoint means I am getting it backwards but I can see how that was phrased poorly on my part!

reply

tialaramex · 2024-04-26T17:47:18

There are two distinct questions here, to which implementations can provide different answers

1. Interface: How can I interact with "string" values, what kind of operations can I perform versus what can't be done ? Methods and Operators provided go here.

2. Representation: What is actually stored (in memory) ? Layout goes here.

So you may have understood (1) for Python, but you were badly off on (2). Now, at some level this doesn't matter, but, for performance obviously the choice of what you should do will depend on (2). Most obviously, if the language represents strings as UTF-8 bytes, then "encoding" a string as UTF-8 will be extremely cheap. Whereas, if the language represents them as UTF-16 code units, the UTF-8 encoding operation will be a little slower.

reply

lucb1e · 2024-04-26T18:34:45

Alright, but don't leave us hanging: what does Python3 use for (2) that you say I was badly off on? (Or, in actuality, never thought about or meant to make claims about.) Now we still can't make good choices for performance!

https://stackoverflow.com/questions/1838170/what-is-internal... says Python3.3 picks either a one-, two-, or four-byte representation depending on which is the smallest one that can represent all characters in a string. If you have one character in the string that requires >2 bytes to represent, it'll make every character take 4 bytes in memory such that you can have O(1) lookups on arbitrary offsets. The more you know :)

reply

aktiur · 2024-04-27T19:05:08

Pre-python 3.2, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else.

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...

reply

samus · 2024-04-26T22:27:18

Since Java 9, the Java JRE does something similar: if a string contains only characters in ISO-8859-1 then it is stored as such, else the usual storage format (int16) is used.

tialaramex · 2024-04-26T18:52:35

Yeah, I started writing about what you found (the answer to (2) for Python) and I realised that's a huge rabbit hole I was venturing down and decided to stop short and post, so, apologies I guess.

d0mine · 2024-04-26T19:50:07

The Python source code is utf-8 by default in Python 3. But it says nothing about a character encoding used to save to a file. It is locale-dependent by default.

    # string literals create str objects using utf-8 by default
    Path("filenames use their own encoding").write_text("file content encoding uses yet another encoding")

The corresponding encodings are:

- utf-8 [tokenize.open] - sys.getfilesystemencoding() [os.fsencode] - locale.getpreferredencoding() [open]

reply

Macha · 2024-04-26T13:02:44

> And many other popular programming languages, including Node.js, Go, Rust, and Java uses UTF-8 by default.

Oh, I missed Java moving from UTF-16 to UTF-8.

reply

hashmash · 2024-04-26T14:01:57

With Java, the default encoding when converting bytes to strings was originally platform independent, but now it's UTF-8. UTF-16 and latin-1 encodings are (still*) used internally by the String class, and the JVM uses a modified UTF-8 encoding like it always has.

* The String class originally only used UTF-16 encoding, but since Java 9 it also uses a single-byte-per-character latin-1 encoding when possible.

reply

rootext · 2024-04-26T14:00:15

It seems you are mixing two things: inner string representation and read/write encoding. Java has never used UTF-16 as default for the second.

Dwedit · 2024-04-27T02:36:39

Or possibly confusing it with JavaScript, which treats strings as sequences of UTF-16 characters?

cryptonector · 2024-04-26T15:39:48

Not even on Windows?

layer8 · 2024-04-26T16:16:03

No, file I/O on Windows in general doesn’t use UTF-16, but the regional code page, or nowadays UTF-8 if the application decides so.

int_19h · 2024-04-26T21:34:48

Depends on what you define as "file I/O", though. NTFS filenames are UTF-16 (or rather UCS2). As far as file contents, there isn't really a standard, but FWIW for a long time most Windows apps - Notepad being the canonical example when asked to save anything as "Unicode" would save it as UTF-16.

layer8 · 2024-04-26T22:21:03

I'm talking about the default behavior of Microsoft's C runtime (MSVCRT.DLL) that everyone is/was using.

UTF-16 text files are rather rare, as is using Notepad's UTF-16 options. The only semi-common use I know of is *.reg files saved from regedit. One issue with UTF-16 is that it has two different serializations (BE and LE), and hence generally requires a BOM to disambiguate.

reply

TheCycoONE · 2024-04-27T02:43:43

Powershell use to output utf-16 by default on Windows. It might still but it's been awhile since I needed to try.

int_19h · 2024-04-27T10:27:30

Then you're talking about the C stdlib, which, yeah, is meant to use the locale-specific encoding on any platform, so it's not really a Windows thing specifically. But even then someone could use the CRT but call wfopen() rather than fopen() etc - this was actually not uncommon for Windows software precisely because it let you handle Unicode without having to work with Win32 API directly.

Microsoft's implementation of fopen() also supports "ccs=..." to open Unicode text files in Unicode, and interestingly "ccs=UNICODE" will get you UTF-16LE, not UTF-8 (but you can do "ccs=UTF-8"). .NET also has this weird naming quirk where Encoding.Unicode is UTF-16, although there at least UTF-8 is the default for all text I/O classes like StreamReader if you don't specify the encoding. Still, many people didn't know better, and so some early .NET software would use UTF-16 for text I/O for no reason other than its developers believing that Encoding.Unicode is obviously what they are supposed to be using to "support Unicode", and so explicitly passing it everywhere.

reply

PurpleRamen · 2024-04-26T13:08:25

Seems it happened two years ago, with Java 18.

Animats · 2024-04-26T21:12:39

Is the internal encoding in CPython UTF-8 yet?

You can index through Python strings with a subscript, but random access is rare enough that it's probably worthwhile to lazily index a string when needed. If you just need to advance or back up by 1, you don't need an index. So an internal representation of UTF-8 is quite possible.

reply

rogerbinns · 2024-04-26T22:44:26

The PyUnicode object is what represents a str. If the UTF-8 bytes are ever requested, then a bytes object is created on demand and cached as part of the PyUnicode, being freed with the PyUnicode itself is freed.

Separately from that the codepoints making up the string are stored in a straight forward array allowing random access. The size of each codepoint can be 1, 2, or 4 bytes. When you create a PyUnicode you have to specify the maximum codepoint value which is rounded up to 127, 255, 65535, or 1,114,111. That determines if 1, 2, or 4 bytes is used.

If the maxiumum codepoint value is 127 then that array representation can be used for the UTF-8 directly. So the answer to your question is that many strings are stored as UTF-8 because all the codepoints are <= 127.

Separately from that, advancing through strings should not be done by codepoints anyway. A user perceived character (aka grapheme cluster) is made up of one or more codepoints. For example an e with an accent could be the e codepoint followed by a combining accent codepoint. The phoenix emoji is really the bird emoji, a zero width joiner, and then fire emoji. Some writing systems used by hundreds of millions of people are similar to having consonants, with combining marks to represent vowels.

This - - is 5 codepoints. There is a good blog post diving into it and how various languages report its "length". https://hsivonen.fi/string-length/

Source: I've just finished implementing Unicode TR29 which covers this for a Python C extension.

reply

a-french-anon · 2024-04-26T15:08:04

Why not utf-8-sig, though? It handles optional BOMs. Had to fix a script last week that choked on it.

shellac · 2024-04-26T16:09:00

At this point nothing ought to be inserting BOMs in utf-8. It's not recommended, and I think choking on it is reasonable behaviour these days.

Athas · 2024-04-26T16:21:06

Why were BOMs ever allowed for UTF-8?

josefx · 2024-04-26T16:41:39

Some editors used them to help detect UTF-8 encoded files. Since they are also valid zero length space characters they also served as a nice easter egg for people who ended up editing their linux shell scripts with a windows text editor.

stubish · 2024-04-27T04:03:08

An attempt to store the encoding needed to decode the data with the data, rather than requiring the reader to know it somehow. Your program wouldn't have to care if its source data had been encoded as UTF-8, UTF-16, UTF-32 or some future standard. The usual sort of compromise that comes out of committees, in this case where every committee member wanted to be able to spit their preferred in-memory Unicode string representation to disk with no encoding overhead.

plorkyeran · 2024-04-26T16:38:40

When UTF-8 was still very much not the default encoding for text files it was useful to have a way to signal that a file was UTF-8 and not the local system encoding.

da_chicken · 2024-04-26T17:51:54

Some algorithms can operate much easier if they can assume that multibyte or variable byte characters don't exist. The BOM means that you don't have to scan the entire document to know if you can do that.

account42 · 2024-04-29T10:53:12

No, a missing BOM does not guarantee that the file doesn't contain multi-byte code points and never has.

da_chicken · 2024-04-29T17:33:21

That isn't what I said.

I said that if a BOM is present, then you explicitly know that multi-byte characters are possibly present. Therefore, if it's present you know that assuming that the Nth byte is the Nth code point is unsafe.

The opposite is irrelevant. There's never any way to safely determine a text file's encoding if there is no BOM present.

reply

BoingBoomTschak · 2024-04-27T08:42:09

Only reason I used it was to force MSVC to understand my u8"" literals. Should've forced /utf8 in our build system, in retrospective.

For UTF-16/32, knowing the endianness doesn't seem to be a frivolous functionality. And in fact, having to use heuristics-based detection via uchardet is a big mess, some kind of header should have been standardized since the start.

reply

Dwedit · 2024-04-26T20:36:32

Basically every C# program will insert BOMs into text files by default unless you opt-out.

neonsunset · 2024-04-27T00:26:05

Where did you get that from?

Arnavion · 2024-04-27T00:58:40

It's the behavior when using the default `Encoding.UTF8` static. You have to create your own instance as `new UTF8Encoding(false)` if you don't want a BOM.

neonsunset · 2024-04-27T02:29:41

This is true for `UTF8Encoding` used as an encoder (e.g. within transcoding stream, not often used today).

Other APIs, however, like File.WriteAllText, do not write BOM unless you explicitly pass encoding that does so (by returning non-empty preamble).

reply

Dwedit · 2024-04-27T20:12:44

I actually did not know that File.WriteAllText/new StreamWriter defaulted to UTF-8 without BOM if no encoding was specified. I always passed in an encoding to those functions, and "Encoding.UTF8" has a BOM by default. Without specifying any encoding, I just assumed it would pick your system locale, because all the default String <-> Number conversion functions will indeed do that.

There are some coding standards for C# that mandate passing in the maximum number of parameters to a function, and never allow you to use the default parameter to be used. Sometimes this is a big win (prevents all that Current Culture nonsense when converting between numbers and strings, you need Invariant Culture almost all the time), and other times introduces bugs (Using the wrong value when creating Message Boxes to put them on the logon desktop instead of the user's screen).

reply

neonsunset · 2024-04-27T20:45:22

It's a different overload. Encoding is not an optional parameter: https://learn.microsoft.com/en-us/dotnet/api/system.io.file....

Enforcing an overload of the highest arity of arguments sounds like a really terrible rule to have.

Culture-sensitivity is strictly different to locale as it does not act like a C locale (unsound) but simply follows delimiters/dates/currency/etc. format for parsing and formatting.

It is also in many places considered to be undesirable as it introduces environment-dependent behavior where it is not expected hence the analyzer will either suggest you to specify invariant culture or alternatively you can specify that in the project through InvariantGlobalization prop (to avoid CultureInfo.InvariantCulture spam). This is still orthogonal to text encoding however.

reply

orf · 2024-04-26T15:42:03

Because changing Python to silently prefixing all IO with an invisible BOM isn’t a good idea.

int_19h · 2024-04-26T21:36:02

The expectation isn't for it to generate BOM in the output, but to handle BOM gracefully when it occurs in the input.

shpx · 2024-04-27T02:50:37

> On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file

https://docs.python.org/3/library/codecs.html

The codec you're imagining would also make reading a file and writing it back change the file if it contains a BOM.

reply

int_19h · 2024-04-27T10:14:54

Indeed it would, but since codecs are only used for files that are semantically text, and in such files BOM is basically a legacy no-op marker, it's not actually a problem. Naive code using text I/O APIs would also have this issue with line endings, for example, so there's precedent for not providing the perfect roundtrip experience (that's what bytes I/O is for).

otteromkram · 2024-04-26T15:32:40

Only one script? Out of how many?

hermitdev · 2024-04-26T17:50:38

Not the OP, but I see this pop up quite frequently in ETL, usually handling csv files.

anthk · 2024-04-26T17:35:23

On UTF-8, the Linux framebuffer should had had a good utf8 support (a proper one, not 256/512 glyphs) long ago. Even GNU Hurd since 2007 or so it had a better 'terminal console' with UTF8 support. It's 2024.

Aerbil313 · 2024-04-26T20:35:33

Nice. Now the only thing we need is JS to switch to UTF-8. But of course JS can't improve, because unlike any other programming language, we need to be compatible with code written in 1995.

shpx · 2024-04-27T02:42:22

This is about when you ask Python to open a file "as text", what encoding it will use by default. The internal representation of strings is a different matter and, like JavaScript, Python doesn't "just use UTF-8" for that.

lexicality · 2024-04-26T13:44:04

> Additionally, many Python developers using Unix forget that the default encoding is platform dependent. They omit to specify encoding="utf-8" when they read text files encoded in UTF-8

"forget" or possibly simply aren't made well enough aware? I genuinely thought that python would only use UTF-8 for everything unless you explicitly ask it to do otherwise.

reply

aktiur · 2024-04-26T16:43:53

It actually depends!

`bytes.decode` (and `str.encode`) have used UTF-8 as a default since at least Python 3.

However, the default encoding used for decoding the name of files use ` sys.getfilesystemencoding()`, which is also UTF-8 on Windows and macos, but will vary with the locale on linux (specifically with CODESET).

Finally, `open` will directly use `locale.getencoding()`.

reply

Affric · 2024-04-26T13:10:12

Make UTF-8 default on Windows

johannes1234321 · 2024-04-26T13:30:07

Since Windows Version 1903 (May 2019 Update) they push for Utf-8. But Windows is a big pile of compatible legacy.

pjc50 · 2024-04-26T15:07:09

In addition to ApiFunctionA and ApiFunctionW, introduce ApiFunction8? (times whole API surface)

Introduce a #define UNICODE_NO_REALLY_ALL_UNICODE_WE_MEAN_IT_THIS_TIME ?

reply

cryptonector · 2024-04-26T15:40:40

ApiFunctionA is UTF-8 capable. Needs a run-time switch too, not just compile-time.

sebazzz · 2024-04-26T21:17:21

Yes: https://learn.microsoft.com/en-us/windows/win32/sbscs/applic...

> On Windows 10, this element forces a process to use UTF-8 as the process code page. For more information, see Use the UTF-8 code page. On Windows 10, the only valid value for activeCodePage is UTF-8.

> This element was first added in Windows 10 version 1903 (May 2019 Update). You can declare this property and target/run on earlier Windows builds, but you must handle legacy code page detection and conversion as usual. This element has no attributes.

reply

garaetjjte · 2024-04-26T15:52:37

It's now possible, but for years the excuse was that MBCS encodings only supported characters up to 2 bytes.

ComputerGuru · 2024-04-26T19:22:49

Only under windows 11, I believe. And that switch is off by default.

int_19h · 2024-04-26T21:38:09

You're thinking of the global setting that is enabled by the user and applies to all apps that operate in terms of "current code page" - if enabled, that codepage becomes 65001 (UTF-8).

However, on Win10+, apps themselves can explicitly opt into UTF-8 for all non-widechar Win32 APIs regardless of the current locale/codepage.

reply

layer8 · 2024-04-26T16:20:08

That would break so many applications and workflows that it will never happen.

tedivm · 2024-04-26T13:58:08

That's exactly what this proposal (which has been accepted) is going to do.

lolinder · 2024-04-26T14:01:34

I think they mean that the Windows operating system should default to UTF-8.

numpad0 · 2024-04-26T22:58:26

lots of apps can't even handle non-ASCII username on Windows

Affric · 2024-04-27T03:07:18

I have seen the worst of it.

Too many companies running franken-software from decades ago.

reply

anonym29 · 2024-04-26T18:04:24

[flagged]

terr-dav · 2024-04-26T18:21:25

It sounds like you're building an empire over there, albeit a tiny one with very large gates.

Myrmornis · 2024-04-26T13:08:45

Hm TIL, I thought that the string encoding argument to .decode() and .encode() was required, but now I see it defaults to "utf-8". Did that change at some point?

LeoPanthera · 2024-04-26T15:47:21

> ChatGPT4 says it's always been that way since the beginning of Python3

This is not a reliable way to look up information. It doesn't know when it's wrong.

reply

Myrmornis · 2024-04-28T15:44:12

I was just being conversational.

_ache_ · 2024-04-26T13:31:43

You can verify on the documentation by switching the version.

So ... since 3.2: https://docs.python.org/3.2/library/stdtypes.html#bytes.deco... In 3.1 it was the default encoding of string (the type str I guess). https://docs.python.org/3.1/library/stdtypes.html#bytes.deco...

reply

aktiur · 2024-04-26T16:36:50

> In 3.1 it was the default encoding of string (the type str I guess).

No, what was used was what sys.getdefaultencoding(), which was already UTF-8 in 3.1 (I checked the source code).

At that time, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else. But that did not exist in 3.1, where you could either use the "narrow" build of python (that used UTF-16) or the "wide" build (that used UTF-32).

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...

reply

_ache_ · 2024-04-26T16:57:42

Thank you ! The documentation was misleading about "default encoding of string".

int_19h · 2024-04-26T21:41:00

The simple thing to remember is that for all versions of Python going back 12 years, there's no such thing as "default encoding of string". A Python string is defined as a sequence of 32-bit Unicode codepoints, and that is how Python code perceives it in all respects. How it is stored internally is an implementation detail that does not affect you.

Dylan16807 · 2024-04-27T00:30:49

32 bit specifically?

The most expansive Unicode has ever been was 31 bits, and UTF-8 is also capable of at most 31 bits.

reply

int_19h · 2024-04-27T10:12:14

You're right, the docs just say "Unicode codepoints", and standard facilities like "\U..." or chr() will refuse anything above U+10FFFF. However I'm not sure that still holds true when third-party native modules are in the picture.