Added commentary about the overall design and its limitations.

2026-01-13 06:50:39 -08:00 · 2013-12-05 22:59:23 +02:00 · 2013-12-05 22:59:23 +02:00 · 0cd7a14e57
commit 0cd7a14e57
parent a22205d67c
1 changed files with 92 additions and 0 deletions
--- a/src/w32.c
+++ b/src/w32.c
@ -1290,6 +1290,98 @@ w32_valid_pointer_p (void *p, int size)



+/* Here's an overview of how the Windows build supports file names
+   that cannot be encoded by the current system codepage.
+
+   From the POV of Lisp and layers of C code above the functions here,
+   Emacs on Windows pretends that its file names are encoded in UTF-8;
+   see encode_file and decode_file on coding.c.  Any file name that is
+   passed as a unibyte string to C functions defined here is assumed
+   to be in UTF-8 encoding.  Any file name returned by functions
+   defined here must be in UTF-8 encoding, with only a few exceptions
+   reserved for a couple of special cases.  (Be sure to use
+   MAX_UTF8_PATH for char arrays that store UTF-8 encoded file names,
+   as they can be much longer than MAX_PATH!)
+
+   The UTF-8 encoded file names cannot be passed to system APIs, as
+   Windows does not support that.  Therefore, they are converted
+   either to UTF-16 or to the ANSI codepage, depending on the value of
+   w32-unicode-filenames, before calling any system APIs or CRT library
+   functions.  The default value of that variable is determined by the
+   OS on which Emacs runs: nil on Windows 9X and t otherwise, but the
+   user can change that default (although I don't see why would she
+   want to).
+
+   The 4 functions defined below, filename_to_utf16, filename_to_ansi,
+   filename_from_utf16, and filename_from_ansi, are the workhorses of
+   these conversions.  They rely on Windows native APIs
+   MultiByteToWideChar and WideCharToMultiByte; we cannot use
+   functions from coding.c here, because they allocate memory, which
+   is a bad idea on the level of libc, which is what the functions
+   here emulate.  (If you worry about performance due to constant
+   conversion back and forth from UTF-8 to UTF-16, then don't: first,
+   it was measured to take only a few microseconds on a not-so-fast
+   machine, and second, that's exactly what the ANSI APIs we used
+   before do anyway, because they are just thin wrappers around the
+   Unicode APIs.)
+
+   The variables file-name-coding-system and default-file-name-coding-system
+   still exist, but are actually used only when a file name needs to
+   be converted to the ANSI codepage.  This happens all the time when
+   w32-unicode-filenames is nil, but can also happen from time to time
+   when it is t.  Otherwise, these variables have no effect on file-name
+   encoding when w32-unicode-filenames is t; this is similar to
+   selection-coding-system.
+
+   This arrangement works very well, but it has a few gotchas:
+
+   . Lisp code that encodes or decodes file names manually should
+     normally use 'utf-8' as the coding-system on Windows,
+     disregarding file-name-coding-system.  This is a somewhat
+     unpleasant consequence, but it cannot be avoided.  Fortunately,
+     very few Lisp packages need to do that.
+
+     More generally, passing to library functions (e.g., fopen or
+     opendir) file names already encoded in the ANSI codepage is
+     explictly *verboten*, as all those functions, as shadowed and
+     emulated here, assume they will receive UTF-8 encoded file names.
+
+     For the same reasons, no CRT function or Win32 API can be called
+     directly in Emacs sources, without either converting the file
+     name sfrom UTF-8 to either UTF-16 or ANSI codepage, or going
+     through some shadowing function defined here.
+
+   . File names passed to external libraries, like the image libraries
+     and GnuTLS, need special handling.  These libraries generally
+     don't support UTF-16 or UTF-8 file names, so they must get file
+     names encoded in the ANSI codepage.  To facilitate using these
+     libraries with file names that are not encodable in the ANSI
+     codepage, use the function ansi_encode_filename, which will try
+     to use the short 8+3 alias of a file name if that file name is
+     not encodable in the ANSI codepage.  See image.c and gnutls.c for
+     examples of how this should be done.
+
+   . Running subprocesses in non-ASCII directories and with non-ASCII
+     file arguments is limited to the current codepage (even though
+     Emacs is perfectly capable of finding an executable program file
+     even in a directory whose name cannot be encoded in the curreent
+     codepage).  This is because the command-line arguments are
+     encoded _before_ they get to the w32-specific level, and the
+     encoding is not known in advance (it doesn't have to be the
+     current ANSI codepage), so w32proc.c functions cannot re-encode
+     them in UTF-16.  This should be fixed, but will also require
+     changes in cmdproxy.  The current limitation is not terribly bad
+     anyway, since very few, if any, Windows console programs that are
+     likely to be invoked by Emacs support UTF-16 encoded command
+     lines.
+
+   . For similar reasons, server.el and emacsclient are also limited
+     to the current ANSI codepage for now.
+
+*/
+
+
+
 /* Converting file names from UTF-8 to either UTF-16 or the ANSI
   codepage defined by file-name-coding-system.  */