> We’ve established that, yes, pathnames can include newlines. We have not established why they can do that. After some deliberation, the Austin Group could not find a single use-case for newlines in pathnames besides breaking naive scripts. Wouldn’t it be nice if the naive scripts were just correct now? Ok, that might be a bit much all at once. We’re heading there though!
Oh my god. This makes me so happy. This is the most lovely think I've read in the world of computing since the unix gods decided that newlines were to be a single character.
The philosophy underlying the sentence "Wouldn’t it be nice if the naive scripts were just correct now?" is incredibly positive. We are surrounded by arrogant jerks who break old code by aggressively enforcing stricter compliance of some stupid rules. But here come these posix heros who do the exact opposite: make old code correct! There is hope in mankind after all.
Rather unfortunately, I happen to have a handful of files on my machine with newlines in them (the filenames were programmatically generated from a summary of their contents). I loathe the possibility that my shell tools are going to suddenly crash when confronted with these weird files, rather than just producing some slightly silly output. I wish we'd standardized the behaviour of just escaping such characters as `\n/\r` or `^J/^M`...
> the following utilities are now either encouraged to error out if they are to create a filename that contains a newline, and/or encouraged to error out if they are *about to print a pathname that contains a newline* in a context where newlines may be used as a separator
It then proceeds to list a bunch of utilities including diff, file, find, grep, head, du, etc., none of which create files directly.
These utilities could be updated to reject newlines in file paths if they're going to print in a "newline delimited" form - but for some of these utilities, that's the only available form.
> error out if they are about to print a pathname that contains a newline in a context where newlines may be used as a separator
But that's already broken. This is a situation where filenames with newlines in them are indistinguishable from two filenames in outputs. So instead of producing subtly broken output, tools are encouraged (not forced) to explicitly fail with a lot of noise.
The "in a context where newlines may be used as a separator" part of this sentence is very important.
IIUC the tools are still allowed to succeed in non broken situations, for instance when a null separator is used and not a new line character. And I can't imagine the tools you listed will start breaking in situations that worked (apart from file creation - indeed this will likely start breaking, and new line characters in filename needs to be considered deprecated and things using them to be fixed).
This is strictly better IMHO (if one thinks that newlines in files are not worth the troubles given how things work in POSIX, especially the part where things are line-based and new line characters have quite some significance)
I'm convinced we will need to be careful with symbolic links related to new line characters in filenames, but I'm curious of which specific aspect you had in mind.
Oh, nothing specific to newlines. Just, when you rename files to fix newlines, you need to check if they break symlinks pointing to them.
For instance, I had project folders for my individual research projects. In order to have a central repository of resources and not have copies of multi-megabyte pdfs in each folder, I put all referenced papers in a single directory and symlinked them for each project that needed them. Later, I wanted to rename the papers to remove newlines. The symlinks complicated this process quite a bit!
In academia, I get (and used to create) pdfs with names like:
"On the number of
associative foobars
of degree blah -
Johnson and Anderson.pdf"
all the time. It is very convenient for non-technical academics to have a descriptive file name, and to be able to see it entirely in the navigator they use newlines.
I'm pretty sure that still works. I forgot the exact scenario, but my Windows CI on GitHub Actions output shorte~1 pathna~1 like that in a script just a few months ago. On one hand, the backwa~1 compati~1 is nice. On the other hand, there's just so much depreca~1 cruft that keeps popping up even on contemp~1 systems.
Eh, maybe. In practice I usually do all my moderately-heavy filesystem scripting in Python these days, for which pathname quoting is just a complete non-issue. Of course, I still use a shell for quick-and-dirty stuff, but usually only for pretty simple tasks where the simplest quoting setup ("$i") suffices.
As a fellow spaces-in-filenames-hater, the fight is not lost. We are on the brink of winning it; it's just a mount option away!
While we cannot avoid that people hit the spacebar when writing a filename on a gui, this does not mean at all that the resulting filename itself need contain a plain space character. Those spaces can and should be transparently translated to non-breaking space characters at some point. Maybe by the gui itself, or more robustly by the filesystem. This would make everybody happy: gui users and naive shell script writers.
In order to preserve the sacrosanct simplicity of naive shell scripts. Seems like a very noble goal to me.
The only unexpexted compexity arises when you want to deal with filenames having mixed spaces and nbsps. But I'd say that people who do that had it coming.
If you want simple shell scripts to work, make an actually good shell language without all the footguns.
The filesystem is way more important than /bin/sh and and any complexity added there will trickle down to all programs, not just shell scripts.
It's not worth adding hacks on the FS to patch defects in poorly written shell scripts (which are being replaced en masse with python/nodejs/even weirder yaml files/systemd units/etc... anyways)
Whitespace in filenames in general is difficult to deal with. Many, maybe most, programs get it wrong. It's not just about shell scripts, many GUI programs fail to handle those files properly too.
My favorite file+space issues is spaces at the end of file names, especially when you copy and paste text, or text gets trimmed from an input box, or the person forgets to trim space from an input box...
No, no they don't - you just don't notice when they get it wrong, and you also don't name your files stupid things (I imagine).
If you actually test this, you'll realize a ton of Windows programs get it wrong.
Also, in general this is a poor argument. The goal of Linux isn't to be as much like Windows as possible, because Windows sucks ass. Nobody in their right mind would use Linux if it was just Windows but, presumably, shittier. The entire appeal of Linux is that it isn't Windows, and it isn't MacOS.
Eh? It's really not a bother in pretty much any programming language, and you don't really need to do anything special for it. I don't know any program that has any problems with it.
Even zsh has fixed this. It's just /bin/sh and bash that are annoying.
Simplicity doesn't always mean stupidity. The simple but functional shell that correctly handles whitespaces without much hassle was already available since 90s, namely rc which is also found in Plan 9. Adopting rc's string concatenator `^` in POSIXy shells shouldn't be too hard.
sh-3.2$ f='Hello world'
sh-3.2$ echo $f
Hello world
sh-3.2$ for i in $f; do echo $i; done
Hello
world
sh-3.2$ f='Hello\xC2\xA0world'
sh-3.2$ echo $f
Hello world
sh-3.2$ for i in $f; do echo $i; done
Hello world
My team already uses `make` but there's no reason for me to run it in my Downloads folders. File names in there are sometimes wild. Yet I expect command line tools to work with them. If they will cease to do so, I will have to start using non-POSIX variants of those tools, I guess.
I don't know who "the Austin Group" mentioned in the article are, but how come they "could not find a single use-case for newlines in pathnames besides breaking naive scripts" when legitimate use-cases are so easy to find?
(And if they're that incompetent, why does the article imply they are worth quoting and listening to?)
It is [1] the joint working group that for the last 25+ years has been responsible for both the POSIX standard and the Single Unix Specification. It emerged after the UNIX wars as a consolidation of the various splintered UNIX standardization efforts (POSIX itself, X/OPEN, OSF, etc).
Is that legitimate? A path name is just a unique identifier for a file, IMO it doesn't make sense to put a whole novel in there. If anything, a giant summary like that should be in the meta tags?
In what way is it not legitimate? It's not an accident, bug or data corruption. Someone put it there for a reason, and it benefits their use case. That's as legitimate as it gets.
That's a core part of the problem: a path name is NOT just a unique identifier for a file. Desktop operating systems and their classical utilities conflate the "unique identifier" and whatever "displayed title" of a file though which the end user interacts with the file.
Users care about "titles" or "summaries" of files, not "filesystem identifiers"; as long as the two are conflated, non-technical users will use the identifier to write titles and thus make the file easy to locate in an interactive GUI. Meta tags are not even in the cognitive horizon of most people.
... the use case in the parent comment I was replying to.
And no I'm not going to copy that here for you to quip "that's not a legitimate use case". Make an effort to make a point and support it with better justification than "because I said so".
Right click, rename, enter, enter, enter (until the entire file name is visible on the box)? That's how I did it when I used Windows.
Edit: now I remember the most basic way: open the pdf, select and copy the title, click on rename and paste from clipboard. Works great to get the file name with the newlines exactly as they are on the title!
Yes - I just tested on Win10+11 because I thought "there is no way I didn't accidentally do something like this on accident... and I would have remembered seeing a new line in my file name when I made that mistake."
I just opened a folder in file explorer, clicked 'rename' and then tried the following combinations:
Enter
L Ctrl + Enter
L Alt + Enter
Win + Enter
R Ctrl + Enter
R Alt + Enter
None of them let me put new lines in the filename - it either did nothing, or 'closed' the rename view.
Shrug, I last used windows with Windows 7, so you are probably right. That being said, at least two of the students I am currently tutoring are on XP and one of my colleagues as well :D
Right, I just remembered the main way to create those filenames: open the pdf, select and copy the title, close, rename the file and paste from clipboard.
I am interested in hearing the rationale for downvotes explicitly. I am describing a reality that exists and must be taken into account. Why are you downvoting?
A correct script will have no problems with "-rf" or any other file name. I have (and recommend script writers make their own) a directory hierarchy of "dangerous" file names to test scripts.
For example, it contains a directory where all file and subdirectory names are in unary, consisting only of repetitions of the newline character. A correct script should be able to enumerate, access and modify files in there without issue.
It's a bandaid on a wider problem: the design of Unix shell is bonkers and the whole thing should be deleted. Why? Because I haven't seen any other tool ever have so many pitfalls. Take n random languages and m random developers and tell them to loop over a string array and print its contents, and count how many correct programs you get on average per language. There will be easy languages, then difficult languages, then a huge gap, then Unix shell because in your random sample you managed to get one guy who has PhD in bash.
The main problem is using text as a common format between different applications.
First: text is not well defined. Is it ASCII? Is it UTF-8? Some programs can spew UTF-32 with proper locale configured, it's a mess.
Second: encoding and decoding of objects to text is not defined at all. Those problems with filenames is just one example. Using newline as a separator is a natural thing that is easy to implement, yet it is wrong.
In my opinion two things should be done:
1. Standardise on UTF-8. No other encodings allowed.
2. Standardise on JSON. It is good enough to serve as universal exchange format, tools like `jq` exist for some time now.
So any utility must read and write JSON objects with some standard env set. And shells can be developed with better syntax to deal with JSON. This way you can write something like
`ps aux | while read row; do echo ${row.user} ${row.pid}; done`
>It is good enough to serve as universal exchange format, tools like `jq` exist for some time now.
Please don't use that underdefined joke of a spec. Define "PosixJson" and use that instead. Right now it's not even clear what the result of parsing {"a": 1234678901234567890} is. Is this a parse error? A bigint? A float/double? Quiet wraparound? Something else? I've seen all these behaviors in real world JSON implementations across different languages.
> A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character.
So, if you have some non-printable characters like BEL/␇/ASCII 0x07, that's still a text file.
(and I believe what bytes count as a valid character depend on your `LC_CTYPE`).
But the moment you have a line longer than {LINE_MAX} bytes (which can depend on which POSIX environment you have), suddenly your text file is now a binary file.
Kind of a weird definition indeed. One edge case: the definition states the file must contain characters, so presumably zero length files are out. But then how could you have zero lines?
Yes obviously. But the POSIX specification for a "text file" as above is that it contains characters, which an empty file by definition does not. So an empty file cannot be a text file if you read that specification strictly, and therefore you cannot have zero lines in a text file. As soon as you have a single character there is at least one line, and the amount of lines can only stay the same or grow from there.
The definition should read "one or more lines" instead or (probably better) specify that a text file contains "zero or more characters".
What cursed madness have you hit that spits out UTF-32 under normal conditions?! That can only be a bug - UTF-32/UCS-4 never saw external use, and has only ever been used for in-memory fixed-width character representation, e.g. runes in Go.
You never have to worry about whether you're dealing with ASCII vs. UTF-8, but rather if you're dealing with UTF-8 vs. ISO-8859-1, or worse, Shift JIS or similar.
I think a lot of tools should support json as well as plain text. Probably the latter by default, and the former with a "-o json" or similar option. I'm fine with wc giving me `5`, I'd prefer that to `{ "characters": 5 }`.
There are exchange formats that are well-defined enough to be useful to many computers while also being readable enough to be traversed by human eyes. There's no reason to everything ad-hoc, you don't get much by that. You also control the shell itself - there's no reason you can't display object representations in a pretty way.
JSON itself is bad for a streaming interface, as is common with CLI applications. You can't easily consume a JSON array without first reading it in its entirety. JSONL would be a better fit.
But then, how well would it work for ad-hoc usage, which is probably one of the biggest uses of shells?
> I haven't seen any other tool ever have so many pitfalls.
I haven't seen any other tool with so much general utility and availability.
> to loop over a string array and print its contents
Is incredibly easy in bash and bash like shells. As highlighted the issue is that tools like 'ls' don't create "a string array." They create one giant string that has to be parsed. The rules in the shell are different than in other languages but it /will/ do most of the parsing for you, or all of it, if you do it carefully.
This is a fine tradeoff. As evidenced by it's wide usage and lack of convincing replacements.
Someone needs to come up with a interactive shell first, one that is comparable in usability. Then we can think about replacing the unix shell.
I tried both python and lua interactively, but they are a pain when it comes to handling files. You have to type much more to get the same things done.
The bigger issue is the sheer momentum of Unix shell. Even if you come up with an alternative that is better by every objectively measurable metric, it's still going to be a monumental task to have it packages with commonly used distros. Kinda like the "why can't the US switch to the metric system" problem.
OK let them add an explicit check to standard tools, and/or to open(), mkdir(), etc. with O_PORTABLECHARS. And an environment option to disable this check.
I'm sure you might get more than 5 people on HN replying to you that they are using fish right now. Say something discrediting about fish and they show up.
Heh, reminds me of how to get help with Linux back in the day. If you directly asked for help, you'd be told to RTFM. If you stayed confidently that Windows could do something and that Linux sucks because it can't, you'd get users tripping over themselves with details and instructions,'just to prove you wrong.
There's a direct cost in money, time and lives that has come from the US's adherence to their US Customary Units (which are often different to the old imperial units). People have literally died because of the confusion caused by having multiple systems of units in common use with ambiguous names (degrees, gallons, etc). Each year industry worldwide spends an enormous amount of money indirectly precisely because of this problem and it's still incredibly unlikely to be fixed within my lifetime.
Bash-alternatives that are not completely compatible frankly just don't have a chance.
If it isn't distributed out of the box with every nix-like OS, it inherently isn't* “better by every objectively measurable metric" - distribution of a common, stable standard is a huge benefit in and of itself.
Python maybe often installed by default but it's definitely not an essential/required package "out of the box" on every install.
Also, in a thread where one topic is how POSIX shell handles whitespace in filenames, it's hilarious (not in a good way) that someone suggests a language that handles whitespace the wrong way in it's own code. Yes, significant whitespace is objectively wrong.
What OS/distro is Lua included on out of the box? That doesn't mean "available in a package". I mean literally included in every single install and cannot reasonably be omitted?
Regardless of the availability, the parent comment says
> better by every objectively measurable metric
Neither Python nor Lua are "better" than shell, at the types of things shell is commonly used for - they're objectively worse.
Lua gets onto every other Linux distro as dependency of some base system component. For example, rpm or pipewire depend on lua. Ubuntu and Debian ship with pipewire per default.
That isn't even close to "installed on every system". Best I can tell from the reverse dependencies, it's required for some Gnome Remote Desktop tool, and best I can tell, it doesn't rely on Lua anyway (at least on Debian).
> You should use the word "objectively" less.
I specifically used the word objectively, because the original comment that I replied to, said this:
> Pipewire being the Pulseaudio replacement from Redhat.
Right, so it's a desktop package that ultimately will be installed on about 1% of all Linux machines because the vast majority are servers without a desktop environment.
Also worth pointing out: liblua on Debian at least, is the shared library. It's not the binary to execute standalone Lua scripts.
This this like a game where you come up with bullshit and i have to come up with the facts to rectify it? RHEL/centOS have more than 1% market share alone.
Check your own installs and tell me if you find some that dont have liblua or libluajit.
For the library thing: I said "Python and lua are pretty close to that." earlier. I did not say that they have interpreters ready everywhere. But if the language core is already installed on a large fraction of machines, then adding the interpreter is not a big cost.
> already installed on a large fraction of machines
So far you've presented no evidence of this though, just that it's used by a new desktop-focused package.
All linux desktops over the last 30 years is not even a "large fraction" of total Linux installs, much less the ones that have already migrated to this new audio system.
> adding the interpreter is not a big cost
It's nothing to do with cost. It's about "how do I know this will absolutely 100% run on any POSIX machine I throw it on without any extra steps".
Remember the argument here is about something that is claimed to be "objectively better" than Shell. The ubiquitous nature of POSIX shell is a huge barrier for any possible competitor, and saying "well you just need to install it" just defeats the purpose. You might as well write it in fucking java and say "well you just need to install a JVM".
Edit to Add:
a good number of systems I manage do have liblua installed... because HAProxy requires it, and those systems have HAProxy installed. Not because it was installed as part of the base OS or even a default group of packages.
Incidentally, HAProxy and thus liblua were installed on those systems by infrastructure management that's implemented as shell script. So what kind of chicken and egg argument do we need to have here about how exactly I can run a Lua script to install Lua?
PowerShell designer could learn from decades of programming language progress and especially shell usage. They could improve many aspects indeed. This doesn't mean that the original design is "bonkers", only that it's not perfect.
The way Powershell works is largely based on what the computing world was doing with shells outside Bell Labs, at IBM, Xerox, and others places, exactly at similar timeframe as UNIX was happening.
Modern programming language designers have a bad relationship with verbosity. I don't know why they do this.
It's a lang for an interactive shell, typing literally translates to developer speed. I understand the want for clarity and maybe that's nice in large scripts, but the main goal is to be a shell. So, optimize for that. Also, you probably shouldn't be using powershell for large scripts anyway.
The only recent lang I've seen that has a handle on this is Rust. You can tell they put a lot of thought into having keywords be as short as possible while still being descriptive.
Those aliases are, I believe, only defined on Windows PowerShell (the closed-source version 5; not PowerShell 7). I wish those default aliases you mentioned weren’t a thing. Especially `curl` (people should use `iwr` instead), which is an alias of `Invoke-WebRequest`, because it makes the `curl.exe` shipped with Windows nearly undiscoverable.
This should not be as downvoted as it is. In a way shell is broken. The brokenness is in that it requires each command to serialize and deserialize again, considering all the weird things that can happen with the "all is a string" kind of approach, instead of having a proper data interchange format or even sending objects to next steps in the pipeline. This behavior is what necessitates even thinking about the changes listed in the post. We wouldn't even have that problem, if the design of shell was better thought out. Now we are dealing with decades of legacy built on these shaky foundations. I hate to admit it, but seems at least this aspect Powershell got right, whatever one may think about the rest of it.
Dear anal_reactor, what is a "string array"? I have used unix shells since nearly 30 years and never heard about them. And I consider myself a script-fu master!
There are two array-like constructions in the shell: list of words (separated by spaces) and list of lines (separated by newlines). Both cases are implemented as a single string, and the shell makes it trivial to iterate through its components.
That is exactly the problem many people have with it. Encoding „arrays“ this way is foreign to everyone who comes from „normal“ programming languages. Both variants lead to problems because either character can occur in elements, worst case scenario they contain both at the same time. I can see why this leads to confusion and bugs.
It’s like people saying they won’t learn French because it has a different grammatical structure. There’s no “normal” natural language. If you’re used to the C-like syntax, learning C-like language will be easy. But that’s not an argument to say Lisp is confusing.
That's why I put normal in quotes. There is however more to it than having a different grammatical structure: It works different from many commonly used languages that have actual arrays/lists where elements can contain anything the type allows. If you come from any of the common modern programming languages (lets say Java, Kotlin, C#, JS/TS, Python, Swift, Go, Rust, etc.) and expect something similar (because many of them are very similar) you will be confused. Using spaces or newlines to encode elements in a single string is just not robust and leads to easy to make mistakes.
Most of these languages were created long after bash and the other shells. The fact is that shell scripts allows for unquoted strings and quoting is a specific operation, not syntax. Also shell scripts were meant for automations, not for writing general programs. The basic units are commands, arguments, input, output, files,… so the design makes these easy to manipulate.
I’m not saying that we can’t improve, but I’m more in favor of making the tool more apt to solve a problem than making it easier to learn. Because the latter often wants to forego the requirement of understanding the problem space.
Yes, these are newer. I mainly wanted to make the point that it is confusing if you are new to bash and come from these newer languages with the wrong expectations. The concise nature and many subtle details makes it very difficult for beginners and infrequent users.
Compare this to the newer programming languages where you explicitly call something with speaking names like .Trim(), .EndsWith(), support from compiler and IDE.
In my experience automation and general programs often are the same thing once things get more complicated. Bash scripts usually grow rapidly and are a giant PITA to maintain or refactor. Throw in build systems and helper scripts and you quickly receive a giant pile of spaghetti. Personally I just switch to one the mentioned programming languages once it goes above a simple sequence of operations.
Personally I don't see how to improve it much without becoming a full blown programming language, at which point it would probably make more sense to just release a library for common automation tasks that is also composable. Maybe I'm just not the right target audience.
The issue with your otherwise good reply is that someone are bringing expectations to an expert tool (programming languages, software, OS) and blidly assuming that everything will work as he thinks it should. Familiarity helps with learning, but shouldn’t replace it. Someone new to bash should probably start with a book.
And for bigger automation projects, there are lots of projects and programming languages that can help.
I agree it is an issue but it is how many people work and think. Most of the time they are not even wrong. "Hey, I have variables and loops, I know that!".
I would even make the case for expert tools being as unsurprising and familiar as possible unless there is a very good reason for them not to. Also they should be robust against misuse and guide the user towards good practices. There are always beginners, people that rarely need to use it, people that do programming as "just a job" and people that make mistakes because they are distracted, tired or just human. Something like "rm -r /" is a good reminder of that for many people.
Plus there are already a lot of tools required. Reading a book about every tool I have to use would be unpractical for most projects. Maybe more expert tools should just be tools. The same way I can now just use Ubuntu and get a working desktop system including drivers for most common hardware. If I compare that to the past where I installed a Linux distribution and then found out I lack a driver for my network card but I need to download it from the internet... I still can modify my system if I need to, but it's nice that I don't have to. I think we can do similar things with many parts of development and free some capacity for other tasks.
Their proposed solution is not compatible with reality though where POSIX does not get to define what kind of files exist on filesystems you need to work with.
All they did is introduce new error cases in C programs while not actually fixing anything for shell scripts.
If anything, it's going to result in more exploits as people write shell scripts with the assumption that newlines cannot appear in filenames.
I do. Single files are handled with quotes around arguments just fine. For lists of files you need to use NUL as a separator. That's not really hard to do once you are aware of the problem but ergonomics could be better - which is something useful that POSIX could change.
But they did not make old code correct. Filenames are still allowed to contain newlines. Shell scripts still need to be prepared to deal with that. Nothing really changed, they just added a feel-good half-measure.
It's a step in the right direction. You have to understand that for decades a vocal group of Unix die-hards has opposed any limitations whatsoever on the bytewise content of file names. The newline restriction in this latest version of POSIX may be modest, but it represents a dam breaking. When (obviously) the sky doesn't fall, the next version of POSIX will have a lot more filename cleanup.
In 2024, if you don't get the correct result decoding a text as UTF-8, the bug is the text, not the decoding. And luckily, adoption of UTF-8 in the past 30+ years have gone will enough that you don't need to worry.
Caveats for cursed hardware standards demanding two-byte encodings like USB.
I hope you're happy in your ivory tower, but I personally work with a lot of files with other encoding, most often that weird utf16 (Windows), sometimes also legacy files with different ANSI encoding. Declaring "my decoder is fine, it's the text that is buggy" is not going to score a lot of points with my boss and clients.
The only valid reason for still having files stored in legacy ANSI encodings is that their only use is input to software that has not been maintained for ~30 years and cannot be updated. That's fine because they're just binary inputs in a closed ecosystem that no one touches.
But if they are supposed to be treated as text, then yes it's the text that's buggy - they should just be converted to UTF-8 once and have the originals thrown away.
UTF-16 is something that Microsoft has cursed us with by inserting it into specifications (like USB) so that we cannot get rid of it, even if it never made any sense what so ever. But those are in effect explicit protocols with a hard contract, very different from something where you would "assume an encoding".
Filename character set and its interpretation shall be controlled per directory or, at least, per FS. This pertains not only to permitted set like with or without LF, but to collation rules as well (including case insensitivity with cases like Turkish/Crimean/etc. I/ı and İ/i). Also this shall include workarounds for already existing problems: if a directory already contains files I1 and ı1, there shall be a technique to deal with them separately ever with Turkish locale.
But restricting this at syscall level is definite insanity, among with excuses.
- find(1p) now supports -print0
- xargs(1p) now supports the -0 argument
- newlines in filenames now should throw errors in many utilities
- a complier implementing the c17 standard is now required
- ulimit is expanded
- renice can use relative values
- a timeout utility has been added
- make adds support for $^ $+ ::= :::= != ?= +=
- logger is improved
- gettext is adopted
- readlink and realpath are adopted
- rm now supports -d to remove empty directories and -v for verbose
- various improvements to printf, sed, test
This adds `set -o pipefail` to POSIX sh, which causes a whole pipeline to fail (non-zero exit code) if one or more of the commands in the pipeline fail.
If you're writing scripts, use that and don't forget -e and -u
-e Exit immediately if a pipeline (which may consist of a single simple command), a list, or a compound command (see SHELL GRAMMAR above), exits with a non-zero status
-u Treat unset variables and parameters other than the special parameters "@" and "*" as an error when performing parameter expansion
> and they still fail to catch even some remarkably simple cases
I totally agree. Although I'd say that there isn't anything "remarkably simple" about writing a bash script. Anything in the shell scripting world that seems remarkably simple is just because one hasn't realised the ghosts and horrors that lurk in the shadows.
But I'll use -e anytime. It feels like having a protective proton pack at least.
Pipefail is useful and very hard to emulate on pure POSIX; you need to create named fifos, break the pipeline into individual redirections and check for error on each line.
And that is fine; but sometimes you want to treat a pipeline as a "single command" and then you can use pipefail to abort the pipeline on error. Then you can handle the error at the granularity of the entire pipeline without caring which part failed.
Lastly, I am confused as to the "silent" failures; maybe you are thinking of combining this with `set -e`? Then yes, that is bad and I recommend against the combination; but then again, I and most advanced scripters recommend against shotgunning `set -e` in the first place. Use it in specific portions of the script when appropriate, and use proper error handling otherwise.
Gee, imagine if shells with errexit option enabled wrote some diagnostic output to stderr before exiting. "Add your own error checking instead", how do I check which piece of pipeline has failed, exactly? The PIPESTATUS variable is bash-specific and was not standardized.
? Why are you replying to me? My position was pretty clear:
"Pipefail is useful and very hard to emulate on pure POSIX; you need to create named fifos, break the pipeline into individual redirections and check for error on each line.
And that is fine; but sometimes you want to treat a pipeline as a "single command" and then you can use pipefail to abort the pipeline on error. Then you can handle the error at the granularity of the entire pipeline without caring which part failed."
By the way, I never script in Bash; I only script in POSIX primitives using dash as my executable.
The history at the beginning of this is not correct. Two examples: the assertion that there was one compatible UNIX prior to United States vs AT&T, the statement that GNU and BSD started that same year. Very, very off.
https://en.m.wikipedia.org/wiki/History_of_Unix#/media/File%... is a good visual of (many of, not all) the various versions of UNIX and when they were released. BSD was first released in 1978. United States v. AT&T was implemented in 1984 (judgment 1982) GNU was first created in 1983.
TIL the POSIX standard is still updated. Does it still suffer from the issues that make Linux break POSIX compatibility in some areas because they consider it a flawed standard?
At some point they decided to narrow the change to just ban the newline character.
Which I personally think is a pity. Allowing escape in file names is a security risk because it enables you to embed ECMA-48 escape sequences in file names. Secure terminal emulators shouldn’t be made vulnerable by arbitrary escape sequences, but there are “too smart for their own good” terminal emulators out there that have escape sequences that let you do crazy things like run arbitrary executables.
There are many non-UTF-8/16/32 character encoding used in the wild which use these value in multi-byte character encoding. These values are used in the wild.
I think the decision forbidding newline in pathname is also wrong. It may break tons of existing code.
I wish Linux/etc had a mount option and/or superblock flag called “allow only sane file names”. And if you had that set, then attempting to create a file whose name wasn’t valid UTF-8, or which contained C0 or C1 controls, would fail. The small minority of people who really need pre-Unicode encodings such as ISO 2022 could just not turn that option on. And the majority who don’t need anything like that could reap the benefits of eliminating a whole category of potential bugs and vulnerabilities.
That's obviously impossible since it would break backward compatibility and the users' existing filesystems (and the Linux kernel will rightly never accept anything like that).
The only reasonable fix is to enhance bash and shell IDEs to track for each variable whether it could possibly include all filename-valid characters (e.g. if it comes from read with no options then it can't contain \n) and warn (off by default unless stderr is a terminal) if they can't and it's used as a filename (conservatively determined when used as arguments to processes), and also warn when using find without -print0, etc. noninteractively and perhaps interactively as well.
Run a program to list a directory. Everything that interfaces with that, will assume newline delimiters. Similar assumptions are baked into a lot of software.
Enforcing that a newline isn't part of a path, ensures the security of those systems that are commonly relied on.
Except no one's enforcing anything yet. Earlier versions of POSIX allowed rejecting filenames containing newlines, the newest version encourages it while mandating features required to handle such filenames safely (find -print0, xargs -0, read -d ''). So nothing's set in stone yet.
> Everything that interfaces with that, will assume newline delimiters.
Well, only badly written programs. nushell handles this fine, as will any program that doesn't try to do everything as plain strings:
~> touch "foo\nbar"
~> ls foo* | print
╭───┬──────┬──────┬──────┬──────────╮
│ # │ name │ type │ size │ modified │
├───┼──────┼──────┼──────┼──────────┤
│ 0 │ foo │ file │ 0 B │ now │
│ │ bar │ │ │ │
╰───┴──────┴──────┴──────┴──────────╯
However after reading it they're only making them illegal for the posix utilities from the 70s that aren't written properly, so I think that makes sense.
> We’ve established that, yes, pathnames can include newlines. We have not established why they can do that. After some deliberation, the Austin Group could not find a single use-case for newlines in pathnames besides breaking naive scripts. Wouldn’t it be nice if the naive scripts were just correct now?
Filenames should be boring printable normalized UTF-8. I have never, not once, seen a good reason that a filename should be able to contain random binary gobbledygook
> Filenames should be boring printable normalized UTF-8. I have never, not once, seen a good reason that a filename should be able to contain random binary gobbledygook
Ensuring normalization is hard. Where should you do it? There's only one good place: in the filesystem. But if you normalize on create then you'd better use the same form that everyone else uses, but, what's that? Input methods generally produce NFC, but there's no guarantee that they will not produce something else. HFS+ normalizes to NFD on create.
ZFS uses form-insensitivity -- much like case-insensitivity, but for form. The reason ZFS went this was exactly that HFS+ and input methods differ as to forms. I pushed hard for this way back when. IMO form-insensitivity is the best way forward.
But as for guaranteeing that filenames are UTF-8... that's much harder. The best thing to do is to not allow the use of non-UTF-8, non-ASCII, non-C locales -- not a guarantee, but pretty good.
Sure. Form-insensitivity is another good option. I'd actually argue for full case insensitivity too (like macOS), although I realize that it's probably a stretch.
Case-insensitivity is also an option in ZFS, but honestly case-insensitivity drives me nuts, especially if it's not case-preserving. Oh, that reminds me, ZFS is form-insensitive, and form-preserving.
To build an internationalized shell script I'll need to compile multiple .mo language files and distribute them along side the script itself.
For shell scripts part of a large system, that's probably fine. For small scripts, that's not very practical. You are not only adding a compilation step, you're also requiring distribution of multiple files. That's a pain.
It just kind of kills the convenience of a simple shell script. I would probably end up writing a makefile to manage all of this and at that point I am only a hop skip and jump away from using a compiled language instead of shell.
Hopefully nothing, posix is, or at least it should be, a descriptive standard. This is why posix is so terrible, and why posix is so great.
The way I feel posix, and other descriptive standards work best is when they describe what every one is already doing. This is opposed to prescriptive standards which try focus on how the "correct" way to do somthing, prescriptive standards tend to be over engineered and may or may not actually work.
Both prescriptive standards and descriptive standards have their uses. If POSIX is a prescriptive standard, then maybe another standard should exist that is descriptive.
Keep in mind that the Web standard eventually became prescriptive because descriptive standards failed to catch up. Likewise it can be argued that descriptive standards for the common OS interface are no longer usable.
To be crass, description is only useful for existing things and prescription hinders making innovative things. I think social forces make it natural that standards are treated both descriptively and prescriptively, and that too leads to angst. Case in point, POSIX was once more descriptive, but then people wanted backwards compatibility for existing and new OSes, which made it more prescriptive. The takeaway is that ad-hoc things become permanent once they are too difficult to remove, and then people are sad. Nothing is immune, so just make reasonable attempts for the standard and the culture to harmonize for a specific purpose.
is fine in a makefile as far as POSIX is concerned, because:
> Applications shall select target names from the set of characters consisting solely of slashes, hyphens, periods, underscores, digits, and alphabetics from the portable character set
I kind-of would like to see a POSIX-strict profile which incorporates commonsense (by commonsense I mean avoiding things that repeatedly over many years have tripped up programmers in frustrating ways) things like no newline in file names. Operating systems (or distributions) or could opt into this profile, and then someone programming on such an operating system could rely on the constraints of the profile and additional facilities could be added on that might need to rely on those constraints. Hopefully, gradually the use of the profile would spread.
> future editions will not require c17, but will simply require whatever C specification version is the most modern and already implemented by major toolchains
Is this really good?
If you can't rely on anything concrete being guaranteed, and it is open to interpretation what "modern" or "major toolchains" are, why have a standard?
EILSEQ for \n finally, but why not for unicode confusables? Path names are identifiers, and as such need to be identifiable. Meaning stricter rules than just buffers (not talking about strings).
Since old-POSIX systems will be in use for some time, I wonder how many things will be able to switch to using the new capabilities. And how many OSes already support all of the new changes.
It would yield false-positives with non-UTF-8 encoded text. Big5 <https://en.wikipedia.org/wiki/Big5#Encoding> in particular was notorious for using ASCII values for trailing bytes. I don't know if it's still in use or if there are others.
As someone who truly limits himself to POSIX when he can, I think they needed to push it forward to not become completely obsolete. I'm really sad `mktemp -d` and `set -o nullglob` didn't make the cut, but that's how it is, I guess.
A bespoke `mktempd` script is one of the first things I install in a new system. Fortunately, it is not too hard to make a `mktemp -d` compatible script with POSIX tools. `set -o nullglob` is another story :D
This is correct (though of course a decent `mktempd` script will deal with the listed problems or crash loudly on failure), and there are even more reasons to avoid /tmp.
Unfortunately, it is one of the very few directories that are somewhat POSIX-"guaranteed" writable by a non-root user and the fact that on modern systems it is usually mounted on a tmpfs makes it very attractive for pure POSIX usage without rich array support.
If you have mount permissions, of course, you should tell your `mktempd` to base its directory on a private tmpfs.
> The problem is that pathnames2 (as per section 3.254 of POSIX 2024) are just strings (meaning they can contain any bytes except the NUL character), [...]
Pathnames can neither contain NUL nor '/'.
Re: `find -print0` / `xargs -0`:
> Previous POSIX releases have considered -print0 before, but never ended up adopting it because using a null terminator meant that any utility that would need to process that output would need to have a new option to parse that type of output.
What nonsense. Just add the `-0` or similar options as needed.
> More precisely, this approach does not resolve our original problem. xargs(1p) can’t sort, and therefore we still have to handle that logic separately, unless sort(1p) also grows this support, even after read(1p). This problem continues with every other type of use-case. Importantly, it breaks the interoperability that POSIX was made to uphold.
More nonsense.
> A bunch of C functions3 are now encouraged to report EILSEQ if the last component of a pathname to a file they are to create contains a newline (put differently, they’re to error out instead of creating a filename that contains a newline).
Ok, that's tolerable. Ditto utilities (notice here they were able to make a list of utilities).
> Anyway, POSIX 2024 now requires c17, and does not require c89
I wish it would have been c99. What does c17 add exactly, more C++-esque complexity or not? Why was it not c99 (or perhaps even c11) over c17? Genuine questions.
> What does c17 add exactly, more C++-ish bullshit or not?
Multithreading support and such (atomics, thread-local storage and a guarantee that `errno` is in TLS), explicitly aligned types and allocations, dedicated types for strings known to be Unicode, _Noreturn, _Generic, _Static_assert, anonymous structs and unions in the nested position, quick_exit, timespec, exclusive mode ("x") in f[re]open, CMPLX macros.
I'm not even sure which one can be C++-ish bullshit possibly except for about two points:
- Multithreading does seem farfetched for causal users. In fact, I do think it could have been minimized without any actual harm, but multithreading itself needed to be specified because it greatly affects a memory model. (Before C11, C had no thread-aware memory model and different threading implementations were subtly different beyond what the standard stated.) Even JavaScript, originally without no notion of threads, eventually got a thread-aware memory model due to shared web workers. But that never meant JS itself need multithreading support in its standard library, and C could have done the same.
- `_Generic` is even more debatable, though I believe it was the only way forward when we accept <tgmath.h>, which is known to be a response to Fortran (other responses include `restrict`) and was impossible to implement in the portable manner before C11. As long as it retains its scary underline and title case, I guess it's fine.
Most importantly posix already has existing multithreading facilities in posix threads, so it is imperative that they are reformulated in term of the C++11/C11 memory model.
C17 is a bugfix version of C11 (the next major revision would be C23). The exact list of fixes is available in [1]. Mandating C11 instead of C17 when both are available seems not really useful now.
You have the correct insight about errnos. The new guarantee only means that other threads are not possible to mess with your errnos, but cleaning errnos will be still useful within an individual thread.
exit is not guaranteed to work correctly when called simultaneously from multipe threads, while quick_exit will be okay even in that situation. I think this behavior was not even specified before C11, and only specified after observing existing implementations.
It is expected that libc threading routines are thin wrappers around pthread in Linux. That's why I do think it can be minimized; the only actual problem before C11 was the lack of thread-aware memory model. No need to actually be able to create threads from libc to be honest, especially given that each platform now almost always has a single dominant threading implementation like pthread.
Thank you for your answers, it is much appreciated.
I suppose I will not use "quick_exit" either in that case, I have many workers, there is a job queue mutex, along with phtread_cond_wait and phtread_mutex_{lock,unlock} and when the "job_quit_flag" is set to true, that means all jobs are done and I am ready to return NULL.
7.5 ¶2: [...] and `errno` which expands to a modifiable lvalue that has type `int`, the value of which is set to a positive error number by several library functions. It is unspecified whether `errno` is a macro or an identifier declared with external linkage. If a macro definition is suppressed in order to access an actual object, or a program defines an identifier with the name `errno`, the behavior is undefined.
7.5 ¶3: The value of `errno` is zero at program startup, but is never set to zero by any library function. The value of `errno` may be set to nonzero by a library function call whether or not there is an error, provided the use of `errno` is not documented in the description of the function in this International Standard.
The fact that `errno` can expand to an lvalue does reflect what is required for multithreading implementations among others, but that's about all.
> We’ve established that, yes, pathnames can include newlines. We have not established why they can do that. After some deliberation, the Austin Group could not find a single use-case for newlines in pathnames besides breaking naive scripts. Wouldn’t it be nice if the naive scripts were just correct now? Ok, that might be a bit much all at once. We’re heading there though!
Oh my god. This makes me so happy. This is the most lovely think I've read in the world of computing since the unix gods decided that newlines were to be a single character.
The philosophy underlying the sentence "Wouldn’t it be nice if the naive scripts were just correct now?" is incredibly positive. We are surrounded by arrogant jerks who break old code by aggressively enforcing stricter compliance of some stupid rules. But here come these posix heros who do the exact opposite: make old code correct! There is hope in mankind after all.
Rather unfortunately, I happen to have a handful of files on my machine with newlines in them (the filenames were programmatically generated from a summary of their contents). I loathe the possibility that my shell tools are going to suddenly crash when confronted with these weird files, rather than just producing some slightly silly output. I wish we'd standardized the behaviour of just escaping such characters as `\n/\r` or `^J/^M`...
They did the right thing for this: make the tools fail on file creation, but not on existing files.
I guess it's still advisable to rename those files, I don't know how things like cp, mv or rsync will behave when copying such files in the future.
No, they did not do the right thing:
> the following utilities are now either encouraged to error out if they are to create a filename that contains a newline, and/or encouraged to error out if they are *about to print a pathname that contains a newline* in a context where newlines may be used as a separator
It then proceeds to list a bunch of utilities including diff, file, find, grep, head, du, etc., none of which create files directly.
These utilities could be updated to reject newlines in file paths if they're going to print in a "newline delimited" form - but for some of these utilities, that's the only available form.
> error out if they are about to print a pathname that contains a newline in a context where newlines may be used as a separator
But that's already broken. This is a situation where filenames with newlines in them are indistinguishable from two filenames in outputs. So instead of producing subtly broken output, tools are encouraged (not forced) to explicitly fail with a lot of noise.
The "in a context where newlines may be used as a separator" part of this sentence is very important.
IIUC the tools are still allowed to succeed in non broken situations, for instance when a null separator is used and not a new line character. And I can't imagine the tools you listed will start breaking in situations that worked (apart from file creation - indeed this will likely start breaking, and new line characters in filename needs to be considered deprecated and things using them to be fixed).
This is strictly better IMHO (if one thinks that newlines in files are not worth the troubles given how things work in POSIX, especially the part where things are line-based and new line characters have quite some significance)
If your file system allows them, be careful with symlinks though!
Why, specifically?
I'm convinced we will need to be careful with symbolic links related to new line characters in filenames, but I'm curious of which specific aspect you had in mind.
Oh, nothing specific to newlines. Just, when you rename files to fix newlines, you need to check if they break symlinks pointing to them.
For instance, I had project folders for my individual research projects. In order to have a central repository of resources and not have copies of multi-megabyte pdfs in each folder, I put all referenced papers in a single directory and symlinked them for each project that needed them. Later, I wanted to rename the papers to remove newlines. The symlinks complicated this process quite a bit!
Ah, right, indeed :-)
In academia, I get (and used to create) pdfs with names like:
"On the number of
associative foobars
of degree blah -
Johnson and Anderson.pdf"
all the time. It is very convenient for non-technical academics to have a descriptive file name, and to be able to see it entirely in the navigator they use newlines.
Oh god. I already get upset enough by spaces in a file name, although I realise that fight is basically lost now!
Didn't Windows name "Program Files" with a space to force application developers to handle spaces in paths properly?
For the longest time you could get away with this in cmd:
> dir c:\progra~1
So if forcing people to handle spaces was the goal, it took a long time to force it.
I'm pretty sure that still works. I forgot the exact scenario, but my Windows CI on GitHub Actions output shorte~1 pathna~1 like that in a script just a few months ago. On one hand, the backwa~1 compati~1 is nice. On the other hand, there's just so much depreca~1 cruft that keeps popping up even on contemp~1 systems.
In theory yes, in practice to this day many people don't bother how to learn how to deal with pathnames in a proper way.
Top difficulties in computer science:
1. naming things
2. cache coherency
3. off-by-one errors
???
4. quoting pathnames
I would replace 4 with parameter expansion rules.
Eh, maybe. In practice I usually do all my moderately-heavy filesystem scripting in Python these days, for which pathname quoting is just a complete non-issue. Of course, I still use a shell for quick-and-dirty stuff, but usually only for pretty simple tasks where the simplest quoting setup ("$i") suffices.
Not to mention C:\Program Files (x86)
And C:\Programme and other localized variants to force people to go through the proper APIs instead of hardcoding paths.
I just got used to installing things I need to interact with in a program into a folder named C:\workspace
As a fellow spaces-in-filenames-hater, the fight is not lost. We are on the brink of winning it; it's just a mount option away!
While we cannot avoid that people hit the spacebar when writing a filename on a gui, this does not mean at all that the resulting filename itself need contain a plain space character. Those spaces can and should be transparently translated to non-breaking space characters at some point. Maybe by the gui itself, or more robustly by the filesystem. This would make everybody happy: gui users and naive shell script writers.
>Those spaces can and should be transparently translated to non-breaking space characters at some point
Why? This just introduces more complexity and interoperability headaches for seemingly no reason.
> Why?
In order to preserve the sacrosanct simplicity of naive shell scripts. Seems like a very noble goal to me.
The only unexpexted compexity arises when you want to deal with filenames having mixed spaces and nbsps. But I'd say that people who do that had it coming.
If you want simple shell scripts to work, make an actually good shell language without all the footguns.
The filesystem is way more important than /bin/sh and and any complexity added there will trickle down to all programs, not just shell scripts.
It's not worth adding hacks on the FS to patch defects in poorly written shell scripts (which are being replaced en masse with python/nodejs/even weirder yaml files/systemd units/etc... anyways)
Whitespace in filenames in general is difficult to deal with. Many, maybe most, programs get it wrong. It's not just about shell scripts, many GUI programs fail to handle those files properly too.
When GUI programs mishandle filenames with spaces, IME it's usually because they spawn a subshell in a naive way (system("rm " + filename)).
To mishandle spaces you have to split an input w/ filenames by whitespace, which is not that common of an operation outside of a shell.
My favorite file+space issues is spaces at the end of file names, especially when you copy and paste text, or text gets trimmed from an input box, or the person forgets to trim space from an input box...
The vast majority of Windows and MacOS programs get it right.
No, no they don't - you just don't notice when they get it wrong, and you also don't name your files stupid things (I imagine).
If you actually test this, you'll realize a ton of Windows programs get it wrong.
Also, in general this is a poor argument. The goal of Linux isn't to be as much like Windows as possible, because Windows sucks ass. Nobody in their right mind would use Linux if it was just Windows but, presumably, shittier. The entire appeal of Linux is that it isn't Windows, and it isn't MacOS.
Eh? It's really not a bother in pretty much any programming language, and you don't really need to do anything special for it. I don't know any program that has any problems with it.
Even zsh has fixed this. It's just /bin/sh and bash that are annoying.
nushell uses real lists for things which means you don't need to care about seperators except when dealing with external system things
Simplicity doesn't always mean stupidity. The simple but functional shell that correctly handles whitespaces without much hassle was already available since 90s, namely rc which is also found in Plan 9. Adopting rc's string concatenator `^` in POSIXy shells shouldn't be too hard.
It would be really nice if there was a mount option that would quietly remove spaces in filenames, or convert them to an underscore.
If I had it, I would use it today.
Yep, works today:
Just always quote variable interpolation and you will never have problems.
Convince (force?) your team to use make and soon everyone will forget spaces in file names are even a thing!
My team already uses `make` but there's no reason for me to run it in my Downloads folders. File names in there are sometimes wild. Yet I expect command line tools to work with them. If they will cease to do so, I will have to start using non-POSIX variants of those tools, I guess.
I don't know who "the Austin Group" mentioned in the article are, but how come they "could not find a single use-case for newlines in pathnames besides breaking naive scripts" when legitimate use-cases are so easy to find?
(And if they're that incompetent, why does the article imply they are worth quoting and listening to?)
It is [1] the joint working group that for the last 25+ years has been responsible for both the POSIX standard and the Single Unix Specification. It emerged after the UNIX wars as a consolidation of the various splintered UNIX standardization efforts (POSIX itself, X/OPEN, OSF, etc).
[1] https://en.wikipedia.org/wiki/Austin_Group
Is that legitimate? A path name is just a unique identifier for a file, IMO it doesn't make sense to put a whole novel in there. If anything, a giant summary like that should be in the meta tags?
In what way is it not legitimate? It's not an accident, bug or data corruption. Someone put it there for a reason, and it benefits their use case. That's as legitimate as it gets.
That's a core part of the problem: a path name is NOT just a unique identifier for a file. Desktop operating systems and their classical utilities conflate the "unique identifier" and whatever "displayed title" of a file though which the end user interacts with the file.
Users care about "titles" or "summaries" of files, not "filesystem identifiers"; as long as the two are conflated, non-technical users will use the identifier to write titles and thus make the file easy to locate in an interactive GUI. Meta tags are not even in the cognitive horizon of most people.
Name one legit use case.
... the use case in the parent comment I was replying to.
And no I'm not going to copy that here for you to quip "that's not a legitimate use case". Make an effort to make a point and support it with better justification than "because I said so".
How do these non-technical academics even create a PDF file with a name like that?
Right click, rename, enter, enter, enter (until the entire file name is visible on the box)? That's how I did it when I used Windows.
Edit: now I remember the most basic way: open the pdf, select and copy the title, click on rename and paste from clipboard. Works great to get the file name with the newlines exactly as they are on the title!
Doesn't <enter> just confirm the typed input for the filename and finish the renaming? How does that insert newlines?
Yes - I just tested on Win10+11 because I thought "there is no way I didn't accidentally do something like this on accident... and I would have remembered seeing a new line in my file name when I made that mistake."
I just opened a folder in file explorer, clicked 'rename' and then tried the following combinations: Enter L Ctrl + Enter L Alt + Enter Win + Enter R Ctrl + Enter R Alt + Enter
None of them let me put new lines in the filename - it either did nothing, or 'closed' the rename view.
Shrug, I last used windows with Windows 7, so you are probably right. That being said, at least two of the students I am currently tutoring are on XP and one of my colleagues as well :D
No, it was always this way.
Right, I just remembered the main way to create those filenames: open the pdf, select and copy the title, close, rename the file and paste from clipboard.
I don't know if this is a Linux thing, but when renaming a file, when I press enter, I apply the new name, the file manager doesn't add a newline.
I am interested in hearing the rationale for downvotes explicitly. I am describing a reality that exists and must be taken into account. Why are you downvoting?
The thing is, it's hard to predict what would happen to those scripts regardless... E.g. try naming your files "-rf" and see how many things break :)
A correct script will have no problems with "-rf" or any other file name. I have (and recommend script writers make their own) a directory hierarchy of "dangerous" file names to test scripts.
For example, it contains a directory where all file and subdirectory names are in unary, consisting only of repetitions of the newline character. A correct script should be able to enumerate, access and modify files in there without issue.
If one really wanted to embrace chaos, introduce this as a new team file naming standard for "risk finding" files ;)
I do enjoy "ls *; touch -- -lisah; ls *" as a fun little brainteaser for those uninitiated to this behavior.
export TMPDIR=" / "
to surprise the next person or script to do "rm -rf $TMPDIR/foo"...
Of course, there is an xkcd for that: https://xkcd.com/1172/
Dude, just fix the filenames.
It's a bandaid on a wider problem: the design of Unix shell is bonkers and the whole thing should be deleted. Why? Because I haven't seen any other tool ever have so many pitfalls. Take n random languages and m random developers and tell them to loop over a string array and print its contents, and count how many correct programs you get on average per language. There will be easy languages, then difficult languages, then a huge gap, then Unix shell because in your random sample you managed to get one guy who has PhD in bash.
The main problem is using text as a common format between different applications.
First: text is not well defined. Is it ASCII? Is it UTF-8? Some programs can spew UTF-32 with proper locale configured, it's a mess.
Second: encoding and decoding of objects to text is not defined at all. Those problems with filenames is just one example. Using newline as a separator is a natural thing that is easy to implement, yet it is wrong.
In my opinion two things should be done:
1. Standardise on UTF-8. No other encodings allowed.
2. Standardise on JSON. It is good enough to serve as universal exchange format, tools like `jq` exist for some time now.
So any utility must read and write JSON objects with some standard env set. And shells can be developed with better syntax to deal with JSON. This way you can write something like
`ps aux | while read row; do echo ${row.user} ${row.pid}; done`
>It is good enough to serve as universal exchange format, tools like `jq` exist for some time now.
Please don't use that underdefined joke of a spec. Define "PosixJson" and use that instead. Right now it's not even clear what the result of parsing {"a": 1234678901234567890} is. Is this a parse error? A bigint? A float/double? Quiet wraparound? Something else? I've seen all these behaviors in real world JSON implementations across different languages.
POSIX does actually define what a "text file" is, but the definition is a bit unusual:
See https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1...
> 3.387 Text File
> A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character.
So, if you have some non-printable characters like BEL/␇/ASCII 0x07, that's still a text file.
(and I believe what bytes count as a valid character depend on your `LC_CTYPE`).
But the moment you have a line longer than {LINE_MAX} bytes (which can depend on which POSIX environment you have), suddenly your text file is now a binary file.
Kind of a weird definition indeed. One edge case: the definition states the file must contain characters, so presumably zero length files are out. But then how could you have zero lines?
POSIX defines a line as:
> 3.185 Line
> A sequence of zero or more non-<newline> characters plus a terminating <newline> character.
So a file with some characters but no trailing newline is reported by `wc -l` as having zero lines.
An empty file is not hard to make. It's just a matter of creating the file and not writing to it.
Yes obviously. But the POSIX specification for a "text file" as above is that it contains characters, which an empty file by definition does not. So an empty file cannot be a text file if you read that specification strictly, and therefore you cannot have zero lines in a text file. As soon as you have a single character there is at least one line, and the amount of lines can only stay the same or grow from there.
The definition should read "one or more lines" instead or (probably better) specify that a text file contains "zero or more characters".
Ahh I see what you're saying. I misunderstood at first.
What cursed madness have you hit that spits out UTF-32 under normal conditions?! That can only be a bug - UTF-32/UCS-4 never saw external use, and has only ever been used for in-memory fixed-width character representation, e.g. runes in Go.
You never have to worry about whether you're dealing with ASCII vs. UTF-8, but rather if you're dealing with UTF-8 vs. ISO-8859-1, or worse, Shift JIS or similar.
> That can only be a bug - UTF-32/UCS-4 never saw external use
I regularly use `iconv -t utf-32be | hd` to look what a bizarre sequence is denoting yet another weird symbol like an itchy hedgehog.
And what is a real reason to disallow this?
I think that I hit that with Java:
From quick googling it seems that glibc does not support it, so it should not happen.> it seems that glibc does not support it
`iconv` does, and this is enough in common. Among with tons of eerie EBCDIC/whatever...
Don't even assume UTF-something is the only character encoding. There are so many existing character encodings before Unicode. It's still widely used.
I think a lot of tools should support json as well as plain text. Probably the latter by default, and the former with a "-o json" or similar option. I'm fine with wc giving me `5`, I'd prefer that to `{ "characters": 5 }`.
True, but this would be immensely difficult to pull off, because how do you convince other people to write programs that produce actual working JSON?
The primary purpose of command line program output is to convey information to a human, not to other programs.
Command line scripting is supposed to be adhoc and hack.
There are exchange formats that are well-defined enough to be useful to many computers while also being readable enough to be traversed by human eyes. There's no reason to everything ad-hoc, you don't get much by that. You also control the shell itself - there's no reason you can't display object representations in a pretty way.
I disagree that it supposed to be adhoc and hack. Look at PowerShell.
That under limited OSes such as DOS. Under Unix, piping has been the philosophy.
JSON itself is bad for a streaming interface, as is common with CLI applications. You can't easily consume a JSON array without first reading it in its entirety. JSONL would be a better fit.
But then, how well would it work for ad-hoc usage, which is probably one of the biggest uses of shells?
> The main problem is using text as a common format between different applications.
If you can't get the immensity of the cleverness of Unix foundations, you should not talk about them.
That idea is what made it possible for you to type that sentence in the first place.
> I haven't seen any other tool ever have so many pitfalls.
I haven't seen any other tool with so much general utility and availability.
> to loop over a string array and print its contents
Is incredibly easy in bash and bash like shells. As highlighted the issue is that tools like 'ls' don't create "a string array." They create one giant string that has to be parsed. The rules in the shell are different than in other languages but it /will/ do most of the parsing for you, or all of it, if you do it carefully.
This is a fine tradeoff. As evidenced by it's wide usage and lack of convincing replacements.
> I haven't seen any other tool with so much general utility and availability.
> availability
That's the real reason why we use Unix shell. It's not good, but it's available. Like a cheap hooker.
> but it /will/ do most of the parsing for you, or all of it, if you do it carefully.
"It mostly works if you're careful" doesn't sound very convincing to me.
> "It mostly works if you're careful" doesn't sound very convincing to me.
Would you rather write your own parser?
> but it's available. Like a cheap hooker.
Username checks out.
Someone needs to come up with a interactive shell first, one that is comparable in usability. Then we can think about replacing the unix shell.
I tried both python and lua interactively, but they are a pain when it comes to handling files. You have to type much more to get the same things done.
The bigger issue is the sheer momentum of Unix shell. Even if you come up with an alternative that is better by every objectively measurable metric, it's still going to be a monumental task to have it packages with commonly used distros. Kinda like the "why can't the US switch to the metric system" problem.
OK let them add an explicit check to standard tools, and/or to open(), mkdir(), etc. with O_PORTABLECHARS. And an environment option to disable this check.
Why they force the restriction at syscall level?
People already use different shells, mksh, fish, and so on. With fish there is a non-posix shell in wide use.
>wide use
Five people around the globe isn't wide use.
I'm sure you might get more than 5 people on HN replying to you that they are using fish right now. Say something discrediting about fish and they show up.
Heh, reminds me of how to get help with Linux back in the day. If you directly asked for help, you'd be told to RTFM. If you stayed confidently that Windows could do something and that Linux sucks because it can't, you'd get users tripping over themselves with details and instructions,'just to prove you wrong.
Human psychology is fascinating!
There's a direct cost in money, time and lives that has come from the US's adherence to their US Customary Units (which are often different to the old imperial units). People have literally died because of the confusion caused by having multiple systems of units in common use with ambiguous names (degrees, gallons, etc). Each year industry worldwide spends an enormous amount of money indirectly precisely because of this problem and it's still incredibly unlikely to be fixed within my lifetime.
Bash-alternatives that are not completely compatible frankly just don't have a chance.
If it isn't distributed out of the box with every nix-like OS, it inherently isn't* “better by every objectively measurable metric" - distribution of a common, stable standard is a huge benefit in and of itself.
> distributed out of the box with every nix-like OS,
Python and lua are pretty close to that.
> Python and lua are pretty close to that.
Python maybe often installed by default but it's definitely not an essential/required package "out of the box" on every install. Also, in a thread where one topic is how POSIX shell handles whitespace in filenames, it's hilarious (not in a good way) that someone suggests a language that handles whitespace the wrong way in it's own code. Yes, significant whitespace is objectively wrong.
What OS/distro is Lua included on out of the box? That doesn't mean "available in a package". I mean literally included in every single install and cannot reasonably be omitted?
Regardless of the availability, the parent comment says
> better by every objectively measurable metric
Neither Python nor Lua are "better" than shell, at the types of things shell is commonly used for - they're objectively worse.
Lua gets onto every other Linux distro as dependency of some base system component. For example, rpm or pipewire depend on lua. Ubuntu and Debian ship with pipewire per default.
You should use the word "objectively" less.
> Lua gets onto every other Linux distro
Just FYI, there are UNIX-like, POSIX compatible systems that are not a Linux distro.
> rpm or pipewire depend on lua. Ubuntu and Debian ship with pipewire per default.
Pipewire? Do you mean this? https://packages.debian.org/bookworm/pipewire
That isn't even close to "installed on every system". Best I can tell from the reverse dependencies, it's required for some Gnome Remote Desktop tool, and best I can tell, it doesn't rely on Lua anyway (at least on Debian).
> You should use the word "objectively" less.
I specifically used the word objectively, because the original comment that I replied to, said this:
> better by every objectively measurable metric
pipewire -> wireplumber -> libwireplumber -> liblua
Pipewire being the Pulseaudio replacement from Redhat.
Bookworm is probably the last Debian without :P
> Pipewire being the Pulseaudio replacement from Redhat.
Right, so it's a desktop package that ultimately will be installed on about 1% of all Linux machines because the vast majority are servers without a desktop environment.
Also worth pointing out: liblua on Debian at least, is the shared library. It's not the binary to execute standalone Lua scripts.
This this like a game where you come up with bullshit and i have to come up with the facts to rectify it? RHEL/centOS have more than 1% market share alone.
Check your own installs and tell me if you find some that dont have liblua or libluajit.
For the library thing: I said "Python and lua are pretty close to that." earlier. I did not say that they have interpreters ready everywhere. But if the language core is already installed on a large fraction of machines, then adding the interpreter is not a big cost.
> already installed on a large fraction of machines
So far you've presented no evidence of this though, just that it's used by a new desktop-focused package.
All linux desktops over the last 30 years is not even a "large fraction" of total Linux installs, much less the ones that have already migrated to this new audio system.
> adding the interpreter is not a big cost
It's nothing to do with cost. It's about "how do I know this will absolutely 100% run on any POSIX machine I throw it on without any extra steps".
Remember the argument here is about something that is claimed to be "objectively better" than Shell. The ubiquitous nature of POSIX shell is a huge barrier for any possible competitor, and saying "well you just need to install it" just defeats the purpose. You might as well write it in fucking java and say "well you just need to install a JVM".
Edit to Add: a good number of systems I manage do have liblua installed... because HAProxy requires it, and those systems have HAProxy installed. Not because it was installed as part of the base OS or even a default group of packages.
Incidentally, HAProxy and thus liblua were installed on those systems by infrastructure management that's implemented as shell script. So what kind of chicken and egg argument do we need to have here about how exactly I can run a Lua script to install Lua?
> a good number of systems I manage do have liblua installed
/thread
Even outside of distribution, python and lua aren't objectively better. For starters, they're much more verbose.
I just said that, scroll up.
I certainly have my complaints about Powershell, but it's got pretty good coverage, decent documentation, and cross platform support.
if it weren't so irregular, inconsistent, spotty and tasteless, it'd be a great option.
Oil shell?
https://www.oilshell.org/
Compatible with most bash scripts
> the design of Unix shell is bonkers
Compared to what?
Powershell?
PowerShell designer could learn from decades of programming language progress and especially shell usage. They could improve many aspects indeed. This doesn't mean that the original design is "bonkers", only that it's not perfect.
The way Powershell works is largely based on what the computing world was doing with shells outside Bell Labs, at IBM, Xerox, and others places, exactly at similar timeframe as UNIX was happening.
Can you give examples of what should be improved in PowerShell?
Verbosity is a huge problem there
Modern programming language designers have a bad relationship with verbosity. I don't know why they do this.
It's a lang for an interactive shell, typing literally translates to developer speed. I understand the want for clarity and maybe that's nice in large scripts, but the main goal is to be a shell. So, optimize for that. Also, you probably shouldn't be using powershell for large scripts anyway.
The only recent lang I've seen that has a handle on this is Rust. You can tell they put a lot of thought into having keywords be as short as possible while still being descriptive.
FoundTheCamelCaseConvert.
My God next you will say getopt() --longform is the bestest
It's been years since I used Powershell, but IIRC there are shortcuts for the common commands, e.g. cat, ls, mv, rm, and such DTRT.
Those aliases are, I believe, only defined on Windows PowerShell (the closed-source version 5; not PowerShell 7). I wish those default aliases you mentioned weren’t a thing. Especially `curl` (people should use `iwr` instead), which is an alias of `Invoke-WebRequest`, because it makes the `curl.exe` shipped with Windows nearly undiscoverable.
Works on my machine!
[dead]
This should not be as downvoted as it is. In a way shell is broken. The brokenness is in that it requires each command to serialize and deserialize again, considering all the weird things that can happen with the "all is a string" kind of approach, instead of having a proper data interchange format or even sending objects to next steps in the pipeline. This behavior is what necessitates even thinking about the changes listed in the post. We wouldn't even have that problem, if the design of shell was better thought out. Now we are dealing with decades of legacy built on these shaky foundations. I hate to admit it, but seems at least this aspect Powershell got right, whatever one may think about the rest of it.
On my rhel7 system, the Debian dash shell is this large:
I happen to have an old powershell installed: A strict POSIX shell is always going to be vastly smaller, for many reasons.I would prefer that the POSIX shell was an LR-parsed language, but you can't have everything.
> loop over a string array
Dear anal_reactor, what is a "string array"? I have used unix shells since nearly 30 years and never heard about them. And I consider myself a script-fu master!
There are two array-like constructions in the shell: list of words (separated by spaces) and list of lines (separated by newlines). Both cases are implemented as a single string, and the shell makes it trivial to iterate through its components.
That is exactly the problem many people have with it. Encoding „arrays“ this way is foreign to everyone who comes from „normal“ programming languages. Both variants lead to problems because either character can occur in elements, worst case scenario they contain both at the same time. I can see why this leads to confusion and bugs.
It’s like people saying they won’t learn French because it has a different grammatical structure. There’s no “normal” natural language. If you’re used to the C-like syntax, learning C-like language will be easy. But that’s not an argument to say Lisp is confusing.
That's why I put normal in quotes. There is however more to it than having a different grammatical structure: It works different from many commonly used languages that have actual arrays/lists where elements can contain anything the type allows. If you come from any of the common modern programming languages (lets say Java, Kotlin, C#, JS/TS, Python, Swift, Go, Rust, etc.) and expect something similar (because many of them are very similar) you will be confused. Using spaces or newlines to encode elements in a single string is just not robust and leads to easy to make mistakes.
Most of these languages were created long after bash and the other shells. The fact is that shell scripts allows for unquoted strings and quoting is a specific operation, not syntax. Also shell scripts were meant for automations, not for writing general programs. The basic units are commands, arguments, input, output, files,… so the design makes these easy to manipulate.
I’m not saying that we can’t improve, but I’m more in favor of making the tool more apt to solve a problem than making it easier to learn. Because the latter often wants to forego the requirement of understanding the problem space.
Yes, these are newer. I mainly wanted to make the point that it is confusing if you are new to bash and come from these newer languages with the wrong expectations. The concise nature and many subtle details makes it very difficult for beginners and infrequent users.
Compare this to the newer programming languages where you explicitly call something with speaking names like .Trim(), .EndsWith(), support from compiler and IDE.
In my experience automation and general programs often are the same thing once things get more complicated. Bash scripts usually grow rapidly and are a giant PITA to maintain or refactor. Throw in build systems and helper scripts and you quickly receive a giant pile of spaghetti. Personally I just switch to one the mentioned programming languages once it goes above a simple sequence of operations.
Personally I don't see how to improve it much without becoming a full blown programming language, at which point it would probably make more sense to just release a library for common automation tasks that is also composable. Maybe I'm just not the right target audience.
The issue with your otherwise good reply is that someone are bringing expectations to an expert tool (programming languages, software, OS) and blidly assuming that everything will work as he thinks it should. Familiarity helps with learning, but shouldn’t replace it. Someone new to bash should probably start with a book.
And for bigger automation projects, there are lots of projects and programming languages that can help.
I agree it is an issue but it is how many people work and think. Most of the time they are not even wrong. "Hey, I have variables and loops, I know that!".
I would even make the case for expert tools being as unsurprising and familiar as possible unless there is a very good reason for them not to. Also they should be robust against misuse and guide the user towards good practices. There are always beginners, people that rarely need to use it, people that do programming as "just a job" and people that make mistakes because they are distracted, tired or just human. Something like "rm -r /" is a good reminder of that for many people.
Plus there are already a lot of tools required. Reading a book about every tool I have to use would be unpractical for most projects. Maybe more expert tools should just be tools. The same way I can now just use Ubuntu and get a working desktop system including drivers for most common hardware. If I compare that to the past where I installed a Linux distribution and then found out I lack a driver for my network card but I need to download it from the internet... I still can modify my system if I need to, but it's nice that I don't have to. I think we can do similar things with many parts of development and free some capacity for other tasks.
Their proposed solution is not compatible with reality though where POSIX does not get to define what kind of files exist on filesystems you need to work with.
All they did is introduce new error cases in C programs while not actually fixing anything for shell scripts.
If anything, it's going to result in more exploits as people write shell scripts with the assumption that newlines cannot appear in filenames.
In the real world, nobody writes shell scripts that handle newlines in filenames.
I do. Single files are handled with quotes around arguments just fine. For lists of files you need to use NUL as a separator. That's not really hard to do once you are aware of the problem but ergonomics could be better - which is something useful that POSIX could change.
But they did not make old code correct. Filenames are still allowed to contain newlines. Shell scripts still need to be prepared to deal with that. Nothing really changed, they just added a feel-good half-measure.
It's a step in the right direction. You have to understand that for decades a vocal group of Unix die-hards has opposed any limitations whatsoever on the bytewise content of file names. The newline restriction in this latest version of POSIX may be modest, but it represents a dam breaking. When (obviously) the sky doesn't fall, the next version of POSIX will have a lot more filename cleanup.
Next step is to forbid newlines from file content itself, to fix conforment json parsers ?
This is pretty standard for a human run system. Gotta make the human feel good about an idea before they can do said idea.
If you’re not familiar with humans, there are several manuals available online.
Don't assume UTF-8 is the only character encoding used in the wild. There are character encoding with leading bytes not easily detectable like UTF-8.
In 2024, if you don't get the correct result decoding a text as UTF-8, the bug is the text, not the decoding. And luckily, adoption of UTF-8 in the past 30+ years have gone will enough that you don't need to worry.
Caveats for cursed hardware standards demanding two-byte encodings like USB.
I hope you're happy in your ivory tower, but I personally work with a lot of files with other encoding, most often that weird utf16 (Windows), sometimes also legacy files with different ANSI encoding. Declaring "my decoder is fine, it's the text that is buggy" is not going to score a lot of points with my boss and clients.
The only valid reason for still having files stored in legacy ANSI encodings is that their only use is input to software that has not been maintained for ~30 years and cannot be updated. That's fine because they're just binary inputs in a closed ecosystem that no one touches.
But if they are supposed to be treated as text, then yes it's the text that's buggy - they should just be converted to UTF-8 once and have the originals thrown away.
UTF-16 is something that Microsoft has cursed us with by inserting it into specifications (like USB) so that we cannot get rid of it, even if it never made any sense what so ever. But those are in effect explicit protocols with a hard contract, very different from something where you would "assume an encoding".
Shouldn't hurt to tell clients to right their weird proprietary software originated encodings though.
why people assume utf8 had only know locale encoding still?
you're probably guilty of the sin you preach and is showing wrongly decoded utf8 and don't even know.
Now do that with all whitespace!
Filename character set and its interpretation shall be controlled per directory or, at least, per FS. This pertains not only to permitted set like with or without LF, but to collation rules as well (including case insensitivity with cases like Turkish/Crimean/etc. I/ı and İ/i). Also this shall include workarounds for already existing problems: if a directory already contains files I1 and ı1, there shall be a technique to deal with them separately ever with Turkish locale.
But restricting this at syscall level is definite insanity, among with excuses.
Looks like the BSD-family will have some implementing to do.
I just booted OpenBSD 7.0 (which is a bit dated).
The find utility has print0, and xargs has -0. Notibly, xargs also has -P for running processes in parallel.
rm has both -d and -v.
The renice command appears to be able to use relative adjustments with -n.
There is a timeout command.
There is a readlink command, but no realpath (but a manual page exists for it as a system call).
Strict adherence to POSIX isn't a goal of any of the current BSDs is it?
People get "POSIX compliance" confused with "Unix certification". The first is an API you implement, the second is a rubber stamp.
All active Unix-like operating systems aim to implement the new interfaces as they're defined.
I'm confident they'd accept patches.
macOS
This adds `set -o pipefail` to POSIX sh, which causes a whole pipeline to fail (non-zero exit code) if one or more of the commands in the pipeline fail.
If you're writing scripts, use that and don't forget -e and -u
For `set -u` I mostly agree. For `set -e` see my comment below and Greg's wiki: http://mywiki.wooledge.org/BashFAQ/105
> and they still fail to catch even some remarkably simple cases
I totally agree. Although I'd say that there isn't anything "remarkably simple" about writing a bash script. Anything in the shell scripting world that seems remarkably simple is just because one hasn't realised the ghosts and horrors that lurk in the shadows.
But I'll use -e anytime. It feels like having a protective proton pack at least.
Does it? It is not mentioned anywhere in the post. Can you post a reference to your source?
The post only have a few highlights. The Posix specs are only for paying IEEE customers though, but https://pubs.opengroup.org/onlinepubs/9799919799/ mentions it.
That is the POSIX spec, no?
It's at: https://pubs.opengroup.org/onlinepubs/9799919799/utilities/V...
(no permalink, search for "pipefail")
Holy balls that's like Christmas!
Really? Wont that break piping grep?
Probably, so don't `set -o pipefail` in scripts that pipe into grep.
Ah ok I read it as 'sets it by default' for some reason.
Sad. Use of that option is almost always a mistake. It only leads to undebuggable silent failures.
I'd rather both have this option and have it work reliably. It's ridiculous that
does not count as a pipefail when cmd1 or cmd2 fail but does, so the "correct" way to set an environment variable from a pipeline's output is actuallyPipefail is useful and very hard to emulate on pure POSIX; you need to create named fifos, break the pipeline into individual redirections and check for error on each line.
And that is fine; but sometimes you want to treat a pipeline as a "single command" and then you can use pipefail to abort the pipeline on error. Then you can handle the error at the granularity of the entire pipeline without caring which part failed.
Lastly, I am confused as to the "silent" failures; maybe you are thinking of combining this with `set -e`? Then yes, that is bad and I recommend against the combination; but then again, I and most advanced scripters recommend against shotgunning `set -e` in the first place. Use it in specific portions of the script when appropriate, and use proper error handling otherwise.
Why does `set -e` make a pipeline fail silently?
`set -e` makes the script abort and is often used in lieu of proper error handing:
Whether the above reports error or not depends on the command; when you have a pipeline failing in the above way, it is even sneakier: You are reliant on all commands in the pipeline being verbose about failure to signal error.None of the above is advisable. The advisable code is
Error handling like that makes sense if you’re writing a program. But if you just want a script for an automation, `set -e` is enough.
It is not; Greg's wiki further explains why, if the silent failure problem above is not enough reason.
Gee, imagine if shells with errexit option enabled wrote some diagnostic output to stderr before exiting. "Add your own error checking instead", how do I check which piece of pipeline has failed, exactly? The PIPESTATUS variable is bash-specific and was not standardized.
? Why are you replying to me? My position was pretty clear:
"Pipefail is useful and very hard to emulate on pure POSIX; you need to create named fifos, break the pipeline into individual redirections and check for error on each line.
And that is fine; but sometimes you want to treat a pipeline as a "single command" and then you can use pipefail to abort the pipeline on error. Then you can handle the error at the granularity of the entire pipeline without caring which part failed."
By the way, I never script in Bash; I only script in POSIX primitives using dash as my executable.
The history at the beginning of this is not correct. Two examples: the assertion that there was one compatible UNIX prior to United States vs AT&T, the statement that GNU and BSD started that same year. Very, very off.
Okay, but you would add more value if you could also state what is the correct order if things.
https://en.m.wikipedia.org/wiki/History_of_Unix#/media/File%... is a good visual of (many of, not all) the various versions of UNIX and when they were released. BSD was first released in 1978. United States v. AT&T was implemented in 1984 (judgment 1982) GNU was first created in 1983.
TIL the POSIX standard is still updated. Does it still suffer from the issues that make Linux break POSIX compatibility in some areas because they consider it a flawed standard?
Yes! Finally! Let's treat filenames with new lines as errors! I'm so delighted with this decision.
The original request was to ban all bytes between 1 and 31.
https://www.austingroupbugs.net/view.php?id=251
At some point they decided to narrow the change to just ban the newline character.
Which I personally think is a pity. Allowing escape in file names is a security risk because it enables you to embed ECMA-48 escape sequences in file names. Secure terminal emulators shouldn’t be made vulnerable by arbitrary escape sequences, but there are “too smart for their own good” terminal emulators out there that have escape sequences that let you do crazy things like run arbitrary executables.
There are many non-UTF-8/16/32 character encoding used in the wild which use these value in multi-byte character encoding. These values are used in the wild.
I think the decision forbidding newline in pathname is also wrong. It may break tons of existing code.
I wish Linux/etc had a mount option and/or superblock flag called “allow only sane file names”. And if you had that set, then attempting to create a file whose name wasn’t valid UTF-8, or which contained C0 or C1 controls, would fail. The small minority of people who really need pre-Unicode encodings such as ISO 2022 could just not turn that option on. And the majority who don’t need anything like that could reap the benefits of eliminating a whole category of potential bugs and vulnerabilities.
> There are many non-UTF-8/16/32 character encoding used in the wild which use these value in multi-byte character encoding.
Like what? I am genuinely curious: Shift-JIS, GB2312, Big5, and all of the EUC variants do not use bytes that correspond to C0 characters in ASCII.
That's obviously impossible since it would break backward compatibility and the users' existing filesystems (and the Linux kernel will rightly never accept anything like that).
The only reasonable fix is to enhance bash and shell IDEs to track for each variable whether it could possibly include all filename-valid characters (e.g. if it comes from read with no options then it can't contain \n) and warn (off by default unless stderr is a terminal) if they can't and it's used as a filename (conservatively determined when used as arguments to processes), and also warn when using find without -print0, etc. noninteractively and perhaps interactively as well.
Why is that an issue?
Run a program to list a directory. Everything that interfaces with that, will assume newline delimiters. Similar assumptions are baked into a lot of software.
Enforcing that a newline isn't part of a path, ensures the security of those systems that are commonly relied on.
Except no one's enforcing anything yet. Earlier versions of POSIX allowed rejecting filenames containing newlines, the newest version encourages it while mandating features required to handle such filenames safely (find -print0, xargs -0, read -d ''). So nothing's set in stone yet.
> Everything that interfaces with that, will assume newline delimiters.
Well, only badly written programs. nushell handles this fine, as will any program that doesn't try to do everything as plain strings:
However after reading it they're only making them illegal for the posix utilities from the 70s that aren't written properly, so I think that makes sense.Next: spaces
Still much better than mojibaked names.
What do you mean?
What is the encoding of the filenames?
I am personally not aware of any MBCS that could have a 0x20 or 0x0D as a valid trailing byte. Are you?
I think my comment correctly contrasted mojibake from new lines or spaces for that reason.
> We’ve established that, yes, pathnames can include newlines. We have not established why they can do that. After some deliberation, the Austin Group could not find a single use-case for newlines in pathnames besides breaking naive scripts. Wouldn’t it be nice if the naive scripts were just correct now?
Finally. Now let's do the rest: https://dwheeler.com/essays/fixing-unix-linux-filenames.html
Filenames should be boring printable normalized UTF-8. I have never, not once, seen a good reason that a filename should be able to contain random binary gobbledygook
> Filenames should be boring printable normalized UTF-8. I have never, not once, seen a good reason that a filename should be able to contain random binary gobbledygook
Ensuring normalization is hard. Where should you do it? There's only one good place: in the filesystem. But if you normalize on create then you'd better use the same form that everyone else uses, but, what's that? Input methods generally produce NFC, but there's no guarantee that they will not produce something else. HFS+ normalizes to NFD on create.
ZFS uses form-insensitivity -- much like case-insensitivity, but for form. The reason ZFS went this was exactly that HFS+ and input methods differ as to forms. I pushed hard for this way back when. IMO form-insensitivity is the best way forward.
But as for guaranteeing that filenames are UTF-8... that's much harder. The best thing to do is to not allow the use of non-UTF-8, non-ASCII, non-C locales -- not a guarantee, but pretty good.
Sure. Form-insensitivity is another good option. I'd actually argue for full case insensitivity too (like macOS), although I realize that it's probably a stretch.
Case-insensitivity is also an option in ZFS, but honestly case-insensitivity drives me nuts, especially if it's not case-preserving. Oh, that reminds me, ZFS is form-insensitive, and form-preserving.
I really hate to say it, but the fretting about newlines used as delimiters after 50 years of misuse …
… makes PowerShell start to look damn good.
To build an internationalized shell script I'll need to compile multiple .mo language files and distribute them along side the script itself.
For shell scripts part of a large system, that's probably fine. For small scripts, that's not very practical. You are not only adding a compilation step, you're also requiring distribution of multiple files. That's a pain.
It just kind of kills the convenience of a simple shell script. I would probably end up writing a makefile to manage all of this and at that point I am only a hop skip and jump away from using a compiled language instead of shell.
Hopefully nothing, posix is, or at least it should be, a descriptive standard. This is why posix is so terrible, and why posix is so great.
The way I feel posix, and other descriptive standards work best is when they describe what every one is already doing. This is opposed to prescriptive standards which try focus on how the "correct" way to do somthing, prescriptive standards tend to be over engineered and may or may not actually work.
see also: descriptive and prescriptive dictionaries. http://www.englishplus.com/news/news1100.htm
Both prescriptive standards and descriptive standards have their uses. If POSIX is a prescriptive standard, then maybe another standard should exist that is descriptive.
Keep in mind that the Web standard eventually became prescriptive because descriptive standards failed to catch up. Likewise it can be argued that descriptive standards for the common OS interface are no longer usable.
To be crass, description is only useful for existing things and prescription hinders making innovative things. I think social forces make it natural that standards are treated both descriptively and prescriptively, and that too leads to angst. Case in point, POSIX was once more descriptive, but then people wanted backwards compatibility for existing and new OSes, which made it more prescriptive. The takeaway is that ad-hoc things become permanent once they are too difficult to remove, and then people are sad. Nothing is immune, so just make reasonable attempts for the standard and the culture to harmonize for a specific purpose.
That is also a way to never progress beyond the status quo.
Nitpick re: https://blog.toast.cafe/posix2024-xcu#fn:6
is fine in a makefile as far as POSIX is concerned, because:> Applications shall select target names from the set of characters consisting solely of slashes, hyphens, periods, underscores, digits, and alphabetics from the portable character set
I kind-of would like to see a POSIX-strict profile which incorporates commonsense (by commonsense I mean avoiding things that repeatedly over many years have tripped up programmers in frustrating ways) things like no newline in file names. Operating systems (or distributions) or could opt into this profile, and then someone programming on such an operating system could rely on the constraints of the profile and additional facilities could be added on that might need to rely on those constraints. Hopefully, gradually the use of the profile would spread.
> future editions will not require c17, but will simply require whatever C specification version is the most modern and already implemented by major toolchains
Is this really good?
If you can't rely on anything concrete being guaranteed, and it is open to interpretation what "modern" or "major toolchains" are, why have a standard?
EILSEQ for \n finally, but why not for unicode confusables? Path names are identifiers, and as such need to be identifiable. Meaning stricter rules than just buffers (not talking about strings).
Since old-POSIX systems will be in use for some time, I wonder how many things will be able to switch to using the new capabilities. And how many OSes already support all of the new changes.
Why was `isascii()` removed?
(Listed in the Sortix article linked in OP.)
It would yield false-positives with non-UTF-8 encoded text. Big5 <https://en.wikipedia.org/wiki/Big5#Encoding> in particular was notorious for using ASCII values for trailing bytes. I don't know if it's still in use or if there are others.
This is a surprisingly greedy POSIX update.
As someone who truly limits himself to POSIX when he can, I think they needed to push it forward to not become completely obsolete. I'm really sad `mktemp -d` and `set -o nullglob` didn't make the cut, but that's how it is, I guess.
A bespoke `mktempd` script is one of the first things I install in a new system. Fortunately, it is not too hard to make a `mktemp -d` compatible script with POSIX tools. `set -o nullglob` is another story :D
It's quite hard to write mktemp securely[1]. It would be great if POSIX didn't make people attempt to do that error-prone task themselves.
[1]: There's some explanation in this recent post: https://dotat.at/@/2024-10-22-tmp.html
This is correct (though of course a decent `mktempd` script will deal with the listed problems or crash loudly on failure), and there are even more reasons to avoid /tmp.
Unfortunately, it is one of the very few directories that are somewhat POSIX-"guaranteed" writable by a non-root user and the fact that on modern systems it is usually mounted on a tmpfs makes it very attractive for pure POSIX usage without rich array support.
If you have mount permissions, of course, you should tell your `mktempd` to base its directory on a private tmpfs.
File names with / in them
> The problem is that pathnames2 (as per section 3.254 of POSIX 2024) are just strings (meaning they can contain any bytes except the NUL character), [...]
Pathnames can neither contain NUL nor '/'.
Re: `find -print0` / `xargs -0`:
> Previous POSIX releases have considered -print0 before, but never ended up adopting it because using a null terminator meant that any utility that would need to process that output would need to have a new option to parse that type of output.
What nonsense. Just add the `-0` or similar options as needed.
> More precisely, this approach does not resolve our original problem. xargs(1p) can’t sort, and therefore we still have to handle that logic separately, unless sort(1p) also grows this support, even after read(1p). This problem continues with every other type of use-case. Importantly, it breaks the interoperability that POSIX was made to uphold.
More nonsense.
> A bunch of C functions3 are now encouraged to report EILSEQ if the last component of a pathname to a file they are to create contains a newline (put differently, they’re to error out instead of creating a filename that contains a newline).
Ok, that's tolerable. Ditto utilities (notice here they were able to make a list of utilities).
Note that GNU sort has...
-z, --zero-terminated: end lines with 0 byte, not newline
strlcpy()!
[dead]
> Anyway, POSIX 2024 now requires c17, and does not require c89
I wish it would have been c99. What does c17 add exactly, more C++-esque complexity or not? Why was it not c99 (or perhaps even c11) over c17? Genuine questions.
> What does c17 add exactly, more C++-ish bullshit or not?
Multithreading support and such (atomics, thread-local storage and a guarantee that `errno` is in TLS), explicitly aligned types and allocations, dedicated types for strings known to be Unicode, _Noreturn, _Generic, _Static_assert, anonymous structs and unions in the nested position, quick_exit, timespec, exclusive mode ("x") in f[re]open, CMPLX macros.
I'm not even sure which one can be C++-ish bullshit possibly except for about two points:
- Multithreading does seem farfetched for causal users. In fact, I do think it could have been minimized without any actual harm, but multithreading itself needed to be specified because it greatly affects a memory model. (Before C11, C had no thread-aware memory model and different threading implementations were subtly different beyond what the standard stated.) Even JavaScript, originally without no notion of threads, eventually got a thread-aware memory model due to shared web workers. But that never meant JS itself need multithreading support in its standard library, and C could have done the same.
- `_Generic` is even more debatable, though I believe it was the only way forward when we accept <tgmath.h>, which is known to be a response to Fortran (other responses include `restrict`) and was impossible to implement in the portable manner before C11. As long as it retains its scary underline and title case, I guess it's fine.
Most importantly posix already has existing multithreading facilities in posix threads, so it is imperative that they are reformulated in term of the C++11/C11 memory model.
You quoted me before my edit, but fair enough. I do like the "atomics" support.
> "guarantee that `errno` is in TLS"
I suppose that does not mean that I can just avoid setting errno to 0 before calling a function after which I check for errno, right?
Yeah, I do have an issue with stuff like "_Generic" but I assume I can just simply not use it.
What is "quick_exit" exactly and what does it solve?
As for multithreading, I stick to phtread. Is any of the new features a replacement for that or what?
At any rate, why C17 over C11 then?
C17 is a bugfix version of C11 (the next major revision would be C23). The exact list of fixes is available in [1]. Mandating C11 instead of C17 when both are available seems not really useful now.
You have the correct insight about errnos. The new guarantee only means that other threads are not possible to mess with your errnos, but cleaning errnos will be still useful within an individual thread.
exit is not guaranteed to work correctly when called simultaneously from multipe threads, while quick_exit will be okay even in that situation. I think this behavior was not even specified before C11, and only specified after observing existing implementations.
It is expected that libc threading routines are thin wrappers around pthread in Linux. That's why I do think it can be minimized; the only actual problem before C11 was the lack of thread-aware memory model. No need to actually be able to create threads from libc to be honest, especially given that each platform now almost always has a single dominant threading implementation like pthread.
[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2244.htm
My last question would be: is it "OK" to use phtread in my code or are there any alternatives (i.e. "best way") when using C17?
No, just use pthread. There are some useful pthread APIs missing from C17 anyway too.
Thank you for your answers, it is much appreciated.
I suppose I will not use "quick_exit" either in that case, I have many workers, there is a job queue mutex, along with phtread_cond_wait and phtread_mutex_{lock,unlock} and when the "job_quit_flag" is set to true, that means all jobs are done and I am ready to return NULL.
> guarantee that `errno` is in TLS
I mean, that is already true.
There is no such guarantee in C99:
7.5 ¶2: [...] and `errno` which expands to a modifiable lvalue that has type `int`, the value of which is set to a positive error number by several library functions. It is unspecified whether `errno` is a macro or an identifier declared with external linkage. If a macro definition is suppressed in order to access an actual object, or a program defines an identifier with the name `errno`, the behavior is undefined.
7.5 ¶3: The value of `errno` is zero at program startup, but is never set to zero by any library function. The value of `errno` may be set to nonzero by a library function call whether or not there is an error, provided the use of `errno` is not documented in the description of the function in this International Standard.
The fact that `errno` can expand to an lvalue does reflect what is required for multithreading implementations among others, but that's about all.
Nor is it in POSIX, but it's true of all POSIX-like systems that support threading.