Resolving a mysterious problem with find

(johndcook.com)

33 points | by chmaynard 8 months ago ago

46 comments

  • shakow 8 months ago

    I'm... not sure that's the kind of Friday-at-2AM embarrassing mistakes you want to put on your blog if you sell yourself as an elite consultant?

    Not using `-iname`, using `-print0` and being surprised to see NULs appearing, the weird pipe + xargs instead of just `-exec`, using some hyper-convoluted way of replacing the NULs instead of just man find... that's probably not the best advertisement for “decades of consulting experience ”.

    • sirsinsalot 8 months ago

      Being an experienced consultant doesn't mean you know everything about everything.

      Hell, I put myself in that camp and I am perfectly capable of hacks, cludge, and mistakes.

      Be compassionate.

      • cowsandmilk 8 months ago

        Using print0 as an argument and then being surprised at what it does is not great. It speaks to the author never opening the man page for find and just having copied this command a long time ago.

      • usr1106 8 months ago

        But you should know what you don't know. You don't blog about a mystery in behavior of some code when you have fundamental gaps how that code works. I'd say asking on SO might be better in this case.

      • shakow 8 months ago

        The problem is not that he does not know everything, it is very normal to learn all the time (xkcd 1053, etc.). The issue is to be so deep into your own sentiment of self-importance to imagine that you discovering the basic functions of a decades-old program is worth sharing with the whole world on your professional website.

        I'm pretty sure a lot of people woud laugh at a JS “thought leader” who would write a whole post on how he just discovered this weird number i whose square is negative and how smart he was to have found the wikipedia page about it.

    • EdwardCoffin 8 months ago

      To be fair, he is a mathematical consultant who uses computer tools, not a specialist in computer tools.

    • xg15 8 months ago

      > the weird pipe + xargs instead of just `-exec

      I think that part makes sense though as it's simply the older idiom before -exec existed. (That's the reason why both find and xargs have specific flags related to /0-delimited filenames that are basically counterparts to each other)

      Also, shouldn't the two be roughly equal in efficiency? To my knowledge, xargs (without -i) does the same command aggregation that -exec ... + does.

      So the number of "grep" processes spawned by the two commands should be roughly the same, I think.

    • usr1106 8 months ago

      I thought exactly the same. Those who can do, do. Those who cannot do, teach. Those who cannot teach, consult.

      But instead of dwelling on prejudices I decided to try my own solution. See https://news.ycombinator.com/item?id=42163286

      • Upvoter33 8 months ago

        Why insult teachers?

        • card_zero 8 months ago

          Well,

          This quote is from George Bernard Shaw, but it's from a character from a play he wrote in 1903, "Man and Superman". The character is a descendent of Don Juan, a firebrand, but instead of being crazy about seducing women, he's crazy about revolution and anarchy. GBS explains exactly what he was thinking when he wrote this, in the preface, at extreme length:

          https://www.gutenberg.org/cache/epub/3328/pg3328-images.html

          The preface is actually a letter to the friend who inspired the play. Some quotes:

          > There is a political aspect of this sex question which is too big for my comedy ...

          > When we two were born, this country was still dominated by a selected class bred by political marriages.

          > I do not know whether you have any illusions left on the subject of education, progress, and so forth. I have none. Any pamphleteer can show the way to better things; but when there is no will there is no way.

          > I have only made my Don Juan a political pamphleteer, and given you his pamphlet in full by way of appendix.

          So the quote is from that appendix, Maxims for Revolutionists.

          https://www.gutenberg.org/cache/epub/26107/pg26107-images.ht...

          > He who can, does. He who cannot, teaches.

          There's at least one other well known quote there:

          > The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.

          Did GBS really mean this stuff? Kinda, but he's obviously being playful, because he got an overblown and ridiculous comedy revolutionary character to say it for him, in a work of fiction.

          • usr1106 8 months ago

            I had no idea that it's that old. I had only heard it in the context of consultant bashing.

        • pizzafeelsright 8 months ago

          Teachers at a minimum only need n+1 more knowledge than their students.

    • rustybolt 8 months ago

      I'm very glad you're not reviewing my PRs.

  • NikkiA 8 months ago

    You're using -print0 and surprised that it's output has NUL characters between them?

    • xg15 8 months ago

      Was puzzled about that too, especially since his solution "find ... -print0 | strings" undoes the advantage that -print0 gives you, i.e. safe handling of filenames with newlines in them (and his "sed" solution straight-up undoes the -print0 completely).

      So with all due respect to the author, I wonder if he was just using -print0 after rote-learning it as part of the find command (or having had some tutor implore "ALWAYS use -print0"), without knowing what it does.

  • unsnap_biceps 8 months ago

    > There may be better solutions [1], but my solution was to insert a call to strings in the pipeline

    The "right" answer is to switch to using -print rather than -print0

    -print delimits the values with a newline character (\n) -print0 delimits the values with a null character (\0)

    • natmaka 8 months ago

      Not always perfectly right because an argument containing a filename containing a space character will be interpreted as 2 arguments.

      • CGamesPlay 8 months ago

        No it won’t, because none of the output is interpreted as an argument. It’s passed as lines to grep. The second invocation correctly uses print0 and pairs with xargs to understand this.

        Now, it does fail with filenames that have newlines in them, but who would do such a thing!

        • natmaka 8 months ago

          I wrote "Not always perfectly right" thinking about all cases, not this particular one: in nearly all cases (bar being absolutely sure there is no blank character anywhere) -print0 (and therefore xargs -0) seems better to me, and it sure saved me on many occasions. Better let "find" do all the work it can, including filtering filenames.

        • usr1106 8 months ago

          If you run this interactively on your own files, saying who would do such thing is fine.

          If your server code runs this on untrusted input (files uploaded by users or whatever), the answer will be: Someone trying to crack your system.

      • unsnap_biceps 8 months ago

        Certainly, which is why I put quotes around right, but for this usage, it's not an issue. Find prints the whole path on a single line (including the spaces) and grep (by default) puts the full matched line, so you'll still get the full file path regardless of how many spaces are in it.

  • cess11 8 months ago

    Pretty convoluted, no?

    I would likely use -exec:

       $ mkdir dir.py
       $ echo blah >> blah.py
       $ find . -type f -name "*.py" -exec grep -i BLAH {} \;
       blah
       $
    
    Edit: Ah, right, he's filtering on filenames. That's what -iname is for. The man file is quite good.
  • linsomniac 8 months ago

    Am I the only one who has gone all in on using "-exec +"?

        find . -name '*.py' -type f -exec grep -il {} +
    • nicoburns 8 months ago

      I've switched away from find entirely, and now use "fd" whose exec functionality is quite straightforward to use.

      • linsomniac 8 months ago

        fd looks like a great tool, if you don't have `find` locked in already you probably want to start with `fd` instead. I've already got the basics of `find` in my head so that's the tool I usually reach for rather than switching to `fd`.

    • usr1106 8 months ago

      That solves only the second part of their task. The part which they actually had no problem with. But I agree the exec + solution feels better then the xargs -0 solution.

      • linsomniac 8 months ago

        Agreed. The first part of that task just seemed to be a misunderstanding of what -print0 is, and using `strings` as the fix is weird. I'm surprised they didn't suggest `tr '\0' '\n'`... :-)

    • 8 months ago
      [deleted]
  • magicalhippo 8 months ago

    Find is one of those tools I use seldom enough that I completely forget how to use it, but is also complex enough that when I do need it I have to spend way too long studying the man page to figure out the right incantations.

    In almost all cases I just want something simple, like finding a file somewhere on disk with based on a partial filename.

    Now there are probably some nice, more modern tools made for this, but usually when I need it it's on some system where they're not present and I can't just install random stuff from the interwebs. So find it is...

    • cowsandmilk 8 months ago

      I feel like locate is commonly installed and is designed for what you describe.

      • magicalhippo 8 months ago

        Often yea, but updatedb requires root IIRC and I don't always have that.

        I just feel it would be nice if finding a file didn't feel like playing Tomb Raider.

  • TheGrassyKnoll 8 months ago

    I recommend giving ripgrep a try. (it's been around awhile now) https://github.com/BurntSushi/ripgrep

    • oguz-ismail 8 months ago

      It's not compatible with grep though. How do you search for a square bracket?

          $ grep '[][]' </dev/null
          $ rg '[][]' </dev/null
          rg: regex parse error:
              (?:[][])
                   ^^
          error: unclosed character class
          $
      
      And why does it search the current directory when its input is redirected from /dev/null? What other surprises are there?
      • blueflow 8 months ago
      • pletnes 8 months ago

        To me, ripgrep is an improvement and the differences are a good thing.

      • nicoburns 8 months ago

        It's compatible or close enough with more modern regex syntaxes. Which are probably familiar to a lot more people than grep. Want to search for square brackets, then escape them (or do a a string literal search with -F)

    • pletnes 8 months ago

      So much faster than grep for these things! Love ripgrep! I also use it to rip apart directories of log files. Super convenient

  • naruhodo 8 months ago

    Instead of `find -name '*.py' | grep -i "$PATTERN"` you can use `find -iname "*${PATTERN}*.py"` for case-insensitive glob-matched filenames, or mess around with regexes on the whole path with `find -iregex "$REGEX"`.

    And yeah, why would you ASCII NUL terminate each filename output by `find` by using `-print0`? I mean, who adds quotes, backslashes or whitespace to their Python source file names?

    • pletnes 8 months ago

      Why not just globstar in the first place? grep foo **/*ham*py

  • howeyc 8 months ago

    if you want file names matching a pattern, no need for grep:

    find . -name '*.py' -iname '*pattern*'

    for filenames that have content matching pattern, no need for find:

    grep -r --include '*.py' -l -i pattern .

    no need for pipes, xargs, etc.

  • gbolcer 8 months ago

    I was going to say, do a file -i to find the encoding of frodo.py. I note the file man page has a -0 or --print0 command that adds a ‘\0’ that can be 'cut', but strings works too.

  • usr1106 8 months ago

    The first line they start with is utter nonsense. find -print0 will not produce lines, but records (or strings) separated by NUL. But grep is a tool working with lines (separated by LF). No mystery that it cannot work.

    Using -print0 is necessary if you have filenames containing LF chars. Otherwise just use -print and grep and everything should be fine.

    Now how do we handle NUL separated records? That required a bit of thinking, the Unix world is based so much on lines. Without extensive testing the following awk program seems to work:

        BEGIN       { RS = "\0" }
        $0 ~ regexp
    
    Call with

        awk -v 'regexp=what I search for'
    
    In their script that would be

        awk -v "regexp=$1"
    
    
    Edit: Credits for s/whitespace/LF chars/ go to user hnfong
    • hnfong 8 months ago

      When grepping for filenames print0 is needed only when the files have new lines in them. (Which is quite degenerate.) grep works fine with spaces and tabs in the stdin

      • usr1106 8 months ago

        Thanks! Updated.

  • creaktive 8 months ago

    You keep using that -print0, I do not think it means what you think it means