Lemmy newb here, not sure if this is right for this /c.

An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

  • Stubb@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 days ago

    I’ve found that many of these solutions/hacks block legitimate users that are using the tor browser and Internet Archive scrapers, which may be a dealbreaker for some but maybe acceptable for most users and website owners.

  • Jason2357@lemmy.ca
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 days ago

    This is signal detection theory combined with an arms race that keeps the problem hard. You cannot block scrapers without blocking people, and you cannot inconvenience bots without also inconveniencing readers. You might figure something clever out temporarily, but eventually this truism will resurface. Excuse me while I solve a few more captchas.

    • Tobberone@lemm.ee
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 days ago

      The internet as we know it is dead, we just need a few more years to realise it. And I’m afraid that telecommunications will be going the same way, when no-one can trust that anyone is who they say anymore.

    • klu9@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 day ago

      You’re welcome.

      I believe I found it originally via the “distribuverse”… specifically, ZeroNet.

    • splendoruranium@infosec.pub
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 days ago

      They block VPN exit nodes. Why bother hosting a web site if you don’t want anyone to read your content?

      Fuck that noise. My privacy is more important to me than your blog.

      It’s a minimalist private blog that sets no 3rd party cookies and loads no 3rd party resources. I presume that alleviates your concerns? 😜

    • tripflag@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 days ago

      and filtering malicious traffic is more important to me than you visiting my services, so I guess that makes us even :-)

        • El Barto@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 days ago

          You had me until the “ethically sound position” part.

          You’re saying that Joe Blogger is acting unethically because he doesn’t allow VPN users to visit his site. C’mon, brother.

        • tripflag@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 days ago

          Absolutely; if I was a company, or hosting something important, or something that was intended for the general public, then I’d agree.

          But I’m just an idiot hosting whimsical stuff from my basement, and 99% of it is only of interest for my friends. I know ~everyone in my target audience, and I know that none of them use a VPN for general-purpose browsing.

          As it is, I don’t mind keeping the door open to the general public, but nothing of value will be lost if I need to pull the plug on some more ASN’s to preserve my bandwidth. For example when a guy hopping through a VPN in Sweden decides to download the same zip file thousands of times, wasting terabytes of traffic over a few hours (this happened a week ago).

  • drkt@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 days ago

    I have plenty of spare bandwidth and babysitting-resources so my approach is largely to waste their time. If they poke my honeypot they get poked back and have to escape a tarpit specifically designed to waste their bandwidth above all. It costs me nothing because of my circumstances but I know it costs them because their connections are metered. I also know it works because they largely stop crawling my domains I employ this on. I am essentially making my domains appear hostile.

    It does mean that my residential IP ends up on various blocklists but I’m just at a point in my life where I don’t give an unwiped asshole about it. I can’t access your site? I’m not going to your site, then. Fuck you. I’m not even gonna email you about the false-positive.

    It is also fun to keep a log of which IPs have poked the honeypot have open ports, and to automate a process of siphoning information out of those ports. Finding a lot of hacked NVR’s recently I think are part of some IoT botnet to scrape the internet.

    • melroy@kbin.melroy.org
      link
      fedilink
      arrow-up
      0
      ·
      3 days ago

      I found a very large botnet in Brazil mainly and several other countries. And abuseipdb.com is not marking those IPs are a thread. We need a better solution.

      I think a honeypot is a good way. Another way is to use proof of work basically on the client side. Or we need a better place to share all stupid web scraping bot IPs.

      • drkt@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 days ago

        I wouldn’t even know where to begin, but I also don’t think that what I’m doing is anything special. These NVR IPs are hurling abuse at the whole internet. Anyone listening will have seen them, and anyone paying attention would’ve seen the pattern.

        The NVRs I get the most traffic from have been a known hacked IoT device for a decade and even has a github page explaining how to bypass their authentication and pull out arbitrary files like passwd.

  • F04118F@feddit.nl
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 days ago

    Interesting approach but looks like this ultimately ends up:

    • being a lot of babysitting / manual work
    • blocking a lot of humans
    • not being robust against scrapers

    Anubis seems like a much better option, for those wanting to block bots without relying on Cloudflare:

    https://anubis.techaro.lol/