How to determine which IP addresses are hitting your web site the most

by Ross McKillop on October 7, 2008

Security

This is a brief one. Thanks to The How-To Geek for bringing this command into my troubleshooting repertoire.

A bit of background first. This command is helpful to determine who is causing the most hits to your web site. On my music blog, I post a fair number of (totally legal) MP3s. Some less than ethical people often use those MP3s to stream via their site, causing mine to be slower (and a larger bandwidth bill). By figuring out the IP address of the site/person “stealing” my bandwidth I can then block their IP from accessing any of my content.

Note: you’ll need shell access to your web server log files

  1. SSH (or telnet) to your web host. Switch to the directory that stores your web server log files.
  2. Run this command:

    tail -100000 access.log | awk '{print $1}' | sort | uniq -c |sort -n

    where 100000 is the number of lines (starting from the end of the log file) you want to search, and access.log is the name of your web server access log.

  3. terminal with tail awk uniq sort results

  4. The result will be a (probably) fairly long list of IPs, sorted by fewest hits to most. The first value in each row is the number of times the IP address (the second number) accessed your site (in the number of lines of the log file you specified).

    Use the host command to determine the fully qualified domain name of any IP address that shows up (you’ll probably want to know who the ones that hit your site the most are). In the screenshot example below, two of the IPs that hit simplehelp.net the most were Googlebot and the Yahoo Site Crawler.

  5. terminal with tail awk uniq sort

  6. If there’s an IP/domain that looks suspicious, you can check to see which files they were hitting by using the command:

    tail -1000 access.log | grep xx.xx.xx.xx

    In that command, 1000 is the number of lines to check, access.log is the name of your web server access log, and xx.xx.xx.xx is the IP you want to sort by. I’d suggest using a smaller number (1000 vs. 100000 as used in the first command) as you probably don’t need/want to see every file they accessed. If you do, increase the 1000 number. Or, if not very many results show up, that means that they were hitting your site “earlier”, and you’ll want to increase 1000 to a higher number.

Related Posts:
  • Why you get email that isn’t addressed to you
  • How to use vMailias to generate unique email addresses for easy sorting and filtering
  • How to secure your wireless home network
  • How to transfer Outlook auto-complete addresses to a new computer
  • Easy steps to preventing spam
  • Get Simple Help tutorials just like this one in your email inbox every day - for free! Just enter your email address below:

     

    You can always opt out of this email subscription at any time.


    Bookmark and Share

    { 3 comments… read them below or add one }

    1 Tony 10.07.08 at 2:58 pm

    The problem is that if someone hotlinks your mp3s (or images, or any other media), then it’s still the IPs of their users that will show up in your logs, not the offending webserver. That is to say, it will likely be a fairly even distribution of users and indistinguishable from that of your legitimate visitors (unless someone is just continuously refreshing your media content).

    What you want to be doing is checking for the referrer information to your media files (naturally excluding your own domain from the list).

    2 Ross 10.07.08 at 3:05 pm

    Tony -

    Under normal circumstances yes, you’re absolutely right. In my specific case, the site was loading the files via a flash player that they hosted, and all the requests came from the sites flash player. Adding the IP to my .htaccess in turn stopped the flash player from loading the songs for anyone/everyone who tried to play them from the *explatives* site.

    3 miiimooo 11.01.08 at 1:27 pm

    لم افهم شي

    Leave a Comment

    You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>