The Berkeley Directory Page Content Page Name (for URL) Page Title Breadcrumb Text A little over a year ago I discovered that Berkeley makes available on the Internet (at <http://directory.berkeley.edu>) contact information for all people associated with Berkeley which includes email addresses. This information is available to anyone on the Internet, not just to those inside the UC system or accessing it from outside with via a proxy. The email addresses are shown as text rather than graphics and, as I discovered, the system isn't smart enough to monitor traffic and shut off computers that attempt to collect too many addresses. ("Too many" as in "all.") So, if you want a list of 23,864 email addresses to spam - help yourself. Yes, they let you remove your address, but something tells me that the majority of the people who have their addresses on the site don't even know that it exists. I tried raising those concerns with the administration, but they didn't seem to see a problem. They pointed out that Berkeley provides students with a notice about the directory on page 21 of the printed course schedule. Raise your hand if you've ever seen a _printed_ course schedule at Berkeley. I then thought that maybe I would get administration's attention if I actually successfully retrieved the addresses. After writing a python script and letting it run for a while, I came into possession of 23,864 Berkeley email addresses. Something like the following but without the hash marks: AAB##, ANN# ######@socrates.berkeley.edu AAH###, ERI#### #######@berkeley.edu AAK##, DAV##### #####@haas.berkeley.edu AAK##, JOH############# ######@uclink.berkeley.edu AAL###, JES############### ####@uclink.berkeley.edu AAL##, ROL# not available AAR#, JOH### not available AAR##, HOL##### ######@socrates.berkeley.edu AAR##, MAR####### not available AAR#####, SCO####### ########@cs.berkeley.edu AAR##, ASH### not available ABA#, CHR######## #####@berkeley.edu ABA#, JOS##### not available ABA#, RON############ #########@berkeley.edu ABA#, STE######### ######@uclink.berkeley.edu ABA#, MEB######## not available ABA###, IMA## #######@library.berkeley.edu ABA###, PAT######### ################@#######.com ABA####, REA############### ########@uclink.berkeley.edu ABA####, TAW######### ########@berkeley.edu ABA#####, KAT######### #######@berkeley.edu ABA####, JOY######### not available ABA####, VIR##### ################@###.com It continues like this for another 39,479 lines. I am not going to include the code here out of concern for those 23,868 inboxes, but the basic idea is simple. You call the following URL for each two-letter combination, substituting each two letter combination for '%s': https://directory.berkeley.edu/cgi-bin/search.cgi? display_type=textonly&search-type=lastfirst&search-base=all &search-term=%s&search.x=14&search.y=12&search=Search Every time you get a list of names with links to more information. Some queries (e.g., "ad") give you an error: "exceeded the maximum number of results," but it's pretty trivial to overcome this by doing queries for certain three letter combinations. (You need to search only for those triples "xyz" where both "xy" and "yz" resulted in the error. Those are actually very few.) Thus, it takes a total of about 800 queries to retrieve the full list of people associated with UC Berkeley (about 40,000 people) and then one query per person to get all of their details. Having collected the addresses I sent an email to University Registrar Ms. Castillo-Robson who assured me that there really is nothing to worry about. A year later, the directory still functions just like it did last year. On a ligher side, now that I've got 23,864 addresses, I thought I might as well get some statistics on it. First, here are the most popular domain names for the email addresses: berkeley.edu 8600 uclink.berkeley.edu 7348 uclink4.berkeley.edu 1806 socrates.berkeley.edu 920 haas.berkeley.edu 586 yahoo.com 367 hotmail.com 342 nature.berkeley.edu 295 eecs.berkeley.edu 272 library.berkeley.edu 238 boalthall.berkeley.edu 187 lbl.gov 130 law.berkeley.edu 129 math.berkeley.edu 125 cs.berkeley.edu 121 aol.com 114 me.berkeley.edu 103 econ.berkeley.edu 94 ssl.berkeley.edu 91 ce.berkeley.edu 80 dev.urel.berkeley.edu 75 cchem.berkeley.edu 73 mba.berkeley.edu 65 unx.berkeley.edu 59 cp.berkeley.edu 58 cal.berkeley.edu 57 uhs.berkeley.edu 56 stat.berkeley.edu 56 calmail.berkeley.edu 51 newton.berkeley.edu 46 sims.berkeley.edu 44 Or, put differently: berkeley.edu 8600 uclink.berkeley.edu 7361 {etc}.berkeley.edu 2130 uclink4.berkeley.edu 1823 haas.berkeley.edu 587 yahoo.com 367 hotmail.com 342 nature.berkeley.edu 295 eecs.berkeley.edu 273 library.berkeley.edu 238 {etc}.com 232 boalthall.berkeley.edu 187 lbl.gov 130 aol.com 114 {etc}.net 112 {etc}.edu 88 {etc}.org 31 {something}.{etc} 16 {etc}.gov 11 -------------------------- total 23864 Now the methods of choosing the user names. The table shows for each domain pattern what percentage of user names fit into specific patterns. (The patterns are illustrated by the hypothetical user names for "john marvin doe": "doe", "jdoe", etc.) If the user name matched only the beginning of the string (e.g. "jlong" instead of "jlonglastname"), I counted it as a match. <table border='1'> <th> <td valign='top'>jdoe</td> <td valign='top'>doe</td> <td valign='top'>doej</td> <td valign='top'>johnd</td> <td valign='top'>john</td> <td valign='top'>john doe</td> <td valign='top'>jmdoe</td> <td valign='top'>jmd</td> <td valign='top'>john .doe</td> <td valign='top'>nums</td> <td valign='top'>_</td> <td valign='top'>misc</td> </th> <tr><td>all</td> <td><font color='black'>24</font></td> <td><font color='gray' size='-1'>9</font></td> <td> </td> <td><font color='gray' size='-1'>5</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>4</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>2</font></td> <td> </td> <td><font color='gray' size='-1'>9</font></td> <td><font color='gray' size='-1'>5</font></td> <td><font color='black'>23</font></td> </tr> <tr><td>berkeley.edu</td> <td><font color='black'>22</font></td> <td><font color='gray' size='-1'>4</font></td> <td> </td> <td><font color='gray' size='-1'>5</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>7</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>1</font></td> <td> </td> <td><font color='gray' size='-1'>10</font></td> <td><font color='gray' size='-1'>8</font></td> <td><font color='black'>25</font></td> </tr> <tr><td>uclink.berkeley.edu</td> <td><font color='black'>24</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>3</font></td> <td><font color='gray' size='-1'>7</font></td> <td><font color='gray' size='-1'>1</font></td> <td> </td> <td><font color='black'>11</font></td> <td><font color='gray' size='-1'>5</font></td> <td><font color='black'>25</font></td> </tr> <tr><td>{etc}.berkeley.edu</td> <td><font color='black'>25</font></td> <td><font color='black'>21</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>4</font></td> <td><font color='black'>14</font></td> <td><font color='gray' size='-1'>2</font></td> <td><font color='gray' size='-1'>4</font></td> <td><font color='gray' size='-1'>8</font></td> <td> </td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='black'>12</font></td> </tr> <tr><td>uclink4.berkeley.edu</td> <td><font color='black'>26</font></td> <td><font color='black'>11</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>8</font></td> <td><font color='gray' size='-1'>2</font></td> <td> </td> <td><font color='gray' size='-1'>5</font></td> <td><font color='gray' size='-1'>3</font></td> <td><font color='black'>28</font></td> </tr> <tr><td>socrates.berkeley.edu</td> <td><font color='black'>27</font></td> <td><font color='black'>22</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>7</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>8</font></td> <td><font color='gray' size='-1'>1</font></td> <td> </td> <td><font color='gray' size='-1'>2</font></td> <td> </td> <td><font color='black'>20</font></td> </tr> <tr><td>haas.berkeley.edu</td> <td><font color='black'>19</font></td> <td><font color='red' size='+1'>68</font></td> <td> </td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>3</font></td> <td> </td> <td><font color='gray' size='-1'>3</font></td> <td> </td> <td> </td> <td> </td> <td> </td> <td><font color='gray' size='-1'>4</font></td> </tr> <tr><td>yahoo.com</td> <td><font color='gray' size='-1'>6</font></td> <td> </td> <td> </td> <td><font color='gray' size='-1'>1</font></td> <td> </td> <td><font color='gray' size='-1'>8</font></td> <td><font color='gray' size='-1'>5</font></td> <td> </td> <td> </td> <td><font color='black'>26</font></td> <td><font color='black'>19</font></td> <td><font color='black'>29</font></td> </tr> <tr><td>hotmail.com</td> <td><font color='gray' size='-1'>4</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>1</font></td> <td> </td> <td><font color='gray' size='-1'>8</font></td> <td><font color='gray' size='-1'>4</font></td> <td> </td> <td> </td> <td><font color='black'>22</font></td> <td><font color='black'>19</font></td> <td><font color='red' size='+1'>35</font></td> </tr> <tr><td>nature.berkeley.edu</td> <td><font color='red' size='+1'>41</font></td> <td><font color='black'>20</font></td> <td> </td> <td><font color='gray' size='-1'>4</font></td> <td><font color='gray' size='-1'>10</font></td> <td> </td> <td><font color='gray' size='-1'>5</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>1</font></td> <td> </td> <td> </td> <td><font color='black'>12</font></td> </tr> <tr><td>eecs.berkeley.edu</td> <td><font color='black'>17</font></td> <td><font color='black'>26</font></td> <td> </td> <td><font color='gray' size='-1'>7</font></td> <td><font color='black'>18</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='gray' size='-1'>7</font></td> <td><font color='gray' size='-1'>3</font></td> <td> </td> <td> </td> <td> </td> <td><font color='black'>16</font></td> </tr> <tr><td>library.berkeley.edu</td> <td><font color='red' size='+1'>91</font></td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td><font color='gray' size='-1'>1</font></td> <td> </td> <td> </td> <td> </td> <td> </td> <td><font color='gray' size='-1'>6</font></td> </tr> <tr><td>{etc}.com</td> <td><font color='black'>19</font></td> <td><font color='gray' size='-1'>2</font></td> <td> </td> <td><font color='gray' size='-1'>3</font></td> <td><font color='black'>13</font></td> <td><font color='gray' size='-1'>3</font></td> <td><font color='gray' size='-1'>3</font></td> <td> </td> <td><font color='gray' size='-1'>9</font></td> <td><font color='gray' size='-1'>9</font></td> <td><font color='gray' size='-1'>5</font></td> <td><font color='black'>29</font></td> </tr> <tr><td>boalthall.berkeley.edu</td> <td><font color='red' size='+1'>39</font></td> <td><font color='gray' size='-1'>6</font></td> <td> </td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>8</font></td> <td><font color='gray' size='-1'>5</font></td> <td><font color='gray' size='-1'>5</font></td> <td><font color='gray' size='-1'>2</font></td> <td> </td> <td><font color='gray' size='-1'>4</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='black'>18</font></td> </tr> <tr><td>lbl.gov</td> <td><font color='gray' size='-1'>10</font></td> <td><font color='gray' size='-1'>10</font></td> <td> </td> <td> </td> <td> </td> <td> </td> <td><font color='red' size='+1'>69</font></td> <td><font color='gray' size='-1'>1</font></td> <td> </td> <td> </td> <td><font color='gray' size='-1'>3</font></td> <td><font color='gray' size='-1'>3</font></td> </tr> <tr><td>aol.com</td> <td><font color='gray' size='-1'>1</font></td> <td> </td> <td> </td> <td> </td> <td> </td> <td><font color='gray' size='-1'>2</font></td> <td><font color='gray' size='-1'>4</font></td> <td> </td> <td> </td> <td><font color='red' size='+1'>46</font></td> <td> </td> <td><font color='red' size='+1'>43</font></td> </tr> <tr><td>{etc}.net</td> <td><font color='black'>16</font></td> <td><font color='gray' size='-1'>6</font></td> <td> </td> <td> </td> <td><font color='gray' size='-1'>3</font></td> <td><font color='gray' size='-1'>9</font></td> <td><font color='gray' size='-1'>7</font></td> <td> </td> <td><font color='gray' size='-1'>6</font></td> <td><font color='black'>15</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='red' size='+1'>33</font></td> </tr> <tr><td>{etc}.edu</td> <td><font color='black'>22</font></td> <td><font color='gray' size='-1'>9</font></td> <td><font color='gray' size='-1'>1</font></td> <td> </td> <td><font color='gray' size='-1'>10</font></td> <td><font color='gray' size='-1'>3</font></td> <td><font color='gray' size='-1'>7</font></td> <td> </td> <td><font color='black'>12</font></td> <td><font color='black'>12</font></td> <td><font color='gray' size='-1'>1</font></td> <td><font color='black'>19</font></td> </tr> <tr><td>{etc}.org</td> <td><font color='red' size='+1'>32</font></td> <td><font color='gray' size='-1'>3</font></td> <td><font color='gray' size='-1'>6</font></td> <td> </td> <td><font color='black'>22</font></td> <td> </td> <td><font color='gray' size='-1'>6</font></td> <td> </td> <td><font color='black'>12</font></td> <td><font color='gray' size='-1'>3</font></td> <td><font color='gray' size='-1'>3</font></td> <td><font color='gray' size='-1'>9</font></td> </tr> <tr><td>{something}.{etc}</td> <td><font color='black'>18</font></td> <td><font color='gray' size='-1'>6</font></td> <td> </td> <td><font color='gray' size='-1'>6</font></td> <td><font color='black'>12</font></td> <td> </td> <td> </td> <td> </td> <td><font color='gray' size='-1'>6</font></td> <td><font color='gray' size='-1'>6</font></td> <td><font color='black'>12</font></td> <td><font color='red' size='+1'>31</font></td> </tr> <tr><td>{etc}.gov</td> <td> </td> <td><font color='red' size='+1'>45</font></td> <td> </td> <td> </td> <td> </td> <td> </td> <td><font color='black'>18</font></td> <td><font color='gray' size='-1'>9</font></td> <td><font color='gray' size='-1'>9</font></td> <td> </td> <td> </td> <td><font color='black'>18</font></td> </tr> </table> A random observations: * "john@\*.berkeley.edu" is more common than "john@berkeley" - that's obvious. However, also "doe@\*.berkeley.edu" is much more common than "doe@berkeley" - is Berkeley large enough that last names collisions are common? * MBAs and .gov people really like their last names. Lawyers, on the other hand, are like the rest of us. * library.berkeley.edu and lbl.gov must have an explicit policy. What's interesting is that lbl.gov must make exceptions for people without initials and for people without first names. :) Advanced Fields Category 2002200320042005200620072008200920102011201220132014E. AsiaE. EuropeL's FamilyL's FriendsN. AmericaN. EuropeS. AmericaS. AsiaW. EuropeY & LY's FamilyY's Friends Prototype Redirect Permissions0 Actions Config Markup Module HTML/Meta/Keywords HTML/Meta/Description Save Hook HTML Fields Main Head Body Header Menu Logo Content Template Page Sidebar Footer Tags Allowed for XSSFilter HTTP Fields Cache-Control Expires Guru Fields Templates Translations Fields Edit UI Admin Edit UI Don't put anything here Don't put anything here A summary of your changes Edit Summary Don't put anything here Don't put anything here Don't put anything here Don't put anything here save preview cancel