Planet Drupal Association
Drupal booth at Hostingcon, Aug 10-12th, Washington, DC
The Drupal project will have a booth at Hostingcon, August 10-12th, in Washington, DC. This is the second year we will be at hostingcon meeting with members of the Drupal community and talking to hosting providers. The booth will be staffed by Eric Mandell, from Blackmesh hosting, members of the DC Drupal community, and Drupal hosting providers.
Hostingcon provides the Drupal community a great opportunity to help educate hosting providers about Drupal's popularity. Exciting web applications like Drupal, help drive hosting sales, so providers are eager to learn about what Drupal can do. The event also provides the Drupal community an opportunity to educate system administrators about Drupal hosting best practices such as security, performance tuning, scalability, and high availability.
We will also be doing a special 30 minute presentation in the presentation theater, on the expo floor. If you would like to attend that presentation or other Hostingcon sessions you can get a special discount on early bird registration using the code "Drupal2009". Early bird registration ends tomorrow. You can read about the Keynotes and Sessions here.
If you are interested in helping with the Drupal booth, you can register here.
2009 Association Budget
As the first major budgeting process the Association undertook it was a bit of a long and mildly painful process. However we work out the kinks, set our goals, priorities and creating awesome working teams. Check out the newly published 2009 budget!. As always if you see something interesting and want to get involved contact us.
Drupal in the .org pavillion at Open Source World
I received word this morning that the Drupal Association has been awarded a booth in the .org pavilion. The Bay Area has a strong Drupal community and we anticipate have a good cross section of support from the local community to help staff the booth. But if you are planning to be at Open Source world, or you would like to help staff the booth, please feel free to sign-up to staff the Drupal booth.
Drupal.org redesign code sprints: update 3
Ever wanted to see the Media Lab at MIT? An almost mythical place where "...the future is lived, not imagined ... a world where radical technology advances are taken for granted, ..." and technology is designed "... for people to create a better future".
I've visited the Media Lab before and it is highly recommended. This is your chance.
On Friday, June 12th, we're conducting a one-day Drupal.org redesign sprint at the MIT Media Lab. Designed to bring home our kicking re-design for Drupal.org designed by Mark Boulton and Leisa Reichelt, all Drupal designers and developers who would like to contribute to theming our new home at drupal.org are welcome, invited, and encouraged to attend. Attendance is limited, so you must sign-up in advance.
Even if you're "just" a newbie, we'll make sure to have you ready for attendance by activating your infrastructure.drupal.org account, by creating your own hosted copy of drupal.org for your localized testing, and by training you in working within the redesign theme issue queue.
If you are interested in helping -- and I hope you are -- please sign-up to schedule your training by visiting http://groups.drupal.org/node/22036. Thanks!
Getroffene Hunde bellen
The title is a German proverb which translates as "hit dogs bark" and means that people will react once they feel sufficiently threatened.
The proverb apparently applies to Phorm a UK company that wants to use Deep Packet Inspection to serve ads. They have been critizied for their plan, and I also wrote them a letter asking them to exempt drupal.org from their plans (besides an auto-reply no answer was received).
This company has recently launched a website that complains loudly about alleged foul play against them.
You can read why DPI & Phorm are evil at Privacy International.
More resources can be found at No DPI.
Apparently the activism against DPI and Phorm's use of it have reached a level that compelled Phorm to launch their own website against it. They seem to fear for the success of their business model.
I guess everybody should try to understand both sides' arguments, but for me the use of DPI makes it quite clear that Phorm is not an acceptable solution for serving ads. I don't allow advertisers to look into my letters to decide which ads to put into my letterbox either.
Should the use of DPI spread, it will be neccessary to encrypt more traffic in order to protect one's privacy.
Turn over in the Drupal association board and general assembly
A few weeks ago a community member raised concern on the consulting list about turn over in the Drupal association, or rather the lack of it. Since this information is often public, but hard to track down, I thought I'd shed some light.
Ten people have rotated on and off the board of the Drupal
association. Only four original board members are still present.
The board has two new members as of 2009 for the Drupal association, who were not previously permanent members of the association: Tiffany Farriss and Cary Gordon.
The association has 30 permanent members up from 15 elected permanent members in it's first year.
- 2006 - Zack, Moshe, Boris left the board and are still General assembly members. Steven Wittens and Dries Knapen were on the original board and have since resigned from the Drupal association.
- 2007 - Jeff Eaton, Bert Boerland were on the second board but are now members of the General Assembly.
Current directors from the original board: Dries has been re-elected as president. Gerhard has remained manager of infrastructure. Kieran has changed from fund raiser to business development. Angela has changed from director to Secretary. Four of the original nine directors are still on the board.
To learn more about about the members of the Drupal association our staff page.
Solr And Varnish
I've recently been setting up a new Apache Solr search cluster for drupal.org. Hopefully this is going to be the "final" solution to our search issues. During this project, I noticed that Solr can now set headers for Reverse Caching Proxies. This is really cool! I went off to test this and setup Varnish to proxy my Solr connections. First, I setup Solr to send Cache-Control headers with a 60 second timeout for all requests. I may tweak this later, but this is fine for testing. (Note: Solr can also send If-Modified-Since headers) I then setup varnish with a very simple configuration.
20 Percent Time And Infrastructure Admins
I'm a pretty big fan of Google's 20% time. To those who don't know, this is a policy where employees are allowed to work on "their own" project for 20% of the time. For some managers, this would seem terrifying or just stupid. However, in reality it allows your employees to work on what they are passionate about and, shocking I know, sometimes they have good ideas and their ideas make you money. At the very least they gain experience. I'd much rather have an employee learning new things on my dime than stagnating on "their own time."
MySQL 5.1 Refresh
I've been following MySQL 5.1 for awhile now. It has many interesting/exciting features and improvements, but I have had a long-standing view that it was released too early and is not production ready. With the partial release/sneak-beta of MySQL 5.4, a performance-centric release based on 5.1, I've decided to take a look at the bugs I've been following in 5.1 and re-evaluate its production ready-ness.
Google to invest 90,000 USD in Drupal
The summer is off to a great start again. Google just announced that they will sponsor 18 Drupal developer stipends in this year's Summer of Code program (SoC). Google provides a stipend of 5,000 USD to each student developer, of which 4,500 USD goes to the student and 500 USD goes to Drupal Association (or to the mentors). With 18 accepted applications this adds up to a 90,000 USD investment over a three-month period. In addition to Drupal, they are supporting a ton of other Open Source projects, including PHP, which Drupal heavily depends on.
The accepted students, their projects, and the mentors are listed on the official Drupal.org announcement. Congratulations to all successful applicants, and thanks to the Drupal Summer of Code organizers, the Drupal mentors, and last but not least, Google. Awesome!
Speedy google bot
Googlebot is a frequent visitor(*) of drupal.org, eating over 50GB of traffic in over 11 million requests in March. I didn't know that the postprocessing is also quite fast.
Today, one of our helpful users reported some spam which was created by a new user. The user was a member for 2 hours and the spam itself was just over 1 hours old before I deleted it.
After I deleted it, I thought that it would have been entertaining to show it to somebody (the subject of the spam was hardware for your shower) and I decided to look in google. I didn't really expect it to be there, but there it was "found one hour ago". That means that googlebot picked it up within the first few minutes after its creation and it was in the search results shortly afterwards. Considering the huge amount of websites that googlebot looks at I find this pretty impressive.
(*) It is actually the type of visitor that arrives and never leaves...
Spammers read my blog
Maybe they don't but they have realized that their spam profiles on drupal.org are too short-lived to get them much traffic. As a result of this, the number of new spam profiles seems to be down.
As a side note: In part due to the spam profiles and the google traffic that they generated drupal.org served more than 30 Mio pages in March. This is an increase of about 38% compared to February with 22 Mio viewed pages.
The bigger part of this surge can probably be attributed to DrupalCon at the beginning of March.
Here's a table of the number of 403 pages by month:
Month Count January 66000 February 65000 March 110000 April (until 19th) 1.3 MioGoogle's webmaster tools also tell me that "naked girls" is currently the 19th most popular term connected with searches and drupal.org (about 1% of results). This should change again once google has dropped the blocked profile pages from its index. Unfortunately, there is no feasible way to unlist all the pages in their index as you can only enter URLs one-by-one in the appropriate form.
Goodbye Phorm
After amazon and wikimedia the Drupal association has decided to opt out of the Phorm webtraffic snooping scheme. It is quite scandalous that one has to opt-out instead of having the option ot opt-in, but we did it anyway. We got the same auto-reply that wikimedia got, let's see if we get any more detailed response later.
I hope the European commission enquiry into the UK's privacy laws will lead to this idea being put where it belongs: the rubbish pile of failed ideas.
Why spam works
I've recently been looking into the spam that hits drupal.org and yesterday I've finally found out why they do that and that it actually works. Until I block the accounts at least.
A blocked a account will give any visitor a "403 access denied" message. Drupal logs these incidents. It also logs the referer of these requests, so I am able to see which page the visitor was looking at when he clicked on the link to the blocked account. Most of these pages are search resulte of google and other search engines. And of course the visitor was looking for porn of all different flavours.
This is not really something new. What really surprised me was the good google ranking that the drupal.org links had. Even for relatively unspecific two word search phrases we often ranked on the second or third page of search results. For more specific requests we often rank among the top 5 or even right at the top.
This is why spamming drupal.org makes sense for spammers: Our high page rank enables them to target their audience rather efficiently. And since googlebot loves drupal.org (http://association.drupal.org/node/332) their links show up in the search results in no time.
Now that we have looked at the business motivation of the spammers, let's look at the porn seekers.
I've taken a snapshot of our watchdog table. It contains almost 120000 "access denied errors" for user pages where there is a referer and the referer is not from drupal.org itself.
Visual inspection shows me that indeed most of these are from search engines and the search terms are of a sexual nature.
The snapshot covers the time of 13.5 hours yesterday (10:15 to 23:45 UTC). That means we have almost 9000 requests for porn on drupal.org per hour which remain unsatisfied.
The requests come from 87000 different IPs so we can conclude that most people don't fall for the same trick twice.
The geographical distribution of the IPs is as follows:
# of 403s Country 30181 United States 9246 United Kingdom 7776 Germany 7520 India 4569 Turkey 4416 Canada 4119 France 3835 Italy 2207 Norway 2082 Netherlands 1941 Poland 1853 Pakistan 1771 Australia 1767 Indonesia 1649 Brazil 1624 Spain 1467 Greece 1416 South Africa 1292 China 1171 Egypt 1154 Saudi Arabia 1042 Iran 1000 Romania(Showing only countries with 1000 or more entries)
The geographical distribution has been calculated using the data from maxmind after importing it into MySQL using this handy How-To.
The results are probably somewhat skewed due to not taking a full 24
hours into account.
One final notice: A friend who works for a security firm has informed me that the business of the spammers is not only porn. They are more interested in infesting the computers of the porn seekers with malware. So the porn seekers should be glad we blocked these accounts.
Spammer update
Last week I blogged about the spammers on drupal.org and how we remove their accounts. This week I've again looked at the newly created accounts and also added some other domains to the access rules (mainly aliases of mailinator.com).
There is one new player on the mail provider list. Apparently somebody created a domain to use for mail in order to be able to register at sites like drupal.org. And that they did: they created almost 500 accounts on d.o during the last week. They are of course all blocked now.
Strangely enough, the domain is registered to an Algerian company. The domain is hosted in the Netherlands, the spamvertized site is hosted in Moldova, and the registrant of that hides behind an anonymizer service. The only other use of the original spammer domain points to another domain that is suspended.
I've contacted the registrar of the domain and will see if I can get it suspended too.
Update: the spamvertized site is only a redirect to a porn site owned by a Russian and hosted in the US. A truly international cooperation.
Googlebot likes Drupal 6
It is now several weeks after the upgrade of drupal.org to Drupal 6 and I've taken a look at google's crawling statistics for drupal.org.
This is the most interesting graph for me as infrastructure manasger, it shows the average time that googlebot needs to download a html page from drupal.org. We apparently had a bit of a rough ride in January, but recently this has smoothed out. About 600ms per page seems quite a good value to me.
This second graph shows the bandwidth consumed by googlebot in kB per day. Googlebot consumes about 80GB per month.
The third graph shows the number of crawled pages per day, up to 500k. It is interesting to note that despite the number of pages increased considerably after the upgrade, the amount of data transferred did not. I don't think that Drupal 6 has a smaller html footprint than Drupal 5. I attribute this to the fact that we removed the long "all modules and themes" pages (due to unrelated reasons).
AttachmentSize chart.png10.51 KB chart_002.png13.19 KB chart_003.png12.87 KBSpammers on drupal.org
So, after I claaimed we'd have less spammers than others, I wanted to find out how many spammers we've actually had.
mysql> select EXTRACT(YEAR_MONTH FROM from_unixtime(created)) as yearmonth, count(*) as count from users where status = 0 and login != 0 group by yearmonth order by yearmonth desc ;
Year/Month # of spammers 2009 / 04 820 2009 / 03 710 2009 / 02 1101 2009 / 01 371 2008 / 12 171 2008 / 11 145 2008 / 10 136 2008 / 09 268 2008 / 08 486 2008 / 07 639 2008 / 06 145 2008 / 05 132 2008 / 04 149 2008 / 03 206 2008 / 02 167 2008 / 01 105 2007 / 12 85 2007 / 11 66 2007 / 10 96 2007 / 09 79 2007 / 08 112 2007 / 07 206 2007 / 06 136 2007 / 05 116 2007 / 04 98 2007 / 03 78 2007 / 02 64 2007 / 01 81 2006 / 12 46 2006 / 11 59 2006 / 10 67 2006 / 09 31 2006 / 08 34 2006 / 07 29 2006 / 06 28 2006 / 05 25 2006 / 04 17 2006 / 03 18 2006 / 02 15 2006 / 01 17There are two interesting observatiosn to be made:
1) Spammers are most active on drupal.org in summer
2) Spamming on drupal.org is on the rise.
Especially the latter point is of concern.
I know that many readers will think "Why don't they simply deploy mollom on drupal.org?".
Unfortunately, this is not (yet) the answer. Most of these spammers do not create a single post on drupal.org, they merely use the high page rank of the user profiles to redirect the gullible to other sites. Mollom currently does not deal with user profiles at all. I guess that it would be interesting to use it for this purpose once support becomes available. However, I am not sure that the redesign of drupal.org will still use profile.module, so we maybe could standard node form support afterwards.
For the interested readers, here's the SQL query I use to find profile spammers:
select u.name, u.mail, concat('http://drupal.org/user/', u.uid, '/edit'), substr(p.value, 1, 60) from users u inner join profile_values p on u.uid = p.uid where u.uid > 430000 and p.value != '' and length(p.value) > 50 and u.status != 0;
Once a common pattern has been identified, you can confirm it with e.g.
elect u.name, u.mail, concat('http://drupal.org/user/', u.uid, '/edit'), substr(p.value, 1, 60) from users u inner join profile_values p on u.uid = p.uid where u.uid > 430000 and p.value != '' and length(p.value) > 50 and u.status != 0 and p.value like '%spam;
And then the grand finale:
update users set status = 0 where uid in (select p.uid from profile_values p where p.uid > 430000 and p.value != '' and length(p.value) > 50 and p.value like '%spam);
Spammers by mailprovider
On drupal.org we have much less spammers than other websites. One reason is the fact that we do not allow anonymous users to post anything and that every user needs a valid mail address in order to use his account.
This poses the question: Which email providers to our spammers use?
Luckily, this is rather easy to answer:
mysql> select substring_index(substring_index(init, '@', -1), '.', 1) as provider, count(substring_index(substring_index(init, '@', -1), '.', 1)) as count from users where status = 0 and login != 0 group by provider order by count;
This results in the following top ten:
- Gmail: 33141
- Yahoo: 8961
- Hotmail: 475
- Spam.la: 199
- Nospamfor.us: 160
- Mytempemail: 72
- Yandex.ru: 52
- Mailinator.com: 47
- Mail.com: 411
- Suche-project.eu: 39
All domains in this list from 4th rank on are banned from creating any more accounts.
Correction: Yandex.ru and mail.* are not blocked.
1 Includes relevant subdomains, international domains, and aliases.
Love Me Some XtraBackup
So, long ago in a galaxy far far away there was a table engine called InnoDB. It was transactional, it was fast...it was briefly not owned by Oracle. Life was pretty awesome. However....backup sucked. Big time.... To do a consistent point in time backup, we needed to do a single transactional dump of the entire database. This means that we need to hold a short-lived lock at the beginning of the dump (which can be problematic) and that it takes _forever_ to dump and restore.
DCDC 2009: Infrastructure Presentation Slides
At Drupalcon DC this year, Kieran and I gave a talk on the current state of drupal.org infrastructure. David Strauss and Derek Wright ended up joining us on stage. It was an interesting chance to get a large part of the drupal.org infra team all on stage together to answer questions.
The video is available here: http://www.archive.org/details/DrupalconDc2009-Drupal.orgInfrastructureS...
The slides are available here: http://nnewton.org/sites/default/files/Infrastructure_team_presentation.ppt