Phase 1

Back to the main project page

Page outline

Introduction

The first phase of the project is divided into five steps to help you organize your work. Steps 1 and 2 are purely administrative, the actual project work starts in step 3.

Step 1: Formation of teams

Basically you will work in teams of three people. If the number of students is not divisible by three we will allow a maximum of two teams of two people.

You can choose your teams yourselves. Remember that this not only gives you liberty but also duties: select your partners with care. Once the teams are chosen it won't be possible to change their composition. All members of the team will receive the same grade.

Please use the following form for team enrolment:

Team enrolment [only from within EPFL, use VPN from outside]

If you can't find your name in the list, please send an e-mail to Martin containing your team composition.

Deadline for team enrolment: Monday, 2005-03-14, 2400

Step 2: Formation of consortia

Once the teams are chosen each team has to pick a web site as their data source. You will also have to form consortia of three teams. Each of these three teams must pick a different category. In phase 2 you will exchange the data gathered between the teams of a consortium and we want to ensure that you can merge data from three different sources.

Below is a list of categories and web sites that we think present a good choice for the project (the order is purely alphabetical). If you know any other web sites that you want to use, feel free to do so, as long as they fit into one of the three categories. In such a case please let us know so we can add your site to the list below.

Category 1: Music databases

Category 2: e-Shops

Category 3: Digital music stores

While picking a web site you should keep in mind certain criteria:

You must pick a category until Friday of the second week so you can form the consortia. The list where you can sign in your consortia will be available during the exercise session on the 18.3. between 1015–1200. Note that you do not have to tell us the web site you have chosen at that time. We do, however, advise you to pick a site before Friday 18th, so you can form the consortia a little easier.

Deadline for consortium inscription: Friday, 2005-03-18, 1200 (INF1, INF3)

Step 3: Database design

This is the first step where you get in touch with the actual database. You will start by designing an entity/relationship (E/R) model of your database. To do this successfully you have to decide first what kind of information you want to collect from the web site you have chosen.

Try to include as many interesting details from your web site as possible, e.g. genre, labels, release data, etc. You should especially emphasize on the features that are unique to your web site category because this makes it interesting during the merging process of phase 2.

Step 4: Database implementation

Once you have the E/R model you can start translating it into a relational model, i.e. the one you're going to use for the rest of the project. We expect two representations of the relational model: a graphical schema and the SQL script that you are using with your database server.

You should read the Postgres How-to now!

A few tips:

Step 5: Web crawler

The goal of this step is to populate the database that you have developed in the previous step. To do this you will develop a web crawler ("bot"), i.e. a program that downloads data from the web site you have picked, converts it into a usable form, where necessary, and stores it in your database.

We suggest that you write the bot in Java, although you are entirely free to choose your favorite language. The pillars of your bot are HTTP to download the web pages, regular expressions to extract interesting data and links, and JDBC to interact with your Postgres database. For all of these techniques there are simple examples available that demonstrate possible implementations in Java.

WebCrawler.java

Despite its name, this sample program is nowhere near a fully functional web crawler, but it shows you how to fetch data via HTTP and how you can extract certain information (in this case: hyperlinks) from the pages downloaded.

The program demonstrates two slightly different ways. The first one (simpleExtractData) simply downloads a web page without any bells and whistles, while the second one (complexExtractData) uses a slightly more interesting technique (namely the falsification of the user agent) to request a simple page from Google. The reason why the program has to fake the user agent string is that Google tries to detect non policy conform usage (in particular the so called meta searching). So by changing the User-agent field of the HTTP request we lead Google to think that we are a regular browser and get our desired result page.

Both methods then use different regular expressions to extract certain data. You should play around a little with different regexp's to get a feeling for how they work.

HelloPostgres.java

This example demonstrates in a few simple lines of code how you can connect to your Postgres database from Java using JDBC.

The program above uses the actors database created in the Postgres how-to. Don't forget to add the PostgreSQL JDBC driver to the classpath to be able to run this example. More information on how to do this can be found in the corresponding how-to section.

To make the second phase of the project interesting your web crawler must collect a minimum of 200 as complete as possible records (counted in the main table). A missing value once in a while is natural and therefore not a problem, but you must make sure that the main record features (e.g. the CD title, etc.) are present. If you feel that there is no way that you can collect this much data you should come talk to us. It should, however, not be a problem -- after all, these sites provide thousands of items.

Report

At the end of phase 1 you must hand in a report in the form of a web page. This web page must be included in the archive you hand-in. Also, it must be accessible for everyone shortly after the deadline because the other teams from your consortia need it for phase 2. Remember not to make it public too soon to prevent others from copying the contents. If you do not want to make everything public (e.g. your database), you are responsible for proper communication with the rest of your consortia.

If you don't know how to publish your report web site you can find more information in the corresponding how-to.

Format

The report web pages must be valid XHTML according to the XHTML 1.0 standard or higher. The choice of a DTD (Strict, Transitional, Frameset) is up to you. Moreover your web pages must be reasonably viewable in browsers that follow the current standards (that obviously does not include Internet Explorer :-).

Please put all your files in an archive (ZIP, RAR, TAR, ...) called groupXX.{zip,rar,tar.gz,...} before you send them to us. (If you don't know how to do this, try something like tar cfz group87.tar.gz phase1/.)

Content

Your report must document all relevant steps of phase 1 in reasonable detail. You should explain your decisions and highlight the important features of your work. Don't hesitate to mention outstanding features, we will honor additional effort. If you have encountered any particular problems during the process you should also mention them. Note that we will take into account the fact that some web sites may turn out a little more difficult to crawl than others.

You must include the complete source files of your web crawler including links to any external libraries that you may have used. The web crawler must compile, so you might want to double check this before you hand the files in. Also don't forget the database schemas and the complete set of SQL scripts (i.e. schema and data) needed to recreate the database as well as the E/R model.

Shortly after the deadline (Wednesday latest) your report must be available online, so be sure to include the link (it doesn't have to work yet ...) in the report or e-mail you send us!

To be prepared for phase 2, please also include the name of the currency that you use for the prices of your articles.

You should include either a script that compiles your web crawler or at least the corresponding command.

Particularities

While this is not really part of the report, we do require that you change the login password to your account. Not only does this stop other people from stealing or disturbing your work, it should also become a habit for the future. Failure to do so will lead to a deduction of your grade.

Deadline for report hand-in: Friday, 2005-04-22, 2400 (hand-in by e-mail to Martin)

Deadline for database exchange: Wednesday, 2005-04-27, 2400 (see phase 2 for details)

Copyright © Martin Rubli & Patrik Bless – Last change:
This page uses valid XHTML 1.0 Strict and valid Cascading Style Sheets, Level 2. This page uses valid XHTML 1.0 Strict. This page uses valid Cascading Style Sheets, Level 2.