Phase 1
Page outline
Introduction
The first phase of the project is divided into five steps to help you organize your work. Steps 1 and 2 are purely administrative, the actual project work starts in step 3.
Step 1: Formation of teams
Basically you will work in teams of three people. If the number of students is not divisible by three we will allow a maximum of two teams of two people.
You can choose your teams yourselves. Remember that this not only gives you liberty but also duties: select your partners with care. Once the teams are chosen it won't be possible to change their composition. All members of the team will receive the same grade.
Please use the following form for team enrolment:
Team enrolment [only from within EPFL, use VPN from outside]
If you can't find your name in the list, please send an e-mail to Martin containing your team composition.
Deadline for team enrolment: Monday, 2005-03-14, 2400
Step 2: Formation of consortia
Once the teams are chosen each team has to pick a web site as their data source. You will also have to form consortia of three teams. Each of these three teams must pick a different category. In phase 2 you will exchange the data gathered between the teams of a consortium and we want to ensure that you can merge data from three different sources.
Below is a list of categories and web sites that we think present a good choice for the project (the order is purely alphabetical). If you know any other web sites that you want to use, feel free to do so, as long as they fit into one of the three categories. In such a case please let us know so we can add your site to the list below.
Category 1: Music databases
-
Metacritic.com Music: Music reviews
-
MusicBrainz: Music metadatabase
-
WWW Music Database: Album and artist database
-
VH1: Artists A–Z: Music artists database
Category 2: e-Shops
-
Amazon.com: Also has international sites that you may want to check out.
-
Buy.com: Internet superstore
-
CeDe.ch: Swiss multimedia shop
Category 3: Digital music stores
-
AOL music store: German online music store
-
Musicload: German T-Online music store
While picking a web site you should keep in mind certain criteria:
Does it contain useful data? Is there enough to create an interesting database?
Does the site have a browsing facility? (Alphabetical lists or Top 100 lists are a good starting point.)
Is it reasonably simple to write a web crawler for it? (Look out for patterns graspable with regular expressions.)
Does the site force the visitor to use certain software (e.g. Windows with Internet Explorer)? If so, is it possible to fool the browser detection, e.g. by simply changing the 'User-agent' field? Cookies may be trickier to implement but not impossible.
etc.
You must pick a category until Friday of the second week so you can form the consortia. The list where you can sign in your consortia will be available during the exercise session on the 18.3. between 1015–1200. Note that you do not have to tell us the web site you have chosen at that time. We do, however, advise you to pick a site before Friday 18th, so you can form the consortia a little easier.
Deadline for consortium inscription: Friday, 2005-03-18, 1200 (INF1, INF3)
Step 3: Database design
This is the first step where you get in touch with the actual database. You will start by designing an entity/relationship (E/R) model of your database. To do this successfully you have to decide first what kind of information you want to collect from the web site you have chosen.
Try to include as many interesting details from your web site as possible, e.g. genre, labels, release data, etc. You should especially emphasize on the features that are unique to your web site category because this makes it interesting during the merging process of phase 2.
Step 4: Database implementation
Once you have the E/R model you can start translating it into a relational model, i.e. the one you're going to use for the rest of the project. We expect two representations of the relational model: a graphical schema and the SQL script that you are using with your database server.
You should read the Postgres How-to now!
A few tips:
This is a good time to quickly go over the relevant parts of the PostgreSQL documentation, especially chapter 8 to see what data types you can use.
While designing the relational model it is probably a good idea to continuously test your model for syntax correctness in Postgres.
You can create a separate database just for the design process and, once everything is in place, move it over to your "real" database.
Be sure to always keep an up-to-date copy of the model (the .sql file) elsewhere and not only on the database server, just in case you accidentally delete too much. A good way to do this is by using the pg_dump tool.
Step 5: Web crawler
The goal of this step is to populate the database that you have developed in the previous step. To do this you will develop a web crawler ("bot"), i.e. a program that downloads data from the web site you have picked, converts it into a usable form, where necessary, and stores it in your database.
We suggest that you write the bot in Java, although you are entirely free to choose your favorite language. The pillars of your bot are HTTP to download the web pages, regular expressions to extract interesting data and links, and JDBC to interact with your Postgres database. For all of these techniques there are simple examples available that demonstrate possible implementations in Java.
Despite its name, this sample program is nowhere near a fully functional web crawler, but it shows you how to fetch data via HTTP and how you can extract certain information (in this case: hyperlinks) from the pages downloaded.
The program demonstrates two slightly different ways. The first one (simpleExtractData) simply downloads a web page without any bells and whistles, while the second one (complexExtractData) uses a slightly more interesting technique (namely the falsification of the user agent) to request a simple page from Google. The reason why the program has to fake the user agent string is that Google tries to detect non policy conform usage (in particular the so called meta searching). So by changing the User-agent field of the HTTP request we lead Google to think that we are a regular browser and get our desired result page.
Both methods then use different regular expressions to extract certain data. You should play around a little with different regexp's to get a feeling for how they work.
This example demonstrates in a few simple lines of code how you can connect to your Postgres database from Java using JDBC.
The program above uses the actors database created in the Postgres how-to. Don't forget to add the PostgreSQL JDBC driver to the classpath to be able to run this example. More information on how to do this can be found in the corresponding how-to section.
To make the second phase of the project interesting your web crawler must collect a minimum of 200 as complete as possible records (counted in the main table). A missing value once in a while is natural and therefore not a problem, but you must make sure that the main record features (e.g. the CD title, etc.) are present. If you feel that there is no way that you can collect this much data you should come talk to us. It should, however, not be a problem -- after all, these sites provide thousands of items.
Report
At the end of phase 1 you must hand in a report in the form of a web page. This web page must be included in the archive you hand-in. Also, it must be accessible for everyone shortly after the deadline because the other teams from your consortia need it for phase 2. Remember not to make it public too soon to prevent others from copying the contents. If you do not want to make everything public (e.g. your database), you are responsible for proper communication with the rest of your consortia.
If you don't know how to publish your report web site you can find more information in the corresponding how-to.
Format
The report web pages must be valid XHTML according to the XHTML 1.0 standard or higher. The choice of a DTD (Strict, Transitional, Frameset) is up to you. Moreover your web pages must be reasonably viewable in browsers that follow the current standards (that obviously does not include Internet Explorer :-).
Please put all your files in an archive (ZIP, RAR, TAR, ...) called groupXX.{zip,rar,tar.gz,...} before you send them to us. (If you don't know how to do this, try something like tar cfz group87.tar.gz phase1/.)
Content
Your report must document all relevant steps of phase 1 in reasonable detail. You should explain your decisions and highlight the important features of your work. Don't hesitate to mention outstanding features, we will honor additional effort. If you have encountered any particular problems during the process you should also mention them. Note that we will take into account the fact that some web sites may turn out a little more difficult to crawl than others.
You must include the complete source files of your web crawler including links to any external libraries that you may have used. The web crawler must compile, so you might want to double check this before you hand the files in. Also don't forget the database schemas and the complete set of SQL scripts (i.e. schema and data) needed to recreate the database as well as the E/R model.
Shortly after the deadline (Wednesday latest) your report must be available online, so be sure to include the link (it doesn't have to work yet ...) in the report or e-mail you send us!
To be prepared for phase 2, please also include the name of the currency that you use for the prices of your articles.
You should include either a script that compiles your web crawler or at least the corresponding command.
Particularities
While this is not really part of the report, we do require that you change the login password to your account. Not only does this stop other people from stealing or disturbing your work, it should also become a habit for the future. Failure to do so will lead to a deduction of your grade.
Deadline for report hand-in: Friday, 2005-04-22, 2400 (hand-in by e-mail to Martin)
Deadline for database exchange: Wednesday, 2005-04-27, 2400 (see phase 2 for details)
Copyright © Martin Rubli & Patrik Bless –
Last change:
This page uses
valid XHTML 1.0 Strict and
valid Cascading Style Sheets, Level 2.
This page uses
valid XHTML 1.0 Strict.
This page uses
valid Cascading Style Sheets, Level 2.