Phase 2

Back to the main project page

Page outline

Introduction

In the first phase you have gotten a first impression of the problems encountered when trying to extract useful information from formats that were not designed for this purpose. Clearly the standard HTML format is not at all suitable for automatic processing, which is one of the reasons why standards like XHTML have been designed. Of course the fact that many web shops don't want their data to be automatically gathered doesn't help either.

Now that you all have data sets in a "normalized" form (actually three data sets in three different forms), you get to know another major challenge in information systems -- merging inhomogeneous data. As you can imagine this is a very frequent problem in information technology today, just think of corporate mergers, distributed information systems, etc. Whenever data from different sources come together, a way must be found to recognize which records actually describe the same data, but are just in a different format. Unique identifiers like the ISBN for books are unfortunately the exception, so different solutions must be found.

The first part of the second phase of the project is all about this problem. Each team has three different databases at hand, their own and those from the other two teams in the consortium. The goal is to merge these three databases to a single, more comprehensive one in order to have more interesting data for the "web shop" that you are going to design in the second part.

Step 1: Exchanging data

Before you can start with the real work of phase 2 all teams must exchange their databases within their consortia. How you do this is completely up to you. The only rules we make is that you must give your consortium partners access to your original and complete database. This includes the database schema and the actual records in an SQL format suitable for the Postgres servers we are using for the course.

You'll probably want to use the pg_dump tool for this. It lets you extract the entire database in one swoop and save it in a file. Here is an example of how you can dump your database to a file:

pg_dump -Oxd MyDatabase -f mydatabase_dump.sql

For more information you can consult the pg_dump reference or enter pg_dump --help.

Deadline for database exchange: Wednesday, 2005-04-27, 2400

Step 2: Importing the data

Now that you have received the databases dumps from the other two teams you should probably start by reimporting them on your server. The best way to do this is to first create two databases and then importing one dump in each of them.

Step 3: Analyzing the data

Analyze the schemas (possibly using the information found in the reports of the other teams -- see the report list) and decide what information you want to add to your own database. Think ahead what kind of information you're going to need for your "web shop". One obvious example are prices; if you have chosen music databases for your category you probably won't have any pricing information, so you can get this from one of the other databases. Try to identify the fields that can help you link the data from two databases.

It is likely that you have to do certain changes to the databases of your fellow students. You may want to restructure the tables to make them more similar to your own schema.

Step 4: Extending your schema

Obviously the schema developed in phase 1 has primarily been designed for the specific needs of your chosen category. Now that you know what kind of data you want to add you are able to extend your schema to suit the new circumstances. There are several ways to accomplish this, here are just two possibilities:

Clearly mixtures of the two approaches are possible and may make sense in certain cases. The main point of this step (and this is also the criterion we're going to use to grade your work) is that your choice makes sense in the specific circumstances you find yourselves in. Your solution should take into account the basic design principles of relational database design.

A technical note: Once you've analyzed the two foreign databases you might consider reading all three data sets into a single database for the merging process. If you have colliding table names in the three databases you may find Postgres' schema concept useful.

Step 5: Merging the data

There exist many ways to merge the data, once again depending on the specific data sets you have. Here are few general remarks that you may find useful:

As an alternative to preprocessing the strings and then joining them using SQL you might also consider writing a program in Java or any other language you prefer that does the actual merging process. Really there is no one way to do it.

As a general rule of thumb for the merging part there is no perfect solution but it is more of a "best effort" kind of approach. You should really try to get as much data as you can. This makes it much more interesting for the rest of the project, plus it increases your chances in the bidding game at the end. The more articles you can offer the more attractive your site becomes. Note that there is no upper limit to the number of articles that you can offer!

A note on currencies

The articles that you have gathered in phase 1 may have prices in different currencies like CHF, EUR, USD, etc. To simplify things and to make sure that prices remain comparable for the remaining parts of the project you are required to convert all prices from foreign currencies to CHF.

You must use the following exchange rates of 2005-04-18:

Foreign currency CHF
1 EUR 1.5459
1 GBP 2.2602
1 USD 1.1870

If you can't find your currency in this list, please let us know as soon as possible.

While we are using fixed rates for the purpose of this project, take a moment to think about how you would attack such a problem in real life with exchange rates varying from one day to another.

If your team is affected by the currency problem, the way how you solve it will be evaluated. There is no need no implement something very fancy (especially given our fixed rates), nevertheless some solutions are more suitable than others.

If you happen to be in a consortium of teams with equal currencies (or did not encounter the currency problem for some other reason that you should describe) explain in a few sentences how you would have solved the problem.

There is a chance you may not have prices for all articles at the end of the merging process. For example you may have collected much more data than the other two consortia members or you may not be able to match as many records as you would like. If this happens you are encouraged to make up prices where necessary. Try to find a creative way to do this that gives you reasonable prices. The elegance of your solution will be graded, so you should avoid simply replacing all missing values by 10 CHF. Also keep in mind that you can use these prices as a minimum selling price (i.e. the start price of a bid) for the last phase.

Step 6: Creating the Web shop

Once you have a solid database with a number of interesting articles you need to develop a web application that enables the other teams to browse through your offering and search for specific articles. Here are the two functions that your web shop must offer for now:

Of course you are free to add more features to it if you're after bonus points. Here are a few ideas if you're motivated but don't know where to get started:

You will have to add one particular feature later in phase 3, namely the possibility for visitors to bid for articles on the spot. There is nothing you have to do about this during this step, nevertheless you might want to keep it in mind while you're designing your application.

To accomplish this task you will be using the Java application server J2EE and JavaServer Pages. All you need to know on setting up and using the environment can be found in the J2EE how-to which you should read carefully to get the hang of it. Note that the time you invest to learn about J2EE is not only an investment in the future of this project but also in your professional future. First of all, you will need another aspect of J2EE in phase 3 and, second of all, many commercial applications are based on J2EE, so it can never hurt to know it. :-)

A remark about XHTML validity: It can be a little tricky to ensure that all possible dynamic pages of your web shop are 100% valid XHTML, which is why you do not need to do so as a requirement for phase 2. However you will receive a bonus if your web shop is valid XHTML as well! The effort needed for this is relatively small so it might be worth it spending a few thoughts on that. If you do it, please mention it in the report and explain what exactly you did. (Note that this has nothing to do with your report page, which still has to be XHTML valid as a strict requirement. More information below.)

Report

Like for the last phase you must hand in a report at the end of phase 2. This report must be included in the archive you hand-in.

Format

The report web pages must be valid XHTML according to the XHTML 1.0 standard or higher. The choice of a DTD (Strict, Transitional, Frameset) is up to you. Moreover your web pages must be reasonably viewable in browsers that follow the current standards (that obviously does not include Internet Explorer :-).

Please put all your files in an archive (ZIP, RAR, TAR, ...) called groupXX.{zip,rar,tar.gz,...} before you send them to us. (If you don't know how to do this, try something like tar cfz group87.tar.gz phase2/.)

Content

Your report must document all relevant steps of phase 2 in reasonable detail. You should explain your decisions and highlight the important features of your work. The report represents your entire work for phase 2 and your grade will depend on it to a large extent.

Don't hesitate to mention outstanding features, we will honor additional effort. If you have encountered any particular problems during the process you should also mention them. Note that we will take into account the circumstances like the quality of the other two databases of your consortium or general difficulties of the process.

Given the difficulties during the hand-in of phase 1, here's the list of items you must include:

You must put everything listed above in an archive before you hand it in. If the size of your archive exceeds 5 MB, please make it available for download somewhere (e.g. on your web page or directly on the lsir-cis-pc account) and send us only the link/location.

Please note that this time we will not check the archives you hand in for completeness at reception. Instead, if something is missing and we cannot correct certain parts, you will simply receive less or no points at all for the corresponding criteria. So remember to reread this page thoroughly and double-check the archives you hand-in against the above list!

Please also include a link to your web shop in the e-mail you send us so we can add it to the Web shop list.

Particularities

Once again you are required to secure your infrastructure. You must therefore change your J2EE administrator password so that other people cannot login. Failure to do so will lead to a deduction of your grade by 0.5 points. So if you haven't done so by the time you read this paragraph, now is the best time to make sure you don't forget.

Deadline for report hand-in: Thursday, 2005-05-19, 2400 (hand-in by e-mail to Martin)

Copyright © Martin Rubli & Patrik Bless – Last change:
This page uses valid XHTML 1.0 Strict and valid Cascading Style Sheets, Level 2. This page uses valid XHTML 1.0 Strict. This page uses valid Cascading Style Sheets, Level 2.