1.
A Software Clustering Platform
Working knowledge of Java.
Desire to work on an exploratory research project and develop and implement
innovative ideas.
Clustering software systems is the process of grouping software artifacts (files, classes, packages, etc.) together to identify subsystem structures and to produce a higher level of abstraction for software systems. It can be considered as an architecture reconstruction task.
This project aims to deploy some of the known software clustering algorithms to discover software structures in Cobol and Visual Basic software systems. The work will not start from scratch. Instead, the project aims to integrate exiting technologies and build the necessary adapters to get various tools to work together and complete the missing pieces.
First, the student will research and assess the applicability of clustering techniques in general and clustering tools in particular for COBOL (or VB) programs. The student can look at the various clustering approaches in the bibliography and attempt to provide an assessment of their applicability and the corresponding tools to COBOL or VB (for instance how close to reality the results are for COBOL, what they actually reveal for the code, etc.). S/he can also make a comparison on the applicability, of both techniques and tools, to real COBOL systems.
Second, the student will assemble a software clustering platform from exiting tools. The tools proposed as a starting point for the project are:
1. ATX L-CARE reengineering platform for reverse engineering and producing higher-level abstractions of software systems.
2. Bunch tool for code clustering which implements, in Java, some source code clustering algorithms (See references). The Bunch tool gets as input module dependency files that can be provided by L-CARE. However, L-CARE provides XML information on such graphs, while Bunch accepts its own format so the student will have to convert the CARE generated files to the ones accepted by Bunch.
3. Some independent tool to visualize graphs, e.g., Graphviz to actually visualize the graphs produced by clustering (See references).
This project is exploratory in nature, the student (with consultation of the supervisor) can switch to better tools or even build his/her own if necessary. So, the specific tools used do not matter; what matter is examining the clustering process.
· The student will learn software clustering and will try out some of its algorithms and tools.
· The student will learn how to set up a real software reverse engineering / reengineering environment composed of various tools integrated together to accomplish non-trivial task. The student will learn the challenges involved in this process.
· The student will learn about the issues and challenges of exchanging inputs and outputs between different software engineering tools, that may use different representation formats for these i/o. The student will develop some adapters to overcome this problem, if needed.
1. A working implementation of a software
clustering environment that integrates various tools together.
2. An evaluation report on the application
of software clustering technology to cluster Cobol and/or VB programs.
· A Survey of some clustering algorithms - Brian Mitchell , Clustering Software Systems to Identify Subsystem Structures
· Bunch tool for software clustering - http://serg.cs.drexel.edu/projects/bunch/
·
A Paper on Bunch tool for software clustering -
S. Mancoridis, B.S. Mitchell, C. Rorres, Y. Chen and E. R. Gansner Using
Automatic Clustering to Produce High-Level System Organizations of Source Code
· Graphviz - Graph Visualization Software
· More publications on software clustering http://serg.cs.drexel.edu/publications/
2.
A Survey of Data Mining of Version Control Data
The project aims to produce a
survey and analysis of the current literature on applying data mining methods
to version control data of CVS and other version control systems. The study is
suggested to cover, among other things, the following topics:
o
Identification
of the key projects, the important publications and the active researchers in
this area.
o
Analysis
and categorization of the types of data stored by version control systems
o
Analysis
and categorization for the different users of these systems and their interests
o
Analysis
and categorization of the data mining and other methods used in analysing the
data stored by version control systems
o
Study and
categorization of the available literature on the subject according t the
purpose of the study.
o
Analysis of
the results of these studies with a measure of their success.
o
Possibly, a
comparison between different methods and their results either theoretically or
practically by implementing some of these methods.
·
A survey
dissertation on mining version control systems.
·
A possible
implementation on some of version control systems mining methods.
·
Possibly,
writing a technical journal paper on results of the survey to be published
This is a long list of possible references on this topic. This does not mean that every single one is to be used. Since it is a survey project, the student should take some time initially to identify the valuable resources and studies that should be subject of this study and ignore the trivial ones.
Related Workshops
-----------------
MSR 2004: International
Workshop on Mining Software Repositories
Proceedings available at: http://plg.uwaterloo.ca/~aeehassa/home/pubs/MSR2004ProceedingsFINAL_IEE_Acrobat4.pdf
Some Related Journal
Publications
---------------------------------
Daniel
M. German, Using software trails to reconstruct the evolution of
software, Journal of Software Maintenance and Evolution (JSME), Special Issue
on Evolution of Large-scale Industrial Software Applications, Vol. 16 No. 6,
p 367-384, 2004.
D. Atkins, T. Ball, T.
Graves, and A. Mockus. Using version control
data to evaluate the impact
of software tools: A case study of the
version editor. IEEE
Transactions on Software Engineering,
28(7):625-637, July 2002.
Stephen G. Eick, Todd L.
Graves, Alan F. Karr, J. S. Marron, and
Audris Mockus. Does code
decay? assessing the evidence from change
management data. IEEE
Transactions on Software Engineering,
27(1):1-12, 2001.
Todd L. Graves, Alan F.
Karr, J. S. Marron, Harvey P. Siy: Predicting
Fault Incidence Using
Software Change History. IEEE Trans. Software Eng. 26(7): 653-661 (2000)
T. Graves and A. Mockus.
Identifying productivity drivers by
modeling work units using
partial data. Technometrics,
43(2):168-179, May 2001.
Ahmed E. Hassan and Richard
C. Holt: Studying The Chaos of Code
Development, Proceedings of
WCRE 2003: Working Conference on Reverse
Engineering, Victoria,
British Columbia, Canada, November 13-16, 2003. (Paper invited to IEEE
Transactions Special Issue for WCRE 2003)
James D. Herbsleb, Audris
Mockus: An Empirical Study of Speed and
Communication in Globally
Distributed Software Development. IEEE Trans. Software Eng. 29(6): 481-494
(2003)
Philip M. Johnson, Carleton
A. Moore, Joseph A. Dane, Robert S. Brewer: Empirically Guided Software Effort
Guesstimation. IEEE Software 17(6): (2000)
Toshihiro Kamiya, Shinji
Kusumoto, Katsuro Inoue: CCFinder: A
Multilinguistic Token-Based
Code Clone Detection System for Large Scale Source Code. IEEE Trans. Software
Eng. 28(7): 654-670 (2002)
Audris Mockus, Roy T.
Fielding, James D. Herbsleb: Two case studies of open source software
development: Apache and Mozilla. ACM Trans. Softw. Eng. Methodol. 11(3):
309-346 (2002)
Audris Mockus and David M.
Weiss. Globalization by chunking: a
quantitative approach. IEEE
Software, 18(2):30-37, March 2001.
Dewayne E. Perry, Harvey P.
Siy, Lawrence G. Votta: Parallel changes in large-scale software development:
an observational case study. ACM Trans. Softw. Eng. Methodol. 10(3): 308-337
(2001)
Walt Scacchi. Free/Open
Source Software Development Practices in the
Computer Game Community,
IEEE Software, Special Issue on Open
Source Software, (to appear,
2004).
Related Workshop and
Conference publications
------------------------------------
T. Zimmermann, P.
Wei_gerber, S. Diehl, A. Zeller: Mining Version
Histories to Guide Software
Changes. Proc. 26th International Conference on Software Engineering (ICSE),
Edinburgh, UK, May 2004.
T. Zimmermann, S. Diehl, A.
Zeller: How History Justifies System
Architecture (or not). Proc.
International Workshop on Principles of
Software Evolution (IWPSE
2003), Helsinki, Finland, September 2003.
Michael Fischer, Martin
Pinzger, and Harald Gall. Analyzing and Relating Bug Report Data for Feature
Tracking. In Proceedings of the 10th Working Conference on Reverse Engineering
(WCRE), Victoria, British Columbia, Canada, IEEE CS Press, November 2003.
Michael Fischer, Martin
Pinzger, and Harald Gall. Populating a Release History Database from Version
Control and Bug Tracking Systems. In Proceedings of the 2003 International
Conference on Software Maintenance
(ICSM), Amsterdam, The
Netherlands, IEEE CS Press, September 2003.
Harald Gall, Jacek
Krajewski, and Mehdi Jazayeri. CVS Release History Data for Detecting Logical
Couplings. In Proceedings of the International Workshop on Principles of
Software Evolution (IWPSE), Helsinki, Finland,
IEEE CS Press, pp. 13-23,
September 2003.
H. Gall, M. Jazayeri, and C.
Riva. Visualizing software release histories: the use of color and third
dimension. In International Conference on Software Maintenance (ICSM '99),
pages 99-108, Oxford, England, Aug. 1999.
IEEE Computer Society Press.
H. Gall, K. Hajek, and M.
Jazayeri. Detection of logical coupling based on product release history. In
International Conference on Software Maintenance (ICSM '98), Washington D.C.,
Nov. 1998. IEEE Computer Society Press.
H. Gall, M. Jazayeri, R.
Klvsch, and G. Trausmuth. Software evolution
observations based on
product release history. In M. J. Harrold and G. Visaggio, editors, ICSM. (ICSM
'97), pages 160-6, Bari, Italy, Sep. 1997.
IEEE Computer Society Press.
Marek Leszak, Dewayne E.
Perry, Dieter Stoll: A case study in root cause defect analysis. ICSE 2000:
428-437
Dewayne E. Perry, Harvey P.
Siy, Lawrence G. Votta: Parallel Changes in Large Scale Software Development:
An Observational Case Study. ICSE 1998: 251-260
Philip M. Johnson, Hongbing
Kou, Joy Agustin, Christopher Chan, Carleton Moore, Jitender Miglani, Shenyan
Zhen, William E. J. Doane: Beyond the Personal Software Process: Metrics
collection and analysis for the differently disciplined. ICSE 2003: 641-646
Mikio Aoyama, Katsuro Inoue,
Vaclav Rajlich: Principles of software
evolution: 5th international
workshop on principles of software evolution (IWPSE 2002). ICSE 2002: 657-658
Ahmed E. Hassan and Richard
C. Holt, The Chaos of Software Development, Proceedings of IWPSE 2003:
International Workshop on Principles of Software Evolution, Helsinki, Finland,
September, 1-2, 2003
Audris Mockus, Roy T.
Fielding, James Herbsleb: A case study of open
source software development:
the Apache server. ICSE 2000: 263-272
James D. Herbsleb, Audris
Mockus: Formulation and preliminary test of an empirical theory of coordination
in software engineering. ESEC / SIGSOFT FSE 2003: 138-137
Audris Mockus, James D.
Herbsleb: Expertise browser: a quantitative
approach to identifying
expertise. ICSE 2002: 503-512
James D. Herbsleb, Audris
Mockus, Thomas A. Finholt, Rebecca E. Grinter: An Empirical Study of Global
Software Development: Distance and Speed. ICSE 2001: 81-90
Ivan T. Bowman, Richard C.
Holt: Reconstructing Ownership Architectures To Help Understand Software
Systems. IWPC 1999: 28-37
Richard C. Holt, J. Y. Pak:
GASE: visualizing Software
Evolution-in-the-Large. WCRE
1996: 163-
Y. Liu, E. Stroulia: Reverse
Engineering the Process of Small Novice
Software Teams, 10th Working
Conference on Reverse Engineering, November 13-16, 2003, pp. 102-112, IEEE
Press.
Davor Cubranic, Gail C.
Murphy: Hipikat: Recommending Pertinent Software Development Artifacts. ICSE
2003: 408-418
J. Shirabad, Timothy
Lethbridge, Stan Matwin: Supporting Software
Maintenance by Mining
Software Update Records. ICSM 2001: 22-31
J. Shirabad, Timothy
Lethbridge, Stan Matwin: Mining the Maintenance
History of a Legacy Software
System. ICSM 2003
Jennifer Bevan, E. James
Whitehead, Jr., "Identification of Software
Instabilities." In
Proceedings of the Tenth Working Conference on Reverse Engineering (WCRE 2003),
Vancouver, British Columbia, Canada, November 13-16, 2003.
Annie Chen, Eric Chou,
Joshua Wong, Andrew Y. Yao, Qing Zhang, Shao Zhang, Amir Michail:
"CVSSearch: Searching through Source Code using CVS Comments"
International Conference on Software Maintenance, pages 364-373, 2001
John Champaign, Andrew
Malton, Xinyi Dong, Stability and Volatility in the Linux Kernel, Proceedings
of IWPSE 2003: International Workshop on Principles of Software Evolution,
Helsinki, Finland, September, 1-2, 2003
Dirk Draheim, Lukasz
Pekacki: Process-Centric Analytical Processing of Version Control Data,
Proceedings of IWPSE 2003: International Workshop on Principles of Software Evolution,
Helsinki, Finland, September, 1-2, 2003
3.
Software Architecture Recovery: A Case Study
The project aims to study and
apply the available technology for software architecture recovery and
reconstruction on a software system. The purpose is rebuilding the architecture
of the software and document it for the purpose of maintaining and evolving it.
The activities involved are:
o
Review of
the exiting literature on architecture recovery and reconstruction
o
Preliminary
investigation of the target system
o
Choice of
the suitable methods for the system in hand
o
Applying
the chosen methods to the target system to recover and document its
architecture
A
dissertation on software architecture recovery
Architecture
description and software technical documents for the target system
Possibly,
writing a technical journal paper on results of the study to be published
Architecture
Reconstruction Guidelines, Third Edition
http://www.sei.cmu.edu/publications/documents/02.reports/02tr034/02tr034.html
O'Brien,
Liam. Experiences in Architecture Reconstruction at Nokia
(CMU/SEI-2002-TN-004).
http://www.sei.cmu.edu/publications/documents/02.reports/02tn004.html
Software
Architecture Reconstruction: Practice Needs and Current Approaches
http://www.sei.cmu.edu/pub/documents/02.reports/pdf/02tr024.pdf
Architecture
Recovery Using Conway’s Law _
http://plg.uwaterloo.ca/~itbowman/pub/CASCON98.pdf
Software
Software
Architecture Reconstruction
http://adam.wins.uva.nl/~x/sar/sar.pdf
SEMINAR:
Software Architecture: Recovery and Modelling
http://www.bauhaus-stuttgart.de/dagstuhl/#program
http://www.dagstuhl.de/03061/
Architecture
recovery of Apache 1.3 — A case study http://apache.hpi.uni-potsdam.de/document/documents/architecture_recovery_of_apache.pdf
Rigi
Reverse Engineering Tool http://www.rigi.csc.uvic.ca/index.html
The project aims to expand
previous work on mining recorded traces of interaction between software systems
and their users to software systems with Graphical User Interfaces. The process
of mining software usage data involves:
o
Recording
the actions take by the software system and the user in response to one another
during the system-user dialog, i.e., during the usage of the system through its
user interface.
o
Applying
data mining methods (e.g., IPM2 algorithm. See references below) to discover
patterns of interesting user activities in the recorded sequences of actions.
o
Analysis of
the discovered patterns and using them for program understanding,
reengineering, user interface personalization, re-documentation, reverse
engineering, and similar tasks.
This process had been applied
successfully to Web-based and legacy systems and the references below describe.
The steps to be followed for this research could be:
o
Study of
the available literature and available resources (Code for IPM2 is available)
o
Study of
the available GUI recording technology and how it can be utilized or adapted
for this application. This should result in selecting some GUI recorder or
creating one.
o
Installing
a GUI recorder on the machines of a number of users of a GUI based system and
collecting traces of their interaction with the system.
o
Collected
the recorded traces and applying data mining to them.
o
Analysis of
the discovered patterns and identifying potential uses for them.
A
dissertation on mining software GUI usage data
An
implementation of a prototype for collecting and mining software GUI usage data
Possibly,
writing a technical journal paper on results of this study to be published
Enhancing
A GUI Event Recorder To Support The Creation Of User Documentation
http://www-scf.usc.edu/~shenliu/_private/paper_submission_Liu.pdf
M.
El-Ramly, E. Stroulia, Mining Software Usage Data. Available at: http://plg.uwaterloo.ca/~aeehassa/home/pubs/MSR2004ProceedingsFINAL_IEE_Acrobat4.pdf
Mohammad
El-Ramly and Eleni Stroulia.
Analysis
of Web-usage Behavior for Focused Web Sites: A Case Study. Journal of Software
Maintenance and Evolution, 1-2 16 ( 2004 ), 129-150. John Wiley and Sons .
Eleni
Stroulia, Mohammad El-Ramly, Paul Iglinski and Paul Sorenson.
User
Interface Reverse Engineering in Support of Interface Migration to the Web .
Automated Software Engineering, 3 10 ( 2003 ), 271-301 . Kluwer Academic
Publishers .
Eleni
Stroulia, Mohammad El-Ramly and Paul Sorenson.
From
Legacy to Web through Interaction Modeling . In Proc. of the 18th International
Conference on Software Maintenance (ICSM 2002), 3-6 October 2002, Montreal,
Quebec, Canada. , 320 - 329 . IEEE Computer Society, ISBN 0-7695-1819-2 ( 2002
).
Mohammad
El-Ramly, Eleni Stroulia and Paul Sorenson.
From
Run-time Behavior to Usage Scenarios: An Interaction-pattern Mining Approach .
In Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (SIGKDD), July 23 - 26, 2002, Edmonton, Canada. , 315 - 324 .
ACM Press, ISBN 1-58113-567-X ( 2002 ).
5.
Java to C# Language Transformer
Working knowledge of Java.
Eagerness to read and learn lots about Java and C# (reads C sharp) and their
specifications, similarities and differences. It also requires investing time
and effort in learning TXL and its use in structural transformations.
The project aims to produce a
language transformer that takes a Java program and produces a corresponding C#
program. The transformer will be written in TXL (see references). TXL is a
programming language specifically designed to support computer program analysis
and source transformation tasks. It is a hybrid functional /rule-based
language. The idea of source code transformation with TXL is to write a
grammatical description of the first source language and a grammatical
description of the target source language. Then the transformation rules that
takes a source program in the first language (Java) and transforms it to a
source program in the second language (C#) are written and tested. It is
expected that the project duration will allow only writing a subset of C#
grammar and the Java-to-C# transformation rules. I am also open to other
language transformers if the student prefers to write a transformer from another
to yet another language instead of Java-to-C#. The project is considerably
non-traditional in the sense that most of the effort will go to understanding
C# specification and translating it into TXL and inventing the structural
transformations necessary to convert Java language constructs to the equivalent
C# ones.
A working Java-to-C# language
transformer.
Phase 1: Research on Java and C# language
specifications. Learning TXL. Implementing a C# grammar in TXL (A Java grammar
in TXL already exists)
TXL Website .
http://www.txl.ca/ (2004).
C# Language Specification.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/csspec/html/vclrfcsharpspec_C.asp (2004).
6.
Information Gathering Agent using NQL
The
student will use Network Query Language (NQL) to build a software agent for
information gathering. Ideally, the user will be able to identify one or more
domains of interest and provide the agent with examples of the information of
interest to him on the web sites related to this domain, or alternatively, the
agent will observe the navigation behaviour of the user in the domain(s) of
interest. From these examples, the agent will learn to gather daily information
digests for the user in the domain(s) of interest and present it to him every
day in the morning. Additionally, the system will have the capability of
collecting information from legacy information systems to integrate with
information collected from the Web. This definition is broad and flexible.
After a period of background research and readings, the student should
concretely define the specifications of his/her project.
The
end product is an information gathering agent for multiple domains that allows
the user to specify the domain of interest (e.g., stock, news, etc.) and
specify some details of his interests within this domain (which may include the
information sources of interest and examples of the information of interest or
examples of the user’s navigation of the relevant websites). Then, the agent
provides a daily digest of information tailored to the user’s needs, and
collected from the relevant web sources.
This
project will require NQL software, which can be purchased/obtained from http://www.nqltech.com/nql.asp. It
will need some background research on software agents and programming by
demonstration or by example. Some links for starting this research are provided
below.
David Pallmann , Network Query
Language (NQL), ISBN: 0-471-20766-7, John Wiley & Sons Inc
http://www.mobilein.com/NQLWhitepaper.pdf
http://www.nqltech.com/nql.asp
http://www-2.cs.cmu.edu/~softagents/intro.htm
http://www.agentbase.com/survey.html
http://citeseer.nj.nec.com/bauer00programming.html
http://web.media.mit.edu/~lieber/PBE/Your-Wish/#ruvini