sss ssss      rrrrrrrrrrr
                      ssss    ss       rrrr   rrrr
                     sssss     s       rrrr    rrrr
                     ssssss            rrrr    rrrr
                      ssssssss         rrrr   rrrr
                          ssssss       rrrrrrrrr
                    s      ssssss      rrrr  rrrr
                    ss      sssss      rrrr   rrrr
                    sss    sssss       rrrr    rrrr
                    s  sssssss        rrrrr     rrrrr
         +=======    Quality Techniques Newsletter    =======+
         +=======            February 2004            =======+

subscribers worldwide to support the Software Research, Inc. (SR),
eValid, and TestWorks user communities and to other interested
parties to provide information of general use to the worldwide
internet and software quality and testing community.

Permission to copy and/or re-distribute is granted, and secondary
circulation is encouraged by recipients of QTN, provided that the
entire document/file is kept intact and this complete copyright
notice appears in all copies.  Information on how to subscribe or
unsubscribe is at the end of this issue.  (c) Copyright 2003 by
Software Research, Inc.


                       Contents of This Issue

   o  Risk Management 101, by Boris Beizer

   o  First International Workshop on Quality Assdurance and Testing
      of Web Based Applications

   o  eValid: A Quick Summary

   o  TAV-WEB 2004: Workshop on Testing, Analysis and Verification
      of Web Services

   o  Web Caching and Content Distribution (WCW)

   o  Foundations of Computer Security (FCS-2004)

   o  1st Internaitonal Workshop on Web Services and Formal Methods

   o  QTN Article Submittal, Subscription Information


                        Misk Ramagement 101
                          by Boris Beizer

      Note:  This article is taken from a collection of Dr.
      Boris Beizer's essays "Software Quality Reflections" and
      is reprinted with permission of the author.  We plan to
      include additional items from this collection in future

      Copies of "Software Quality Reflections," "Software
      Testing Techniques (2nd Edition)," and "Software System
      Testing and Quality Assurance," can be purchased
      directly from the author by contacting him at

1. The Quality Movement

1.1. Risk Management or Risk Enhancement?

"Risk Management" is a popular term.  It evokes an image of data
gathering, statistical models, and predictions of important
probabilities to ten significant figures.  It's what we expect out
of the Harvard and Wharton business schools.  It's scientific.  But
could we be kidding ourselves?  Could this activity lead to
increased risk rather than risk reduction?  To risk-knowledge chaos
rather than to risk management?  The answer, as you might expect, is
YES!  There are many unknowns in the use of statistical models to
support risk management for software and if the improper use of such
models gives you a false sense of security, then instead of
improving your situation you have made it worse.

1.2. Are We Part of The Quality Movement?

We believe in the quality movement.  We believe, as an article of
faith, that the philosophy that has worked so well for hardware
manufacturing can be adapted to yield comparable quality (and
therefore, productivity) gains for software.  That's what we'd like
to believe but in practice we fall far short of what has and can be
accomplished for widget manufacturing.  Take a hard look at
ourselves.  I'll even grant you the optimism of considering only you
readers, who represent at most 25% of the software developers, and
who by the very fact that they're reading this, are a biased sample
of believers.  Among us believers and practitionershow many of us
could get better than 500 points out of 1,000 on the Baldridge Award
application?  How about 400?  Do I hear a 300? 200?  Here's the 1999
Baldrige Award score card and what a software developer could get
without trouble:


                              GENERAL      SOFTWARE
                              TOTAL     EASY    TOUGH

1. Leadership                   125     115     10
2. Strategic Quality Planning   85      60      25
3. Customer/Market Focus        85      75      10
4. Information and Analysis     85      55      30
5. Human Resources Utilization  85      85      00
6. Process Management           85      70      15
7. Results                      450     345     105

TOTAL                           1,000   805     195

The GENERAL column is the scores in the Baldrige Award application
for each of the listed categories.  Under the SOFTWARE heading the
EASY column is an estimate of the point potential in each category
that doesn't have anything special to do with software.  The TOUGH
column lists the points that are probably special to software
because the methods that work for widgets either: clearly do not
apply; there is considerable controversy and/or confusion over how
to apply them; research is needed.

Software development is not yet part of the quality movement.  We're
hungry little kids out in the cold with our faces pressed against
the bakery shop window, looking in at the goodies and enjoying the
aroma.  Are we on the outside because we're dumb?  Because were more
hide-bound than blue-collar workers?  Because our management is more
reactionary than Detroit car builders?  Because we have less cash
for capital investment?  Some of that's true, but those reasons
aren't basic.  The core reason is that it's harder to achieve
quality software than quality widgets: but if you accept my analysis
of what's tough and easy,, software quality is harder to achieve
than widget quality by only about 20%.

1.3. Basic Tenets of the Quality Movement

The Baldrige boils down to simple ideas at the heart of the quality
movement: 1) everybody's involved, 2) continual improvement, 3)
statistical quality control.

1. Everybody's Involved: That means not just software engineering
and marketing, but user involvement from the planning to the
maintenance.  It means user-driven rather than technology- and
schedule-driven products.  It means quality decisions made at all
levels.  Everybody gets into the act. This is an area in which there
are no fundamental differences between software and widgets.  Going
all-out in getting people involved (especially users) gets you a few
hundred Baldrige points and a lot closer to that bakery doorway.  Do
it!  But because it doesn't present any software-specific problems,
I won't dwell on it.

2. Continual  Improvement:  That means learning from your (and
others') mistakes.  Learning how bugs come about and providing the
tools, time, and training to prevent them in the future.  You can
raise the score by a few hundred points (and your software quality
and productivity).  Here again, software isn't fundamentally
different than widgets.  Possibly there is a difference because the
programmers' supposedly higher intelligence (compared to blue-collar
workers, say) gets in the way.  The blue-collar worker is more
honestly humble and therefore more open to change.  At the bottom
line, though, it's not really different. There's no excuse to not do
it.  Do it!

3. Statistical Quality Control: That means using valid statistical
methods to measure quality and defects and to use such statistics to
best allocate our resources toward improving qualitye.g., risk
minimization, Pareto analysis, and so on.  This is where software is
differentthis is where those missing 200 points are and what most of
this essay is about.

1.4. Objectives of Testing and QC

I've categorized testing objectives into four stages of increasingly
higher goals. These are like Maslow's higher needsfood before
freedomfreedom before realizing your potentials, etc.  The
objectives of testing are:

1. To convince us that the software works.

2. To demonstrate that it does not work -- that it has bugs

3. To give management data with which do rational risk evaluation

4. To prevent bugs by doing the right kind of designs

These are increasingly higher goals in the sense that one must
achieve the lower goals before the higher. That is, goal 4 should
presume 3, 2, and 1 and goal 3 presumes 2 and 1.  Each of these
higher goals leads us to unsolved statistical problems.

1. The Software Works. If you want an absolute demonstration that
software works, that's an unsolvable problem.  You can't prove in a
mathematical sense, that software works.  What does "working" mean?
That every feature works?  That the probability of failure in use is
small enough to warrant use?  That the software has low risk of
causing other software to fail?  Ultimately, we would like to be
convinced by valid statistics, probabilities, risk, mathematical
confidence, etc.

2. The Software Doesn't Work. Given enough time and money we can
always demonstrate that software doesn't work.  The problem with
this as an objective of testing is that it never ends.  The
statistical issue is what models can we, should we, use to tell us
when to stop testing.

3. Risk Management.  Even if management quantified its risk strategy
we don't yet know what kind of data to give them to plug into such a
hypothetical risk model.  There's a slew of nasty statistical issues

4. Bug Prevention in Design.  This is one we can do something about.
It's done by gathering bug statistics and providing the programmers
with the resources and education needed to prevent those bugs in the
future. There are exciting new research results which promise to
provide formal statistical models for testability but the utility of
those models will be hampered because we don't have clear answers to
the statistical questions raised by the previous three goals and
most of all because we don't have reliable bug statistics.

1.5. Barriers and Cop Outs -- A Prescription for Failure

You have a selling job to do.  You have to sell it to the troops, to
management, to users, and to all the rest.  One of the most common
and difficult question that comes up is:

              Show me a cost-benefit analysis for all
                    this software quality stuff!

This seemingly reasonable request is trap.  You must refuse and you
must get (especially) management to buy-in on faith alone.  Pointing
to other software development organizations ahead of you in the
quality game can help build that faith but unbelievers always claim
that "they're different."   Why faith?  Has Beizer found religion is
his dotage?  Will he  shout "HEAL THIS SOFTWARE!"? Faith because
faith is all you've got until your process revolution is over and
your quality goals achieved.  How can you show cost-benefit when you
don't know your end-to-end software development costs?  When you
don't know what your users really think about your software?  When
you have no baseline against which to measure improvement?  When
half the bugs go unreported?  When you can't repeat any test you did
more than a year ago?  How can you predict schedule improvement when
the organization has yet to meet a schedule?

No small part of the quality culture is the building of an
infrastructure that allows all aspects of the process to be
quantified.  Once it's done, you can retrospectively show how the
benefits did accruebut until you have that informational
infrastructure you're just guessing, making up numbers, relying on
faith and clothing it in pseudo-science.  That makes your whole
sales pitch vulnerable to destruction by a beady-eyed accountant
with a sharp #2 pencil.  Don't get shot down.  Don't fall into the
premature quantification trap.  Point out what a cop-out premature
cost-benefit analysis is and insist on faith.

2. Why Software Is Different

2.1. Manufacturing Process Versus Engineering Process

Software is different than widgets.  The great improvements in
widget quality have been gained mostly by improving the
manufacturing process.  It is only of late that quality attention
has focused on the engineering process.  While the Japanese
manufacturing process is typically ahead of its Western counterpart,
the quality of the engineering processes (for cars or washing
machines, say) are not that far apart -- East or West.  We don't
know how many prototypes get tossed on the scrap heaps, nor does it
matter when you're planning to manufacture millions of them.

There is no software manufacturing process to speak of "only" an
engineering process.  Our manufacturing process consists of
reproducing CD-ROMS, shrink-wrapping them, and putting the lot into
a box. That process, through the use of big cyclic redundancy check
codes, is almost perfect.  The main reason, then, that software
quality is tougher to achieve than widget quality is that we have to
contend with quality control and quality assurance over an
engineering process where we're just as much the pioneer as any
other kind of engineering.

2.2. The Small Sample Problem

If you're doing statistical quality control for widgets, you may
start out by inspecting every widget you build but soon you
substitute more efficient and less costly sampling methods.  You
don't inspect every widget.  You inspect some fraction of them for
different things.  For some widgets, adequate samples might consist
of only 5% of the production run. Without sampling you run the risk
of improving productivity and therefore reducing manufacturing labor
content only to have your gains obviated by vast increases in
inspection labor content.  The number of samples you take depends on
the production run, defect probability, the probability of not
detecting defectsa whole lot of stuff that's well understood and for
which there are excellent, reliable, models. The key point though,
is that the sample size is statistically valid.  That's how
pollsters can contact a few thousand households and get useful
statistics for marketing or politics.

What about software?  Because we're trying to control an engineering
process, it's the engineering process that we have to samplethe
whole projectnot the lines of code.  In other words, an entire
project that took 18 months, consumed 100 work years, and peaked
with a staff of 100, is only one data point.  Things are little
better in maintenance.  Say you do two releases a year and have been
doing it without significant process change for five years.  You
then have 10 data points with 100% sampling.  And if you do the
statistics to determine what you can infer about the next release,
you find by using the appropriate tests of statistical significance
that you're not much better off than faith and gut feel.  And, of
course, you did change your process (I hope) with each release so
that you really can't combine the data without huge leaps of more

The same problem occurs for supposed statistical evidence in favor
of this or that methodology such as Cleanroom.  We have a handful of
projects, each one of which constitute only one data point, from
which statistical inferences are impossible. There's no easy way out
of this box.  Few software developing entities can ever accrue
enough projects or releases to permit the kind of statistical
analyses we can do for widget production.  The way out is honest
sharing of project statisticseven with competitorsespecially with
your competitors because they're statistically the closest to you.

2.3. The Pesticide Paradox

When we have a good process in which QA does its job by feeding back
bug information to the developers, they respond by adopting
procedures that will prevent the same kind of bug.  The reaction to
bugs might be more focus during inspections, changes in languages or
compilers, changes in coding standards, and so on. That's good.  The
flip side of this coin however, is that the programmer's very action
invalidates our primary quality control methodtesting.  Every test
technique carries a hidden assumption about the nature of the bugs
it's trying to discover.  When the programmers react to QA feedback
by avoiding those bugs, the effectiveness of the tests we've written
is eroded.  I call this the Pesticide Paradox because it is well-
known in agriculture that pests build up a tolerance to pesticides.
Hence the recent resurgence of measles and malaria.  Similarly,
software builds up a tolerance to tests.  This kind of interaction
is unheard of in widget testing.  How would the widget builder react
if you told her that the very act of measuring widget defects
destroyed the effectiveness of the measuring tool?

2.4. We Don't Share Data--Why?

Here's one QA area where the East is far ahead of the West.  They
share software project data, we don't.4 Why not?  Here are some

1.  Fear of CompetitionSharing statistics exposes our process to our
competitors.  If we think we're ahead, then we've given them help
that we don't want to give them.  If we think we're behind, we've
given their sales force ammunition with which to sink us. Either
way, we lose.

2.  Fear of Liability and LitigationOwning up to bugs exposes us to
product liability suits and litigation.  We can't admit to the
possibility of bugs in our product and therefore we can't publish or
share the statistics lest some unhappy user gets hold of them and
uses them as evidence in a liability suit.

3. Fear of The Stock MarketThe public demands perfection and
severely punishes any vendor who admits to bugs.  The public has to
be educated about what is and isn't theoretically possible and what
is and isn't good software.  Meanwhile, don't share data because it
will hurt the stock price if it gets out.

None of these are special to software.  They're Western cultural
problems.  These fears and the courage to ignore them are part of
what it means to join the quality movement.  That's both hopeful and
depressing.  Hopeful because these barriers to sharing data will
come down as they come down for widgets, cars, and appliances.
Depressing because we aren't making as much progress on these fronts
as we should be making

3. Risk Squared--critique and Warning

3.1  Doing Experiments.

If we're going to be rational about risk management, then we have to
do some experiments with our software development process.  You can
dream up elegant models but without supporting empirical evidence
it's just mental gymnastics.  How do we do experiments here? How can
we conduct an experiment in which the subjects don't know that
they're subjects?  In medicine, they've learned to do single-,
double-, and triple-blind experiments.

Single-Blind Experiment.  The subject doesn't know if they're
getting the real medicine or sugar pills.

Double-Blind Experiment.  The subject doesn't know what they're
getting and the person who gives out the pills doesn't know who's
getting real pills or sugar.

Triple-Blind Experiment.  The subject doesn't what they're getting,
the person handing out the pills doesn't know who got what, and the
person who evaluates the results doesn't know who got what.

We can't prevent programmers from knowing that they're using a new
methodology, so the conditions of a single-blind experiment are
impossible to achieve. The conditions of a double-blind experiment
don't seem to apply.  It's possible to keep the evaluator from
knowing who used what process, but our experimental problems center
around the easiest experimental conditionsingle-blind.  Whatever
process variation you use in your hypothetical experiment the
subjects (programmers) can and will consciously or subconsciously
cheat to make the results come out as they wish.  If their job is on
the line and they're nervous, they'll just take the software home,
compile it and test it on their own PC and then bring it in to
follow the "correct" process.  If they want the experiment to fail,
for whatever reason, there's an infinite number of ways to achieve

There is a way out of this experimental impasse.  It's to use the
statistical methods that have long been used by epidemiologist.  If
you're trying to find the causes of smallpox, say, you can't very
well infect half of the population with the virus and keep the other
half uninfected as a control.  You have to let nature take its
course and keep copious records of what happened.  Then, if you have
enough data points (usually thousands of individuals) the
statistical methods of epidemiology will enable you to establish
correlation between factors.  If you have enough data points and
recorded enough factors, you may also be able to establish

Unfortunately, there just isn't enough shared statistics with enough
measured factors around to warrant any kind of statistically valid
conclusions concerning the effectiveness of this or that
methodology.  Giant software developers such as IBM, Microsoft, and
AT&T might be gathering such data, but if they are, they aren't
making it public.

3.2. The Risks and Problems

3.2.1. General

Statistics are seductive.  We have an unhealthy tendency to believe
things when they're presented to several significant digits and
especially if they're the end result of a lot of number crunching.
Here's a few things to think about concerning the possible validity
of an offered model.

3.2.2. The User Profile Problem

A major component of software risk should be the probability that
the software will fail in used.  We simulate the users' behavior by
using a random test case generator and record the failures that
actually occure.g., crashes, instances of data corruption, etc.  The
assumption is that we know or can predict how users will behave.  If
we don't have a statistically valid characterization of the users'
behavior then any prediction of failure probability and therefore
any assessment of risk could be fallacious.

Much of the experience with statistical models of software
reliability comes from telephony. Unfortunately, telephony is one of
the few areas in which user behavior is predictable.  Telephone
people have been gathering user behavior statistics since 1904 and
can provide minutely detailed statistics about almost every aspect
of telephone usage.  Most changes made to telephone systems are
invisible to the user.  Those that are visible, such as new
features, are introduced and accepted very slowly so that the
statistics changes slowly over decades rather than from release to
release.  Telephony is one of the few areas in which such statistics
are available and valid.  To extrapolate from telephony experience
is dangerous.

In most other software areas, the new features are directly
observable by the user and are used by them as soon as they are
introduced.  No one has been able to predict how new features will
affect user behavior.

Take a list of ten new features: three are considered essential,
five are "nice" but not essential, and two came along for the ride.
You know which ones the user focus on: they rave about the two
"free" features, ignore the "essential" features, and exhibit usage
statistics that no one could have predicted.  Even careful market
research doesn't do the trick because saying that you want something
and then actually using it are two different things.  It boils down
to the fact that no one has been able to predict how the new
release's features will change the users' behavior. Therefore, the
validity of any prediction of failure under use is undermined by
this fundamental uncertainty.

3.2.3. The Problem of Scale

All engineering areas experience the problem of scale.  The scale
problem was first explored by Galileo who observed that because area
goes up as the square of the length and volume (and therefore
weight) goes up as the cube of the length, merely increasing all
dimensions proportionally does not usually work. Thus, an ant can
lift several times its own weight but if the ant were scaled up to
the size of an elephant, unlike the sci-fi movies, it would not be
able to even hold itself up, never mind tossing tanks and buildings
around.  Scale effects are found in every area of human endeavorit
would be foolish to assume that they don't exist for software.

We know that there are scale effects because we have seen excellent
prototype systems developed by one or two persons turn into hundreds
or thousands of work years when done for a commercial market.  We
all know that small groups appear to be more productive per work-
hour than big groups.  We can identify one probable scale
effectintegration bugs and integration testing.  Integration bugs
arise at the interfaces (usually internal) between components.
Integration testing is aimed at displaying such bugs.  Interfaces do
not usually grow linearly with program size, but vary from nlogn to
n2, where n is the number of components.  Unwanted interactions
between components also seem to follow a similar law.  We also know
that some test techniques that work well for big systems, such as
stress testing, are often useless for small systems.  While we don't
understand these scale effects and have yet to discover them all, we
know that they exist.

Much of the experience with statistical models of software
reliability has been gained on big to super-scale software (20
million lines of code and more), again in telephony.  Super-scale
software is probably more predictable than medium and small scale
software because the many people involved, the many different
components, all lend to a statistical smearing (e.g., the law of
large numbers) that can make models for such systems robust and
reliable.  But until we can recognize the scale effects that operate
between small, medium, big, and super-scale software, any
statistical model based on one scale cannot be freely used at
another scale without substantial risk.

3.2.4. The Software Infrastructure

Our software is built-up in layers.  Take a PC user who's working
with fixed spreadsheet templates over a spreadsheet program.  The
spreadsheet works under Desk-Top which in turn works under Windows,
which is under MS-DOS, a BIOS, and then the hardware--but the
hardware itself is micro-coded.  That's at least six levels of
infrastructure between the user and the hardware that actually
executes the program.  Today, there's often a shell program, a
data-base program, and who knows what else.  To the above, we should
also add the compiler, linker, and loader.  Early software was
written in machine language didn't even have an assembler as an
intermediary between the writer and the hardware.  In addition to
the vertical layering of software there is increased complexity in
the form of horizontally interacting software, of which TSR programs
are the primary example.  There has been, and continues to be, a
trend toward increasing the size and complexity of the software

When a system crashes or corrupts data today, it is no longer
possible to point to a specific program and say that that program
was the cause.  More and more, many instances of crashes and
corruption forever remain mysteries.  More and more, the problem is
caused by an unfortunate interaction between otherwise correctly
working programs.  The user is only interested in symptoms, not
causes.  If your program was installed after program X, it will do
you no good to say that the problem was with program X (assuming
that's true) and not your program.  All you can do is wish that they
had installed the software in the opposite order so that X's writer
would be blamed rather than you.

It is not possible to test a piece of software in all possible
environments, with all possible co-resident programs, with all
possible arrangements of the software infrastructure.  It is not
possible to do these tests and therefore it is not possible to
predict the software's failure probability under use. Consequently,
we have yet another barrier to rational risk assessment.  You might
argue that this is not realistic or fair because the user is blaming
us for unpredictable interactions with other software that might
actually be the cause of the problem.  There's justice in that
objection but the users aren't interested in justicejust in getting
their work done.

"Risk" ultimately translates into the probability of user
dissatisfaction.  If you get into an argument with the user, they
are perforce dissatisfied.  If, as I predict, the number of such
mystery crashes caused by the infrastructure and environment
increases, such increases will increase the uncertainty of any
statistical predictions about our software.

3.2.5. The Issue of Confidence

There are two kinds of confidence to consider.  One is statistical
confidence and the other human confidence.  Unfortunately the two
terms don't at all mean the same thing, although there is a
philosophical connection between them.  Human confidence boils down
to a warm feeling in the tummy.  It is essentially a subjective,
non-scientific metric.  Statistical confidence is defined as the
probability that a measured value will lie between two other values.
For example, the mean value of a variable is calculated as 100.  We
might then state that there is a 95% probability that the actual
value will lie between 95 and 105.  The numbers 95 and 105 are then
called the 95% confidence limits for 100.

The notion of statistical confidence is an essential component of
risk assessment.  For example, you use a hypothetical model to
predict the date when the software will be ready for use.  The model
predicts that it will be ready on December 30, 2001 at 3:04 pm.  Now
when you ask for the 95% confidence limits you are told "plus-minus
one century".  If you wanted tighter bounds, you might be told
"plus-minus one week, with a probability of 10-100".  Without some
notion of confidence limits, any statistical statement is
meaningless and therefore any risk assessment based on such
statistics is dangerous.

Several researchers, such Hamlet, have pointed out that no amount of
discrete testing (e.g., feature-by-feature, statement coverage,
etc.) gives you the kind of data you need to predict statistical
confidenceonly random testing based on a proper user profile can do
that--but we've seen the problems that those kinds of tests bring.

3.2.6. All Bugs Aren't Equal

Some bugs are just annoying while others are fatal. But given enough
annoying bugs the cumulative effect is no less fatal from a user's
point of view.  Any useful notion of risk must incorporate not just
the probability of bugs but the probability of bugs of different
types as a step in measuring the probability that the user will not
accept the software as ready. Although most software suppliers use
four or five levels of bug severity, none of the available software
reliability models incorporate the notion of bug severity (rather
the severity of the bug symptoms). Bug severity is a subjective
notion because one user can't tolerate any bugs while another will
accept those bugs as long as reasonable work-arounds are provided.

We're really not concerned about bug severity, but symptom severity.
Unfortunately, one bug can exhibit many symptoms and a given symptom
can be caused by many different bugs.  Furthermore, we have no
method for statistically characterizing the relation between bugs
and their symptoms and as a consequence, no way to relate what we've
learned in detailed feature and structural testing to usage risk.

3.3. Status of Present Risk Models

I don't think that we have any risk models today.  At least no risk
models whose usage does not entail substantial risks.  Let's
summarize the construction of a hypothetical risk model, top-down,
to see where the conceptual and statistical gaps are:

1. The risk policy model.  Do we have a quantified, quantifiable
model of what we mean by risk?  Has our risk posture been expressed

2. The Consequential Costs.  Any notion of risk can be boiled down
to dollars and cents.  We're liable for direct damages caused by our
software, we may be liable for consequential damages, and we will
certainly suffer in the marketplace if our software is bad.  Can we
model these costs and can we get the hard data needed to support

3. The Probability of Specific Failure Types.  What is the
probability of various failure types, so that we can find the
probability that the consequential costs of 2 above will occur?
That is, we should multiply the cost of events by the probability of
those events to get an effective exposure.

4.  Bug--Symptom Mapping.  Testing tells us about bugs.  Proper
statistical testing based on usage profiles tells us about failure
probabilities (maybe)how do we convert our testing and debugging
experience into a prediction of failure probabilities?

5. The User Profile Problem.  What are the users going to do?  How
do we find out?  How do we eliminate the uncertainty associated with
new release usage? What kind of polling, sampling, etc., should we

6.  The User Bug-Reaction Model.  How will the user react to bugs?
Do we run controlled experiments to find out (e.g., ship buggy
software to a selected number of users)?  How will our competitors
react to these bugs?  All that's part of risk.

7.  Scale Effects.  Size is only one component of scale effects.
There are others: number of components, number of programmers on the
project, percentage of new code, number of data objects, number of
internal interfaces, number of external interfacesthere's probably a
hundred more factors with scale effects.  Have we got these under
control so that we can use data learned on previous releases and
products to predict something about this release and product?

8.  Infrastructure, Environment, and Shared Blame. We are not alone
in this computer.  It doesn't matter if it is fair or not, user
perception is what shapes risk.  What about interactive bugs for
which no one software builder is to blame.  If these dominate the
failures, does our pristine model that ignores the environment, co-
resident programs, and the infrastructure have any meaning at all?

9.  Statistical Validity.  Finally, there's the crucial question of
statistical validity.  Do we have enough data, have we gathered
enough information about enough factors to warrant a statistically
valid inference?  If you think this is easy, think about how
difficult it is for the medical profession to prove causality (e.g.
factor X causes disease Y) and how often subsequent event shows them
to be wrong.

4. To Do or Not to Do

4.1. What Not To Do

Our present state of ignorance is so vast that caution is easier
than action.

1. Don't be blinded by numbers and fancy math. Lots of numbers and
fancy equations doesn't make ill-conceived and ill-founded models
valid.  Many of us tend to assume that incomprehensible math must be
valid.  Researchers solve the problems they can solve and often
those are far short of reality.  They work with models.  A model
aircraft might teach us something about aerodynamics, but I wouldn't
book a flight to Chicago on one.

2.  Don't use naive statistics.  It's a lot easier to make a
statistical error than it is to do statistics correctly.  Most of
what we do should be backed-up by professional-level statistics.
Some of that is very technical.  It may take several decades of
research and use before software statistics progresses to the point
where the right formulas can be pulled from a handbook.  Right now,
even professional statisticians get confused by the issuesit's no
place for an amateur.

3. Don't make invalid inferences.  Causality is just about the
hardest thing to prove statistically. Just because things correlate
doesn't mean there's causality.  For example, the correlation
between program size and listing weight is good to about program so
that it fits into limited RAM we should switch to a lighter weight

4.2. What to Do

1.  Be critical.  It's all models.  Models are based on assumptions.
Understanding the underlying assumptions is more important than
understanding the math or the model.  If the assumptions don't fit
your situation, the model, no matter how erudite, also won't fit.
If the model doesn't fit you're taking great risks if you use it.

2.  Accept our ignorance.  We'd like to know more about how to do
quantitative risk analysis, but we don't.  I wish it were otherwise
but it isn't.  The fact that the problem is important and urgent
doesn't change our ignorance.  Accepting that is a prerequisite to
being critical.

3. Join the research crowd.  Like it or not, this whole subject is a
research area.  If you attempt to construct quantitative risk
models, you're doing research.  If you apply someone else's model,
you're also doing research.  If you change your policy or process
based on such models, you're doing research. Be sure you're willing
to accept the risk associated with any research.

4.  Share Your Experiences.  You'll probably never generate enough
valid data from your own organization to warrant statistically valid
inferences.  Maybe no one can--not even the giant software
producers.  You have to share your experience by sharing data,
publishing statistics, and all the rest.  If you don't then the
better process and the valid risk management methods will come from
countries such as Japan and Taiwan who do share data.

5.  Have Faith.  The whole question of software reliability,
quantitative risk modeling, and all that we've discussed above is
under intensive research. Progress is being made on all fronts.
Although we haven't achieved our goals as yet, there's every reason
to be optimistic about the eventual evolution of valid quantitative
methods for determining risk that are as robust as those that
evolved for widgets many years ago.


         First International Workshop on Quality Assurance
               and Testing of Web-Based Applications
                      held in conjunction with
                            COMPSAC 2004
                  September 28-30, 2004, Hong Kong


The Internet is rapidly expanding into all sectors of our society
and becoming a heterogeneous, distributed, multi-platform,
multilingual, multimedia, autonomous, cooperative wide area network
computing environment. Web-based applications are complex, ever
evolving and rapidly updated software systems. Testing and
maintaining web-based applications are a nightmare. Traditional
quality models, testing methods and tools are not adequate for web-
based applications because they do not address problems associated
with the new features of web-based applications. At present, web-
based applications testing and maintenance are still an unexplored
area and rely on ad hoc testing processes. Little has been reported
on systematic testing methods and techniques, quality metrics, and
dependability of web-based applications, to mention just a few.

                       Scope of the Workshop

This workshop seeks position papers that present preliminary ideas,
partially complete research results, or discussion of issues or
concerns on quality assurance and testing of web-based applications.
Papers that describe more mature research results should be
submitted to the main conference. Topics of interest of this
workshop include, but are not limited to, the following:

                     Challenges of Web Testing

Web-based applications exhibit characteristics that are very
different from conventional software systems. This theme invites
papers that investigate testing problems associated with web-based
applications. Topics may include:

  * Analysis of testing problems introduced by web-based applications
  * Analysis of testing problems as challenges to existing testing methods
  * Applicability of existing testing methods to web-based applications
  * Characteristics or properties of new types of testing methods

                   Models, Methods and Techniques

Effective quality assurance and testing of web-based applications
require a systematic approach. This theme seeks proposals of test
models, methods and techniques for web- based applications. Test
models, methods and techniques deal with the representation of test
artifacts and procedures for the analysis of component under test
(CUT), and generation of test cases/test data. Topics include but
are not limited to the following:

  * Test models and metamodels
  * Verification and validation
  * Analysis and testing
  * Test criteria
  * Architecture and framework
  * Reverse engineering
  * Exception handling
  * Testing for security, privacy and trustworthiness
  * Tools and environments

                   Process and Management Issues

A process is a sequence of macro-level activities performed to
accomplish a significant task. This theme deals with processes and
management activities for quality assurance and testing of web-based
applications. Example topics are:

  * Quality management
  * Human factors
  * Web configuration management
  * Metrics and indicators
  * Maintenance and evolution
  * Content management
  * Process improvement
  * Quality of service
  * Security and privacy (as quality metrics)
  * Dependability
  * Fault tolerance and automatic recovery models

               Practical Applications and Experience

Report on quality assurance and testing of practical web-based
applications or industrial experiences are strongly encouraged.
Topics include but are not limited to:

  * Quality assurance and testing of E-Commerce applications
  * Quality assurance and testing of E-Government applications
  * Quality assurance and testing of E-Science applications
  * Quality assurance and testing of Wireless applications
  * Security and privacy in practice
  * Lessons learned

                         Technology Impact

This theme is concerned with impact of related technologies to
quality assurance and testing of web-based applications as well as
impact of quality assurance and testing of web-based applications to
other technologies. Example technologies are

  * Bio-metric technology
  * Data warehouse and data mining
  * Agent technology
  * Autonomic computing
  * Component software engineering
  * Wireless communication
  * Mobile computing
  * Service-oriented computing
  * ubiquitous/pervasive computing
  * Network centric computing
  * Web services technologies
  * Grid computing
  * Open grid service architectures
  * Grid middleware

Steering Committee Chair
   Stephen Yau, Arizona State Univ., USA

Program Co-Chairs
   David Kung, University of Texas at Arlington, USA
   Hong Zhu, Oxford Brookes University, UK,


                      eValid: A Quick Summary

Readers of QTN probably are aware of SR's eValid technology offering
that addresses website quality issues.

Here is a summary of eValid's benefits and advantages.

  o InBrowser(tm) Technology.  All the test functions are built into
    the eValid browser.  eValid offers total accuracy and natural
    access to "all things web."  If you can browse it, you can test
    it.  And, eValid's unique capabilities are used by a growing
    number of firms as the basis for their active services
    monitoring offerings.

  o Functional Testing, Regression Testing.  Easy to use GUI based
    record and playback with full spectrum of validation functions.
    The eVmanage component provides complete, natural test suite

  o LoadTest Server Loading.  Multiple eValid's play back multiple
    independent user sessions -- unparalleled accuracy and
    efficiency.  Plus: No Virtual Users!  Single and multiple
    machine usages with consolidated reporting.

  o Mapping and Site Analysis.  The built-in WebSite spider travels
    through your website and applies a variety of checks and filters
    to every accessible page.  All done entirely from the users'
    perspective -- from a browser -- just as your users will see
    your website.

  o Desktop, Enterprise Products.  eValid test and analysis engines
    are delivered at moderate costs for desktop use, and at very
    competitive prices for use throughout your enterprise.

  o Performance Tuning Services.  Outsourcing your server loading
    activity can surely save your budget and might even save your
    neck!  Realistic scenarios, applied from multiple driver
    machines, impose totally realistic -- no virtual user! -- loads
    on your server.

  o Web Services Testing/Validation.  eValid tests of web services
    start begin by analyzing the WSDL file and creating a custom
    HTML testbed page for the candidate service.  Special data
    generation and analysis commands thoroughly test the web service
    and automatically identify a range of failures.

  o HealthCheck Subscription.  For websites up to 1000 pages, eValid
    HealthCheck services provide basic detailed analyses of smaller
    websites in a very economical, very efficient way.

  o eValidation Managed Service.  Being introduced this Fall, the
    eValidation Managed WebSite Quality Service offers comprehensive
    user-oriented detailed quality analysis for any size website,
    including those with 10,000 or more pages.

       Resellers, Consultants, Contractors, OEMers Take Note

We have an active program for product and service resellers.  We'd
like to hear from you if you are interested in joining the growing
eValid "quality website" delivery team.  We also provide OEM
solutions for internal and/or external monitoring, custom-faced
testing browsers, and a range of other possibilities.  Let us hear
from you!


                            TAV-WEB 2004
   Workshop on Testing, Analysis and Verification of Web Services

                   In conjunction with ISSTA 2004
              (ISSTA 2004 is co-located with CAV 2004)
             Boston, Massachusetts, USA, July 11, 2004


Topics of interest include:  Testing, analysis and verification
techniques and tools for web services that address the demands
created by this new domain including
 * XML-based messaging, asynchronous communication,
 * Web service descriptions,
 * Coordination and composition of Web services,
 * Interaction and choreography among Web services,
 * Performance measurement and reasoning about resource usage,
 * Principled techniques for optimizing performance, and
 * Formal models for describing and reasoning about Web services.

 Tevfik Bultan (
 Shriram Krishnamurthi (


                   9th International Workshop on

                           Beijing, China
                   18 October - 20 October, 2004

          Proceedings will be published by Springer-Verlag

The International Workshop on Web Caching and Content Distribution
(WCW) serves as the premiere meeting for researchers and
practitioners to exchange results and visions on all aspects of
content caching, distribution, and delivery. Starting from basic
caching, research in content distribution has broadened its scope to
cover practically all areas related to the intersection of content
and networking, including such areas as data grid computing, peer-
to-peer computing, utility computing, edge computing, application
networking, pervasive networking and content computing. Building on
the success of the previous WCW meetings, WCW9 plans to form a
strong technical program that covers the newest and most interesting
areas relating to content services as they move through the

The workshop solicits technical papers related to Internet content
caching and replication, content delivery, and content services
networking. Particular areas of interest include:

- Adaptive/dynamic replication and caching of Web content
- Application/service networking
- Caching and edge services for the wireless Web
- Caching and replication for grid computing
- Caching and replication in peer-to-peer content delivery networks
- Caching and replication for utility computing
- Caching and replication for Web services
- Consistency management
- Content placement and request routing
- Data integrity and content security for the Web
- Dynamic-content caching and edge services
- Empirical studies of deployed content delivery systems
- Geographical influences on caching and replication
- In-stream content modification
- Memory and storage management for content caches
- Overlay networks for content delivery
- Peering and content services internetworking
- Scalability issues in Web caching and replication
- Security and availability of Web services
- Streaming media caching and QoS
- Web workload analysis and characterization
- Wide-area upload and "content gathering"

  Chi-Hung Chi, National University of Singapore
  Lam Kwok Yan, Tsinghua University


             Foundations of Computer Security - FCS'04
(affiliated with LICS'04 and ICALP'04)

Turku, Finland, July 12-13, 2004


Computer security is an established  field of Computer Science of
both theoretical  and practical  significance. In  recent years,
there has been  increasing  interest  in  logic-based  foundations
for  various methods  in  computer security,  including  the formal
specification, analysis and design of cryptographic protocols and
their applications, the formal  definition of various  aspects of
security such  as access control  mechanisms,   mobile  code
security   and  denial-of-service attacks, trust  management, and
the  modeling of information  flow and its application  to
confidentiality policies,  system composition, and covert channel

The aim of this workshop is  to provide a forum for continued
activity in this area,  to bring computer security researchers  in
contact with the  LICS'04 and  ICALP'04 communities,  and  to give
LICS and  ICALP attendees an opportunity to talk to experts in
computer security.


We are interested both in new results in theories of computer
security and also in more exploratory presentations that examine
open questions and  raise  fundamental concerns  about  existing
theories.   Possible topics include, but are not limited to:

Composition issues                  { authentication
Formal specification                { availability and denial of service
Foundations of verification         { covert channels
Information flow analysis           { cryptographic protocols
Language-based security             { confidentiality
Logic-based design          for }   { integrity and privacy
Program transformation              { intrusion detection
Security models                     { malicious code
Static analysis                     { mobile code
Statistical methods                 { mutual distrust
Trust management                    { security policies


                   1st International Workshop on
            Web Services and Formal Methods (WS-FM 2004)

                 February 23-24, 2004, Pisa, Italy
   Workshop affiliated to COORDINATION 2004, February 24-27, 2004


The technical program is available at:

Information about registration, travel and accommodation can be
found at the COORDINATION 2004 web site:

    ------------>>> QTN ARTICLE SUBMITTAL POLICY <<<------------

QTN is E-mailed around the middle of each month to over 10,000
subscribers worldwide.  To have your event listed in an upcoming
issue E-mail a complete description and full details of your Call
for Papers or Call for Participation to .

QTN's submittal policy is:

o Submission deadlines indicated in "Calls for Papers" should
  provide at least a 1-month lead time from the QTN issue date.  For
  example, submission deadlines for "Calls for Papers" in the March
  issue of QTN On-Line should be for April and beyond.
o Length of submitted non-calendar items should not exceed 350 lines
  (about four pages).  Longer articles are OK but may be serialized.
o Length of submitted calendar items should not exceed 60 lines.
o Publication of submitted items is determined by Software Research,
  Inc., and may be edited for style and content as necessary.

DISCLAIMER:  Articles and items appearing in QTN represent the
opinions of their authors or submitters; QTN disclaims any
responsibility for their content.

TRADEMARKS:  eValid, HealthCheck, eValidation, TestWorks, STW,
STW/Regression, STW/Coverage, STW/Advisor, TCAT, and the SR, eValid,
and TestWorks logo are trademarks or registered trademarks of
Software Research, Inc. All other systems are either trademarks or
registered trademarks of their respective companies.

        -------->>> QTN SUBSCRIPTION INFORMATION <<<--------

To SUBSCRIBE to QTN, to UNSUBSCRIBE a current subscription, to
CHANGE an address (an UNSUBSCRIBE and a SUBSCRIBE combined) please
use the convenient Subscribe/Unsubscribe facility at:


               Software Research, Inc.
               1663 Mission Street, Suite 400
               San Francisco, CA  94103  USA

               Phone:     +1 (415) 861-2800
               Toll Free: +1 (800) 942-SOFT (USA Only)
               FAX:       +1 (415) 861-9801
               Web:       <>