Friday, February 24, 2023

ChatGPT, COVID, and the Almighty Underwater Chicken

ChatGPT, COVID, and the Almighty Underwater Chicken

It feels like nowadays everyone writes about ChatGPT and there are good reasons for it. ChatGPT is an impressive technology. It provides answers not as sequences of links or some other strange things. It writes text - understandable to everyone who can read and write.

Was there ever a new technology that suddenly was so present everywhere? Just open a newspaper and you will find articles about ChatGPT. ChatGPT is a hype. Is it worth the hype? This is something else. 

Whenever a new hype appears - especially those ones that could have impact on the society, it makes sense to think about technology, evidence and other related topics. But let's first start with the story of the Almighty Underwater Chicken.


The Almighty Underwater Chicken

There is a great TV show for children I like a lot, because it tells something about the relationship between epistemology and social life in an understandable and entertaining way: Roger and the Rottentrolls - Series 3, Episode 4: The Almighty Underwater Chicken. 

It is about some hand puppet folk (called the Rottentrolls). Someone of them is at a river and suddenly thinks he has seen a chicken in it (where the teller states that the Rottentroll should have noticed that he just saw the picture of a chicken on a plastic bag on the river's ground, but he didn't). When he comes home and tells the others about the chicken, someone replies that it was probably just the picture on a plastic bag. But some other Rottentroll appears (who likes playing tricks to the others) and explains that it is the appearance of the Almighty Underwater Chicken. That chicken appears every now and then and makes everyone's wishes come true (as long as one believes in it). The rather critical Rottentroll doubts about it, but he is told that the Almighty Underwater Chicken appears in many mysterious ways. So, he just claims "Hurray the Almighty Underwater Chicken".

Days later (the Rottentrolls already worship the Almight Underwater Chicken) princess Kate appears and after she sees what's going on, she decides to disprove the story of the Almighty Underwater Chicken: She asks one Rottentroll (who wished to be able to juggle) to show whether he is able to juggle the following day.

What happens next is, that during night this Rottentroll becomes aware that he cannot juggle. But since he does not want to disappoint the others (nor does he want the others' dreams to become not true), he practices all night juggling - and the next morning he can juggle.

Well, the end of the story is, that the Rottentroll who invented the Almighty Underwater Chicken tells everyone that he made it up.

One can easily see what the Almighty Underwater Chicken has to do with epistemology. But let's just forget the Almighty Underwater Chicken for a moment. And let's take a look on epistemology and social life from a different angle.


Let's go back a year or two: COVID

Let's go back a year or two. COVID was present everywhere. People died, but there were people who had some ideas how to reduce the problem. What happened (well, at least in Germany) was that even the public news wrote about studies. Vaccines were invented, some of them were more effective than others, some disappeared. People were suddenly aware of the FDA (respectively the European version of it, the EMA). Suddenly, people spoke about evidence. People were aware that it is not enough that someone just invents a new technology. An invention has to prove its' benefit. The risk of the technology has to be studied before it is widely applied. And people were aware that neither anecdotes nor single cases are sufficient to argue for or against a new technology. Even on the bus people spoke about double-blind studies.

Years ago, there were good reasons to assume that the COVID tragedy has some positive effect on the society. It felt like there was a sudden step towards enlightenment. There were some counter reactions as well, but it felt like a majority of people suddenly recognized the value of science: the critical (but non-subjective) testing of technology independent of the believe of the experimenters. 

Now, two years later, there is not much left of this optimistic impression. It looks like we fall back into old habits.


The Appearance of ChatGPT

Suddenly, there is this amazing technology online. And it is shocking to see how we welcome it with open arms. And while there is every now and then a report telling us that ChatGPT might not be the last word in wisdom, we find countless highly educated people who share their positive and euphoric ChatGPT anecdotes. And one finds countless examples, where even high school teachers integrate ChatGPT in teaching - with the argument, that pupils should learn how to use a technology (often with the additional comment that they should be taught a critical examination of new technology).

It is shocking.

There is a new technology who's effect is not known. We do not know whether its' integration into teaching has negative consequences. We do not know whether it is reasonable to make this technology open available. We do not know how often ChatGPT's could have bad or terrible consequences. But there are a number of positive anecdotes about it.

One could argue that the previous thoughts are just the typical reaction of an overcautious person. And one could argue that such kind of argumentation finally just destroys innovation. Just to remind ourselves: the argument "it kills innovation" was the argument against the FDA - the organization that stopped innovators killing people with their products.. 

There are countless articles about the need and the success of the FDA and there is no need to repeat them here. But just to make clear: a huge majority of things that are currently just few meters away from you are massively tested by law (not only the food that you eat, not only the machines that you currently use to read this text, ...).

Let's make our life easy. Let's ask some trivial questions: 
  1. How often do you let your child speak alone with a complete stranger? (I know, it is a typical rhetorical trick to bring in a child, but I assume the argument is clear)
  2. How often do you assume a complete stranger could help when you have questions or problems?
  3. How often do you believe in what a complete stranger says?
One could argue that all previous questions are stupid: people have developed mechanisms to come to a conclusion whether or not a stranger could be trusted. This is correct. But none of the mechanisms work because of the simple fact that a software product is not human. Although it might give someone the feeling that one interacts with a human, because ChatGPT uses a human communation channel: it formulates text.

And to make one thing clear: we are far away from having any standardized testing procedures for technologies such as ChatGPT. In other words: no one knows today what exactly ChatGPT is able to provide.


Back to the Almighty Underwater Chicken

The story of the Almighty Underwater Chicken sounds typical for a children's story: there is some strange phenomenon, some misunderstanding, a villain, some blind believers, and an unbelievable naivety of some characters.

But whenever I read just another anecdote about ChatGPT, I have this strange feeling that we are much closer to the Rottentrolls than to the educated society we should be. Shouldn't we have learned more about how to be a little bit more cautious with new technology? Shouldn't we have learned more about being a bit more cautious with untested products provided by multi-billion companies?

We should make more and more explicit that ChatGPT is far from being a tested technology. We should ask lawmakers to propose testing procedures for such kind of technology before it is provided to countless people (and especially kids in school). And whenever someone tells another anecdote about ChatGPT, it is our duty to remind people that an anecdote cannot replace knowledge. 

Actually, I do believe that ChatGPT is a powerful technology that will massively influence our lives in the future - probably even more than internet search machines did. But before we join in the cheers about the Almighty Underwater Chicken, and before we spread the word that it makes our wishes come true, let's test it first. 

We should not and must not stop demanding evidence. Because of the simple reason: There is more and more the impression that one is surrounded by people who do not stop saying: "Hail the Almighty Underwater Chicken".

We must be aware that it finally might just turn out to be a plastic bag - and hopefully nothing more dangerous. Hurray the Almighty Underwater Chicken? No, forget it.

Feel free to leave comments.

Saturday, June 11, 2022

Please, Stop Complaining About Missing Generalizability of Code Examples in Experiments

Since years, I hear and read complaints that studies do not generalize. I mostly get such responses by reviewers who argue why they believe that one of my experiments doesn't generalize. Actually, I have heard and seen those complaints not only about my studies but on other experiments as well. As a result such experiments don't get published. 

There is nothing wrong if bad papers do not get published. No one wants to have wrong results in the literature. But it is bad if results don't get published because of someone's belief that the results do not generalize. And it is even more problematic if results do not get published because the doubts about missing generalization are just the consequence of some misunderstandings on experimentation.

To summarize the following text: Please, stop complaining about missing generalizability of code examples in experiments.

Unfortunately, it takes some space to explain this  in more detail.

On Controlled Experiments

Controlled experiments are quite simple. Their goal is to measure something in a situation where everything that can be controlled is controlled. And in case there are things that cannot be controlled (so-called confounding factors), experimenters should either try to avoid or to measure them. 

In the simplest case there is one dependent variable (such as time to completion in a programming task) and some independent variables (such as certain techniques that are used, code styles, etc.) -- variables that are intentionally varied by the experimenter. The independent variables are those things that are in the focus of the experimenter, i.e. those things that are studied.

After executing the study, the experimenter checks, whether the variations on the independent variable have any effect on the dependent variable using some statistical procedures. The whole idea about experiments is quite trivial. 

A Simple Example: IfJava vs. IfTrue

Let's consider a possible study of a classical AB experiment (one independent variable with two treatments A and B). An experimenter might think that there are differences between Java's if-statements and some given alternative. I.e., there are two variants:

Treatment A (IfJava):
  if (someCondition)  
    ...
  else
    ...

Treatment B (IfTrue):
  someCondition ifTrue 
    ...
  else
    ...

There is one independent variable (if-style) with two treatments (IfJava, IfTrue). With respect to the dependent variable,  it is quite plausible to measure the time until the if-statement is understood. But we need to speak about confounding factors.

On Confounding Factors

Confounding factors are factors with an undesired effect. Undesired means they influence the dependent variable which should only be influenced by the independent variables. Unfortunately, confounding factors don't just add some constant to the dependent variable. Instead, they come with their own distribution (mean, deviation, etc.). Confounding factors can hide the effect of the independent variable: if a confounding factor is too strong (or its deviation too large), one measures mainly the effect of the confounding factor in an experiment and not the effect of the independent variable. 

In the best case, the effect of confounding factors is small and known and can be extracted from the literature. But taking the current state of our literature into account, we cannot expect any hard numbers from it. So, what does it mean for our experiment?

The goal of the experiment is to measure the difference between IfJava and IfTrue. And we have to use concrete code snippets. But how should such snippets look like? One could have a spontaneous idea. Just let's use some arbitrary if-statement that could like the following.

Treatment A (JavaIf):
  if ((myVariable > 23) && isThisRight() && !someOtherCondition())  
    return 1;
  else
    return 2;

Great. We could ask participants "what is the result of the if-statement?" and in case the statistical analysis sees a difference the experimenter calls the if-style that requires less time more readable.

Unfortunately, we have a confounding factor: the complexity of the code. It is plausible that the more complex the condition, the more time it takes to answer the question. I.e., the dependent variable time is influenced by something that is not in the focus of the study.

We can examine the literature for readability models for Boolean expressions. Additionally, we need statistical information about such models. But such models with associated statistics don't exist. What can we do?

People would say "well, you just have to vary the complexity of the Boolean expression and consider this as a second variable in the experiment". Such comment is not serious. First, we cannot vary the expression's complexity in a controlled way because there is no known complexity model for Boolean expressions. Second, it completely misses the problem of confounding factors: in case the effect of the Boolean expression is too strong, we could accidentally hide the difference between JavaIf and IfTrue (in case it exists). And third, our goal is not to study Boolean expressions. Our goal is to study if-styles. Why should we bother about the complexity of Boolean expressions?

Actually, the last idea -- not to bother about Boolean expressions -- is problem and solution in one. It solves our problem in the study. But it has the problem that most reviewers will then argue that the study's result is not generalizable.

Becoming Aware of How Large the Problem Is

Before coming to the solution, it makes sense to speak about problem in more detail -- the reason why it does not make much sense to vary the Boolean expressions in our code.

In the previous code example we see that the condition is not a pure Boolean expression. It is an expression in the programming language Java that finally evaluates to a Boolean value. Respectively, it is an expression of type Boolean. It is important to understand this difference.

A Boolean expression comes from Boolean algebra. It consists of variables and operators (and some brackets). But the code contains method calls as well. I.e. even if there would be a readability model for Boolean expressions, we have to live with the problem that somehow the method calls play a role as well. And as soon as we got there, we need to emphasize that names play a role as well. And we have to take Java's semantics such as for the operator && into account, because in case the left hand side of an && already evaluates to false, the right hand side will not be executed (which is important in the presence of side effects, etc.).

The intention of the previous paragraph is to make explicit that one cannot say "let's generate some expressions". A serious scientist will take all these things as potentially confounding factors into account. And without having knowledge about these factors, one should better get rid of these factors. 

The Solution And The Problem

As already said, there is a simple solution to this problem: don't bother about Boolean expressions. And it simply means that instead of using Boolean expressions in the condition, one just uses a Boolean literal with the following code:

Treatment A (JavaIf):
  if (true)  
    return 1;
  else
    return 2;

For a number of people (and unfortunately, for a number of people in the software science community as well) this code looks stupid. And the typical arguments (that one also finds in reviews) are:
  • there is no logic in an if-statement whose condition is a literal, because the result statement is already known upfront, and 
  • this is pure artificial code you will never find in any code repository.
It is completely understandable if someone from industry argues that way, especially someone who is not familiar with experimentation. But a reviewer should be aware of the problem of confounding factors and the reason why one has to adapt the code in order to get rid of such factors. 

On the Introduction of Additional Factors

Unfortunately, the story about missing generalizability is not yet over. But this time, it comes from a different source.

Let's assume (note that we haven't done the experiment) that it takes on average 1.1 seconds to answer the question in IfJava while it takes on average 1.0 seconds to answer the question using IfTrue. 10% difference sounds a lot. But experienced experimenters will be alarmed.

Since you measure something on participants, and since there is a deviation between participants (as well as deviation within a participant), not only the mean values are interesting. You also need to know something about the deviation. From that you can determine the effect size such as Cohen's d and from that you can estimate the sample size with some statistical tools. Let's assume that the effect size is d = .8 (which assumes that your deviation is really, really small). The resulting sample size will be 42 participants per group, i.e. 84 participants in total. This is a large number of people. At that point, experimenters typically think about alternatives.

What experimenters can do is to measure more data points per participant. I.e. one would rather design the experiment as a crossover trial or even as an N-of-1 trial. I.e., one would give one participant multiple tasks. But such decision has consequences and one of it is that you cannot give participants the identical task, because once a participant knows the code, he does not need to think about the code for a second or a third time. Hence, there is a need to vary the code. 

One could change the Boolean literal. But this does not change much. And it would mean that a participant who receives more than 2 tasks will get at least two time the identical task. One could think about the body of the if- or the else branch. But this introduces again some complexity from some other source not related to the if-statement.

Fortunately, there is a trick: use the if-statement again in the body. The possible code looks like following:

Treatment A (JavaIf):
  if (true)  
    if (true)  
      return 1;
    else
      return 2;
  else
    return 3;

This kind of code can be varied. You can for example consider nesting depth as a parameter, etc. Then, you can give participants some of this code (you just have to think about learning, fatigue, and novelty effects). Note that the additional factor (such as nesting depth) is not inherently interesting. It is the result of a design choice which was necessary because of missing knowledge in the literature about complexity of Boolean expressions and the resulting counteractions in order to remove confounding factors combined with expected, required sample size.

People might argue that the situation now is the same as before: there is one factor (nesting) which is not known upfront and which is potentially a confounding factor. To a certain extent this statement is right. But it ignores that the resulting code does not consist of other language constructs that should not be studied (except the Boolean literal, the return statement and the integer literal).

On Generalizability of Code Examples

The previous code examples are probably good choices to study potential differences between IfJava and IfTrue. Still, your study will probably never be published. Again, the main argument against will be that the experiment code is no real code. 

The resulting problem is, that the results will neither become available to other researchers nor to other language designers. In case there is a difference between if-styles, the next language designer has no chance to hear about it. And other researchers will not be able to benefit from the measured differences and deviations. And if in some years someone has the same idea about IfTrue, such person cannot just take a look into the literature to find out what is already known.

Actually, the argument against the code examples reveals a complete misunderstanding of experimentation. Again, the resulting code is the result of controlling factors and reducing confounding factors. It was the goal of the experimenter to find an experiment that gets rid of disturbing factors. One can be relatively sure that the experimenter is aware that the experimental code is not what one finds in reality. But he had damn good reasons still to use it.

Starting from complains about the missing generalizability, people will start longer speculations about possible effects of other factors that exist in reality and they will speculate whether the difference in reality is really 10% or not. Again, this is a complete misunderstanding of experimentation.

Again, the goal of experimentation is to measure the effect of something in a controlled environment. The goal is not to test, what the effect in reality is. In reality, there are many more factors that have an effect. In order to understand the effect in reality, these different factors and their possible interactions need to be known first.

Telling a software scientist to find more realistic code examples is comparable to telling the experimenter of an Aspirin study that one should not artificially measure the effect of Aspirin on headaches, but to consider more realistic scenarios in hospitals such as heart attacks or cancer. Of course, Aspirin was studied on headaches, because it was designed to reduce headaches. It was not designed to heal cancer. Studying Aspirin in a more or less arbitrary setting (more realistic example) will probably not measure anything. Not because Aspirin has no effect (on pain). But because the deviation of different illnesses is too large (where the pain reducing effect plays a too minor role).

Let's get back to our example. IfTrue was built to have a positive effect on if-statements.  It was neither designed to make Boolean expressions easier, not to make anything else better. Arguing that such construct should be studied in a more realistic example is simply wrong.

Conclusion

Again, please stop complaining about missing generalizability of code examples in experiments, because it simply does not make any sense. Check what the focus of a study is, check what factors are intended to be studied and check, whether confounding factors were reduced as much as possible.

The whole idea about peer-reviewing is that people should judge whether evidence was collected based on known facts from the field. This implies that personal opinions, estimations, or feelings do not belong to the review process. Our current state of reviewing practice has actually nothing to do with this idea.

And in case you still see the need for generalizability of code examples in experiments, please answer the following questions.

First, what criteria do you apply in order to identify real code?

Second, what evidence do you have that your personal idea of real code is actually real?

Third, how do you think deviation in real code should be considered?

And in case you don't understand the third question, ask yourself whether you should really review any experiments.

Please, feel free to leave comments.

(actually, the problem that complaining about missing generalizability is not only a matter of code examples. But this is something for a different article.)

Saturday, March 19, 2022

Please, Stop Collecting Developer Opinions

Just recently, I was quite enthusiastic to read a software science paper, because its title sounded promising. I do not want to refer to this specific paper, because it is neither the goal to discredit one specific work nor one specific author. It is a set of similar papers that needs to be criticised.

Over the last years more and more questionnaires can be found at scientific venues in software science. The commonality of these papers is that authors ask developers about their opinions on some topic. Then, responses from a huge number of people are collected. Then, the results are analyzed.

So far, there is no problem.

There is nothing wrong with opinions. It is interesting to know what peoples' opinions are. It is especially interesting from a marketing perspective, because it says something about the perception of people.

The problem lies in the conclusions.

What a number of works in software science do and which is fundamentally wrong is to infer from subjective perceptions something about the perceived phenomena.

Let's assume there is a technology X that tries to make developers' life easier. Then, someone asks developers whether it makes their life easier and let's assume with high evidence the answer is yes. What can we conclude from it? 

We can conclude that developers think that it makes their life easier. It is also possible that developers just pretend that it makes their life easier. But we do not know whether it makes their life easier. Making any claim about how or whether the technology influenced a developer's life is not possible from the evidence gathered so far. 

In order to find out whether the technology X helps, no result from subjective perceptions would bring us closer to an answer. Whatever the result of the questionnaire is, the question whether or not the technology helps is still unanswered.

One could argue that a developer's life becomes better because he thinks that there is a technology that helps him - independent of whether or not it actually helps him. This kind of argument is comparable to a placebo argument that we frequently find in homeopathy. But it should be clear that this argument should not be used in software science, because it is more a meta argument: if something makes people think that it makes their lives better, then it is good. The argument is comparable to the question whether a free beer makes a developer's life better.

Of couse, this leads us (as always) to the need for studies. But the argument is not that questionnaires are no studies. They are. But the problem with them is that they purely depend on subjective perceptions. 

There are good reasons why you find whole textbooks about perception in psychology. Perception is not only subjective from the perspective that people can perceive the same phenomenon in different ways (because of differences in physionomy, differences in experience, etc.). Perception is influenceable. You can easily find a bunch of studies that show that perception can be influenced and the Pepsi versus Coke experiment [1] is just one example (again, whole textbooks are on that topic, there is no point in giving a longer list on that here).

So, what is actually the problem? When we study technology, we need to measure interactions with the technology in a non-subjective way. You can still ask developers questions. "In this scenario, what is the outcome?" could be an appropriate question. But it differs from a question "Do you think that technology X helps?".

We need to stop asking for subjective perceptions.

The implications of the statement are much more serious than we think. Community processes that can be found today for example in programming language design typically ask people about opinions. But it leads to far to discuss this issue here.

Please, stop collecting developer opinions.

Opinions are important. But they do not permit to draw any conclusion beyond peoples' opinions. And should a technical discipline have the focus on opinions? I think the answer is no.

Feel free to leave a comment.

Thursday, October 15, 2020

What Should Software Science learn from the Corona Crisis?

What Should Software Science Learn From the Corona Crisis?

The corona crisis does not only influence people's daily life, it also influences how people think about science. Suddenly, scientists are present in the news, scientific results influence new laws that appear because of the corona crisis, and the results of scientific studies become part of people's daily conversations. Actually, this new popularity of science is good.

And there are a number of people who doubt in scientific results. Actually, this is not that bad. Science requires doubt. Progress happens, because some people do not believe in commonly accepted theories and search for alternative explanations or new interpretations for given phenomena. 

What's bad is when people ignore results or invent new theories without having any evidence for them. And what's even more bad is, that there are people who follow such new theories without even demanding evidence. Such people can be fooled too easily and for other people fraud becomes a profitable business.

People's typical reaction on skeptics of the corona crisis is that education would help. If people would be better educated, their knowledge about science helps them to distinguish between serious interpretations of scientific results and rather wild guesses based on personal anecdotes. But while this statement is probably true in general, we cannot assume that every discipline provides such a profounding knowledge. 

Taking into account that the scientific foundation of software science is rather low it actually makes sense to think the other way around: what can software science learn from the corona crisis? So, why not trying to find some "lessons learned" for software science from the ongoing crisis?

It is the numbers that do matter

It sounds stupid to point this out, but the first thing to be learned from the corona crisis is, that it is the numbers that do matter. 

The first and rather obvious number that directly comes to one's mind is the death rate. But other numbers such as infection rates, etc. do matter as well for medicine. For other disciplines such as economics numbers do matter, too. There, monetary aspects such as the costs of the crisis do matter. In the end, it is not a single number that matters. Each single number plays its role, but a number of different numbers need to be taken into account in order to get the big picture. 

But the important insight is not only that numbers matter. The important insight is, that hardly anything else than numbers matters. Even if there is a person who has a strong believe in the effectiveness of some medicine, therapy or vaccine, it does not imply that such statement should be taken too serious. Even if someone ignores the huge, negative impact of corona on the economy, it does not imply that this negative impact does not exist. Rhetorical skills of people might strongly work on some people. But rhetorical skills hardly change the reality.

In the end, the effectiveness of some treatment, or the validity of an argument requires numbers: we want evidence for statements and not only some famous peoples' believes. Software science should demand evidence. Software science should demand numbers - numbers that do matter.

Numbers are dirty

The second lesson to be learned is, that numbers are rarely pure. Numbers are dirty. Empiricists are aware that measurements are rarely as pure as people would want them to be. Measurements imply errors in measurements, measurement tools have their drawbacks. Although people want measurement tools to be as precise as possible, we have to accept that every measurement tool has problems. 

And empiricists are used to the problem that people, who do not accept empirical results, discredit the numbers. In the corona crisis, the death rates are discredited. People doubt, that the number of deaths is valid - and they have good reasons to doubt in the perfection of the reported numbers. Obviously, there is no independent institute that can analyze for every single case whether a person died just with or from corona. We do not necessarily speak about intentional lies. We speak about cases where it is not clear whether there was a causal relationship between the virus and a person's death. And we have to accept that even corona tests can fail. It is the nature of measurements that there are error rates - and it is the goal to reduce such error rates.

We also see that the numbers are attacked on different levels. For example, we find people who doubt whether the reported death cases are actually true, i.e. people argue that there could be additional corona cases that were intentionally not reported. Or some people argue that the reported number of infections are too low, because some governments are not interested in reporting high numbers. And even if numbers are accepted, people who are not willing to accept empirical results start new interpretations. For example, people argue that a high infection rate is not the result of an ongoing pandamia, but rather the result of a high number of tests. Or a high death rate is not the result of failing countermeasures, but rather the result of an extremely aggressive virus.

In the end, we have to accept that all numbers have their problems. This does not mean that we should blindly trust in all reported numbers. It is important to see how the reality is mapped to numbers and it is important to understand potential problems. But just stating "the reported numbers are wrong" is rarely a constructive criticism. There is a need to understand how problematic a number is, how measurements could be improved, etc. And it is always necessary to question the relevance of reported numbers. And it is important to identify people who discredit numbers for rather personal reasons and who hinder that way the process of knowledge gathering.

For software science, the lesson learned is that we should not be too quick to discredit reported numbers. We need to understand the process of data collection (and interpretation) and need to understand how large possible errors of certain measurements techniques are. This means that finally we need to identify relevant measurements for our discipline and we need to define measurement techniques in order to get valid measures. And we should be cautious with people who discredit numbers for the sake of discrediting numbers.

It is not a single study that matters, it is multiple of them

Another lesson learned from the corona crisis should be that scientific knowledge usually does not arise from a single study. 

Up to now, there are hundrets and hundrets of studies on corona from the field of medicine. And our knowledge on corona is the results of a combination of a large number of these studies. This does not means that each single study is fantastic. There are in the meantime a number of studies which are today considered invalid. And there are studies who just reproduce results that already have been already reproduced by others.

But the essential lesson learned is, that people in a mature discipline study the same phenomenon over and over again from different perspectives. Different experimental designs, different treatments, different measurements, different measurement methods, etc. -- the knowledge on the field consists of multiple tools and efforts in order to get the big picture.

Such effort is required in software science as well. Instead of celebrating novel ideas in our field, we should appreciate more studies that study given phenomena in depth. We should collect multiple studies on the same phenomena. We should encourage people to study phenomena, although there exist already some studies on such phenomenon.

The Need for Education and Demystifying Science

These corona times teaches us, how necessary it is that people understand non-subjective reasoning and how necessary it is that people distinguish between fact and fiction. Unfortunately, this requires education. It is not enough to argue for or against a statement by adding a phrase such as "scientific studies have shown" to it. Science is not a a magical process. Science just means to be as much non-subjective as possible. Science tries to run, collect, summarize and interpret studies without any agenda in mind. Education demystifies science and statements such as "there is a scientific study" start losing their authority -- which is good, because there is the need to understand studies and not only to accept an author's interpretation of a study.

People should doubt in the results of studies. Such doubts must not be some naive scepticism. It requires knowledge about the underlying procedures and it requires knowledge about the lines of reasoning built upon collected numbers. The necessary willingness to doubt in results also requires knowledge and recognition of valid results. Knowledge about methods teaches us where the limits of doubt are.

Infortunately, this is probably the biggest issue for software science. Actually, it is not clear whether software science provides to its actors enough knowledge for the mentioned kind of reasoning. There are even reasons to believe, that software science education, which is massively influenced by or based upon math, is counterproductive for understanding the results of empirical studies: when you are familiar with proofs by contradiction or with counter examples that disproof a general statements, it is hard to understand why a single case in an empirical discipline does not destroy a whole theory. When you are used to counter examples, it is hard to understand why a single person, who suffers from covid-10 for a second time, does not automatically falsify an immunity theory.

Summary

There is a lot that can be learned from the corona crisis. Software science can learn a lot from the corona crisis. We as software engineers or software scientists should not just read newspapers today and pretend that the process of knowledge gathering for corona is completely different to what needs to be done in our field. 

We should demand numbers. We need to provide such numbers. Our lines of argumentation should rely on numbers. And we need to accept the impurity of numbers - and use education as a weapon against wild speculations and naive scepticism in our field.



Wednesday, September 9, 2020

How much distrust do we need, how much trust can we afford in software science?

How much distrust do we need, how much trust can we afford in software science?

While summarizing results of identifier studies for a magazine I had to make a decision: I had to decide how much I trust in some experiments. In other words: how much distrust is needed when reading papers, reports, etc.? For example, if someone just wrote "I did an experiment and technique A turned out better then technique B", should I just take these works for granted and assume this is some evidence?

Example: Shneiderman's Experiment on Identifiers

The problem happened to me when I tried to summarize identifier studies. One of the earlier studies on identifiers was mentioned by Shneiderman and Mayer and I asked myself whether I should take the available, very short description of the experiment as a form of evidence into account: 
"Two other experiments, carried out by Ken Yasukawa and Don McKay, sought to measure the effect of commenting and mnemonic variable names on program comprehension in short, 20-50 statement FORTRAN programs. The subjects were first- and second-year computer science students. The programs using comments (28 subjects received the noncommented version, 31 the commented) and the programs using meaningful variable names (29 subjects received the mnemonic form, 26 the nonmnemonic) were statistically significantly easier to comprehend as measured by multiple choice questions." [1, p. 231]
That's it. Almost nothing more is said about the experiment in the given source. I could have said "well, the paper appeared in a peer-reviewed journal, hence this is evidence", but I did not feel that way. The problem was not only, that concrete numbers were missing in the description. The problem was also that I did not had a precise idea what was done in the experiment.

I wanted to give an impression of what was done in the experiment and then report means (and differences in means) and in case there are interaction effects, I wanted to report them as well (not in terms of statistical numbers, but rather in terms of text).  I know, means are problematic, but my goal was to summarize results for a magazine - the audience should not be bothered with p-values, etc. But I think it was also important to give people an idea what exactly participants did in the experiment and  how conclusions were drawn from it. I did not know what programs or how many of them were given to the subjects, how exactly the different treatments looked like, etc.

Ok, I was not satisfied with the description. But I also did not want to make it too easy for me and just say "I should ignore the text", because in the end the description still came from a peer-reviewed journal. So, I tried to find more about the experiment. I was digging at Ben Shneiderman's webpage, but was not successful. But in his book Software Psychology [2] I found some more text: 
"One of our experiments, performed by Don McKay, was similar to Newstead's, but the program did not contain comments. Four different FORTRAN programs were ranked by difficulty by experienced programmers. The programs were presented to novices in mnemonic (IDVSR, ISUM, COEF) or nonmnemonic (I1, I2, I3) forms with a comprehension quiz. The mnemonic groups performed signifianctly better (5 percent level) than the nonmnemonik groups for all four programs." [2, pp. 70-71]
After this paragraph, the book contains a figure that illustrates a difference in "mean comprehension scores" for four different programs. The score between the programs seems to vary from 3 to 5 for mnemonic, respectively 2 to 3.5 for non-mnemonic variables.

The second citation in combination with the figure gave me some more trust that the experiment revealed something. But it still puzzled me what the programs looked like that were given to the subjects. I also wanted to see more in more detail what variable names were used. But I also wanted to know what questions were given to the participants. And the first citation mentions multiple choice questions that were used (the second just speaks about a quiz). How many alternative answers had the subjects? How much time did they have for reading the code? What were the raw measurements, what the means, the confidence intervals? I just had the figure which does not show confidence intervals nor do they give a precise understanding of what the mean is. And finally: how was the data analyzed and what were the precise results? Was just a repeated measures ANOVA used? What about the second factor (programs)? Were there interaction effects?

Finally, I spent a lot of time on the experiment (mainly for searching for a more detailled experiment descriptions and for comparing both descriptions, whether they do match).  And I finally decided for myself, that I should not take the experiment as a form of evidence into account. In the text for the magazine, I just wrote that "there was once an experiment which is today rather historically intersting, but inappropriate as a form of evidence".

I felt bad. 

I had the feeling that I did not give Ben Shneiderman the appropriate credit for his efforts. But I really felt that it is my duty to be much more sceptical with his text - despite the fact than Ben Shneiderman is one of the leading experimenters in software science.

On Scepticism, Evidence and Trust

As a scientist (well, in fact as an educated person), you should not trust too much, you should not just believe someone (no matter who he is) and you should not stick with your own fantasy or your own personal and subjective impressions. I.e. when you are confronted with a statement, the following should hold: A statement ...
  • ... is not more important, because it follows a current hype,
  • ... does not become true just because the author of such a statement is an expert,
  • ... is not more valid, because it was articulated by an authority.
  • ....is not more important, just because you believe in it or the statements comes from you.
This kind of argumentation is far from being new. For example, Karl Popper wrote:
"Thus I may be utterly convinced of the truth of a statement; certain of the evidence of my perceptions; overwhelmed by the intensity of my experience: every doubt may seem to me absurd. But does this afford the slightest reason for science to accept my statement? Can any statement be justified by the fact that K. R. P. is utterly convinced of its truth? The answer is, ‘No’” [3, p. 24]
Of course, you cannot endlessly play this scepticicm game. You cannot ignore everything on this planet and just say that you feel sceptical about it. In the very end, you need to take some evidence into account. This evidence might be damn strong or just weak (and weak evidence does not mean that you heard an anecdote somewhere). 

But even strong evidence implies trust to a certain extent. You must trust that a study was executed, you must trust that the resulting numbers were measured, you must trust that not further numbers were measured that were withheld by the authors, you must trust in the validity of the analysis and you must trust in the seriousness of the interpretation. You must trust that the goal of the study's author was to find out something. 

Unfortunately, whenever some kind of trust is required, it is a door opener for fraud. Trust can be exploited. People can intentionally lie when trust is required.

Reporting Experiments - Setup, Execution, and Analysis Protocol

Before speaking about the problem with trust, let's take a look what can be known about an experiment.

In an ideal world, there is a guarantee that a study was executed, that this study follows a well-defined protocol, and that the study was analyzed in a way that matches the study's design. Such protocol consists of three parts: the setup protocol  (which describes what and how something should be done when it is replicated), the execution protocol (which describes the special circumstances in which the experiment was actually executed) and the analysis protocol (that gives the results of the study in statistical terms).

The setup defines the subjects that are permitted to participate (such as "professional software developer with skills X and Y"), the dependent (such as "reaction time") and independent variables (such as "programming language") and the hypotheses that are tested. Furthermore, the protocol contains the experiment layout (such as "AB test", etc.),  the measurements techniques (such as "reaction time measurement with stop watch") and the different treatments given to the subjects (such as Java 1.5, Squeak 5.0). Furthermore, it describes how and under what circumstances the different treatments are given to the subjects (such as the programming tasks given to the subjects, the task descriptions, the used IDE, etc.). In case the measurement techniques require some aparatus (such as some software used for measurements), this is also contained in the setup protocol. And in case, the aparatus cannot be delivered as part of the protocol, a precise description of the apartus is given.

The execution protocol describes the selection process for the subjects, the subjects that were finally tested, and the specific conditions under which they were tested. These special conditions could be the time interval in which participants were tested, the location where the test was executed, the machines used in the experiment, the concrete IDE (incl. version), etc. And finally the execution protocoll contains the raw data.

The analysis protocol describes how the possible effect of the independent variables on the dependent variables is determined. Since probably some statistics software is used for the analysis, this software is mentioned as well. The analysis protocol describes the results of the experiment in terms of statistical values. For the statistical values, corresponding reporting styles should be used such as APA (although this is very uncommon in software science). Each test comes with a measurement for the evidence (aka. p-value) and effect size (such as Cohen's d, eta squares, or just the means and the differences in means as well - latter ones are no effect sizes, but it is often more useful to have measurements that mean something to the readers instead of abstract things such as Cohen's d that most reader won't be familiar with).

On double-checking results

The information above is required because it permits readers to double check the experiment. It can be checked, whether the layout followed a standard-layout, whether the measurement technique is state-of-the-art, whether the tasks given to the participants were appropriate and whether the analysis follows the experimental design. And in case the reader doubts that the statistical results are right, he can recompute them. It even permits him to apply alternative statistical procedures.

In a real scientific world, there would not be the need for all this, because if an experiment was published in a peer-reviewed journal, you can trust that the reviewers did all this for you. Of course, this is no 100% guarantee. Even in disciplines such as medicine that have a very high research standard, studies are retracted in journals (see for example a recent case with a COVID-19 study). But the situation in software science is different. 

Taking the terribly low number of experiments in software science into account, there is good reason to doubt that an average reviewer in software science is able to do a serious review (just because quite few people are familiar with experimental designs and analyses). As a consequence, we cannot assume that a reviewer double-checked an experiment. Hence, it makes sense today not to trust on published experiments in software science, but to double check them. It does not necessarily mean that that authors intentionally lied. They might have just done some errors. Not intentionally, but just accidentally.

Unfortunately, we run here into a problem: double-checking costs time. Even if someone is well-trained in experimental analyses, it takes time to do the recomputation from the raw data (in case the data is available). But stats are just part of the game. There is the need to check whether the data collection followed the procotols, etc. But we cannot double-check everything. At a certain point we have to stop and say "I just have to trust". But we should make explicit on what we need to trust and what was actually double-checked. And we should give readers a fair chance to decide on his own, what he should trust in and what not.

Why not Making Chains of Trust more Explicit?

But what about the average developer who is interested in what evidence actually exists in software science? He is probably not trained enough to double check experimental results. But that implies that the developer will not take any evidence into account that exists in software science. And that implies that the results of software science will be in vain. This should not be the consequence, otherwise our discipline will never get out of this situation where countless statements without any evaluation exist.

Hence, it makes sense to give developers all essential information about an experiment but also the information about whom and what needs to be trusted. I.e. we should provide developers information such as "I, Stefan Hanenberg, recomputed the results of the experiment X and the results of the analysis are Y. I.e. if you cannot do your analysis on your own, you need to trust me that the computation of the analysis is correct". 

Probably it makes sense to make even more information available such as "I got the measurements and I repeated the analysis, but I was not able to access the tasks given to the subjects. I.e. I only confirm that the results of the experiment match the given data, but I cannot confirm that the data followed an appropriate experiment protocol, hence we need to trust the author of the experiment about that".

And maybe, it makes even sense to make some ratings about the resulting chains of trust. An experiment result such as "we need to trust the author or the experiment" seems to be less trustworthy than "we need to trust that a valid setup protocol was followed, but the results match the given raw data" which is less trustworthy than "we received everything about the experiment and we confirm that the experiment followed an appropriate design, was executed in an appropriate way and the reported results match the raw data". And the best case would be probably: "we received everything needed from the experiment, the results are as described from the author and the experiment was executed by others and they received comparable results".

Why Could that Help?

Our discipline suffers from the problem that a number of statements ("object-orientation is good", "functional programming is good", "UML improves understandability, etc.") are hardly or badly evaluated. But even if there are experiments available, we should make explicit what parts of the experiments are trustworthy and what parts are not - because in the very end, we want to rely on strongly trustworthy results and not just on "we trust some single person".

By making explicit what results in our field do not just depend on our trust in the authors, we make explicit where we could or need to improve our discipline.

References


  1. Ben Shneiderman, Richard Mayer. Syntactic/semantic interactions in programmer behavior: A model and experimental results. International Journal of Computer and Information Sciences 8, 219–238 (1979). https://doi.org/10.1007/BF00977789
  2. Ben Shneiderman. Software psychology: Human factors in computer and information systems, Winthrop Publishers, 1980

  3. Karl Raimund Popper. The Logic of Scientific Discovery. Routledge, 2002. 1st English Edition:1959.

Tuesday, May 19, 2020

Reporting Standards in Software Science Desperately Needed

Reporting Standards in Software Science Desperately Needed

  
If we are really interested in achieving something in software science, there is a need for reporting standards. I really mean this statement. But just recently I made the experience how urgently needed such standards are: I summarized an experiment and became aware how much time it took to extract relevant information from it. Some information was missing, some was confusing, etc. In case a standard such as CONSORT whould have been applied, it probably would have cost me minutes to summarize an experiment - instead of many, many hours where I finally even needed to contact the author, because some information was missing.
  

Recent Experiences While Summarizing Research Results

I recently summarized research results from experiments. The goal was relatively simple: Just collect the results from some studies and summarize them in a way that an ordinary software developer is able to understand them. I think such work is needed, because for example the study by Devambu et al. has shown that most developers judge the validity of claims in software construction based on their personal experience and not because of independent studies [1]. But taking into account that experience is limited and that subjective experiences are quite error-prone, it makes sense to give developers information about studies that exist and that give evidence for some claims. And what's even more important: Give developers studies that contradict given claims.

The topic of my summary was "identifizers", i.e. I wanted to summarize studies who checked what the influence of identifiers on code reading or code understanding is. Yes, I know. No big deal. Everyone of us knows how important the choice of good identifiers is. But I really wanted to know what was actually measured by researchers. And we should know something about the effect sizes.

Most of the studies were done or at least initiated by Dave Binkley. I read most of his papers already in the past and since I am well-trained in reading studies I assumed that it is no big deal to give a quick summary of some of them. And there was another reason why I focussed on his papers: From my experience and in my opinion his studies are well-conducted and I trust in the validity of the results, i.e. I trust that the numbers were collected in a way as described in the papers, I trust that the analyses of the data and I trust that the writings do not try to over-sell results: I think his research has to goal to find answers. His papers are not written for the sake of writing papers, but for the sake to improving the knowledge in our field.
 

My goal for the summary

More precisely, I wanted to summarize papers in a way that gives a 1-2 sentence description of the experimental design, another 1-2 sentences about the dependent and independent variables and a few sentences about the main results. And maybe some more sentences about what can be learned from the study. The goal was not to bother readers with stuff that is needed for scientific writings. I.e. I wanted to skip information about whether the experiment followed a crossover design, whether e.g. a latin-square was used or what statistical procedure was applied. 

Actually, I think it is necessary if readers who are not too deep in scientific writings get results in an understandable way. I.e. if an AB test has been applied, I think it makes sense not to write about statistical power, p-values, confidence intervals or effect sizes, but just to write that "a differences was detected" (in case a significant result was achieved) and then to report means and mean differences. And in case multiple factors are tested, my goal was not to write about interaction effects, etc. but just to explain interactions in a way that an average person can get the meaning quickly. 
 

Shouldn't someone else do the job?

Quickly is the point here. If we want developers to understand results of studies, they have to be communicated efficiently. And the typical research paper at a conference or in a journal does not seem to have the goal to communicate results efficiently. Authors of conference papers are given a certain number of pages they can fill. And authors are actually forced to fill this number of pages. If for example a conference such as the International Conference on Software Engineering (ICSE) has a page limit of 12 pages, you will hardly find a paper at that conference that does not have 12 pages. This has something to do with the review process (which should not be discussed here, although there is an urgent need to discuss it). Ok, so you want to communicate scientific results for a broader audience. But how?

Actually, scientific journalism in other disciplines does this job: people who are trained in writing (for a popular market) summarize results in a way that people are able to understand them. This is important, because people should be informed about what knowledge exists - especially taking into account that people pay for the generation of this knowledge (because a lot of scientific work is paid from tax money). But for software science this kind of journalism does not exist. Yes, there are a bunch of magazines that address technical things. You find books on new APIs or new technology that explain how to apply it. But this is something different. These writing explain how industrial products could be used. They do not explain what we actually know about them. It would be great if there would be people who summarize research results - but we currently have to live with the fact that this is actuall not done in our field.

So, back to the studies.
 

Giving a quick summary took damn long time

Again, I really love Dave's work. I think his studies are great. His writings are great. But it turned out that just writing a quick summary took much more time than expected. When I now explain what happened to me and why I had troubles to summarize the paper, this should not and must not be understood as a criticicm of Dave's work. Really not. Dave's work is definitively a shining example of how good science in our field should be. Dave's paper is just an example for what troubles people could have reading scientific papers. And I assume my papers suffer from the very same problems.

One of the papers I started with was Identifier length and limited programmer memory [2]. I remembered that this study compared 8 expressions with different lengths and that subjects were asked to write down a part of the expression. So, I wanted to write sentences such as:
 "The experiment gave A subjects B expressions to read for a time C (D subjects were removed for some reasons). Each expression consisted of E parts and the authors used the criterion F to distinguish between short and long expressions. After reading, a part from the expression was removed and subjects had to complete it. The average time for reading short expressions was T1 and for long expressions it was T2, so the (statistical significant) differences was T3, respectively it took people G percent more time to read the long compared to the short expressions."
I am aware that these sentences are quite a simplification of the results. Especially, I do not mention all independent variables and I do not mention the applied statistical method. By reporting the means people do not get an idea of the size of the confidence intervals, etc. But, again, the goal was to give a quick (but still informative) and not a complete overview. Why do I think that this kind of summary is informative? Well, I think it contains the most relevant information. The number of subjects gives an idea how large the experiment was (and people are mad about this idea of "being representative" - that's another point that needs to be dicussed, but not here, not now), the dropout rate gives an idea how much the data says about the relation between the "originally adressed sample" and the actual data used for the analysis. And the average times give people an idea how large such differences are. Yes, there are effect size metrics such as Cohen's D or eta square, but if someone does not know these things, such numbers would rather confuse him.
 

Sample size and dropout rate

Doing the first step (number of subjects) seemed relatively easy, because the number 158 is already mentioned in the abstract, so I directly started searching for the dropout rate. Suddenly it took some time to understand what exactly happened to the data. The paper does not have an explicit section such as "experiment execution" or something. But there is a section "Data preparation" where I found the following:
"[...] the data for a few subjects was removed. For example, one subject reported writing down each name. A second subject reported being a biology faculty member with little computer science training. Finally, the time spent viewing Screen 1 was examined. It was decided that responses with times shorter than 1.5 s should be removed because they gave the subject insufficient time to process the code. This affected 18 responses (1.4% of the 1264 responses). In addition, excessively large values were removed. This affected 6 responses (0.5%) each longer than 9 min." [2, p. 435]
Ok, but what was the actual data being used? The second sentence seems to describe that the data of a whole subject was removed. But what means "this affected 18 responses"? Does this mean that the data of 18 subjects was removed? Or just 18 answeres? And what about the other six? Does it mean that 24 answers, i.e. three subjects were removed? Or was each single response treated individually? I felt suddenly slightly reluctant to write down a sentence such as 158 subjects participated, because I was not able to find precisely what data was skipped. But, ok, I lived with the problem - and just reported that 158 subjects participated. Actually, this step alone took me quite a bit of time, because I reread the paper more than once because I assumed I missed some relevant information.
 

How large is the effect of expression length?

The main reason why I looked into the paper was, that I wanted to know whether expression length was a significant factor and in case it was, how large the effect was. The paper report on a significant average difference of 20.1 seconds between long and short expressions in reading time, i.e. longer expressions took longer. But how much longer did they take in comparison to short expressions? 

I started searching either for effect size measures or at least some descriptive numbers such as means or confidence intervals or something. I was really convinced that I must have missed it somewhere. So I re-read the paper over and over again - and did not find the number. The only things I had was the following:
"It was decided that responses with times shorter than 1.5 s should be removed because they gave the subject insufficient time to process the code. This affected 18 responses (1.4% of the 1264 responses). In addition, excessively large values were removed. This affected 6 responses (0.5%) each longer than 9 min." [2, p.435]
So, should I just report that the average reading time was between 1.5 and 9 min which would mean that 20.1 second is "between factor 14 and 4 %"? That does not sound meaningful. Again, searching just for this single number (that I finally did not get) took me some time. The same is true for a second variable: syllable. It is reported that each additional syllable costs the developer 1.8 seconds. But what does the first syllable cost?

In fact, I felt more uncomfortable with the variable syllable because there are multiple treatments of this variable and I would be much more interested in the precision of 1.8 seconds.
 

How exactly were the results of the study computed?

What puzzled me as well was the question, how the results were achieved: what statistical procedure was used? And what tool was used? The paper just says that linear mixed-effects regression models were used. Ok, but with what tool? And what exactly were the input variables for the regression models?

Going back to the question on the effect of the variable length, the paper says that "the initial model includes the explanatory variable Length" [2, p. 437]. Length? In a regression? The paper uses length, which is a  binary variable (the paper distinguishes between short and long), so in principle, it is just a simple AB-test or did I miss something? Or was bunch of variables added to the initial model and just length was the one that was significant?

Actually, it turned out that I had many, many more problems. And it took me quite a lot of time just to identify that some of these problems were real. I should mention that because I had so much trouble, I contacted Dave who sent me the raw data set within hours so I was able to analyze the data on my own in order to get the results from the experiment.
 

Why standards such as CONSORT are urgently needed in software science

Finally, I got the raw measurements and was able to recompute some numbers and everything was fine. But why did I feel something is really problematic?

Again, I am well-trained in reading studies. But if it took me hours to understand what was in the paper, I assume that it took many more hours for people who are not trained in these things. So, how can we even assume that someone will take studies into account if it takes many hours to read them? The study by  Devambu et al. indicates that we should blame developers for not knowing what is actually known in the field. But if understanding a single study takes many hours, it actually makes sense that people do not read them. Why? Because developers have more to do than just spending a whole day on reading a single paper. And in case essential information is finally missing, the whole days was spent in vain.

So, how come that essential information are hard to find in studies or are even missing? Again, I do not blame the authors of the here mentioned study for forgetting something. But the paper was published in a peer-reviewed journal. How is it possible that it passed the peer-reviewing process while some essential information is missing (again, we need to speak about the reviewer process at some point, but not here)? 

I am happy that the paper was published, because otherwise the whole body of knowledge in our field would be even less - and it is already inacceptable low (see the study by Ko et al. [3]). But what would have reduced the problem?

Here come research standards into the game. If our field would be disciplined enough to apply a relatively simple reporting standard such as CONSORT [4], things would be easier. Such standard implicitly contains a summary in each paper that permits you to find information quite fast. For the review process, it is a relatively easy thing to check, whether a paper fulfills the standard. I.e. authors can double-check whether the relevant information is contained and reviewers can do this double-checking as well. 

Applying such a standard would have another implication: if for example a conference would apply such a standard, many papers could be directly rejected because they do not fulfill the standard. The problem identified by Ko et al. (and there are in fact many, many more authors who documented that evidence is hardly gathered in our field) would vanish: scientific venues would publish just papers that follow the scientific rules. This would reduce the problem that readers are confronted with tons of papers whose content cannot be considered as part of our body of knowledge. 

Yes, there is the other problem which makes it hard to imagine that we finally get to the point that the software science literature would contain scientific relevant studies: people must be willing to execute (and publish) experiments which are able to conflict with their own position. But this is a different issue I discussed somewhere else.

Yes, research standards are urgently needed. At least reporting standards. Urgently.

References

  1. Devanbu, Zimmermann, Bird. Belief & evidence in empirical software engineering. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, pages 108–119, 2016. [https://doi.org/10.1145/2884781.2884812]
     
  2. Binkley, Lawrie, Maex, Morrell, Identifier length and limited programmer memory, Science of Computer Programming 74 (2009) [https://doi.org/10.1016/j.scico.2009.02.006]
     
  3. Andrew J. Ko, Thomas D. Latoza, and Margaret M. Burnett. A practical guide to controlled experiments of software engineering tools with human participants. Empirical Software Engineering, 20(1):110–141, February 2015. [https://doi.org/10.1007/s10664-013-9279-3]
     
  4. The CONSORT Group, CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials, 2010. [http://www.consort-statement.org/downloads/consort-statement]

Thursday, May 7, 2020

Before Doing Science in Software Construction Something Else is Needed: Critical Thinking

Before Doing Science in Software Construction Something Else is Needed: Critical Thinking


Why is Science Needed in Software Construction?

Software construction is a huge, multi-billion market where new technology appears almost every day (in case you doubt that software is a multi-billion market, just take a look at the 10 most valuable companies on this planet today). Such new technology comes with multiple claims and the most general one is, that the new technology makes software development easier and hence cheaper.

Taking the size of the market into account, there are good reasons to doubt whether all technology on the market exists for a good reason - beyond the reason that new technology increases the income of companies or consultants who propagate this technology. Taking the size of the software market into account, there are reasons to believe that a lot of technology exists although its promised benefit neither ever existed nor will ever exist.

In the very end, one has to accept that most claims associated with a certain technology are not the result of non-subjective studies. Instead, they are the result of subjective perceptions or impressions of people who either have strong faith in a new technology, who really hope that the technology improves something, or who just love a new technology ("faith, hope, and love are a developer’s dominant virtues" [1, p. 937]). Finally, some of these claims are just the result of marketing considerations: Claims that are made and spread just because they increase the probability of success for the new technology and not because they are true.

That non-subjective studies are rather rare exceptions in the field of software construction is a sad, but well-documented phenomenon. For example, Kaijanaho has shown that up to 2012 only 22 randomized controlled trials on programming language features with human participants were published [2, p. 133]. Another example is the paper by Ko et al. who analyzed the literature published at the four leading, scientific venues in our field. The authors came to the conclusion that "the number of experiments evaluating tool use has ranged from 2 to 9 studies per year in these four venues, for a total of only 44 controlled experiments with human participants over 10 years" [3, p. 137].

So, what's wrong with this situation? The problem is, that new technology causes costs. Costs for learning this technology, applying it, and maintaining software written in it. And there are additional, hidden costs. First, there are costs because new technology supersedes existing technology. Such existing software becomes often rewritten which means that investments done in past need to be repeated the future. And in case existing software is not newly written, there are additional costs for maintaining the old technology. And old technology causes larger costs because once a technology is no longer taught and no longer applied it becomes more expensive to maintain it simply because there are no longer people on the market who are able to master the old technology. An extreme example for this was the Y2K problem, whose costs were to a certain extent caused by forgetting the old technology COBOL.

But there is another, tragic problem. The problem is, that in case a new technology would appear that solves a number of problems we have today, such technology could not be identified. The claims associated with this new technology would just be lost among all the other claims that exist for today's technology or claims that will be associated with competitors.

We must not forget that the goal is not to find excuses to stick to old and inefficient technology. The goal is to make progress. But progress does not mean just to apply new stuff that appeared recently, but to apply technology that improves the field of software construction.

So, what we need are methods to separate good from bad technology. We need to separate knowledge from speculation and marketing claims. And we need to teach such methods to developers to give them the ability to separate knowledge from speculation. This does not mean that we need developers who execute studies. But we need developers who are able to read studies and who are able to identify trustworthy studies from bad ones. In the end, we want a discipline that relies on the knowledge of the field as a whole and not on speculations of individuals.

The Scientific Method

The alternative to subjective experiences and impressions is the application of the scientific method, which is actually the alternative to subjectivity and not just one alternative among others. This does not imply that the term scientific method describes a clear, never-changing and unique process of knowledge gathering. Instead, it is a collection of things that can be done, should be done or must be done. And this collection changes over time, because not only knowledge in a certain discipline changes because of the scientific method. The method changes as well.

It is not surprising that the scientific method is often critically discussed in the field of software construction which is more an expression of the immaturity of the field instead of the community's willingness to generate and gain non-subjective insights. Just to give an impression: even at international, academic conferences on software construction, there are discussions whether not the scientific method makes any sense at all. At such places, there are discussions about the need for control groups, the validity of statistical methods or the validity of experimental setups. All these discussions exist despite the fact that there are tons of literature available from other fields on these topics (which give very clear answers to these topics). One could argue that this immaturity just exists because the field is quite young. In fact, this statement can be easily rejected. In medicine, which is typically considered as one of the old fields, most of the experimental results that we accept todays as those ones that follow valid research methods, are just done in the last 30-40 year.

The fundamental part of the scientific method is, that there are people who are willing to test the validity of hypotheses. This implies that they are willing to accept results although they conflict their own, personal and subjective impressions or attitudes. But this means that they not only accept their own experimental results, but they also accept results from others. Although this seems quite natural, it has one important implication. It means that people established some common agreement what a valid research result is and what not.

Scientific Standards

Let's discuss the very general idea of research standards via an example. Let's assume there are two programming techniques A and B and one would like to test the hypothesis that it takes less time to solve a given problem using technique A than it takes using technique B. So one person tests 20 people, 10 solve a given problem using A, 10 solve it with B. Then the time for both groups are measured and then compared. This is a standard AB-test where not only the experimental setup (randomization of participants, etc.) but also the analysis for the data (t-test, respectively U-test) is well-known since decades. But the general question is, whether or not one should take the results of the experiment into account as a valid result.

It turns out that especially in software construction people complain a lot about such a standard approach. And in case technique A is more  efficient than B, a larger number of people who prefer technique B will find reasons either to ignore the result or to discredit the experiment. Actually, there are quite plausible arguments against the experiment and the most general one is the problem of generalizability: one either doubts that the number of subjects is "representative" in order to draw any conclusion from the experiment. Another doubt is, whether the given programming problems represent "something that can be found in the field" or whether the problems are any "general programming problems at all".

We should not be too ignorant to reject such objections directly, because there is some truth in them. But we should also not be too open minded to take such objections too serious, because of the following reasons: there is no experiment in the world that is able to solve the problem that underlies these objections. No matter how many developers are used as subjects in the experiment, one can always argue that the number is too low. And no matter on how many programming problems the techniques are tested, there are other programming problems on this planet that were not used in the experiment. 

In order to overcome such situation there is a need to have some common understanding of the applied methods: there is the need for community agreements. If people agree on how experimental results are to be gathered, there is no need to doubt in results that come from experiments that follow such agreements. In other disciplines, the problem was identified as well (some longer time ago) and corresponding scientific standards were created. Examples for such standard are the CONSORT standard in medicine [4] (which mainly addresses the way how experiments are to be reported) and the WWC-standard that is used in education [5] (which not only covers the way how experiments are to be executed, but which also handles the process of how experiments should be reviewed).

The need for such community agreements is obvious and we argued already in 2015 that such community agreements are necessary in software construction as well [6]. Today we find movements towards such standards. An example for this is the Dagstuhl seminar "Toward Scientific Evidence Standards in Empirical Computer Science" that takes place in January 2021 [7].

On the Selection of Desired, and the Ignorance of Undesired Results

Such movements are good and necessary. However we should ask ourselves, whether the field of software construction is ready for such standards. Because the introduction of research standards entails some serious risks that should be taken into account. But before discussing these risks, I would like to start with some examples.

In the last years, one situation occured over and over again to me. A collegue contacted me and asked, whether there is one experiment available that supports a certain claim. The collegue's motivation is typically that she or he tries to find a way to argue about the need for some new technology and from her/his perspective this motivation would be stronger if there would be some matching experimental results. At that point I usually start a conversation and ask what if there are experimental results that show the opposite. At that point I usually get the answer that such experiments would be interesting, but wouldn't help in the given situation. In order words: an experimental result (in case it exists) is ignored in case it contradicts a personal intention.

Something else happened to me in the last years which is related to an experiment I published in 2010; an experiment that did not show a difference between static and dynamic type systems [8]. Today it seems quite clear that the experiment had problems and it would have been better if the experiment was never published. In the meantime, other experiments showed the positive effects of static type systems (such as for example [9]): Taking the sum of experiments into account, the question of whether or not a static type system helps developers can be considered answered (so far). But what happened is that people, to whom it is helpful that that no difference between static and dynamic type systems was detected, have the tendency to refer only to the first study in 2010 but to later ones. For example, Gao, Bird and Barr explain relatively detailed the results of the 2010 paper, but do not mention the latter one [10]. Again, it seems as if only those results are taken into account that match a given intention - and results that contradict such intention are ignored.

Finally, another situation occured more than once or twice. A collegue created some new technology and asked me for advice in order to construct an experiment that reveals the benfit of the new technology. After some discussions (which often last for hours) we typically come to the point that the collegue is really convinced about the benefit of the technology in a certain situation, but thinks that in a different situation the technology could be even harmful. Often, this collegue is in the situation that a PhD needs to be finished and "just the last chapter - the evaluation" needs to be done. And what happens next is often that an experiment is created that just concentrates on the probable positive aspects of the new technology - the (possible) negative aspects are not tested.

The commonality of these examples is, that people today have the tendency to select only those results that do match their own perspectives or attitudes. In other words: even if strong empirical evidence, i.e. a number of experimental results, exists for a given claim, people still have the tendency to search for singular results that contradict such claim if people do not share this claim.

This is comparable to people who advocate homeopathy and select those rare experiments where homepathy showed a positive effect - and ignore the overwhelming evidence we have about homeopathy.

The Required and Currently Missing Foundation is Critical Thinking

Probably there is a reason for such a behavior and I assume that such reason has something to do with people's attitude in our field. In our education, from the very beginning people are involved in ideological warfares: procedural versus functional versus object-oriented programming, Eclipse versus IntelliJ, GIT versus Mercurial, JavaScript versus TypeScript, Angular versus React, etc. "Chosing a side" seems to play an essential role in software construction. And it actually makes sense to a certain extent. If I master a technology, it is beneficial for me if this technology becomes the leading technology in the field. If I master a technology that no one uses and that no one is interested in, my technological skills are not and maybe will never be beneficial to me. Consequently, people advocate the technology they use and they try to find reasons why this technology should be used by others as well. And in order to achieve this, all kinds of arguments will be applied and it does not matter whether an argument is actually valid as long as supports my intentions. This behavior becomes stronger as soon as people start developing their own technology. If someone writes as part of this PhD a programming language, there seems to be the tendency to defend this language. 

The idea of defending a self-created technology or to defend a technology just because one is able to master it seems quite natural. But actually, this is probably the core of the problem. We need to communicate from the very beginning that the goal is to make progress. And that progress means that we are willing to identify problems. And in case there is strong evidence that a certain technology has serious problems, we must be open minded enough to take alternative technologies into account. We must be able to accept and apply critical thinking.

Of course, this must not lead to the situation that people switch technology directly after some rumours appear about some better technology - in fact, this would be closer to the situation we have today where a large number of people accept new technology for the sake of being new. Just to throw everything away in order to apply something new is closer to actionism than to critical thinking. Critical thinking also does not mean that we find ad hoc arguments against some technology. Critical thinking must not mean that we encourage wild speculations. It just means that we are willing to accept different arguments. Critical thinking means that we are willing to collect and accept pros and cons. It means that we are willing to give up our own position.

This willingness is the very foundation we need in our field. Because it does not matter if we define research standards in our field and enforce people to follow such research standards as long as people are not willing to accept results that conflict their own positions. Otherwise, research results will be either just ignored or people generate and publish only those results are match their own attitudes.

Once we have achieved this kind of critical thinking and once we are able to give this idea to students, we can go the next step towards evidence in order to give people the ability to differ between strong, weak and senseless arguments, arguments that are backed up by evidence and those ones that are not. Then, we have researchers who are willing to define experiments whose results might contradict the experimenters' positions. This would be the moment where science could start in our field. This would be the moment when we are ready to apply the scientific method.

References

  1. Stefan Hanenberg, Faith, Hope, and Love: An essay on software science’s neglect of human factors, OOPSLA '10: Proceedings of the ACM international conference on Object oriented programming systems languages and applications, October 2010, pp. 933–946. [https://doi.org/10.1145/1932682.1869536]
  2. Antti-Juhani Kaijanaho, Evidence-Based Programming Language Design A Philosophical and Methodological Exploration, PhD-Thesis, Faculty of Information Technology, University of Jyväskylä, 2015. [https://jyx.jyu.fi/handle/123456789/47698]
  3. Andrew J. Ko, Thomas D. Latoza, and Margaret M. Burnett. A practical guide to controlled experiments of software engineering tools with human participants. Empirical Software Engineering, 20(1):110–141, February 2015. [https://doi.org/10.1007/s10664-013-9279-3]
  4. The CONSORT Group, CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials, 2010. [http://www.consort-statement.org/downloads/consort-statement]
  5. U.S. Department of Education’s Institute of Education Sciences (IES), What Works Clearinghouse Standards Handbook Version 4.1, January 2020. [https://ies.ed.gov/ncee/wwc/Docs/referenceresources/WWC-Standards-Handbook-v4-1-508.pdf]
  6. Stefan Hanenberg, Andi Stefik, On the need to define community agreements for controlled experiments with human subjects: a discussion paper, Proceedings of the 6th Workshop on Evaluation and Usability of Programming Languages and Tools, October 2015, pp. 61–67. [https://doi.org/10.1145/2846680.2846692]
  7. Brett A. Becker, Christopher D. Hundhausen, Ciera Jaspan, Andreas Stefik, Thomas Zimmermann (organizers), Toward Scientific Evidence Standards in Empirical Computer Science, Dagstuhl Seminar, 2021 (to appear) [https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=21041]
  8. Stefan Hanenberg, An experiment about static and dynamic type systems: doubts about the positive impact of static type systems on development time, Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, Reno/Tahoe, Nevada, USA, ACM, 2010, pp. 22–35. [https://doi.org/10.1145/1932682.1869462]
  9. Stefan Endrikat, Stefan Hanenberg, Romain  Robbes, Andreas Stefik, How do API documentation and static typing affect API usability?, Proceedings of the 36th International Conference on Software Engineering, May 2014, pp. 632–642. [https://doi.org/10.1145/2568225.2568299]
  10. Zheng Gao, Christian Bird, Earl T. Barr, To Type or Not to Type: : Quantifying Detectable Bugs in JavaScript Proceedings of the 39th International Conference on Software Engineering, 2017, pp. 758-769. [https://doi.org/10.1109/ICSE.2017.75]