Column — The JOT Blog

2015/03/18

Popularity will NOT bring more contributions to your OSS project

Filed under: Column — admin @ 17:34

The vitality and success of Open Source Software (OSS) projects depend on their ability to attract, absorb and retain new developers [1] that decide to commit some of their time to the project. In the last years, new code hosting platforms like GitHub have popped up with the goal of helping in the promotion and collaboration around OSS projects thanks to their integration of social following, team management and issue-tracking features around a pull-based model implementation.

Roughly speaking, GitHub enables a distributed development model based on Git (though with some extensions). In GitHub there are two main development strategies aimed at (1) the project team members and (2) external developers. Team members have direct access to the source code, which they modify by means of pushes. External developers follow a pull-based model, where any developer can work isolately with clones (facilitated by means of forks in GitHub) of the original source code. Later, developers can then send back their changes and request those changes to be integrated in the project codebase. This is what is called to send a pull request. Finally, pull requests are evaluated by project team members, who can either approve the pull request and incoporate the changes, or reject it and propose improvements which can be addressed by the proponent. Beyond the project creator, other developers can be promoted to the status of official project collaborators and get most of the same rights project owners have, so that they can help not only on the development (by means of pushes, as said above) but also with management tasks (e.g., answering issues or providing support to other developers). Issue-tracking support helps both external developers and team members to request new features and report bugs, and therefore fosters the participation in the development process. People interested in the project can also become watchers to follow the project evolution.

What makes some projects more successful than others?

Since there is still very limited understanding of why some projects advance faster than others, we asked ourselves whether projects using all these new collaboration features available in code hosting platforms like GitHub would actually have a positive influence in the advancement of the project. Are popular projects (i.e., projects with more watchers, more issues added, more people trying to become collaborators…) really more successful?

This blog reports on our answer to this question based on our findings after conducting a quantitative analysis considering all the GitHub projects created in the last two and a half years. As metric for project success we chose the number of commits (not necessarily adding code, also removing it). We believe this reflects better than other metrics the fact that the project is alive and improving. Several works have performed qualitative analysis of GitHub samples ([2, 3, 4] among others) but none trying to determine criteria for project success.

Methodology for our quantitative analysis

To perform our study we took all GitHub projects created after 2012 and collected a few relevant attributes for each of them.

GitHub Project Attributes

For each project we were interested in getting insights regarding the following characteristics:

General information. We consider basic project information such as whether the project is a fork of another and the programming language used in its development.
Development. We measure the development status of GitHub projects in terms of commits (totalCommits attribute) since its creation. As GitHub projects can receive commits from pushes (i.e., source code contributions coming from team members) and pull requests (i.e., source code contributions coming from accepted pull requests), we distinguish commitsPush and commitsPR attributes, respectively.
Interest. Being a social coding site, GitHub projects can also be monitored, tracked and forked by users. We therefore focus on two main facilities provided by GitHub: watchers (watchers attribute) and forks (forks attribute). The former is the number of people interested in following the evolution of the project; they are notified when the project status changes (e.g., new releases, new issues, etc.). The latter is the number of people that made a fork. Both attributes can provide good insights on the project popularity [5].
Collaborators. We consider the number of collaborators (attribute) who have joined a project to help in its development.
Contributions. We focus on contributions coming from (1) pull requests (PRs attribute) and (2) issues (issues attribute). In particular, we are interested in collecting the number of pull requests and issues that have been proposed (i.e., opened) for each project.

Mining GitHub

The mining process is illustrated in the following Figure and is composed of three phases: (1) extracting the data, (2) aggregating the data to calculate and import the attribute values for each project into a database, and (3) filtering the database to build the subset of projects used for analysis (see Filter). Next, we will describe each phase of the process.

GitHub mining process

Figure 1: Mining process.

Extractor. GitHub data has been obtained from GitHub Archive which has tracked every public event triggered by GitHub since February 2011. GitHub events describe individual actions performed on GitHub projects, for instance, the creation of a pull request or a push. Events are represented in JSON format. There are 22 types of events but we focus on 7 of them from which we can get the data needed to calculate the project attributes described before. The considered event types are presented in Table 1.

Table 1: Events considered in the GitHub Archive extractor.

Event Type	Triggering condition	Attributes Involved
MemberEvent	A user is added as a collaborator to a repository	collabs
PushEvent	A user performs a push	commitsPush
WatchEvent	A user stars a repository	watchers
PullRequestEvent	A pull request is created, closed, reopened or synchronized	PRs, commitsPR
ForkEvent	A user forks (i.e., clones) a repository	forks
IssuesEvent	An issue is created, closed or reopened	issues

Events are stored in GitHub Archive hourly. Our process collected all the events triggered per day since January 1^st 2012 (starting date for our analyzed period).

Aggregator. This component aggregates the events extracted in the previous step and calculates the attributes for each project.

The resulting dataset contains 7,760,221 projects. This dataset was curated to eliminate projects with missing information or that were former private projects (which would prevent us from getting the full picture of the project). The curated dataset contained 7,365,622 projects.

Filter. This component allows building subsets of the previous dataset in order to perform a more focused analysis. The filter takes as input the dataset from the previous step and creates a new filtered dataset containing only those elements fulfilling a particular condition.

In the context of our study, we built a new filtered dataset including only those projects not being a fork of another and that explicitly mention they were repos with code in a given programming language. GitHub is used for many other tasks beyond software development (i.e., writing books) and we wanted to focus only on original software development projects. The resulting filtered dataset contained 2,126,093 projects and was the one used in all the other analysis presented in this blog post.

First of all, are projects in GitHub really using collaboration features?

Before we try to answer the question of whether using those features help in the project advancement, we should check whehter these features are used at all. To answer this question, we will characterize GitHub projects according to the attributes presented before and specifically study the use of collaboration facilities in them. Table 2 reveals that in fact they are not largely used.

Table 2: Project attributes results of the GitHub dataset.

Development attributes
Attribute	Min.	Q1	Median	Mean	Q3	Max.
totalCommits	0.00	2.00	7.00	43.00	19.00	5545441.00
commitsPush	0.00	2.00	7.00	41.00	19.00	5545441.00
commitsPR	0.00	0.00	0.00	1.31	0.00	38242.00
Interest attributes
Attribute	Min.	Q1	Median	Mean	Q3	Max.
watchers	0.00	0.00	0.00	2.26	1.00	14607.00
forks	0.00	0.00	0.00	0.68	0.00	2913.00
Collaborators and Contribution attributes
Attribute	Min.	Q1	Median	Mean	Q3	Max.
collabs	0.00	0.00	0.00	0.05	0.00	7.00
PRs	0.00	0.00	0.00	0.96	0.00	8337.00
issues	0.00	0.00	0.00	0.29	0.00	1540.00

The results for development attributes such as totalCommits are strongly influenced by the fact that a considerable number of projects have a small number of commits. Thus, 1,259,822 (59.26% of the total number of projects) have between 0-10 commits from pushes (commitsPush) and 2,092,685 (98.47% of the total number of projects) have only between 0-10 commits from pull requests (commitsPR). Figure 2 illustrates this situation by showing the number of projects (vertical axis) per group of commits (horizontal axis).

Comparison between number of projects and number of commits coming from pull requests and pushes

Figure 2: Comparison between number of projects and number of commits coming from pull requests (commitsPR) and pushes (commitsPush).

Regarding the interest attributes, 1,433,042 projects (67.40% of the total number of projects) have 0 watchers and 1,614,556 projects (75.94% of the total number of projects) have never been forked. These results suggest that the use of GitHub is far from what it would be expected as a social coding site.

The results for collaborator and contribution attributes also reveal a very poor usage. Thus, 2,017,911 projects (94.91% of the total number of projects) do not use the collaborator figure; 1,953,977 projects (91.90% of the total number of projects) have never received a pull request; and 1,949,644 projects (91.70% of the total number of projects) have never received an issue.

Therefore we can conclude that most projects do not make any use of GitHub features and use it purely as a kind of backup mechanism. The great majority of projects show a low activity (i.e., totalCommits, commitsPush and commitsPR) and attract low interest (i.e., forks and watchers).

but those that do, do they get any benefits?

If so, this would be a good reason for the other projects to follow suit. Let’s see then if popular projects that attract a lot of interset (plenty of forks and watchers) and manage to involve a large community (that opens issues, becomes collaborators, submits pull requests) end up having more commits in the repository than others.

To answer this question, we will perform a correlation analysis among the involved attributes. More specifically, we resort to the Spearman’s rho (ρ) correlation coefficient to confirm the existence of a correlation. This coefficient is used in statistics as a non-parametric measure of statistical dependence between two variables. The values of ρ are in the range [-1, +1], where a perfect correlation is represented either by a -1 or a +1, meaning that the variables are perfectly monotonically related (either increasing or decreasing relationship, respectively). Thus, the closer to 0 the ρ is, the more independent the variables are.

Table 3 shows the ρ values for each combination of attributes we wanted to evaluate. The first three rows focus on the correlation between the number of collaborators, pull requests and issues and the number of commits of the project. As you can see there is no correlation (except for the somewhat obvious correlation between the number of pull requests and the commits derived from accepting them, as long as they are accepted, but with basically no impact on the global number of commits) among them. The last rows show there is no correlation either between the number of people following the project and the commits.

Table 3: Correlation analysis between the considered attributes.

	Success attributes
	totalCommits	commitsPush	commitsPR
collabs	0.09	0.09	0.06
PRs	0.27	0.25	0.88
issues	0.25	0.25	0.34
watchers	0.11	0.10	0.24
forks	0.08	0.07	0.36

It is important to note that during our study we also calculated the correlation values among all these attributes when grouping the projects according several dimensions, specially based on their size and the language used. None of those groupings revealed different results from those shown above.

Threats to Validity

In this section we describe the threats to validity we have identified in our study.

External Validity. Our study considers a large dataset of GitHub projects, however, it may not represent the universe of all real-world projects. In particular, as GitHub allows users to create open source repositories without any expense, our dataset might include mock or personal projects that are not focused on attracting contributions and they have been open sourced only to avoid paying membership fees to keep them private.

Internal Validity. Our study only considers GitHub data and therefore does not take into account external tools used by some GitHub projects (e.g. to manage the team and issues; for instance people attaching patches to an external Bugzilla bug tracking tool, later manually merged into the project by the project owner) that can lead to bias our study (i.e., in the previous example, that patch would not count as a pull request). Finally, using the language attribute to filter out non-software projects may result in the elimination of relevant projects since some software projects do not set the programming language used.

If popularity is not a good indicator, what determines the success of a project?

Honestly, we think by now it’s clear that we have no idea. We have learnt about quite a few things that do not correlate with success but still have to find one that does. Probably because there is no single reason for that or at least not one that it is simple enough to be easily measured. Still, being able to shed some light on this issue, even if partial, would be very benefitial for the OSS community and thus it’s worth to keep trying.

To try to get some more insights on this we have complemented this quantitative analysis with a more qualitative one where we conducted a manual inspection of the 50 most successful GitHub projects in our dataset (success measured in terms of the number of commits of the project coming from pull requests, i.e., from external contributors). We noticed that 92% of them (i.e., 46 projects) included a description file (i.e. readme), with, often, a link to complementary information in wikis (46%) and/or external websites (50%). A further manual inspection of these three kinds of project information sources revealed that they were not purely “decorative” but that instead included precise information on the process to follow for all those willing to contribute to the project (e.g., how to submit a pull request, the decision process followed to accept a pull request or an issue, etc.). We have compared these numbers with random samples of projects to confirm they are not just average values for the GitHub population.

This hints at the reasonable possibility that having a clear description of the contribution process is a significant factor to attract new contributions. Unfortunately, existing GitHub APIs and services do not provide direct support to automatically check our hypothesis on the whole population of GitHub projects so further research on this requires conducting other kinds of empirical analysis like interviews to contributors and project managers. If this is confirmed, this would open plenty of other interesting questions like whether some kinds of contribution processes (also known as governance rules) attract more contributors than others (e.g. dictatorship approach versus a more open process to accept pull requests). This could help project owners to decide whether to have a more transparent governance process in order to advance faster in the project development. See [6] for a deeper discussion on this.

About the Authors

Javier Luis Cánovas Izquierdo is a postdoctoral fellow at IN3, UOC, Barcelona, Spain.

Valerio Cosentino is a postdoctoral fellow at EMN, Nantes, France.

Jordi Cabot is an ICREA Research Professor at IN3, UOC, Barcelona Spain.

References

[1] C. Bird, A. Gourley, P. Devanbu, U. C. Davis, A. Swaminathan, and G. Hsu. Open Borders ? Immigration in Open Source Projects. In MSR conf., 2007.

[2] J. Choi, J. Moon, J. Hahn, and J. Kim. Herding in open source software development: an exploratory study. In CSCW conf., pages 129–133, 2013.

[3] G. Gousios, M. Pinzger, and A. V. Deursen. An Exploratory Study of the Pull-based Software Development Model. In ICSE conf., 345–355, 2014.

[4] F. Thung, T. F. Bissyande, D. Lo, and L. Jiang. Network Structure of Social Coding in GitHub. In CSMR conf., pages 323–326, 2013.

[5] T. F. Bissyande, D. Lo, L. Jiang, L. Reveillere, J. Klein, and Y. L. Traon. Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub. ISSRE symp., pages 188–197, 2013.

[6] J. Cánovas, J. Cabot. Enabling the Definition and Enforcement of Governance Rules in Open Source Systems. In ICSE – Software Engineering in Society (ICSE-SEIS), to appear.

Comments Off

2013/01/25

Lies, Damned Lies and UML2Java

Filed under: Column — richpaige @ 11:40

We review far too many research papers for journals and conferences. (Admittedly, we probably write too many papers as well, but that’s another story.) We regularly encounter misunderstandings, misconceptions, misrepresentations and plain old-fashioned errors related to Model-Driven Engineering (MDE): what it is, how it works, what it really means, what’s wrong with it, and why it’s yet another overhyped, oversold, overheated idea. Some of these misunderstandings are annoyingly common for us to want to put them down on the digital page and try to address them here. Perhaps this will help improve research papers, or it will make reviewing easier; perhaps it will lead to debate and argument; perhaps this list will be consigned to an e-bin somewhere.

Our modest list of the ten leading misconceptions — which is of course incomplete — is as follows.

1. MDE = UML

At least once a year we read an article or blog post or paper that assumes that MDE is equivalent to using UML for some kind of systems engineering. This is both incorrect and monotonously boring. The reality is that MDE neither depends on, or implies the use of UML: the engineering tasks that you carry out with MDE can be supported by any modelling language that (a) has a metamodel/grammar/well-defined structure; and (b) has automated tools that allow the construction and manipulation of models. Using UML does not mean you are doing MDE — you might be drawing UML diagrams as rough sketches, or to enable simulation/analysis, or for conceptual modelling. Doing MDE does not mean you must be using UML: you could be using your own awesome domain-specific languages, or another general-purpose language that has nothing to do with UML.

We have noticed that this misunderstanding appears less frequently today than it did five years ago; perhaps the message is slowly getting through. The misunderstandings might have started because of the way in which we often introduce MDE to students: conceptual or design modelling with UML is often the first kind of modelling that students see.

So, the good news is that if you’re doing MDE, you don’t have to use UML; and if you are using UML, you don’t have to do MDE. The bad news is that there are many other misconceptions out there waiting to pounce. We are just getting started.

2. MDE = UML2Java

Code generation is often the first use case that’s thought of, mentioned, dissected and criticised in any technical debate about MDE. “You can generate code from your models!” is the cry of the tool vendor. This is usually followed by the even more thrilling: “you can generate Java code from your UML models!” As exciting a prospect as this is, the overemphasis of code generation in discussions of MDE has led to the myth of the UML-to-Java transformation, and that it is the sole way of doing MDE. Without doubt, this is a legitimate MDE scenario that has been applied successfully many times. But as we mentioned earlier, you do not have to use UML to do MDE. Similarly, you don’t have to target Java via code generation to do MDE. Indeed, there is a veritable medley of programming languages you can choose! C#, Objective-C, Delphi, C++, Visual Basic, Cobol, Haskell, Smalltalk. All of these exciting languages can be targeted from your modelling languages using code generators.

It would be much more interesting to read about MDE scenarios that don’t involve the infamous UML2Java transformation — there are undoubtedly countless good examples that are out there. It’s always helpful to have a standard example that everyone can understand, but eventually a field of research has to move beyond the standard, trivial examples to something more sophisticated that pushes the capabilities of the tools and theories.

3. MDE ⇒ code generation

But what if you don’t care about code generation? Clearly you are a twisted individual: if you’re doing MDE you must be generating code, right? Wrong! Code generation — a specific type of model-to-text transformation — from (UML, DSML) models is just another legitimate MDE scenario. Code may not be a desirable visible output from your engineering process. You may be interested in constructing and assessing the models themselves — producing a textual output may not deliver any value to you. You may be interested in generating text from your models, but not executable code (e.g., HTML reports, input to verification tools). You may be interested in serialising your models so as to persist them in a database or repository.

However, if you are generating code from models, you are probably applying a form of MDE (the nuance is really whether your models have a precisely defined structure [metamodel] and whether or not your code generators are externalised — and can be reused).

4. MDE ⇒ transformation.

We’ve established that MDE is more than code generation. MDE is also about more than transformation.

Some problems cannot be easily solved with transformation. As advocates of MDE do we pack our bags and look for furrows that we can plough with model transformation techniques? Or can MDE still be of use?

Supporting decision making — helping stakeholders to reason about trade-offs between competing and equally attractive solutions to a problem — is an area in which models and MDE are increasingly used. (See the wonderful world of enterprise architecture modelling for examples). Code, software or computer systems are not necessarily central to these domains, and transformation does little more for us than produce a nicely formatted report. Instead, we need to consider exploiting other state-of-the-art software engineering techniques alongside typical MDE fare. Perhaps search-based software engineering (i.e. describing what a solution looks like) is preferable to model transformation (i.e. describing how an ideal solution is constructed) in some cases. We have done work in this area at our university [DOI: 10.1007/978-3-642-31491-9_32], and there is growing interest in this topic.

Transformation is powerful. Refactoring, merging, weaving, code generating and many other exciting verb-ings would not be possible without transformation theory and tools. However, models are ripe for other types of analysis and decision support and for these tasks, transformation is often not the right approach. In 2003 model transformation was characterised as the heart-and-soul of MDE. In 2013 we believe that a more well-rounded view is preferable.

5. “The MDE process is inflexible.”

This was an actual quote from a paper we once had to review for a conference. It was both a strange sentence and an interesting one, because we didn’t know what it meant. Just what is “the MDE process”? Did we miss the fanfare associated with its announcement? Arguably “process” and MDE are orthogonal: if you are constructing well-defined models (with metamodels) and using automated tools to manipulate your models (e.g., for code generation) then you are carrying out MDE; the process via which you construct your models and metamodels and manipulate your models is largely independent. You could apply the spiral model, or V-model, or waterfall. You could embed, within one of these processes, the platform-independent/platform-specific style of development inherent in approaches like Model-Driven Architecture (MDA). There is no MDE process, but by carrying out MDE you are likely to follow a process, which may or may not be made explicit.

6. MDE = MOF/Ecore/EMF

You must conform to the Eclipse world. Or the OMG world. You must define your models and metamodels with MOF or Ecore. You will be assimilated.

This is, of course, nonsense. MOF and Ecore are perfectly lovely and useful metamodelling technologies that have served numerous organisations well. But there are other perfectly lovely and useful metamodelling technologies that work equally well, such as GOPRR, or MetaDepth, or even (shock horror) pure XML. Arguably, the humble spreadsheet is the most widely used and most intuitive metamodelling tool in the world.

MDE has nothing to do with how you encode your models and metamodels; it has everything to do with what you do with them (manipulate them using automated tools; build them with stakeholders). Arguably, you should be able to do MDE without worrying about how your models are encoded — a principle that we have taken to heart in the Epsilon toolset that we have developed at our university.

7. Model transformation = Refinement

Refinement is a well-studied notion in formal methods of software engineering: starting from an abstract specification, you successively “transform” your specification into a more concrete one that is still semantics-preserving. In some formal methods, the transformations that you apply are taken from a catalogue of so-called refinement rules (which provably preserve semantics). Their application ultimately results in a specification that is semantically equivalent to an executable program. The refinement process thus produces a program that is “correct-by-construction”.

You can follow the logical (mis-)deduction behind this misconception quite easily:

Refinement rules transform specifications.
Specifications are models (see earlier misconceptions).
Model transformations are a set of transformation rules.
Transformation rules transform models.
Therefore, refinement rules are transformation rules.
Therefore, refinement is transformation.

This is actually OK. Refinement is a perfectly legitimate form of model transformation. The problem is with the reverse inference, i.e., that a transformation rule is a refinement rule. If you assume that transformations must be semantics preserving, then this is not an unreasonable conclusion to draw. But model transformations need not preserve semantics.

Heretical statements like this usually generate one of several possible responses:

“This is crazy: why would I want to transform a model (which I have lovingly crafted and bestowed with valid properties and attributes) into something that is manifestly different, where information is lost?”
“OK, I can see that you might write a transformation that does not preserve semantics, but they must be dangerous, so we just need to be able to identify them and isolate them so that they never get deployed in the wild.”
“I don’t have to preserve semantics? That’s a relief! Semantics preserving transformations are a pain to construct anyway!”

These responses are all variants of misunderstandings we have seen previously: this idea that MDE is equated to a specific scenario or instance of application.

The first misunderstanding is, of course, confusing a specific category of model transformation — those that preserve semantics — with all model transformations. What are some examples of non-semantics preserving transformations? They are legion: measurement applied to UML diagrams is a classic example, where we transform a UML diagram into a number. The transformation process calculates some kind of (probably object-oriented) metric. Another example is from model migration: updating a model because its metamodel has changed. In some scenarios, a metamodel changes by deleting constructs; the model migration transformation likely needs to delete all instances of those constructs. This is clearly not semantics preserving.

The second misunderstanding is the classical “Well, you can do it but don’t expect me to like it” response. Unfortunately, in many real model transformation scenarios, you have to break semantics, and you probably need to enjoy it too. Consider a transformation scenario where we want to transform a very large model (e.g., consisting of several hundred thousand elements) conforming to a very large metamodel (like MARTE, AUTOSAR, SysML etc) into another very large model conforming to a different very large metamodel. Because we are good software engineers, we are likely to want to break this probably very large and complicated transformation problem into a number of smaller ones (see, for example, Jim Cordy’s excellent keynote at GPCE/SLE 2009 in Denver), which then need to be chained together. Each of the individual (smaller) transformations need not preserve semantics — indeed, some of the transformations may be to intermediate convenience languages that exist solely to make complex processing easier.

8. MDE can’t possibly work for real systems engineering because it doesn’t work well in complex domains where there is domain uncertainty.

In systems engineering we often have to cope with domain uncertainty — we don’t fully understand the threats and risks associated with a domain until we have got a certain way along the path towards developing a system. If there is domain uncertainty then the modelling languages that have been chosen, and the operations that we apply to our models (e.g., transformations, model differencing, mergings) are liable to change, and this becomes expensive and time-consuming to deal with. Domain uncertainty is a real problem — for any systems engineering technique, whether it is model-based, code-based or otherwise. Domain uncertainty will always lead to change in systems engineering. The question is: does MDE make handling the change associated with domain uncertainty any worse? Perhaps it does. If you’re using domain-specific modelling languages, then changes will often result in modifications to your modelling languages (and thereafter corresponding changes to your model transformations, constraints etc). If you are using code throughout development, changes due to domain uncertainty will be reflected in changes to your architecture, detailed modular design, protocols, algorithms, etc. Arguably, these are problems of similar conceptual complexity — it’s hard to see how MDE makes things worse, or indeed better: it’s an essentially hard problem of system engineering.

9. Metamodels never change

As we saw in the first misconception, MDE is not only about UML, but also about defining and using other modelling languages. However, when we (or you, or the OMG) design a modelling language, even a small one, we rarely get it right the first time. Or the fifth time. Or the ninth time. Like all forms of domain modelling, constructing a metamodel is difficult and requires consideration of many trade-offs. Language evolution is the norm, not the exception.

Despite this, we often encounter work that:

Does not consider or discuss tradeoffs made in language design. These kinds of papers often leave us wondering why a domain was modelled in a particular way (e.g. “why model X as a class, and Y as an association?” “why model with classes and associations at all?”).
Presents the product of language design, but not the process itself. How was the language designed? Did it arrive fully formed in the brain of a developer, or were their interesting stories and lessons to be learnt about its construction?
Proposes standardisation of domain X because “there is a metamodel.” A metamodel is often necessary for standardisation, it is not sufficient. (For example, does your favourite transformation language implement all of the QVT specification? We bet it doesn’t — and shame on you, of course!)
Contributes extensions to — or changes to — existing languages with little regard for the impact of these changes on models, transformations or other artefacts. Even in UML specifications, the impact of language evolution is not made apparent: there are no clear migration paths from one version to another, as we discovered at the 2010 Transformation Tool Contest (see also the forum discussion on UML migration).

Misconceptions about language evolution might stem from the way in which we typically go about defining a modelling language with contemporary MDE tools. We normally begin by defining a metamodel/grammar, then construct models that use (conform to) that metamodel/grammar, and then write model transformations or other model management operations. The linearity in this workflow is reminiscent of Big Design Up Front, and evokes painful memories of waterfall processes for software development.

However, we have found that designing a modelling language — like many other software engineering activities — is often best achieved in an iterative and incremental manner. We are not alone in this observation. Several recent modelling and MDE workshops (XM, ME, FlexiTools) have included work on inferring metamodels/grammars from example models; relaxing the conformance relationship (typing) of metamodels; and propagating metamodel changes to models automatically and semi-automatically. These are promising first steps towards introducing incrementality and flexibility into our domain-specific modelling tools, but the underlying issue is rather more systematic. As a community, we need to acknowledge that changing metamodels are the norm, and to better prepare to embrace change.

10. Modelling ≠ Programming

There is a tendency in many papers that we read to put a brick wall between modelling and programming — to treat them as conceptually different things that can only be bridged via transformations (created by these magical wizards, or transformation engineers). We’ve seen this type of thing before, in the 1980s, with programming and specification languages in formal methods. Some specification languages like Z were perfectly useful for specifying and reasoning, but were difficult to use for transition to code. Wide-spectrum languages, that unified programs and specifications in one linguistic framework (e.g., Carroll Morgan’s specification statements, Eric Hehner’s predicative programming, Ralph Back’s refinement calculus), did not have these difficulties. Treating models and programs in a unified framework — as artefacts that enable system engineering — would seem to have conceptual and technical benefits, and would allow us to have fewer academic arguments about their differences (and more arguments down at the pub).

Well, we lied when we said there were only ten misconceptions.

11. MDE = MDA

We end with a real chestnut: that MDE is the same thing as MDA.

MDA first appeared via the OMG back in 2001. It is a set of standards — including MOF, CWM and UML — as well as a particular approach to systems development, where business and application logic are separated from platform technology — the infamous PIM/PSM separation. MDE is more general than MDA: it does not require use of MOF, UML or CWM, nor for platform-specific and platform-independent logic and concerns to be kept separate. MDE does require the construction, manipulation and management of well-defined and structured models — but you don’t have to make use of OMG standards, or a particular style of development to do it.

So, for you authors out there: when you say that you have an MDA-based approach, please be sure that you really mean it. Are you using MOF and UML? Are you reliant on a PIM/PSM separation? If so, great! Carry on! If not, please think again, and prevent us from complaining loudly and publicly on Twitter.

The End

We have to stop somewhere. These are just a few of the misconceptions, myths, and misunderstandings related to MDE we’ve encountered. Do send us your own!

About the Authors

Richard Paige is a professor at the University of York, and complains bitterly about everything MDE on Twitter (@richpaige). He also likes really bad films. His website is http://www.cs.york.ac.uk/~paige

Louis Rose (@louismrose) is a lecturer at the University of York. He wrangles Java into the Epsilon MDE platform, tortures undergraduate students with tales of enterprise architecture, and is regularly defeated at chess. His research interests include software evolution, MDE and — in collaboration with Richard — evaluating the effects of caffeine on unsuspecting research students. His website is http://www.cs.york.ac.uk/~louis

Comments (5)

2010/08/01

The Trouble with Configuration Management

Filed under: Column — John McGregor @ 02:19

The trouble with configuration management in some large technical organizations is that it is not just configuration management. An organization often assigns to a single role responsibility for software builds, configuration management (which often includes change management), and releases of products to customers. While these responsibilities are mutually dependent they are distinct roles, have different goals, and require very different skills. When we evaluate a product line organization this role is often identified as a source of problems. One or more of the dimensions is either neglected or incorrectly executed by the personnel assigned to the position.

There are several issues related to this approach. People in this integrated role are often not familiar with how software development will be carried out on a specific project and as a result they will often design a repository structure that does not permit, or at least hinders, parallel development. This has to do with the grain size of each versioned blob and the dependencies among blobs. A poor structure results in the possibility of two people needing to working on the same file at the same time. Developers who are unaware of the structure of the repository or who do not understand how that structure relates to the structure of the products will not be able to write build scripts that are sufficiently modular to be reused at every level up the aggregation hierarchy and may codify dependencies incorrectly resulting in an improperly linked module.

The CM staff will also define one process that is applied to both development activities and delivery activities. While this makes the release part of the CM job simpler, it slows down the development process which needs quick, frequent access to repositories to commit, update, and retrieve program elements. Some companies address this by having separate development and release CM processes without sufficiently differentiating between the processes.

First I will talk about the goals of each role, then the required skills, and finally dependencies among the roles. I will describe a company doing it right and will briefly describe how they perform these three functions.

Goals

Each of these roles has a specific goal that distinguishes it from the other roles.

The goal of the build role is to provide an executable that reflects the current development state of the product. Often legacy modules are not recompiled during a build. This is sometimes because of the time it would take to build the module but more often it is because there is a risk that the module will not build in the current environment. The result of partial recompilation can be that other modules that have changed and should be rebuilt aren’t. Building is a sufficiently complex task that many organizations require an independent auditor to witness the actual building to affirm that the correct modules were used to produce an executable with no errors.

The goal of the (real) configuration management role is to allow controlled changes to the source code by multiple people at the same time while protecting the integrity of previous work. This sounds simple, but in an environment such as a software product line it is not so simple. The structure of the repository of versions of the code facilitates specific development patterns more than others. The CM role must anticipate the eventual complexity of a development effort that will involve multiple players for multiple products. In a software product line the emphasis is on multiple references from multiple products to single copies of assets.

The goal of release management is to deliver to the customer a complete and consistent set of modules and resources that represent a correct product at a specific point in time. The product release and related resources such as the installation mechanism must be tested and the release manager determines whether particular features meet quality standards. Each release represents a build that has a permanent life in the CM system.

Skills

The build role needs an understanding of the product architecture, compile and link time variation points, and, obviously, the build tools. The build manager assures that the final build and installation package includes all the required resources including appropriate licenses. In a software product line, the build script for a product specifies a configuration of the product with compile and link time variations resolved. The script is placed under management because it captures the variant choices. The build role provides build automation to the development and test staff. In many projects every commit is a compile and test before the commit is confirmed. If every developer commits every day the organization has “always running” code. I have developed for many years using this style and it is a very efficient approach, but it is only feasible if the builds and tests are automatic.

The configuration management role requires knowledge of design, variation mechanisms, and the development process. The structure of the repository must support the approach taken to development. Some parallelism in work is required. The CM process should support concurrent work in a baseline. One way this happens is by having assigned ownership of assets so that only one person does work in a specific asset. Or at least having a separate trunk for each asset with only one person allowed in the asset at a time. Newer build tools allow logical references to files removing the need to physically copy code, but many organizations still do copy and are then faced with a diverging code base.

Release management requires an individual who understands the development process, the needs of the customer, the install time variations, and the paperwork required to support a release. In many companies the release manager is responsible for negotiating with the customer and development team for what will be ready for delivery in the next release and by what date. The manager determines that every module in the release has been through the full life cycle and is ready for release. The release manager is also responsible for assembling the installation package. Automation is critical to be able to exactly repeat a build and to quickly get ready for the next build.

Every project should establish and maintain a regular schedule of releases. This steady rhythm will establish a positive expectation on the part of clients. Many large open source projects release nightly builds, which are less well tested than the periodic stable builds. Software product line organizations often make both core asset and product releases on a regular schedule. Eclipse creates a stable, deliverable build for a large percentage of its project twice a year with nightly builds in-between. While the rhythms may vary from one organization to another, this is still a useful approach.

Dependencies

The three roles all manipulate the source code for the products. Essentially CM manages a set of files, some subset of which are selected to be built together, and the result of that build is the key element in a release. While the build process could take place with almost any physical organization of the code, the CM role can facilitate the build process through the structures that are chosen. An organization of the code that reflects some logical structure can reduce the chances of build errors. Clear labeling of versions allows for the automatic generation of new build scripts. The Build role needs to understand the content of a release and know any specific constraints for resolving any compile/link time variations.

Figure 1 Interdependencies

CM needs to know logical dependencies among the source files. The CM role captures a complete snapshot of its database for each release. This does not have to be a separate copy. It can be a versioned script that specifies a specific version for every element in the release. This goes beyond the scope of a “build” which is limited to the elements needed to compile and link the code to produce an executable. The copy includes all tests, test results, the development environment, and more.

The release role needs to know the exact contents of a release, which means the contents of one of many builds and the structure of the repository. If build scripts are managed in the repository by the CM function than the release role can retrieve a specific one at any time.

A Modest Proposal

Much of this confusion can be clarified by automating the build process and treating the build script, which may be more than a single text file, as a first class object. In fact, the build script should be thought of as the product. The script is placed under management and is versioned as changes are made to the contents of the product. The CM role creates a structure for the product that will include the build script and any product unique code. The build role creates the script and submits it to the configuration management system. The release of a product simply requires that the build script be baselined, references into the core asset base are changed from relative to absolute addresses, a separate branch is created to assemble the elements of the product.

Even if the three roles are assigned to a single person, this is a sufficiently simple process that can be handled by a single person. Many tools such as Eclipse and Visual Studio support this approach with integrated build tools. Eclipse has several projects such as Buckminster and b3 that manage the build process (and much more) and consist of artifacts that can be managed in the configuration management system.

Case in Point

My example company will have to remain anonymous but it is a major software development house specializing in a single domain but producing a software product line of consumer and professional products in that domain. This company separates the build, CM, and release roles and is large enough to have several people in each role.

Build – Building a product begins with builds of individual executables. These builds are initiated by development team members. The owner of a module develops a build script for that module. Successive layers of integration also produce build scripts that utilize the lower level scripts to build from the ground up. Finally each product has a script that automates the instantiation of the lower level pieces. Since a fundamental responsibility of the build role is to ensure that the correct code is compiled and linked, this bottom up aggregation of scripts provides a natural traceability mechanism.

CM – The configuration management is largely automated at the start of a project. The structure of the core asset base has been fixed for some time and each new product sets up to utilize that base in their scheme. The developer sets up a directory structure for their code and includes references to each core asset used rather than copying the asset into their project. The CM role establishes and evolves the basic structure and is responsible for “baselining” deliveries, which only requires creating a branch for the build scripts and making variable references in the script to the latest versions of assets into fixed references to specific versions of those assets instead.

Release management – The role of release manager carries both responsibility and authority. Once a delivery date is set and the scope of the release is negotiated with all development streams, the release manager has the authority to remove a feature from a product if including that feature puts the delivery date at risk. The manager also has the responsibility to ensure that any included features have been appropriately tested and have satisfied quality requirements. The release itself is based on the baselined build scripts that have been used from initial development through testing and now to release.

Summary

These infrastructure roles are important to the success of any software development effort but they are particularly critical in a software product line organization. The role definitions and training for each role need to be sharply focused and we need to push for trained personnel in each role. In the event that two or more of these roles are assigned to the same person, controls should be put in place to ensure even-handed and competent treatment in each of the three areas.

Running through this entire discussion is the notion of automation. There are many tools that can be used make building, change management, and product release much more repeatable and robust. Most of these tools support a variety of process structure and styles but most also make the lines between the three roles much more discernable.

Comments (1)