Olivier Girardot's Ramblings

For a new golden age of FOSS

Mon, 23 Feb 2026 00:00:00 GMT

The generative AI trend is arguably one of the biggest paradigm shifts of the last decades for the tech industry —even though the field has seen its share of upheaval recently: the disappearance of OOP as the default mental model, the rise of distributed systems and big data, the emergence of low-level languages (Rust/Zig) making a quiet comeback, and even no-code frameworks promising to abstract the programmer away entirely.

This latest shift has split opinion sharply. On one side, a chorus of voices argues that developers are a dying breed, that the craft is being automated away into irrelevance. On the other, an equally loud camp warns that the "vibe-coding" trend is a plague slowly killing projects, demoralising key contributors, and ultimately creating more problems than it solves.

Both camps are, to some extent, missing the point.

I want to argue something different: that the AI productivity wave is a wonderful, and underappreciated, opportunity for Free Software and Open Source. Not despite the disruption, but because of it.

Software Was Already Eating the World

When Marc Andreessen coined the phrase "software is eating the world" in 2011, it felt like a provocation. Fourteen years later it reads like an understatement.

SaaS was the dominant economic model of the last two decades, and its winners compounded advantages ruthlessly. The clearest illustration is the cloud wars: AWS, GCP, and Azure didn't win by owning better hardware, they won because they could afford to build, staff, and iterate on software abstractions faster than any competitor. Their moats weren't physical. They were organisational and financial: the ability to sustain large engineering teams attacking large problems, continuously, for years - it’s arguably what any European competitor lacks right now and OpenStack did not help much...

This concentration of power is not confined to the giants of the datacenter world. The same pattern plays out at every scale, in every corner of the data and infrastructure ecosystem. Storage, ETL pipelines, artifact management, data transformation — each of these markets has its own set of companies who captured a market early, established switching costs, and gradually shifted their energy from innovation to retention.

What made these positions durable wasn't superior technology. It was the cost of execution. Building a credible alternative to these solutions meant years of engineering, a VC round or two (if the market was the “next big thing” otherwise forget it), a dedicated team, and a long way to reach feature parity before you could even begin the sales conversation. The idea was often not the hard part. Assembling the execution capacity to turn the idea into something real was.

That is rapidly changing.

The Slow Death of FOSS at Scale

To understand the opportunity ahead, it helps to understand what went wrong, or rather, what was always structurally fragile.

The open source ecosystem has been living with a quiet tension for years. The original promise was simple: share the code, share the burden, share the benefit. In practice, maintaining a widely-used open source project at scale is expensive. It requires sustained engineering investment, infrastructure, community management, and increasingly, legal and security resources. The idealism doesn't pay the bills and a lot of great projects died because of that (a recent example being Scrapoxy by Fabien Vauchelles).

The industry's response has been the open core model: release the core as open source, sell the enterprise features, and use the community as a distribution channel. It worked for a while, before the cloud. But it has been fraying, and the fraying is accelerating.

Minio is the most striking recent example and one of the fastest turnarounds in recent memory. The S3-compatible object storage project that became a cornerstone of countless self-hosted and cloud-native stacks quietly archived its open source project to make way for AIStor, a proprietary fork repositioned around AI workloads. The community didn't get a slow pivot. It got a fait accompli. The failure mode here is the rug pull: the open source project could be considered, in retrospect, as a customer acquisition funnel. When the market shifted, the funnel got redirected.

HashiCorp's BSL relicensing of Terraform followed the same pattern a year earlier, triggering the OpenTofu fork. A reminder that the community can fight back, but only when it moves fast enough.

The dbt and Fivetran story is a different failure mode: consolidation absorption and it plays out in two acts. In the first, dbt Labs built genuine momentum as an open source data transformation tool, then acquired SDF Labs to push further into the SQL intelligence space. Fivetran then acquired dbt Labs, folding an independent ecosystem player into a commercial platform. What was an independent node in the ecosystem became an asset on a balance sheet.

The second act is subtler and more damaging. dbt Labs announced the transition from dbt Core (the Apache 2.0-licensed engine) to dbt Fusion, a rewritten engine released under the Elastic License v2 (ELv2). ELv2 is not open source by any definition the OSI would recognise: it prohibits offering the software as a hosted service, which is precisely the use case that made dbt Core valuable to the ecosystem. The open source project didn’t disappear for now (and a release happened in February 2026), but it’s clear that the bulk of the innovation and investments of the company is going to be on dbt Fusion. It’s a rug pull with extra steps: slower, deniable, but just as final.

Then there is the quieter, less dramatic failure mode that Sonetype’s Nexus and JFrog’s Artifactory represent: innovation stall. No rug pull, no hostile acquisition, just a gradual calcification. These artifact repository tools captured their markets early, established deep enterprise integrations, and then largely stopped innovating in any meaningful sense. Pricing crept up. The UI stagnated. Feature development slowed to a pace dictated by enterprise sales cycles rather than user needs. They didn't fail — they just became the kind of expensive, slightly-resented infrastructure that teams budget for because replacing them feels too painful to contemplate and the alternatives are either partial and/or cloud-provider based.

Each of these stories has a different surface cause. But underneath, they share the same root: sustaining FOSS innovation at scale, in a market with well-capitalised incumbents, was too costly relative to the prize. The economics just didn't work and people can only compensate for so long.

The Gap Between Idea and Execution

Here is where the argument turns.

There is a basic principle at work in any market: when the cost of producing something falls, more of it gets produced. This is the supply side of the law of demand, and it is about to reshape the software landscape in ways the AI discourse has largely missed.

For most of software history, the gap between idea and execution was wide enough to be a meaningful filter. Having the right insight about what to build was table stakes. What separated successful projects from abandoned GitHub repositories was execution capacity: the engineering hours, the sustained attention, the infrastructure, the tooling, the documentation. The idea was cheap. Everything else was expensive.

Generative AI is compressing that gap at a rate that is easy to underestimate. Not uniformly, quality still matters enormously, but the cost of turning a well-scoped idea into working software is falling faster than at any point since the commoditisation of cloud compute.

The implications for FOSS are asymmetric, and this is the part that rarely gets said plainly: the cost reduction benefits free and open source projects disproportionately.

A proprietary vendor still needs to recoup its engineering investment through revenue. A VC-backed startup still needs to justify its burn multiple. But a FOSS project only needs to cross an activation energy threshold: enough working software to be useful, enough documentation to be approachable, enough momentum to attract contributors. That threshold has always been the hard part. It is now lower.

Think about what it used to take to build a credible challenger to Artifactory. Three years minimum. A funded team. A long crawl to feature parity across package formats. A sales motion to crack enterprise procurement. The idea — "a better, cheaper, open artifact registry" — was never the scarce resource. The execution capacity was.

Now consider what that same project would look like if a single maintainer with strong domain knowledge and AI assistance decided to tackle it and disrupt a space that hasn’t moved since 15 years. You don’t need to imagine it, just take a look at the project ArtifactKeeper (45+ package formats, systems and library repositories, distributed proxy support, SSO and Security included with an MIT License), at the time of writing this article (22nd February 2026) all of these features are included — the project started on the 15th January 2026 ~a month ago 🤯. I’m not judging the quality I haven’t tested it yet but at least the ambition is clear and I salute Brandon Geraci’s motivation.

This is not hypothetical. The signals are already there in projects like this, and a growing number of infrastructure tools being built by very small teams to very high levels of polish. These aren't flukes. They are early evidence of a structural shift in what a small, motivated team can produce.

FOSS as the Natural Beneficiary

The backlash against AI-generated pull requests is real and worth taking seriously. The reviewer fatigue, the low-signal noise, the erosion of the human craft at the heart of collaborative software development. These are genuine problems, not just personal anxieties.

But they are, at their core, governance problems, not productivity problems. The productivity, to me, is real. The question is who captures it and under what terms.

This is where the proprietary model has a structural disadvantage it rarely acknowledges. A commercial vendor capturing the AI productivity gains does so to protect margins, accelerate roadmaps, and deepen competitive moats. The gains flow to shareholders and, partially, to customers through better products. But as the software itself remains locked, the moat deepens.

A FOSS project capturing the same gains operates under entirely different incentives. The productivity goes into shipping more, faster, under a licence that ensures the code stays free. There is no margin to protect. There is no rug pull option if the core is Apache-2.0 or AGPLv3 from day one. The community retains the right to fork and if the license is GPLv3 it can even start a legacy. The switching costs stay low by design.

This is why the licensing question matters more now than it did five years ago. Projects that embed permissive or copyleft licences from the start are structurally protected against the failure modes we saw with Minio, HashiCorp, and the dbt ecosystem. The rug pull requires the rug. If the licence doesn't allow it, the option doesn't exist.

The opportunity is particularly acute in the markets like that of Nexus, Artifactory, Fivetran : expensive, stale, critical infrastructure with high switching costs and low innovation velocity. These are markets where their moats were built on execution cost, the cost of the alternative being too high to justify. That moat is eroding.

A well-designed FOSS alternative in any of these spaces, built by a small team leveraging the current generation of AI tooling, with a clean licence and a genuine community is a credible threat in a way it simply wasn't possible three years ago. They know this, which is partly why the pace of proprietary pivots and acquisitions is accelerating. The window for pre-emptive consolidation is closing.

For a New Golden Age of FOSS

The first golden age of open source happened because the internet eliminated the cost of distribution. Linux didn't win by outspending SCO or Sun. It won because the economics of sharing code shifted so dramatically that the proprietary model could no longer justify its own overhead for most use cases. The infrastructure of the modern internet (Apache, MySQL, OpenSSH, Linux itself) was built largely by contributors who couldn't have done it without that distribution revolution.

We are at an analogous inflection point, but on the production side. The cost of writing software is falling the way the cost of distributing software fell in the nineties. The implications are the same: the competitive advantage that large engineering organisations held through execution capacity is being democratised.

That doesn't mean expertise stops mattering. It doesn't mean quality is free. It doesn't mean every ill-conceived FOSS project is suddenly viable. The governance challenges around AI-assisted contributions are real and will require new norms : better review tooling, clearer contribution standards, more explicit signal-to-noise filtering.

But it does mean that the class of problems that were previously "too expensive to FOSS" is shrinking. The stale, expensive, proprietary dominant players in tech are facing a structural shift in the cost curve of their competition. For the first time in a while, the economics of building a genuinely free alternative are on the right side of viable at least for the bootstrapping.

The AI wave is not the enemy of free software. Handled well — with good licences, healthy governance, and the willingness to build — it might be its best chance in a generation to actually take back control.

Object oriented programming deemed irrelevant

Thu, 20 Feb 2025 00:00:00 GMT

I've been coding since 2006, during this time I've seen multiple trends & technologies emerge, rise and fall - nowadays the elephant in the room is the bad press around OOP languages the likes of Java, C#, C++.

Our profession is no stranger to this kinda of feud and debate, for example, at the start of my career I learned and was told to quickly forget Remote Procedure Calls technologies like CORBA, SOAP with the bulk of what we called Web Services at the time - (spoiler: it came back with gRPC useful things tend to come back).

I was only told at the time that my job as a software engineer was going to be irrelevant soon enough because of MDA - Model Driven Architecture - and if I wanted to really build things my goal should be to harness all the UML/Merise diagram types perfectly and then feed them all to Eclipse EMF (still alive btw) for it to generate the code (like a good engineer should because *really doing things* is kinda dirty anyway).

OOP and programming languages

One thing that was a given at the time was the clear win of the Object Oriented programming languages for "serious work" - C was already considered too low level - so the clear go-to languages built from the ground up with OOP in mind were Java, C# and C++.

All the other languages wanted in on the action and added some concepts of Classes afterwards some in a clunky limited way like PHP and some with more attention to details like Python.

Fast forward to now

Nowadays OOP is the bad guy, the one responsible for all the evils in this world (along with Waterfall, Agile, Web Services and Design Patterns). To be clear it is denigrated in the tech news and as a general consensus in the ecosystem considered too expensive, too bloated (special mention for the epic The faster you unlearn OOP, the better for you and your software) and a waste of precious time creating more problems than it solves, especially with the challenges we face nowadays (efficient multicore usage, async/data-intensive applications, deep integration with Machine Learning and distributed systems to mention a few).

Ok it looks like a grim picture, let's take a step back and look at what people actually do, and if we take a look at the Stack Overflow developer survey of 2023, it checks out, the most popular and widely used programming languages today are not object oriented from the ground up - they are all scripting languages (except SQL) : ![](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771036280-image-1.png)

The same conclusion can be drawn if we take at the Stack Overflow developer survey of 2024 : ![](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771036870-image.png)

However if we take a look at the number of years of experience of the respondents we can see some biais in the datasets the bulk of respondents being <10 years experience on the job while according to DataUSA the average age in the industry is as of 2022 39 years old ![](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771037640-image-2.png)

And it's easy to see this self-fulfilling prophecy in action with the majority of bootcamps and influencers supposed to help you get a 6 figure job in tech in 4 days tell you that learning JavaScript is the best way to become a FullStack Engineer ever because you can use it on both the frontend and the backend (🤯 !) and Python the best way to get into DataScience (hard to argue about this one...).

So yes, the lingua franca of new developers is now closer in terms of paradigm to a (mostly) dynamically typed - imperative style of programming and it seems, as a whole, experienced developers stayed loyal (PyCon or Devoxx conferences still see ~5000 participants per day each year and JavaOne (rebranded DevNexus) in the USA sees ~10k participants per day) or moved *laterally* :

some experienced developers in OOP languages have moved on to functional programming languages to overcome some of the trauma they faced
others moved on from strict Java / C# etc... to Kotlin/Scala or other more modern forms of the language while Java integrated some of these features to stay relevant and dominant (Streams, lambda, default implementation etc...)

Finally the emergence of a new brand of lower level languages like Go and Rust means that even some of the newcomers had additional options to shield themselves from the "enterprise languages".

Where to go from there

There now seems to be a schism between older generations of programmers and newer generations, the later disregarding for the main part all the teachings (the bad and the good) that object oriented programming brought to the table.

Now let's be frank, none of the concepts that OOP pushed for are special to these languages especially in the later years :

the simple fact of defining Abstractions (not too much, not too little ) and following the Dependency Inversion Principle
the single responsibility principle
the encapsulation habit
the composition over inheritance principle

Or as we now say broadly following the SOLID principles, none of these concepts are things that you can only do with classes, inheritance or a stubbornly opinionated Object Oriented programming language.

As a side note, most of the time it's easier to follow the spirit of these principles in a functional programming language, but amongst the list of popular programming languages they are notoriously absent, the closer we'd get is that some of these languages have "functional programming features" like first-order functions, map, filters and that's it.

OOP has died many times in the past it has however survived until today but in other forms. We still call this by continuity OOP each time mostly because OOP has always been very loosely defined - the projects we build today using OOP languages and frameworks do not use as many abstractions, layers of indirections, overrides or even overloads than when the hype was at its peak, and it's a good thing! For simplicity is always a good thing!

This lack of definition is even clearer if we go back to Alan Kay the creator of SmallTalk who coined the term "Object Oriented Programming" when he meant to say the following :

"OOP to me means only messaging, local retention and protection and hiding of state-process, and extreme late-binding of all things."

None of the current leaders in OOP are message-passing oriented (sadly), yet we consider them object oriented.

I do not care that much for the survival of OOP but I do see the value in the core teachings and the values in terms of separation of concerns it brought us - the efficient tooling and compilers that have been developed and refined for the last 30 years. We, as a profession, are not doomed to repeat the cycle of hype, fame, banishment and rewrite, I've already experienced multiple times in my short career.

We should encourage all software engineers to strive for knowledge, learn, and develop critical thinking rather than forget all kinds of rational behavior, considering only the hype and prejudice of our times - in the end, even "old" programming languages and paradigms can be relevant today for the objectives we all have; to stay sane in a convoluted codebase.

From Pandas to Apache Spark's Dataframe

Fri, 31 Jul 2015 00:00:00 GMT

With the introduction in Spark 1.4 of Window operations, you can finally port pretty much any relevant piece of Pandas' Dataframe computation to Apache Spark parallel computation framework using Spark SQL's Dataframe. If you're not yet familiar with Spark's Dataframe, don't hesitate to checkout my last article RDDs are the new bytecode of Apache Spark and come back here after :p.

I figured some feedback on how to port existing "complex" code might be useful so the goal of this article will be to take a few concepts from Pandas Dataframe and see how we can translate this to PySpark's Dataframe using Spark > 1.4.

Disclaimer: a few operations that you can do in Pandas don't have any sense using Spark. Please remember that Dataframes in Spark are like RDD in the sense that they're an immutable data structure. Therefore things like : [code language="python"] df['three'] = df['one'] * df['two'] # to create a new column "three" [/code] Can't exist, just because this kind of affectation goes against the principles of Spark. Another example would be trying to access by index a single element within a Dataframe. Don't forget that you're using a distributed data structure, not an in-memory random-access data structure. To be clear, this doesn't mean that you can't do the same kind of thing (i.e. create a new column) using Spark, it means that you have to think immutable/distributed and re-write parts of your code, mostly the parts that are not purely thought-of as transformations on a stream of data. So let's dive in.

Column selection

This part is not that much different in Pandas and Spark, but you have to take into account the immutable character of your dataframe. First let's create two dataframes one in Pandas *pdf* and one in Spark *df* : [code language="python"] # Pandas => pdf In [17]: pdf = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])]) In [18]: pdf.A Out[18]: 0 1 1 2 2 3 Name: A, dtype: int64 # SPARK SQL => df In [19]: df = sqlCtx.createDataFrame([(1, 4), (2, 5), (3, 6)], ["A", "B"]) In [20]: df Out[20]: DataFrame[A: bigint, B: bigint] In [21]: df.show() +-+-+ |A|B| +-+-+ |1|4| |2|5| |3|6| +-+-+ [/code] Now in Spark SQL or Pandas you use the same syntax to refer to a column : [code language="python"] In [27]: df.A Out[27]: Column<A> In [28]: df['A'] Out[28]: Column<A> In [29]: pdf.A Out[29]: 0 1 1 2 2 3 Name: A, dtype: int64 In [30]: pdf['A'] Out[30]: 0 1 1 2 2 3 Name: A, dtype: int64 [/code] The output seems different, but these are still the same ways of referencing a column using Pandas or Spark, the only difference is that in Pandas, it is a mutable data structure that you can change, not in Spark.

Column adding

[code language="python"] In [31]: pdf['C'] = 0 In [32]: pdf Out[32]: A B C 0 1 4 0 1 2 5 0 2 3 6 0 # In Spark SQL you'll use the withColumn or the select method, # but you need to create a "Column", a simple int won't do : In [33]: df.withColumn('C', 0) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-33-fd1261f623cf> in <module>() ----> 1 df.withColumn('C', 0) /Users/ogirardot/Downloads/spark-1.4.0-bin-hadoop2.4/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col) 1196 """ -> 1197 return self.select('*', col.alias(colName)) 1198 1199 @ignore_unicode_prefix AttributeError: 'int' object has no attribute 'alias' # Here's your new best friend "pyspark.sql.functions.*" # If you can't create it from composing columns # this package contains all the functions you'll need : In [35]: from pyspark.sql import functions as F In [36]: df.withColumn('C', F.lit(0)) Out[36]: DataFrame[A: bigint, B: bigint, C: int] In [37]: df.withColumn('C', F.lit(0)).show() +-+-+-+ |A|B|C| +-+-+-+ |1|4|0| |2|5|0| |3|6|0| +-+-+-+ [/code] Most of the time in Spark SQL you can use Strings to reference columns but there are two cases where you'll want to use the Column objects rather than Strings :

In Spark SQL Dataframe columns are allowed to have the same name, they'll be given unique names inside of Spark SQL, but this means that you can't reference them with the column name only as this becomes ambiguous.
When you need to manipulate columns using expressions like "Adding two columns to each other" , "Twice the value of this column" or even "Is the column value larger than 0 ?", you won't be able to use simple strings and will need the Column reference
Finally if you need renaming, cast or any other complex feature, you'll need the Column reference too.

Here's an example : [code language="python"] In [39]: df.withColumn('C', df.A * 2) Out[39]: DataFrame[A: bigint, B: bigint, C: bigint] In [40]: df.withColumn('C', df.A * 2).show() +-+-+-+ |A|B|C| +-+-+-+ |1|4|2| |2|5|4| |3|6|6| +-+-+-+ In [41]: df.withColumn('C', df.B > 0).show() +-+-+----+ |A|B| C| +-+-+----+ |1|4|true| |2|5|true| |3|6|true| +-+-+----+ [/code] When you're selecting columns, to create another *projected* dataframe, you can also use expressions : [code language="python"] In [42]: df.select(df.B > 0) Out[42]: DataFrame[(B > 0): boolean] In [43]: df.select(df.B > 0).show() +-------+ |(B > 0)| +-------+ | true| | true| | true| +-------+ [/code] As you can see the column name will actually be computed according to the expression you defined, if you want to rename this, you'll need to use the alias method on Column : [code language="python"] In [44]: df.select((df.B > 0).alias("is_positive")).show() +-----------+ |is_positive| +-----------+ | true| | true| | true| +-----------+ [/code] All of the expressions that we're building here can be used for Filtering, Adding a new column or even inside Aggregations, so once you get a general idea of how it works, you'll be fluent throughout all of the Dataframe manipulation framework.

Filtering

Filtering is pretty much straightforward too, you can use the *RDD-like* filter method and copy any of your existing Pandas expression/predicate for filtering : [code language="python"] In [48]: pdf[(pdf.B > 0) & (pdf.A < 2)] Out[48]: A B C 0 1 4 0 In [49]: df.filter((df.B > 0) & (df.A < 2)).show() +-+-+ |A|B| +-+-+ |1|4| +-+-+ In [55]: df[(df.B > 0) & (df.A < 2)].show() +-+-+ |A|B| +-+-+ |1|4| +-+-+ [/code]

Aggregations

What can be confusing at first in using aggregations is that the minute you write groupBy you're not using a Dataframe object, you're actually using a GroupedData object and you need to precise your aggregations to get back the output Dataframe : [code language="python"] In [77]: df.groupBy("A") Out[77]: <pyspark.sql.group.GroupedData at 0x10dd11d90> In [78]: df.groupBy("A").avg("B") Out[78]: DataFrame[A: bigint, AVG(B): double] In [79]: df.groupBy("A").avg("B").show() +-+------+ |A|AVG(B)| +-+------+ |1| 4.0| |2| 5.0| |3| 6.0| +-+------+ [/code] As a syntactic sugar if you need only one aggregation, you can use the simplest functions like : avg, cout, max, min, mean andsum directly on GroupedData, but most of the time, this will be too simple and you'll want to create a few aggregations during a single groupBy operation. After all (c.f. RDDs are the new bytecode of Apache Spark ) this is one of the greatest features of the Dataframes. To do so you'll be using the agg method : [code language="python"] In [83]: df.groupBy("A").agg(F.avg("B"), F.min("B"), F.max("B")).show() +-+------+------+------+ |A|AVG(B)|MIN(B)|MAX(B)| +-+------+------+------+ |1| 4.0| 4| 4| |2| 5.0| 5| 5| |3| 6.0| 6| 6| +-+------+------+------+ [/code] Of course, just like before, you can use any expression especially column compositions, alias definitions etc... and some other non-trivial functions : [code language="python"] In [84]: df.groupBy("A").agg( ....: F.first("B").alias("my first"), ....: F.last("B").alias("my last"), ....: F.sum("B").alias("my everything") ....: ).show() +-+--------+-------+-------------+ |A|my first|my last|my everything| +-+--------+-------+-------------+ |1| 4| 4| 4| |2| 5| 5| 5| |3| 6| 6| 6| +-+--------+-------+-------------+ [/code]

Complex operations & Windows

Now that Spark 1.4 is out, the Dataframe API provides an efficient and easy to use Window-based framework - this single feature is what makes any Pandas to Spark migration actually do-able for 99% of the projects - even considering some of Pandas' features that seemed *hard* to reproduce in a distributed environment. A simple example that we can pick is that in Pandas you can compute a diff on a column and Pandas will compare the values of one line to the last one and compute the difference between them. Typically the kind of feature hard to do in a distributed environment because each line is supposed to be treated independently, now with Spark 1.4 window operations you can define a window on which Spark will "execute some aggregation functions" but relatively to a specific line. Here's how to port some existing Pandas code using diff : [code language="python"] In [86]: df = sqlCtx.createDataFrame([(1, 4), (1, 5), (2, 6), (2, 6), (3, 0)], ["A", "B"]) In [95]: pdf = df.toPandas() In [96]: pdf Out[96]: A B 0 1 4 1 1 5 2 2 6 3 2 6 4 3 0 In [98]: pdf['diff'] = pdf.B.diff() In [102]: pdf Out[102]: A B diff 0 1 4 NaN 1 1 5 1 2 2 6 1 3 2 6 0 4 3 0 -6 [/code] In Pandas you can compute a diff on an arbitrary column, with no regard for keys, no regards for order or anything. It's cool... but most of the time not exactly what you want and you might end up cleaning up the mess afterwards by setting the column value back to NaN from one line to another when the keys changed. Here's how you can do such a thing in PySpark using Window functions, a Key and, if you want, in a specific Order : [code language="python"] In [107]: from pyspark.sql.window import Window In [108]: window_over_A = Window.partitionBy("A").orderBy("B") In [109]: df.withColumn("diff", F.lead("B").over(window_over_A) - df.B).show() +---+---+----+ | A| B|diff| +---+---+----+ | 1| 4| 1| | 1| 5|null| | 2| 6| 0| | 2| 6|null| | 3| 0|null| +---+---+----+ [/code] With that you are now able to compute a diff line by line - ordered or not - given a specific key. The great point about Window operation is that you're not actually breaking the structure of your data. Let me explain myself. When you're computing some kind of aggregation (once again according to a key), you'll usually be executing a group By operation given this key and compute the multiple metrics that you'll need (if you're lucky *at the same time*, if you're not in multiple reduceByKey or aggregateByKey transformations). But whether you're using RDDs or Dataframe, if you're not using window operations then you'll actually crush your data in a part of your flow and then you'll need to join back again the results of your aggregations to the *main*-dataflow. Window operations allows you to execute your computation and copy the results as additional columns without any explicit join. This is a quick way to enrich your data adding rolling computations as just another column directly. Two additional resources are worth noting regarding these new features, the official Databricks blog article on Window operations and Christophe Bourguignat's article evaluating Pandas and Spark Dataframe differences. To sum up you now have all the tools you need in Spark > 1.4 to port any Pandas computation in a distributed environment using the *very* similar Dataframe API. Vale

RDDs are the new bytecode of Apache Spark

Fri, 29 May 2015 00:00:00 GMT

With the Apache Spark 1.3 release the Dataframe API for Spark SQL got introduced, for those of you who missed the big announcements, I'd recommend to read the article : Introducing Dataframes in Spark for Large Scale Data Science from the Databricks blog. Dataframes are very popular among data scientists, personally I've mainly been using them with the great Python library Pandas but there are many examples in R (originally) and Julia.

Of course if you're using only Spark's core features, nothing seems to have changed with Spark 1.3 : Spark's main abstraction remains the RDD (Resilient Distributed Dataset), its API is very stable now and everyone used it to handle any kind of data since now.

But the introduction of Dataframe is actually a big deal, because when RDDs were the only option to load data, it was obvious that you needed to parse your "maybe" un-structured data using RDDs, transform them using case-classes or tuples and then do the special work that you actually needed. Spark SQL is not a new project and you were, of course, able to load your structured-data (like Parquet files) directly from a SQLContext before 1.3 - but the advantages were not that clear at the time - except if you wanted to run SQL queries or expose a JDBC-compatible server for other BI tools.

Now the advantages are quite clear and I'll try to explain them as simply as possible :

Dataframes are a higher level of abstraction than RDDs

If you're familiar with Pandas syntax, you will feel at home using Spark's Dataframe and even if you're not, you'll learn and - I'd even add - learn to love it. Why ? Because it's a higher level of programming than the RDD, you can do more, faster (old joke now ;-) ). Here's an example from Patrick Wendell's Strata London 2015 presentation "What's coming in Spark" of RDDs in Python vs Dataframe :

![RDD vs Dataframe](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771034246-rdd-vs-dataframe.png)

Of course the second way of writing it is obviously more concise and more understandable, but I'd like to add something else, the tried-and-tested Spark programmers have surely noticed the reduceByKey transformation used here. It is a very common mistake in Spark for common aggregation tasks to use the groupBy then mapValues or map transformation which can be dangerous in a production environment and produce OutOfMemoryerrors on workers.

Do you notice that such a mistake cannot happen using the Dataframe API below for you will be expressing your aggregations using, for example, the agg(...) method (or even directly the **avg(...)**method like up there). This will even allow you to define multiple aggregations at the same time, something that is usually tricky using RDDs : [code language="scala"] case class Person(id: Int, first_name: String, last_name: String, age: Double) // get simple stats on age repartitions by first_name(min, max, avg, count) val rdd: RDD[Person] = ... // first you need to only handle the data you really need, and cache it because you'll - sadly - reuse it val persons = rdd.map(person => (person.first_name, person.age)).cache() val minAgeByFirstName = persons.reduceByKey( scala.math.min(_, ) ) val maxAgeByFirstName = persons.reduceByKey( scala.math.max(, _) ) val avgAgeByFirstName = persons.mapValues(x => (x, 1)) .reduceByKey((x, y) => (x._1 + y._1, x._2 + y.2)) // simple right ? val countByFirstName = persons.mapValues(x => 1).reduceByKey( + _) [/code]

Without even trying to consider the complexity of all I had to write to get all my answers - answers that I would need to join back if I want a consistent RDD with all the informations I need - the most painful point is that I had to duplicate all these aggregations and therefore cache my dataset to mitigate the damages.

Now using the dataframe API, I get to leverage out-of-the-box functions and I can even reuse my computations afterward without having to join-back anything : [code language="scala"] case class Person(id: Int, first_name: String, last_name: String, age: Double) // get simple stats on age repartitions by first_name(min, max, avg, count) val df: Dataframe = ... persons = df.groupBy("first_name").agg( min("age").alias("min_age"), max("age").alias("max_age"), avg("age").alias("average_age"), count("*").alias("number_of_persons") ) // let's add a new column to our schema re-using the two last-computed aggregations : val finalDf = persons.withColumn("age_delta", persons("max_age") - persons("min_age")) [/code]

This is a higher level of programming than RDDs, so some things might be more difficult to express with Dataframe than they were using RDDs when you could groupBy(...) anything and get the List[...] of result as values... But this was a bad practice anyway :).

Spark SQL/Catalyst is more intelligent than you

When you're using Dataframe, you're not defining directly a DAG (Directed Acyclic Graph) anymore, you're actually creating an AST (Abstract Syntax Tree) that the Catalyst engine will parse, check and improve using both Rules-Based Optimisation and Cost-Based Optimisation. This is an excerpt from the Spark SQL paper submitted for SIGMOD 2015 :

![spark SQL pipeline](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771034678-spark-sql-pipeline.png)

I won't get into the depth of this here, because that would even need more than one full article about it, but if you want to understand more this article Deep dive into Spark SQL's Catalyst optimizer from the Databricks blog (once again) will give you insights into how this works. A simple rule if thumb to get is that a lot of "pretty logical" generic tree-based rules will be used to check and simplify your parsed-Logical Plan and then a few Physical Plans representing different executions strategies will be computed and one will be selected according to their "computation cost".

The funny thing is that in the end - nothing changes - after all these transformations your Dataframe will get *compiled* down to RDDs and executed on your Spark Cluster.

Python & Scala are now even in terms of performance

Using the Dataframe API, you're using a DSL that leverages Spark's Scala bytecode - when using RDDs, Python lambdas will run in a Python VM, Java/Scala lambdas will run in the JVM, this is great because inside RDDs you can use your usual Python libraries (Numpy, Scipy, etc...) and not some Jython code, but it comes at a performance cost :

![Unified physical execution](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771035151-unified-physical-execution.png)

This is still true if you want to use Dataframe's User Defined Functions, you can write them in Java/Scala or Python and this will impact your computation performance - but if you manage to stay in a pure Dataframe computation - then nothing will get between you and the best computation performance you can possibly get.

Dataframes are the future for Spark & You

Spark ML is already a pretty obvious example of this, the Pipeline API is designed entirely around Dataframes as their sole data structure for parallel computations, model training and predictions. And even if you don't believe me, here's once again Patrick Wendell's presentation of "What the future of Spark is" :

![Future of Spark](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771035604-future-of-spark.png)

Anyway, I think I made my point regarding the whole goal of this article : RDDs are the new bytecode of Apache Spark. You might be sad or pissed because you spent a lot of time learning how to harness Spark's RDDs and now you think Dataframes are a completely new paradigm to learn...

You're partially right because if you don't already know Pandas or R API, Dataframes are a new thing and you'll need some work to harness it - but remember that in the end, everything comes down as RDDs - so all that you learned before is still relevant, this is just another skill to add to your resume.

Vale

Changing Spark's default java serialization to Kryo

Fri, 09 Jan 2015 00:00:00 GMT

Apache Spark's default serialization relies on Java with the default readObject(...) and writeObject(...) methods for all Serializableclasses. This is a very fine default behavior as long as you don't rely on it too much...

Why ? Because Java's serialization framework is notoriously inefficient, consuming too much CPU, RAM and size to be a suitable large scale serialization format.

Ok, but you can always tell me that you, as a Apache Spark user, are not using Java's serialization framework at all, but the fact is that Apache Spark as a system relies on it a lot :

Every task run from Driver to Worker gets serialized : Closure serialization
Every result from every task gets serialized at some point : Result serialization

And what's implied is that during all c losure serializations all the values used inside will get serialized as well, for the record, this is also one of the main reasons to use Broadcast variables when closures might get serialized with big values.

Kryo is a project like Apache Avro or Google's Protobuf (or it's Java oriented equivalent Protostuff - which I have not tested yet). I'm not a bug fan of benchmarks but they can be useful and Kryo designed a few to measure size and time of serialization. Here's what such a benchmark looks like a the time of writing (i.e early 2015) :

So how can you change Spark's default serializer easily, well, as usual Spark is a pretty configurable system, so all you need is to specify which serializer you want to use when you define your SparkContext using the SparkConf like that : [code language="scala"] val conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") [/code]

And voilà ! But that's not all, if you've got big objects to serializer and are prepared to face the consequences you might get OutOfMemoryErrors or GC Overflows that will happen very fast using Java's default serialization (did I tell you it sucks for some reasons... ?) and won't get resolve auto-magically by switching to Kryo.

Luckily you can define what buffer size Kryo will use by default : [code language="scala"] val conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") // Now it's 24 Mb of buffer by default instead of 0.064 Mb .set("spark.kryoserializer.buffer.mb","24") [/code]

If you're even bolder you can customize all of these options :

spark.kryoserializer.buffer.max.mb(64 Mb by default) : useful if your default buffer size goes further than 64 Mb;
spark.kryo.referenceTracking (true by default) : c.f. reference tracking in Kryo
spark.kryo.registrationRequired (false by default) : Kryo's parameter to define if all serializable classes must be registered
spark.kryo.classesToRegister (empty string list by default) : you can add a list of the qualified names of all classes that must be registered (c.f. last parameter)

The examples above are defined in Scala, but of course these parameters can be used in Java and Python as well.

Enjoy.

Try Apache Spark's shell using Docker

Thu, 18 Dec 2014 00:00:00 GMT

Ever wanted to try out Apache Spark without actually having to install anything ? Well if you've got Docker, I've got a christmas present for you, a Docker image you can pull to try and run Spark commands in the Spark shell REPL. The image has been pushed to the Docker Hub here and can be easily pulled using Docker.

So exactly what is this image, and how can I use it ?

Well, all you need is to execute these few commands : [code language="bash"] > docker pull ogirardot/spark-docker-shell [/code]

I'll try to keep this image up-to-date with future releases of Spark, so if you want to test against a specific version, all you have to do is pull (or directly run) the image with the corresponding tag like that : [code language="bash"] > docker pull ogirardot/spark-docker-shell:1.1.1 [/code]

And then after Docker will have downloaded the full image, using the run command you will have access to a stand-alone spark-shellthat will allow you to try and learn Spark's API in a sandboxed environment, here's what a correct launch looks like : [code language="scala"] > docker run -t -i ogirardot/spark-docker-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/12/11 20:33:14 INFO SecurityManager: Changing view acls to: root 14/12/11 20:33:14 INFO SecurityManager: Changing modify acls to: root 14/12/11 20:33:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 14/12/11 20:33:14 INFO HttpServer: Starting HTTP Server 14/12/11 20:33:14 INFO Utils: Successfully started service 'HTTP class server' on port 50535. Welcome to ____ __ / / ___ _/ / \ \/ _ \/ _ `/ _/ '/ /_/ .__/\,// //\\ version 1.1.1 /_/ Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65) Type in expressions to have them evaluated. Type :help for more information. 14/12/11 20:33:18 INFO SecurityManager: Changing view acls to: root 14/12/11 20:33:18 INFO SecurityManager: Changing modify acls to: root 14/12/11 20:33:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 14/12/11 20:33:19 INFO Slf4jLogger: Slf4jLogger started 14/12/11 20:33:19 INFO Remoting: Starting remoting 14/12/11 20:33:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@ea9ec670e429:43346] 14/12/11 20:33:19 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@ea9ec670e429:43346] 14/12/11 20:33:19 INFO Utils: Successfully started service 'sparkDriver' on port 43346. 14/12/11 20:33:19 INFO SparkEnv: Registering MapOutputTracker 14/12/11 20:33:19 INFO SparkEnv: Registering BlockManagerMaster 14/12/11 20:33:19 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20141211203319-f310 14/12/11 20:33:19 INFO Utils: Successfully started service 'Connection manager for block manager' on port 58304. 14/12/11 20:33:19 INFO ConnectionManager: Bound socket to port 58304 with id = ConnectionManagerId(ea9ec670e429,58304) 14/12/11 20:33:19 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 14/12/11 20:33:19 INFO BlockManagerMaster: Trying to register BlockManager 14/12/11 20:33:19 INFO BlockManagerMasterActor: Registering block manager ea9ec670e429:58304 with 265.4 MB RAM, BlockManagerId(&lt;driver&gt;, ea9ec670e429, 58304, 0) 14/12/11 20:33:19 INFO BlockManagerMaster: Registered BlockManager 14/12/11 20:33:19 INFO HttpFileServer: HTTP File server directory is /tmp/spark-4c832cee-7ed5-470d-9e41-d4a36227d48f 14/12/11 20:33:19 INFO HttpServer: Starting HTTP Server 14/12/11 20:33:19 INFO Utils: Successfully started service 'HTTP file server' on port 55020. 14/12/11 20:33:19 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/12/11 20:33:19 INFO SparkUI: Started SparkUI at http://ea9ec670e429:4040 14/12/11 20:33:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/12/11 20:33:19 INFO Executor: Using REPL class URI: http://172.17.0.15:50535 14/12/11 20:33:19 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@ea9ec670e429:43346/user/HeartbeatReceiver 14/12/11 20:33:19 INFO SparkILoop: Created spark context.. Spark context available as sc. scala> [/code]

Once you reach this scala prompt, you're practically done, and you can use your available SparkContext (variable sc )with simple examples : [code language="scala"] scala> sc.parallelize(1 until 1000).map(_ * 2).filter(_ < 10 ).reduce(_ + _) res0: Int = 20 [/code]

If you've got this right, you're all set ! Plus, as this is a Scala prompt, using you'll have access to all the auto-completion magic a strong type-system can bring you.

So enjoy, take your time and be bold.

Apache Spark : Memory management and Graceful degradation

Thu, 11 Dec 2014 00:00:00 GMT

Many of the concepts of Apache Spark are pretty straightforward and easy to understand, however some lucky few can be badly misunderstood. One of the greatest misunderstanding of all is the fact that some still believe that "Spark is only relevant with datasets that can fit into memory, otherwise it will crash".

This is an understanding mistake, Spark being easily associated as a "Hadoop using RAM more efficiently", but it still is a mistake.

Spark is by default doing the best it can to load the datasets it handles in memory. Still when the handled datasets are too large to fit into memory, automatically (or should i say auto-magically) these objects will be spilled to disk. This is one of the main features of Spark coined by the expression "graceful degradation" and it was very well illustrated by these two charts in Matei Zaharia's dissertation : An Architecture for Fast and General Data Processing on Large Clusters : [caption id="attachment_1211" align="aligncenter" width="660"]![Behaviour of Spark with less/more RAM, extracted from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771032971-graceful-degradation-spark.png) Behaviour of Spark with less/more RAM, extracted from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf\[/caption\] So the first chart clearly shows something interesting for us, It shows us that the behavior of Spark when you give it more or less RAM is pretty much linear in terms of execution time. In other words, the more RAM Spark can use, the quicker your computation will run, but if you give it less and less RAM, in the end Spark will behave like Hadoop, flushing to disk as much as possible. The second chart is also interesting for debunking the urban legend of "Spark will only work if your datasets fit in RAM" showing how Spark will handle larger and larger datasets, once again its behavior is practically linear between the time the computation takes and the size of the dataset (for a given computation). In the end, not only Spark can handle large datasets but It will gracefully adapt to the amount of memory you give it.

Apache Spark : l'importance du broadcast

Thu, 27 Nov 2014 00:00:00 GMT

Apache Spark est un moteur de calcul distribué visant à remplacer et fournir des APIs de plus haut niveau pour résoudre simplement des problèmes où Hadoop montre ses limitations et sa complexité.

Ce billet fait partie d'une série de billet sur Apache Spark permettant d'approfondir certaines notions du système du développement, à l'optimisation jusqu'au déploiement.

Un des avantage principaux de Spark est sa capacité à être bien intégré à l'éco-système Scala/Java ou Python. C'est d'autant plus vrai en Scala car les méthodes principales attachées aux contextes Spark sont de la même forme qu'en Scala avec quelques améliorations (et le contexte distribué en plus) ex: map, flatMap, filter...

Cet avantage vient avec l'inconvénient qu'il est important de savoir quels objets/instances manipulées et dans quel contexte Spark ou Scala ces objets vont être utilisés. Si vous en doutez, voilà un petit exemple permettant de bien l'illustrer : [code language="scala"] val multiplier = 50 val data = sc.parallelize(1 to 10000) val result = data .map( _ * multiplier) .filter( _ > 1000 ) .collect() .map( _ / 2 ) .filter( _ < (20 * multiplier) ) [/code]

Si on étudie cet exemple volontairement simpliste, les deux premières opérations map et filter s'appliquent sur un RDD[Int] géré par Spark et vont donc s'exécuter dans un contexte parallélisé, ce n'est plus le cas dès l'appel à collect() qui va ramener la totalité des données traitées par les workers vers la mémoire du driver Spark . Ainsi les deux autres appels à map et filter vont s'appliquer sur une **List[Int]** et donc font partie de la Standard Library de Scala.

Cet exemple a deux propriétés importantes, il permet de voir la confusion possible entre les appels Scala et Spark, mais surtout il permet de voir, avec le coefficient multiplier utilisé, qu'il est assez simple d'utiliser des valeurs Scala dans une closure envoyée à Spark.

La sérialisation des closures Scala vers les workers Spark méritera un article à lui tout seul et donc n'est pas l'objet de cet article, mais pour bien comprendre le problème qui nous intéresse il suffit de savoir qu'à chaque instance de la closure lancée par un worker contiendra une copie de la valeur utilisée. Ainsi si cette valeur correspond à une donnée un peu volumineuse cela devient rapidement inefficace et surtout dangereux pour l'utilisation mémoire de vos workers.

Heureusement Spark vient avec deux notions de variables partagées les Accumulateurs et variables Broadcastées, maintenant comme vous aurez deviné c'est cette dernière notion qui vient à notre secours.

En effet au lieu d'avoir autant de copie des valeurs dans les closures que nous avons d'appels dans les workers par celle-ci, il est possible d'utiliser la fonction de broadcast () pour partager en lecture-seule cette valeur et ainsi n'avoir qu'une copie par noeud géré par le système.

Cette fonction n'est en revanche intéressante que pour partagé de grosses sources de données à travers les workers, par pour notre pauvre petit Int de multiplier dans l'exemple précédent, voilà comment l'utiliser : [code language="scala"] val largeKeyValuePair: Map[String, String] = .... // broadcast this variable for workers to use it efficiently val bdLarge = sc.broadcast(largeKeyValuePair) val data = sc.parallelize(1 to 10000) val result = data .map( item => (item, bdLarge.value.get(item.toString) ) ... [/code]

Pour résumé, le broadcast sert à n'envoyer qu'une seule fois une valeur assez large pour en valoir la peine. Maintenant votre question doit être "grosse comment" ?

L'Université de Berkeley (CA) a étudié la question dans la publication suivante sur lesperformances des différents algorithmes de broadcasting entre les noeuds et pour vous la faire courte, le mécanisme standard de broadcasting de Spark **Centralized HDFS Broadcast (CHB pour les intimes)**donne ce genre de performance selon la taille des payloads :

![spark-broadcast-performance](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771033309-spark-broadcast-performance.png)

Si vous voulez en savoir plus, j'organise avec Lateral Thoughts et Hopwork des formations Spark régulières, l'agenda est disponible ici : http://www.lateral-thoughts.com/training.

Dagger and Play 2 Java

Mon, 28 Jul 2014 00:00:00 GMT

I recently got the occasion of trying out Play 2 in Java and i must say the Play 2 Framwork looks actually really good in Java too.

But, of course... there is a but, one of the few things that strikes you first, and i must say with great intensity, is the mandatory static methods that you must put in your Controllersin order to define your routes. Exemple : [code language="java"] // in app/controllers/Application.java package controllers; import play.mvc.Controller; import play.mvc.Result; import service.CoffeeService; import views.html.index; public class Application extends Controller { public static Result index() { return ok(index.render("Your application is ready.")); } } [/code]

And with the routes defined as such : [code] # Home page GET / controllers.Application.index() [/code]

This is relatively great... if you like starting off with the wrong foot. I won't talk about modularization or the danger of spaghetti-code , neither will i argue that this is not great for testing controllers that will use services or any other kind of external dependencies.

Luckily, the Play 2 Framwork people have thought long and hard when they designed their systems, and if they won't force you to use any kind of dependency injection systems, they'll allow you to plugin-in your preffered choice. This is clearly documented here but this is in Scala and you might think it's not available for Play 2 Java, and you would be wrong.

So here's a little example on how to do it with a really great project by the teams at Square called Dagger. Dagger relies on the annotation processing framework of Java to be able to plug itself as an extra step of the compiler and try, as much as possible, to do dependency injection checks (and maybe more) at compile-time. So let's try to use it in a simple Java app : [code] // in build.sbt - we'll add the dependency name := "app" version := "1.0-SNAPSHOT" libraryDependencies ++= Seq( javaJdbc, javaEbean, cache, "com.squareup.dagger" % "dagger" % "1.2.2", "com.squareup.dagger" % "dagger-compiler" % "1.2.2" ) play.Project.playJavaSettings [/code] [code language="java"] // in app/controllers/Application.java - we'll inject a simple Service via dagger package controllers; import play.mvc.Controller; import play.mvc.Result; import service.CoffeeService; import views.html.index; import javax.inject.Inject; public class Application extends Controller { private CoffeeService coffeeService; @Inject public Application(CoffeeService service) { coffeeService = service; } public Result index() { return ok(index.render("Your application " + this.toString() +" is ready. " + coffeeService.toString())); } } [/code]

Finally to make it all work we need to change the routes file and override the "Global" configuration class : [code language="java"] // in app/Global.java - we'll create this class and override the controller instance creation import dagger.ObjectGraph; import module.ProductionModule; import play.Application; import play.GlobalSettings; public class Global extends GlobalSettings { private ObjectGraph objectGraph; @Override public void beforeStart(Application app) { super.beforeStart(app); objectGraph = ObjectGraph.create(new ProductionModule()); } @Override public A getControllerInstance(Class controllerClass) throws Exception { return objectGraph.get(controllerClass); } } [/code]

and [code] # Home page GET / @controllers.Application.index() [/code]

The **@controllers.Application.index()**tells the whole system that now he has to create a new instance of Application controller and he will get the controller's instance through the overriden method in Global.

The goal was not in this article to teach you how to use Dagger or Play, rather more to show you how the two of them can work together. If you want to see the whole project, it's available online on https://github.com/lateralthoughts/dagger-play-di-example. So if you want to know more, clone the project and play with it. Any feedback would be appreciated.

Vale

How to remove scaladoc generation from Play 2.2.x Production dist

Tue, 17 Jun 2014 00:00:00 GMT

After a few hours of searching through the Play 2 documentation, the play-framework google group and other blogs or sources, i finally found this piece of code that i decided to share with you. So if, like me, you wanted to remove the Scaladoc generation and packaging inside the ProductionDist that you can create from running the play dist command, then today's your lucky day. If you have a build.sbt file (and you should) in your Play2 app, then all you need to do is add inside your file **sources in doc in Compile := List()**like that : [code language="scala"] import play.Project._ name := "my-web-project" playScalaSettings sources in doc in Compile := List() libraryDependencies ++= Seq(...) [/code]

Timeoff 2014 @ Lateral Thoughts

Mon, 14 Apr 2014 00:00:00 GMT

Une fois n'est pas coutume, je commencerais cet article avec une photo de notre dernier Timeoff LT.

![Image](http://ogirardot.wordpress.com/wp-content/uploads/2014/04/dsc_0004-001.jpg?w=650)

Ça fait surement cliché de dire ça, mais chaque timeoff est différent, et celui là n'a pas dérogé à la règle. J'étais beaucoup plus impliqué dans l'organisation des derniers (ceux où j'allais :p ) alors pour celui-ci je me suis laissé guidé... sans être déçu le moins du monde.

Ensemble

Le point que j'ai le plus apprécié lors de ce timeoff, c'est surtout le fait qu'on était tous ensemble , et cette fois-ci j'ai beaucoup appris alors même qu'une vraie dynamique autour d'un projet commun s'est mise en place. On a réussit la difficile alchimie qui permet d'apprendre beaucoup les uns des autres, mais aussi de sortirun projet avec du code de prod , et pas juste un prototype mal maîtrisé avec des bouts de techno pas testés :).

En prime j'ai pu bien profiter de la compagnie de tout les gens de LT, Lyonnais ou Parisien que j'ai peu l'occasion de voir au jour le jour et ainsi de travailler avec des gens que j'aime et que je respecte énormément.

Au fond

Une chose dont je suis fier, et cela dépasse même le but initial de LT, c'est que cette boite ne fait pas que prendre des expérimentés et leur donner la force de frappe pour faire plus et mieux, elle permet - au jour le jour - à des juniors de prendre leur vie en main et de s'améliorer eux-mêmes.

Ça a l'air un peu prétentieux comme ça, et je n'ai pas la prétention de dire que nous investissons au maximum sur nos juniors, je connais des boites qui, de mon point de vue, "investissent plus" pour le développement personnel de leurs employés, mais je dois dire que je suis juste impressionné, si je me pose un instant, sur le chemin accomplit et la maturité atteinte par les différentes personnes qui nous ont rejoint : Florent Biville, Nicolas Rey, Vincent Doba, Stuart Corring et Jonathan Dray.

Certains étaient moins juniors que d'autres, et chacun avance à son rythme, mais la partie intéressante c'est que le modèle fonctionne. LT n'est pas une société à effet de levier, il n'est pas possible de gagner en puissance financièrement parlant et clairement pas possible de devenir riche (si c'est là le but que l'on veut se fixer... *no pun intended*).

Car au fond une SSII normal gagne de l'argent en faisant croître la compétence de ses employés (ou leur facturation tout simplement) plus vite que leur salaire. C'est là que se situe sa croissance, son efficacité opérationnelle, elle compense ainsi le fait que - pour elle comme pour nous - son chiffre d'affaire est directement proportionnel à son nombre d'employés.

Dans le modèle plat, a-hierarchique etsociocratique que nous avons construit, cette "croissance" naturelle - à nombre d'employé constant - a été sacrifiée, qu'avons nous gagné en échange de cette croissance du capital ?

Un puriste dirait rien à part du risque en plus , personnellement je dirais un capital humain énorme - quoiqu'on veuille impliquer derrière ça. J'arrête là mes divagations de financier, mais juste pour vous replacer le contexte de cette réflexion, un de mes objectifs durant le timeoff était d'arriver à évaluer la société, selon les méthodes usuelles de finance d'entreprise (DCF, Methodes des ratios, VAN etc...). Je vous laisse imaginer la complexité dans une société "pas vraiment" capitalistique...

Enfin une autre chose dont je suis fier et qui vous permettra surement de comprendre mieux que le système fonctionne, c'est qu'en tant que Lateral Thoughts nous avons réussi à avoir les reins nécessaires à l'organisation de Scala.IO et l'expérience a été tellement concluante que nous allons remettre le couvert cette année pour l'édition 2014.

Apprendre

Concernant plutôt la partie apprentissage, j'ai eu l'occasion d'approfondir encore plus, mes connaissances sur la dernière version de Spring avec Spring 4 et Spring Boot, il y a vraiment dedans des améliorations en terme de productivité qui méritent le détour. Et la seule difficulté pour l'instant que j'ai identifiée concerne plutôt l'intégration avec Spring Security, mais n'hésitez pas à vous former une opinion seul.

J'ai eu l'occasion, avec beaucoup de plaisir, de faire partagé mes connaissances sur Scalaet sur Ansible et d'en apprendre encore un peu plus sur les bonnes pratiques Angular. Si Ansible vous intéresse, mon Tools In Action sur Ansiblea été accepté à DevoxxFR 2014 alors n'hésitez pas à venir me poser des questions chiantes Mercredi 16 Avril ! Sur ce

Vale

Highlighting field in memory-based Lucene indexes

Mon, 24 Jun 2013 00:00:00 GMT

I'm using more and more Lucene these days, and getting in depth on a few subjects, today i'm going to talk to you about how to handle the new Highlighting features available with Lucene 4.1.

One of the main achievements with this new version is the creation of the great PostingsHighlighter. Michael McCandless wrote a great piece about it in his article A new Lucene highlighter is born and i encourage you to read it if you want to get serious about highlighting using Lucene :).

Now let's say you want to use it on a MemoryIndex, considering the MemoryIndex as the best In-Memory index type with more than ~500k queries/s handled and the "perfect" **reset()**method, it would be great right ? But it's a nice dream as the MemoryIndex doesn't store anything about the raw data, so... we need a plan B.

The plan B can be to use the old-fashioned, but still useful, RAMDirectory index that will still behave like a normal "Directory"-based index and will give you the ability to store the data you need on the field to match. Here is an example on how to use it : [code language="java"] final int MAX_DOCS = 10; final String FIELD_NAME = "text"; final Directory index = new RAMDirectory(); final StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_41); IndexWriterConfig writerConfig = new IndexWriterConfig(Version.LUCENE_41, analyzer); IndexWriter writer = new IndexWriter(index, writerConfig); // create document Document document = new Document(); FieldType type = new FieldType(); type.setIndexed(true); type.setStored(true); // it needs to be stored to be properly highlighted type.setTokenized(true); type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); // necessary for PostingsHighlighter document.add(new Field(FIELD_NAME, "this an example of text that must be highlighted", type)); // add it to the index writer.addDocument(document); writer.commit(); writer.close(); Query query = new QueryParser(Version.LUCENE_41, FIELD_NAME, analyzer).parse("example"); DirectoryReader directoryReader = DirectoryReader.open(index); IndexSearcher searcher = new IndexSearcher(directoryReader); PostingsHighlighter highlighter = new PostingsHighlighter(); TopDocs topDocs = searcher.search(query, MAX_DOCS); String[] strings = highlighter.highlight(FIELD_NAME, query, searcher, topDocs); System.out.println(Arrays.toString(strings)); // expected output : [this an example of text that must be highlighted] [/code]

I'm honestly considering right now to use both indexesquerying heavily the MemoryIndex and using the RAMDirectory only when i know there's a match found and i need the highlighting features. Maybe i'm not done digging up around this hole and there's a way to make any highlighter work with the MemoryIndex, but i doubt it, both conceptually and after testing everything i could.

If you think otherwise, and know a way to do so, tell me :) Vale

How to test and understand custom analyzers in Lucene

Thu, 20 Jun 2013 00:00:00 GMT

I've began to work more and more with the great "low-level" library Apache Lucene created by Doug Cutting. For those of you that may not know, Lucene is the indexing and searching library used by great entreprise search servers like Apache Solr and Elasticsearch.

When you start to index and search data, most of the time you need to create a filtering and cleaning pipeline to transform your raw text data into something more indexable and slightly more standardized . Such a pipeline may include lowercasing, transforming to ascii or evenstemming (transforming for "eating => eat"). Defining such a pipeline is defining an Analyzer in Lucene-world, and while it's a very easy process to create a new/custom one, tweaking it to your needs is another thing and needs thorough testing.

Today's article is precisely to help you out regarding how to test your own analyzer or even create a simple test case for Lucene's analyzers, to allow you to better understand what they do and why they do it.

Luckily for us, using the latest versionApache Lucene 4.1, we're not left on our own and we can rely on a few tools because Lucene comes with a test framework that needs a few trick to work, so here we go :

You need testing right, so we need to add the dependency org.apache.lucene:lucene-test-framework as a maven artifact, but not so fast, the test-framework needs to be before lucene-coreeven if they are in completely different scope, and you need to use at least maven 2.x because otherwise the classpath order won't respect the dependency definition order (what a beautiful world...) : [code language="xml"] org.apache.lucene lucene-test-framework ${lucene.version} test org.apache.lucene lucene-core ${lucene.version} [/code]

But now if you want to create a new JUnit test for testing the behaviour of an analyzer, you've got access to a new Base Class that you can extends called BaseTokenStreamTestCase . But the joy of it all is not exactly to be able to write "public class MyWonderfulTestCase extends BaseTokenStreamTestCase" and clap your hands, now you have access to a brand new class of assertions (by the way you need to enable assertions to execute your tests with the -ea parameter as VM args)

assertTokenStream: it allows you to specify the field on which you're testing (otherwize "dummy" fieldName gets passed onto the analyzer) and check the token stream output;
assertAnalyzesTo: you don't specify the field on which you're testing, but it has a simpler syntax.

And there is an example of it all in action : [code language="java"] @Test public void shouldNotAlterKeywordAnalyzed() throws IOException { Analyzer myKeywordAnalyzer = new KeywordAnalyzer(); assertTokenStreamContents( myKeywordAnalyzer.tokenStream("my_keyword_field", new StringReader("ISO8859-1 and all that jazz")), new String[] { "ISO8859-1 and all that jazz" }); assertAnalyzesTo(myKeywordAnalyzer, "ISO8859-1 and all that jazz", new String[] { "ISO8859-1 and all that jazz" // a single token output as expected from the KeywordAnalyzer }); } [/code] Hope it will help you out making your search engines more reliable :), Vale

Book review : ElasticSearch Server by Rafal Kuc, Marek Rogozinski

Mon, 17 Jun 2013 00:00:00 GMT

![ElasticSearch Server - book cover](http://ogirardot.wordpress.com/wp-content/uploads/2013/06/8444os.jpg?w=243)

I'm not usually doing a lot of book reviews, mainly because i'm usually not finishing any book i begin... But i decided to finish this one, and i wanted to express my views on this book. If you look at the reviews of ElasticSearch Server on amazon.com you will get a first opinion that i can only agree with, this book is not for you, if you're looking for advanced tips and tweakings about ElasticSearch.

It's mainly for begginners and it will get you through your first fears facing this versatile piece of technology, but if i were you i'd only consider this book when learning how to use elasticsearch if you have no prior experience with Apache Solr or Lucene.

It does a good job introducing indexes and mappings and the fact that even if elasticsearch shields you from the old "Solr - schema.xml" by defining a default mapping for all newly created indexes and types (belonging to an index), this does not prevent you from needing to re-index all data when you realize the mapping you're using is not exactly... adequate.

The main part of the book, at least the one i'd recommend, is not the part about Cluster Administration or the Getting started part, it's the Searching your data chapter of the book. To me, this chapter is a reference part for all the query types supported by ElasticSearch and can be very useful when searching what kind of query you need.

All in all it's not a bad book and you can keep it on the long term as a reference for the query DSLused by ElasticSearch through the HTTP/JSon API, but if you need something to guide you safely into production, you're better off experimenting by yourself.

Don't hesitate to tell me what you think ;-)

Vale

Elasticsearch is the way

Tue, 12 Mar 2013 00:00:00 GMT

Don't get me wrong, i love Apache Solr, i think it's a wonderful project and the versions 4.x are definitely something you should check out when building a proper search engine.

But Elasticsearch, at least for me, is now the way to the future. If you need a few reasons why, read on :

Out of the box scalability

SolrCloud is doing a good job trying to get Solr into the Cloud era, because even if Solr supported distributed query before, sharding had to be done manually...

Elasticsearch scalability is so easy it's a bit frightening, every time i set up a new elasticsearch "single" server i deactivate as soon as possible the cluster-search capability, just in case it starts replicating the internet on my machine ! Sharding/Replication is automatic and almost a necessity, because your server (by default) will remind you that you're a dangerous person keeping all your data on a single machine, and will stay in a yellow state until you start adding some nodes !

Comprehensive Json-based HTTP search API

In all honesty sometimes the json-based search queries can become quite complicated and tedious to read, but it's much more powerful than a simple ?q=.... query or the long and complicated list of URL-GET parameters you end up using with Solr... So even if there are no proper Chrome extension to create a GET HTTP request with a JSON body (!! add a comment if you find one !!), i still think it's a blessing to have that kind of query capacity, and it made me rethink about elasticsearch's tolerance/suitability for complex query (c.f. the "As complex as Solr" part).

Rivers...

Probably one of the best feature of Elasticsearch, it's designed around the fantastic (and true) idea that an Elasticsearch index needs to be fed !

Just this concept changes everything, because it makes the **"realtime index"**the default type of index, because anyway nowadays what matters most is to have an up-to-date search index and it's a fact that Near-Realtime search is one the many advantages that makes Solr and Elasticsearch the best choices out there.

Vibrant community and plugins

Probably the most important part, in my opinion, i do think that the Solr ecosystem lacks a lot of good tools and plugins to leverage more of its power. Luke is a pretty useful tool, but it's very lucene-centric, apart from the solr-provided tools (which are, i must say, sufficient for a lot troubleshooting and debugging). I've been on Solr 3.x for a long time, and even if all the tools where there, the UI certainly lacked in terms of "sexy", nowadays Solr 4.x's UI is certainly more sexy and a pleasure to work with, but it's still only the work of Lucidworks.

Elasticsearch is brand new, the documentation is sexy, the project is sexy, they built a wonderful plugin system that uses github directly !! You don't have to be a fully accredited "Elasticsearch-compliant plugin creator" to publish your project.

So a lot of people created wonderful plugins, that already goes beyond what you can use in the Solr/Lucene world, just a quick review :

Paramedic : a "simple and sexy tool to monitor and inspect elasticsearch clusters";
Head : "A web front end for an ElasticSearch cluster" with a real-time dashboard;
BigDesk : Live charts and statistics for Elasticsearch cluster;
For the analysis, you have Inquisitor

to help understand and debug your queries in ElasticSearch and SegmentSpy to watch real time segments merging and changing.

This is just the state of the art right now, but i can't imagine it going anywhere but forward.

As complex as Solr

Finally, i had prejudiced, because i thought that the goals of Elasticsearch in terms of scalability where clearly ambitious (and deeply needed !), but that this kind of scalability obviously came at a cost and therefor there would be less features than what Solr offered (ex. Dismax queries).

But i was wrong, as i discovered recently that Dismax queries, fuzzy matching and other goodies allowing many things from boosted-field at query time to boosted sub-queries , are available and easily accessible thanks to the Elasticsearch API. So the proper section-name should not be "As complex as Solr" but "As versatile as Solr".

I hope i made my point, and if you're considering building a BigData-ready search engine right now, make sure to check out Elasticsearch or you'll be missing out on a great product.

Vale

Reste à ta place et fais ce qu'on te dit.

Fri, 01 Feb 2013 00:00:00 GMT

Je ne suis pas le plus aguerri des vétérans, et je m'en rend compte encore assez souvent pour savoir que j'ai encore des sempaïsdans plus d'un domaine (pas que technique) dont certains avec qui j'ai la chance de travailler, même si ce n'est pas tout le temps au jour le jour.

Au fil de mes années de travail, en Banque, en SSII, et dans d'autres sociétés, il n'y a qu'une seule constante que je peux vraiment distinguer : dans chacune de ces situations, quelqu'un attendait quelque chose de moi.

Vous allez me dire qu'il n'y a rien d'inhabituel à ça, quand on embauche quelqu'un c'est rarement (quoique...) pour la flagrance de son inutilité. Mais ce n'est pas là que je veux en venir, vous allez voir rapidement :

En banque, on s'attendait à ce que je maintienne une application pour la salle des marchés mais pas à ce que je propose des innovations ou que je prenne le temps de comprendre mieux le métier;
En SSII (dans une autre banque), on s'attendait à ce que je maintienne une application de finance de marché mais encore une fois aucune innovation possible là dedans et plus grave, jusqu'à aujourd'hui, je ne sais pas toujours pas à quoi ressemblait l'ombre de mes utilisateurs...;
En tant qu'ingénieur R&D, on me proposait d'innover, mais dans la direction qu'il fallait (définie par en haut et inconnue à se jour... car changeant tout les mois) sans nous laisser le temps de réfléchir, d'apprendre mieux le métier, le tout dans une urgence latente et permanente.

Vous avez peut-être déjà compris où je veux en venir. Quand quelqu'un attendait quelque chose de moi, il n'attendait qu'une seule chose de moi, que je reste à ma place et que je fasse ce qu'impose mon rôle et seulement ça.

Si on y réfléchit, ce précepte permet de construire un monde très simple avec ça, j'appellerais ça la modularisation de l'entreprise, chacun a un rôle et un seul, reste dans son rôle et alors le plus important dans l'entreprise devient que les royaumes de contrôle des différents rôles de chacun ne se touche pas. C'est le principe de l'ouvrier spécialisé ramené à des travaux intellectuels.

Mais le plus grave, c'est que dans le monde d'aujourd'hui, nous avons integré ce que nos parents disait plus violemment vers 68, il n'y a plus de place dans la société Française pour les jeunes. Les entreprises ont créés des rôles bien segmentés où l'on a le droit de se complaire mais, en tant que jeune, notre seul solution actuelle pour s'épanouir/évoluer devient de changer de boite avec le travers bien connu :

![JXC4EXEe0kKvShm4u1avjg2](http://ogirardot.wordpress.com/wp-content/uploads/2013/02/jxc4exee0kkvshm4u1avjg2.jpeg)

Ca n'a pas toujours été ainsi. Du temps de mon grand père, on se faisait porter par l'entreprise dans laquelle on rentrait, elle lui a fait confiance, l'a challengé, l'a aider à s'améliorer et à se construire pour enfin lui permettre de monter dans celle-ci. Moins loin, du temps de nos parents, baby-boom oblige, beaucoup de niveaux hiérarchiques ont été créés, pas tellement par nécessité (d'efficacité), mais plus par modularisation de l'entreprise et surtout pour éviter les conflits (la génération de nos parents reste quand même celle qui, sans avoir fait la guerre, a traité ses propres parents de Nazis...). C'est un peu le début de ce qu'on appellerait aujourd'hui "les petits chefs".

Seulement ce monde est dangereux, il détruit la créativité des jeunes, laisse l'innovation à une élite sans compétences (le fameux mythe du "Si quelqu'un peut changer les choses c'est bien lui/elle (enfin souvent lui quand même...)" ), et nous complait progressivement dans le rôle que les Etats-Unis nous ont donné depuis plusieurs années, celui de la "Old Europe" ou de l'Europe Muséequi ne vit que sur ses acquis.

Ce que j'aime dans la création d'une NoSSII comme LateralThoughts c'est d'oeuvrer chaque jour à sortir de ce schéma destructeur, de remettre entre les mains de tous la possibilité d'innover, de dégager du temps pour réfléchir, pour s'améliorer et de porter les projets de tous. On dit souvent qu'une bonne idée n'a pas de parti, mais bien trop souvent elle a un niveau hiérarchique...

Et vous, vous avez une bonne idée ?

Vale

Sharing PyPi/Maven dependency data

Thu, 31 Jan 2013 00:00:00 GMT

As time is always running out, i don't think i'll have the time in a while to work again on the data I collected for the last three articles, Going offline with Maven, State of the Maven/Java dependency graph and State of the PyPi/Python dependency graph.

So, as it took me a long time to build these datasets and even if the datasets were already available on the github project, i want to make it publicly available and define the metadata properly so anyone can reuse them freely. The only licence i'm putting it on is Creative Commons, so you're free to use it, re-adapt it, publish based on it, or use it for commercial purposes, as long as you mention me (Olivier Girardot ) as author.

So the dataset is divided in three files, compressed using LZMA :

mvn-deps.csv.lzma and mvn-minimal-deps.csv.lzma

mvn-deps consists in all the Maven artifacts extracted from Maven central repositories and mvn-minimal-deps is the minimal set of dependencies you need to for going offline with Maven, once uncompressed both files are a simple tab-separated csv document with the following columns :

artifactId
groupId
version
dependencies : as a base64 encoded json string with the following keys : artifactId, groupId, versionex: {'artifactId': 'log4j', 'groupId': 'log4j', 'version':'1.0.3'}

pypi-deps.csv.lzma

pypi-deps consists in all the PyPi dependencies, once again it's a tab-separated csv document with the following columns :

name
version
dependencies : as a base64 encoded json string with the following keys : name, versionex: {'artifactId': 'log4j', 'groupId': 'log4j', 'version':'1.0.3'}

An example on how to treat this file to extract it as a networkx graph is available in the github project's IPython notebook that you need to download as a raw file to use it with IPython.

I'd be glad that following Hilary Mason posts on sharing data with academics some publications were to use these datasets, if any does, please feel free to comment on this blog post to link to your remixed work.

Vale

Going offline with Maven

Mon, 14 Jan 2013 00:00:00 GMT

At Lateral-Thoughts, we organize at least once a year, what we call a "Timeoff" where we get together in a nice place and hack on what we want. It can be a learning period or a startup weekend-like event where we hack on a product/idea. Last time it was in a nice house in Guérande where we had everything we needed, internet access, rooms, tables, lots of space, an indoor swimming pool and a barbecue !

But when you want to find a nice place in France, it's not always easy to also have a good/decent internet access , so as we're beginning to plan the next event right now, we asked ourselves what could we do if there was no internet access ? Is there a way to plan for what we would need, so that we wouldn't suffer from having no contact with the outside world :). But in a Java/Python environment, where you use a lot Maven and PyPi, when you don't know what you'll be working on, the one thing you can't (and shouldn't plan) is the libraries/dependencies you'll need.

So what do we do ? The simplest way is to download all the dependencies you can from a Maven repository but that seems like the most in-efficient way ever, and with more than 30Gb of data each, it can take a while...

In the last article I extracted all the libs' metadata and dependencies link, so we know what depends on what. So in order to be more efficient in creating a copied repository, I decided to use those metadata according to two simple rules :

Only keep the latest version of artifacts;
And artifacts/versions that are needed to other artifacts in their latest versions.

With those simple rules, we can create a "minimum" repository containing only what we would need to start a new project :). The data I extracted is not perfect so don't take my word on it. This is a first draft of a work I (or someone else) may continue. The result is a simpler graph containing only 25 553 nodes and 52 916 edges (compared to the 186 384 Nodes and 1 229 083 Edges of the full repository), we can almost comprehend : [caption id="attachment_1003" align="aligncenter" width="640"]![Light version of full-compact maven dependencies - Click to get pdf](http://ogirardot.wordpress.com/wp-content/uploads/2013/01/full-graph-limited-deps-mvn-light.png?w=640) Light version of full-compact maven dependencies - Click to get pdf[/caption] The full pdf file, almost as good as the svg version (without the 24Mb overhead) is available for download jut by clicking on the picture. But if you need the data because, just like us, you may have to go off the grid, the raw csv file is available on GitHub here. It's a simple CSV file compressed with LZMA, its columns are groupId, artifactId, version, dependencies, dependencies being a base64 encoded json dict. Hoping you'll enjoy this. Vale

State of the Maven/Java dependency graph

Fri, 11 Jan 2013 00:00:00 GMT

So here it comes, the second part of a three part articles on dependencies in different world, the first part was about Python/PyPi dependencies and considering the size of the graph : 20661 Nodes, 14047 Edges, I was able to show you the graph in an interactive javascript app using SigmaJS. But this times it's different, after extracting the metadata from Maven repositories, the raw data file generated weights 273M , and the size of the whole directed dependency graph is 186 384 Nodes and 1 229 083 Edges , in other words, it's going to be tough to show you the whole graph interactively but the raw data , the graph file and the Gephi file are available on the GitHub project.

Handling that much data comes with a cost, that your machine must be prepared to pay... For example, as I tried to export the whole graph into a svg file, Gephi tried to use more than my 16G of RAM and eventually couldn't achieve this. Fortunately, it was no problem to extract it to PNG (and with High Definition), so here comes the pictures.

Using Yifan Hu's layout

As you see the graph is, as expected, pretty dense and this spatialization is not exactly the best one to see what's going on, but I tried it first (even if it's not suitable for large-scale graph processing) in order to compare it with the last article's results. As we can see below, the Java/Python eco-systems are really different in terms of dependency and library usage : [caption id="attachment_983" align="aligncenter" width="300"]![maven-deps-ni-labels](http://ogirardot.wordpress.com/wp-content/uploads/2013/01/maven-deps-ni-labels.png?w=300 "Maven/Java dependency graph using Yifan Hu's layout") Maven/Java dependency graph using Yifan Hu's layout[/caption] [caption id="attachment_968" align="aligncenter" width="300"]![PyPi dependency graph generated using Gephi](http://ogirardot.wordpress.com/wp-content/uploads/2013/01/pypi-deps.png?w=300 "PyPi dependency graph using Yifan Hu's layout") PyPi dependency graph using Yifan Hu's layout[/caption]

So Yifan Hu's layout is really great for sparsed/simple graphs and even if it took me more than 2 hours to properly compute it with a millin edges, it's worth it just to compare visually the two parts. But now if we want to analyse and get something out of the Maven metadata, let's use a more "Large Scale Graph" oriented visualisation. For that we need to choose a parallelizable graph spatialization algorithm.

A more suitable approach using ForceAtlas 2

Force Atlas 2 is a much more suitable algorithm to process large quantity of nodes/edges, firstly because it allowed me to parallelize this computation over 7 CPU, but also because it gives us a clearer overview of what's going on, what is the most popular library and other metrics like that : [caption id="attachment_985" align="aligncenter" width="640"]![Maven dependency graph](http://ogirardot.wordpress.com/wp-content/uploads/2013/01/maven-deps-atlas-low.png?w=640) Maven dependency graph using Force Atlas 2 (click to see a higher resolution)[/caption]

So...

So now we have the processed data, just at first sight we can see that the Java ecosystem is much more connected than the Python one, there's no judgment here and we will analyse this data in more depth in the next article not to conclude who's the bestbut more to gain a clearer understanding of a way to go forward for both worlds.

Vale

State of the Python/PyPi dependency graph

Sat, 05 Jan 2013 00:00:00 GMT

I usually work in Java/Maven environment, so when I explain to people that Python also has a package manager - a bit less heavy than maven - and that it's working pretty well, I always have to answer the same question : "Ok, but how does it solve the transitive dependency hell ?"

Also known as the historic DLL Hell /Jar Hell etc... In short, when you depend on A and C, that A depends on B (version 1.2) and C depends on B (version 1.5) : How do you choose which version of B you will take ?

I ended up trying to answer, not exactly that question, but why I never really had that problem in Python. So this article is the first of a three part series you could call "Dependency as a liability".

In this part, I wanted to analyse the Python library world in terms of a full dependency graph - how every library depends on each other.

After talking with Tarek Ziadé about that, he told me how complicated things are right now. It seems that, for now, the way things are, the only complete and secure way to know what a package needs in terms of dependency is to execute its installation on every operating system. This was a bit out of my scope for now, so I took another way, just to see where it would lead me.

Analyzing setup.py files

For recent packages, following the Hitchiker's Guide to packaging, the metadata of the package are stored in file called setup.py that looks like this : [sourcecode language="python"] from distutils.core import setup setup( name='TowelStuff', version='0.1.0', author='J. Random Hacker', author_email='jrh@example.com', packages=['towelstuff', 'towelstuff.test'], scripts=['bin/stowe-towels.py','bin/wash-towels.py'], url='http://pypi.python.org/pypi/TowelStuff/', license='LICENSE.txt', description='Useful towel-related stuff.', long_description=open('README.txt').read(), install_requires=[ "Django >= 1.1.1", "caldav == 0.1.4", ], ) [/sourcecode] You can notice a few things like the author, version, author_email, url, license... and what I was focusing on the install_requires parameter, where you declare all your dependencies. the problem is, that it may sound simple, but the setup.py file is a python script in itself, so the install_requires directive can be changed when the script is executed. So I took my chances, and decided to create a project to extract dependencies from all packages on PyPi according to the install_requires parameter and see if this is mainly used statically or dynamically. So what the meta-deps project does is :

extract all packages from PyPi using the XML-RPC api;
download the releases and extract from the setup.py file the install_requiresdependency;
Store the results in a csv file pypi-deps.csv;

If you want to re-use the raw data, you don't need to re-execute the process (and overload PyPi servers in the meantime), just download the **pypi-deps.csv**file, it contains just these columns :

name of the dependency
version extracted
a base64 encoded, json string to store the list of dependencies : so you just need to execute json.loads(b64decode(...))

Results

So what comes out of all this ? This graph : [caption id="attachment_968" align="aligncenter" width="640"]![PyPi dependency graph generated using Gephi](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771021381-pypi-deps.png) PyPi dependency graph - click to see the interactive version[/caption] Ok, if you see it like that, you must think it looks like a huge jellyfish, and that i'm just joking with you. So I spent a little time creating and optimizing an interactive graph of the PyPi dependency (it seems to be best to open it using chrome) where you can scroll and see all the dependencies with all the metrics and explanation needed. The next steps will be to do the same with Maven dependencies in a Java world, and compute metrics needed to compare the both. Vale