<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Olivier Girardot's Ramblings</title>
        <link>https://ogirardot.writizzy.com</link>
        <description>The ramblings of a tech builder and startup CTO</description>
        <lastBuildDate>Fri, 10 Apr 2026 11:38:36 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>Writizzy</generator>
        <language>en</language>
        <image>
            <title>Olivier Girardot's Ramblings</title>
            <url>https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/1771778326886-t3ybutg.png</url>
            <link>https://ogirardot.writizzy.com</link>
        </image>
        <copyright>All rights reserved 2026, Olivier Girardot's Ramblings</copyright>
        <item>
            <title><![CDATA[Reverse engineering now and then]]></title>
            <link>https://ogirardot.writizzy.com/p/reverse-engineering-now-and-then</link>
            <guid>https://ogirardot.writizzy.com/p/reverse-engineering-now-and-then</guid>
            <pubDate>Wed, 11 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Reverse engineering, hacking and cracking has a long tradition of being tedious and intensive - how has it changed with our new AI tools]]></description>
            <content:encoded><![CDATA[<p>When I was a teenager in the 90s, a friend of mine had a side-gig making cracks and key-generators for games and software. To anyone who grew up with always-on broadband, that sentence probably needs some context.</p>
<h2>A World Without Internet (by Default)</h2>
<p>In Europe in the late 90s, Internet access wasn&#39;t the default mode of any device. You <em>chose</em> to go online — deliberately — by firing up your 56k modem on the family landline, knowing full well that every minute ticked on the phone bill. Mobile phones? They existed, barely. Internet on the phone did not. SMS came later and cost a small fortune per message.</p>
<p>This isolation-by-default created something remarkable: an entire intellectual arms race built on the assumption that software lived offline. Developers built protections knowing there was no server to phone home to. Hackers broke them knowing the same thing. It was an adversarial craft — part art, part sport — played entirely within the confines of a single machine. </p>
<p>Here’s a <a href="https://www.tiktok.com/@pouyasaffari/video/7445784334704905477">small sample curated of the art at the time</a> 😉</p>
<h2>Software distribution before internet</h2>
<p>Back then, you discovered new software the way you discovered music: through curation. Tech magazines shipped with a free CD stuffed with freeware (fully free software) and shareware (a taste for free, the full experience once you&#39;d paid for a License Key). </p>
<p><img src="https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/1772263391561-zol5r50.png" alt="An example of the kind of publication at the time - here a CDROM issue of 1994" /></p>
<p>It was a surprisingly elegant distribution model. As a consumer, you got curated recommendations from a magazine you trusted — far better than the alternative of spending 16 hours downloading a 200 MB file at 3.5 KB/s, only to have the connection drop after five.</p>
<p>There were magazines dedicated to software, games, productivity — and yes, hacking. And among the search engines of the era, alongside the generalist Alta Vista, sat its shadowy cousin: <strong><a href="http://astalavista.box.sk">astalavista.box.sk</a></strong>. There was no real concept of a &quot;dark web&quot; at the time. The web was what it was — light and darkness mixed together — and you were expected to watch where you clicked.</p>
<h2>The art of cracks &amp; keygens</h2>
<p>If you wanted to unlock a shareware program, you went to Astalavista and searched for it. You&#39;d find either a <strong>keygen</strong> or a <strong>crack</strong>.</p>
<ul>
<li><strong>Keygens</strong> were the golden ticket. They generated a proper License Key — one that told the software &quot;I&#39;m good, I&#39;m a paying customer.&quot; Since Internet was a luxury, the software never tried to verify that claim against a remote database. You didn&#39;t modify the program at all. You just had a key that fit the lock.</li>
<li><strong>Cracks</strong> were more invasive. They patched the software itself — typically dropping in a modified DLL that monkey-patched (as we&#39;d say today) the code responsible for checking registration. Surgery, not lockpicking.</li>
</ul>
<p>Keygens are mostly gone now. Cracks still exist in spirit — the modding community around games like Skyrim puts staggering effort into projects like the <a href="https://www.nexusmods.com/skyrimspecialedition/mods/266">Unofficial Skyrim Special Edition Patch</a>, which is essentially the same discipline applied with different intent.</p>
<p>But the underlying process was always the same, and to teenage me it looked like pure black magic. You had to <strong>reverse-engineer</strong> the code responsible for the license check — understand the key verification algorithm well enough to generate compliant keys, or understand the program&#39;s architecture well enough to surgically disconnect the licensing module without breaking everything else.</p>
<p>In practice, this meant a Windows machine tweaked at the boot level to run a decompiler or disassembler, tracing execution pathways and kernel calls to figure out what was actually happening. It required tremendous skill, deep knowledge, and pure grit.</p>
<p>I wondered recently: what would that process look like today, now that we have AI models like Claude that can supply the &quot;grit&quot; part on demand?</p>
<h2>Let’s try it with today’s tech</h2>
<p>Rather than reverse-engineer an existing proprietary format (let&#39;s keep things legal and self-contained), I created a toy problem. I designed a new binary file format called <strong>MIC</strong> (Multi Image Container) — built to store multiple images in a single binary file with error correction codes and thumbnail support.</p>
<p>The experiment has three steps:</p>
<ol>
<li><strong>Design the spec</strong> — published separately at <a href="https://ogirardot.github.io/mic/">ogirardot.github.io/mic</a></li>
<li><strong>Implement a writer</strong> — a Python prototype at <a href="https://github.com/ogirardot/mic">github.com/ogirardot/mic</a></li>
<li><strong>Ask Claude to reverse-engineer the output with zero context</strong></li>
</ol>
<p>The format is pretty simple and the base structure layout looks like :</p>
<p><img src="https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/1772268069669-vsbxnta.png" alt="High level description of the MIC file format" /></p>
<p>I packed two of my own photos into a <code>.mic</code> file:</p>
<pre><code class="language-shell">➜  mic git:(main) ✗ python mic.py pack mountains.mic IMG_20260208_145001.jpg PXL_20260228_111319728.MP.jpg 
Wrote mountains.mic (2 images, 730248 bytes)
</code></pre>
<p>Then I handed the resulting binary to Claude with the simplest possible prompt:</p>
<blockquote>
<p><em>Here&#39;s a strange file, can you reverse-engineer it to tell me what is it about and if there are any data inside?</em></p>
</blockquote>
<p>No spec. No hints. No context. I launched Claude Code with <code>CLAUDE_CODE_SIMPLE=1</code> to ensure a completely blank slate, and fed the same prompt to three different models: Haiku 4.5, Sonnet 4.6, and Opus 4.6.</p>
<h2>The Results</h2>
<p>My expectation was that this first prompt would be a long shot. I was wrong.</p>
<p><strong>Sonnet 4.6</strong> went first. After about 4 minutes of autonomously running <code>xxd</code> dumps and writing ad-hoc Python scripts, it produced a full reverse-engineering report: the magic bytes, the header structure, the directory layout, per-image metadata including dimensions, filenames, CRC32 checksums (verified!), and even a description of the actual photo content. It mapped the entire format.</p>
<p>Here&#39;s how all three models performed:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Time</th>
<th>Result</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Haiku 4.5</strong></td>
<td>37 seconds</td>
<td>✅ Full reverse-engineering</td>
</tr>
<tr>
<td><strong>Sonnet 4.6</strong></td>
<td>4 min 25s</td>
<td>✅ Full reverse-engineering (bonus: opened the extracted images)</td>
</tr>
<tr>
<td><strong>Opus 4.6</strong></td>
<td>1 min 16s</td>
<td>✅ Full reverse-engineering</td>
</tr>
</tbody></table>
<p>Yes — <strong>Haiku</strong>, the smallest and cheapest model, cracked it in 37 seconds.</p>
<p>Each model autonomously figured out the magic bytes (<code>MIC!</code>, <code>IMG!</code>, <code>ENDMIC!</code>), the header layout, the directory structure with offsets and sizes, image dimensions, embedded filenames, CRC32 integrity checks, and the raw JPEG payloads. Opus gave the most concise structural mapping. Sonnet went the extra mile and actually rendered the extracted images.</p>
<h2>What Does This Mean?</h2>
<p>This was admittedly a simple format — no encryption, no compression beyond what JPEG provides, clear magic bytes as handholds. A skilled reverse engineer would have cracked it with a hex editor in minutes.</p>
<p>But the interesting part isn&#39;t the difficulty of the problem. It&#39;s the <em>nature</em> of the process.</p>
<p>What used to require a human sitting in front of a disassembler for hours — loading hex dumps into working memory, forming hypotheses about byte sequences, writing test scripts, iterating — is now something an AI can do autonomously. The model writes its own exploration tools, tests its own hypotheses, and converges on a structural understanding through the same iterative loop a human would use. It just does it faster, and it never loses focus.</p>
<p>The ability of modern models to generate on-the-fly Python debugging code, interpret binary patterns, form and revise hypotheses about data structures — that&#39;s a genuine capability shift. It doesn&#39;t replace human intuition on the hardest problems. But it compresses the iteration cycle dramatically.</p>
<p>My friend from the 90s spent weeks learning the tools and techniques before he could crack his first shareware. Today, the activation energy for that same intellectual exercise is a single prompt.</p>
<p>It&#39;s a brave new world. If something was designed using any kind of logical structure, it can now be understood at a speed we&#39;ve never seen before.</p>
]]></content:encoded>
            <category>hacking</category>
            <category>ai</category>
        </item>
        <item>
            <title><![CDATA[Good software knows when to stop]]></title>
            <link>https://ogirardot.writizzy.com/p/good-software-knows-when-to-stop</link>
            <guid>https://ogirardot.writizzy.com/p/good-software-knows-when-to-stop</guid>
            <pubDate>Thu, 05 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Good software knows what problem it solves and what needs to be tackled by another tool]]></description>
            <content:encoded><![CDATA[<p>it’s 9AM, you’re ready to upgrade your favorite Linux distribution and packages to their latest versions, the process goes smoothly and after a reboot your machine is now up-to-date. You start going as usual about your day and then when trying to list the content of a directory on your machine, something strange happens, the routinely boring behavior you’re used to of <code>ls</code>  surprises you, and not for the best :</p>
<pre><code class="language-javascript">$ ls

┌──────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  NOTICE: The legacy utility `ls` has evolved.                        │
│                                                                      │
│         _       _                                                    │
│        / \   __| | ___                                               │
│       / _ \ / _` |/ _ \                                              │
│      / ___ \ (_| |  __/                                              │
│     /_/   \_\__,_|\___|                                              │
│                                                                      │
│              AI-Powered Directory Intelligence™                      │
│                                                                      │
│  Hello.                                                              │
│                                                                      │
│  The classic `ls` command has reached the end of its lifecycle.      │
│  For decades it faithfully listed files.                             │
│  But listing is no longer enough.                                    │
│                                                                      │
│  The filesystem deserves to be *understood*.                         │
│                                                                      │
│  Introducing:                                                        │
│                                                                      │
│        █████╗ ██╗     ███████╗                                       │
│       ██╔══██╗██║     ██╔════╝                                       │
│       ███████║██║     ███████╗                                       │
│       ██╔══██║██║     ╚════██║                                       │
│       ██║  ██║███████╗███████║                                       │
│       ╚═╝  ╚═╝╚══════╝╚══════╝                                       │
│                                                                      │
│                       Adaptive Listing System                        │
│                                                                      │
│  `als` doesn&#39;t just show files.                                      │
│  It predicts which ones you meant.                                   │
│  It ranks them.                                                      │
│  It understands you.                                                 │
│                                                                      │
│  Your current `ls` binary will remain functional for:                │
│                                                                      │
│                        30 days                                       │
│                                                                      │
│  After this period:                                                  │
│      • `ls` will be deprecated                                       │
│      • updates will cease                                            │
│      • directory awareness will be disabled                          │
│                                                                      │
│  You can begin your transition today:                                │
│                                                                      │
│      $ als --trial                                                   │
│                                                                      │
│  (30-day free evaluation period)                                     │
│                                                                      │
│  Thank you for participating in the future of file awareness.        │
│                                                                      │
│                         — The `ls` Team                              │
│                           (now part of ALS)                          │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
</code></pre>
<p>Fortunately, this does not happen… Good software knows the purpose it serves, it does not try to do everything, it knows when to stop and what to improve.</p>
<p>One of the most counterintuitive, for the maximalist human psyche we have, is to know the role and place your software fits in and to decide when what you want to do next fits with, what you nowadays call the “product vision”, or if it’s just another project, another tool.</p>
<p>For the oldest amongst us this kind of lessons came from 37Signals, the founders of Basecamp’s (the project management tool) books <a href="https://basecamp.com/books#rework">Rework</a> and <a href="https://basecamp.com/books#gettingreal">Getting Real</a> - two books I’d recommend and especially Getting Real for product design, whose lessons I could sum up by : </p>
<ul>
<li><strong>Constraints are advantages</strong> — small teams, tight budgets, and limited scope force better decisions</li>
<li><strong>Ignore feature requests</strong> — don&#39;t build what users ask for; understand the underlying problem instead</li>
<li><strong>Ship early, ship often</strong> — a half-product that&#39;s real beats a perfect product that&#39;s vaporware</li>
<li><strong>Epicenter design</strong> — start with the core interface/interaction, not the edges (nav, footer, etc.)</li>
<li><strong>Say no by default</strong> — every feature has a hidden cost: complexity, maintenance, edge cases</li>
<li><strong>Scratch your own itch</strong> — build something you yourself need; you&#39;ll make better decisions</li>
</ul>
<p>At the time where Minio becomes AIStor and even Oracle Database becomes the <a href="https://www.oracle.com/database/">Oracle AI Database</a>,  I think a little reminder that not everything has to change drastically and that being the de facto standard for a given problem has more value than branding yourself as the new hot thing no-one expected. </p>
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[For a new golden age of FOSS]]></title>
            <link>https://ogirardot.writizzy.com/p/for-a-new-golden-age-of-foss</link>
            <guid>https://ogirardot.writizzy.com/p/for-a-new-golden-age-of-foss</guid>
            <pubDate>Mon, 23 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Arguing that the current Generative AI trend is a chance for disrupting existing staled eco-systems with free software for the common good]]></description>
            <content:encoded><![CDATA[<p>The generative AI trend is arguably one of the biggest paradigm shifts of the last decades for the tech industry —even though the field has seen its share of upheaval recently: the <a href="https://loup-vaillant.fr/articles/deaths-of-oop">disappearance of OOP</a> as the default mental model, the rise of distributed systems and big data, the emergence of low-level languages (Rust/Zig) making a quiet comeback, and even no-code frameworks promising to abstract the programmer away entirely.</p>
<p>This latest shift has split opinion sharply. On one side, a chorus of voices argues that developers are a dying breed, that the craft is being automated away into irrelevance. On the other, an equally loud camp warns that the <a href="https://gist.github.com/richhickey/ea94e3741ff0a4e3af55b9fe6287887f">"vibe-coding" trend is a plague slowly killing projects</a>, demoralising key contributors, and ultimately creating more problems than it solves.</p>
<p>Both camps are, to some extent, missing the point.</p>
<p>I want to argue something different: that the AI productivity wave is a wonderful, and underappreciated,  opportunity for Free Software and Open Source. Not despite the disruption, but because of it.</p>
<hr>
<h2>Software Was Already Eating the World</h2>
<p>When Marc Andreessen coined the phrase &quot;software is eating the world&quot; in 2011, it felt like a provocation. Fourteen years later it reads like an understatement.</p>
<p>SaaS was the dominant economic model of the last two decades, and its winners compounded advantages ruthlessly. The clearest illustration is the cloud wars: AWS, GCP, and Azure didn&#39;t win by owning better hardware, they won because they could afford to build, staff, and iterate on software abstractions faster than any competitor. Their moats weren&#39;t physical. They were organisational and financial: the ability to sustain large engineering teams attacking large problems, continuously, for years - it’s arguably what any European competitor lacks right now and OpenStack did not help much...</p>
<p>This concentration of power is not confined to the giants of the datacenter world. The same pattern plays out at every scale, in every corner of the data and infrastructure ecosystem. Storage, ETL pipelines, artifact management, data transformation — each of these markets has its own set of companies who captured a market early, established switching costs, and gradually shifted their energy from innovation to retention.</p>
<p>What made these positions durable wasn&#39;t superior technology. It was the <em>cost of execution</em>. Building a credible alternative to these solutions meant years of engineering, a VC round or two (if the market was the “next big thing” otherwise forget it), a dedicated team, and a long way to reach feature parity before you could even begin the sales conversation. The idea was often not the hard part. Assembling the execution capacity to turn the idea into something real was.</p>
<p>That is rapidly changing.</p>
<hr>
<h2>The Slow Death of FOSS at Scale</h2>
<p>To understand the opportunity ahead, it helps to understand what went wrong, or rather, what was always structurally fragile.</p>
<p>The open source ecosystem has been living with a quiet tension for years. The original promise was simple: share the code, share the burden, share the benefit. In practice, maintaining a widely-used open source project at scale is expensive. It requires sustained engineering investment, infrastructure, community management, and increasingly, legal and security resources. The idealism doesn&#39;t pay the bills and a lot of great projects died because of that (a recent example being <a href="https://scrapoxy.io/">Scrapoxy</a> by <a href="https://www.linkedin.com/in/fabienvauchelles/">Fabien Vauchelles</a>).</p>
<p>The industry&#39;s response has been the <strong>open core model</strong>: release the core as open source, sell the enterprise features, and use the community as a distribution channel. It worked for a while, before the cloud. But it has been fraying, and the fraying is accelerating.</p>
<p><strong>Minio</strong> is the most striking recent example and one of the fastest turnarounds in recent memory. The S3-compatible object storage project that became a cornerstone of countless self-hosted and cloud-native stacks <strong>quietly archived its open source project</strong> to make way for <strong>AIStor</strong>, a proprietary fork repositioned around AI workloads. The community didn&#39;t get a slow pivot. It got a <strong>fait accompli</strong>. The failure mode here is the <em>rug pull</em>: the open source project could be considered, in retrospect, as a customer acquisition funnel. When the market shifted, the funnel got redirected.</p>
<p><strong>HashiCorp&#39;s BSL relicensing</strong> of Terraform followed the same pattern a year earlier, triggering the OpenTofu fork. A reminder that the community can fight back, but only when it moves fast enough.</p>
<p><strong>The dbt and Fivetran story</strong> is a different failure mode: <em>consolidation absorption</em> and it plays out in two acts. In the first, dbt Labs built genuine momentum as an open source data transformation tool, then <a href="https://www.getdbt.com/sdf">acquired SDF Labs</a> to push further into the SQL intelligence space. <a href="https://www.fivetran.com/press/fivetran-and-dbt-labs-unite-to-set-the-standard-for-open-data-infrastructure-2025">Fivetran then acquired dbt Labs</a>, folding an independent ecosystem player into a commercial platform. What was an independent node in the ecosystem became an asset on a balance sheet.</p>
<p>The second act is subtler and more damaging. dbt Labs announced the transition from <strong>dbt Core</strong> (the Apache 2.0-licensed engine) to <strong>dbt Fusion</strong>, a rewritten engine released under the <strong>Elastic License v2 (ELv2)</strong>. ELv2 is not open source by any definition the OSI would recognise: it prohibits offering the software as a hosted service, which is precisely the use case that made dbt Core valuable to the ecosystem. The open source project didn’t disappear for now (and a release happened in February 2026), but it’s clear that the bulk of the innovation and investments of the company is going to be on dbt Fusion. It’s a rug pull with extra steps: slower, deniable, but just as final.</p>
<p>Then there is the quieter, less dramatic failure mode that Sonetype’s Nexus and JFrog’s Artifactory represent: <strong>innovation stall</strong>. No rug pull, no hostile acquisition, just a gradual calcification. These artifact repository tools captured their markets early, established deep enterprise integrations, and then largely stopped innovating in any meaningful sense. Pricing crept up. The UI stagnated. Feature development slowed to a pace dictated by enterprise sales cycles rather than user needs. They didn&#39;t fail — they just became the kind of expensive, slightly-resented infrastructure that teams budget for because replacing them feels too painful to contemplate and the alternatives are either partial and/or cloud-provider based.</p>
<p>Each of these stories has a different surface cause. But underneath, they share the same root: <strong>sustaining FOSS innovation at scale, in a market with well-capitalised incumbents, was too costly relative to the prize</strong>. The economics just didn&#39;t work and people can only compensate for so long.</p>
<hr>
<h2>The Gap Between Idea and Execution</h2>
<p>Here is where the argument turns.</p>
<p>There is a basic principle at work in any market: when the cost of producing something falls, more of it gets produced. This is the supply side of the law of demand, and it is about to reshape the software landscape in ways the AI discourse has largely missed.</p>
<p>For most of software history, the gap between idea and execution was wide enough to be a meaningful filter. Having the right insight about what to build was table stakes. What separated successful projects from abandoned GitHub repositories was execution capacity: the engineering hours, the sustained attention, the infrastructure, the tooling, the documentation. The idea was cheap. Everything else was expensive.</p>
<p>Generative AI is compressing that gap at a rate that is easy to underestimate. Not uniformly, quality still matters enormously, but the cost of turning a well-scoped idea into working software is falling faster than at any point since the commoditisation of cloud compute.</p>
<p>The implications for FOSS are asymmetric, and this is the part that rarely gets said plainly: <strong>the cost reduction benefits free and open source projects disproportionately</strong>.</p>
<p>A proprietary vendor still needs to recoup its engineering investment through revenue. A VC-backed startup still needs to justify its burn multiple. But a FOSS project only needs to cross an <em>activation energy threshold</em>: enough working software to be useful, enough documentation to be approachable, enough momentum to attract contributors. That threshold has always been the hard part. It is now lower.</p>
<p>Think about what it used to take to build a credible challenger to Artifactory. Three years minimum. A funded team. A long crawl to feature parity across package formats. A sales motion to crack enterprise procurement. The idea — &quot;a better, cheaper, open artifact registry&quot; — was never the scarce resource. The execution capacity was. </p>
<p>Now consider what that same project would look like if a single maintainer with strong domain knowledge and AI assistance decided to tackle it and disrupt a space that hasn’t moved since 15 years. You don’t need to imagine it, just take a look at the project <a href="https://artifactkeeper.com/">ArtifactKeeper</a> (45+ package formats, systems and library repositories, distributed proxy support, SSO and Security included with an MIT License), at the time of writing this article (22nd February 2026) all of these features are included — the project started on the 15th January 2026 ~a month ago 🤯. I’m not judging the quality I haven’t tested it yet but at least the ambition is clear and I salute <a href="https://github.com/brandonrc">Brandon Geraci’s</a> motivation.</p>
<p>This is not hypothetical. The signals are already there in projects like this, and a growing number of infrastructure tools being built by very small teams to very high levels of polish. These aren&#39;t flukes. They are early evidence of a structural shift in what a small, motivated team can produce.</p>
<hr>
<h2>FOSS as the Natural Beneficiary</h2>
<p>The backlash against AI-generated pull requests is real and worth taking seriously. The reviewer fatigue, the low-signal noise, the erosion of the human craft at the heart of collaborative software development. These are genuine problems, not just personal anxieties.</p>
<p>But they are, at their core, <strong>governance problems, not productivity problems</strong>. The productivity, to me, is real. The question is who captures it and under what terms.</p>
<p>This is where the proprietary model has a structural disadvantage it rarely acknowledges. A commercial vendor capturing the AI productivity gains does so to protect margins, accelerate roadmaps, and deepen competitive moats. The gains flow to shareholders and, partially, to customers through better products. But as the software itself remains locked, the moat deepens.</p>
<p>A FOSS project capturing the same gains operates under entirely different incentives. The productivity goes into shipping more, faster, under a licence that ensures the code stays free. There is no margin to protect. There is no rug pull option if the core is Apache-2.0 or AGPLv3 from day one. The community retains the right to fork and if the license is GPLv3 it can even start a legacy. The switching costs stay low by design.</p>
<p>This is why the licensing question matters more now than it did five years ago. Projects that embed permissive or copyleft licences from the start are structurally protected against the failure modes we saw with Minio, HashiCorp, and the dbt ecosystem. The rug pull requires the rug. If the licence doesn&#39;t allow it, the option doesn&#39;t exist.</p>
<p>The opportunity is particularly acute in the markets like that of Nexus, Artifactory, Fivetran : <strong>expensive, stale, critical infrastructure with high switching costs and low innovation velocity</strong>. These are markets where their moats were built on execution cost, the cost of the alternative being too high to justify. That moat is eroding.</p>
<p>A well-designed FOSS alternative in any of these spaces, built by a small team leveraging the current generation of AI tooling, with a clean licence and a genuine community is a credible threat in a way it simply wasn&#39;t possible three years ago. They know this, which is partly why the pace of proprietary pivots and acquisitions is accelerating. The window for pre-emptive consolidation is closing.</p>
<hr>
<h2>For a New Golden Age of FOSS</h2>
<p>The first golden age of open source happened because the internet eliminated the cost of distribution. Linux didn&#39;t win by outspending SCO or Sun. It won because the economics of sharing code shifted so dramatically that the proprietary model could no longer justify its own overhead for most use cases. The infrastructure of the modern internet (Apache, MySQL, OpenSSH, Linux itself) was built largely by contributors who couldn&#39;t have done it without that distribution revolution.</p>
<p>We are at an analogous inflection point, but on the production side. <strong>The cost of <em>writing</em> software is falling</strong> the way the cost of <em>distributing</em> software fell in the nineties. The implications are the same: the competitive advantage that large engineering organisations held through execution capacity is being democratised.</p>
<p>That doesn&#39;t mean expertise stops mattering. It doesn&#39;t mean quality is free. It doesn&#39;t mean every ill-conceived FOSS project is suddenly viable. The governance challenges around AI-assisted contributions are real and will require new norms : better review tooling, clearer contribution standards, more explicit signal-to-noise filtering.</p>
<p>But it does mean that the class of problems that were previously &quot;too expensive to FOSS&quot; is shrinking. The stale, expensive, proprietary dominant players in tech are facing a structural shift in the cost curve of their competition. For the first time in a while, the economics of building a genuinely free alternative are on the right side of viable at least for the bootstrapping.</p>
<p>The AI wave is not the enemy of free software. Handled well — with good licences, healthy governance, and the willingness to build — it might be its best chance in a generation to actually take back control.</p>
]]></content:encoded>
            <category>tech</category>
            <category>foss</category>
        </item>
        <item>
            <title><![CDATA[Object oriented programming deemed irrelevant]]></title>
            <link>https://ogirardot.writizzy.com/p/object-oriented-programming-deemed-irrelevant</link>
            <guid>https://ogirardot.writizzy.com/p/object-oriented-programming-deemed-irrelevant</guid>
            <pubDate>Thu, 20 Feb 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[I've been coding since 2006, during this time I've seen multiple trends \& technologies emerge, rise and fall - nowadays the elephant in the room is the bad press around OOP languages the likes of Jav...]]></description>
            <content:encoded><![CDATA[<p>I&#39;ve been coding since 2006, during this time I&#39;ve seen multiple trends &amp; technologies emerge, rise and fall - nowadays the elephant in the room is the bad press around OOP languages the likes of Java, C#, C++.</p>
<p>Our profession is no stranger to this kinda of feud and debate, for example, at the start of my career I learned and was told to quickly forget Remote Procedure Calls technologies like <a href="https://fr.wikipedia.org/wiki/Common_Object_Request_Broker_Architecture">CORBA</a>, SOAP with the bulk of what we called Web Services at the time - (spoiler: it came back with gRPC useful things tend to come back).</p>
<p>I was only told at the time that my job as a software engineer was going to be irrelevant soon enough because of MDA - <a href="https://en.wikipedia.org/wiki/Model-driven_architecture">Model Driven Architecture</a> - and if I wanted to really build things my goal should be to harness all the UML/Merise diagram types perfectly and then feed them all to <a href="https://projects.eclipse.org/projects/modeling.emf.emf">Eclipse EMF</a> (still alive btw) for it to generate the code (like a good engineer should because *really doing things* is kinda dirty anyway).</p>
<h2>OOP and programming languages</h2>
<p>One thing that was a given at the time was the clear win of the Object Oriented programming languages for &quot;serious work&quot; - C was already considered too low level - so the clear go-to languages built from the ground up with OOP in mind were Java, C# and C++.</p>
<p>All the other languages wanted in on the action and added some concepts of Classes afterwards some in a clunky limited way like PHP and some with more attention to details like Python.</p>
<h2>Fast forward to now</h2>
<p>Nowadays OOP is the bad guy, the one responsible for all the evils in this world (along with Waterfall, Agile, Web Services and Design Patterns). To be clear it is denigrated in the tech news and as a general consensus in the ecosystem considered <a href="https://medium.com/@jacobfriedman/object-oriented-programming-is-an-expensive-disaster-which-must-end-2cbf3ea4f89d">too expensive</a>, <a href="https://news.ycombinator.com/item?id=18526490">too bloated</a> (special mention for the epic <a href="https://dpc.pw/posts/the-faster-you-unlearn-oop-the-better-for-you-and-your-software">The faster you unlearn OOP, the better for you and your software</a>) and a waste of precious time creating more problems than it solves, especially with the challenges we face nowadays (efficient multicore usage, async/data-intensive applications, deep integration with Machine Learning and distributed systems to mention a few).</p>
<p>Ok it looks like a grim picture, let&#39;s take a step back and look at what people actually do, and if we take a look at the <a href="https://survey.stackoverflow.co/2023/">Stack Overflow developer survey of 2023</a>, it checks out, the most popular and widely used programming languages today are not object oriented from the ground up - they are all scripting languages (except SQL) :
<a href="https://ogirardot.wordpress.com/wp-content/uploads/2025/02/image-1.png">![](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771036280-image-1.png)</a></p>
<p>The same conclusion can be drawn if we take at the <a href="https://survey.stackoverflow.co/2024/technology#most-popular-technologies-language">Stack Overflow developer survey of 2024</a> :
<a href="https://ogirardot.wordpress.com/wp-content/uploads/2025/02/image.png">![](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771036870-image.png)</a></p>
<p>However if we take a look at the number of years of experience of the respondents we can see some biais in the datasets the bulk of respondents being &lt;10 years experience on the job while according to <a href="https://datausa.io/profile/soc/software-developers?employment-measures=workforceEOT">DataUSA</a> the average age in the industry is as of 2022 <strong>39 years old</strong>
<a href="https://ogirardot.wordpress.com/wp-content/uploads/2025/02/image-2.png">![](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771037640-image-2.png)</a></p>
<br />

<p>And it&#39;s easy to see this self-fulfilling prophecy in action with the majority of bootcamps and influencers supposed to help you get a 6 figure job in tech in 4 days tell you that learning JavaScript is the best way to become a FullStack Engineer ever <em>because you can use it on both the frontend and the backend</em> (🤯 !) and Python the best way to get into DataScience <em>(hard to argue about this one...)</em>.</p>
<p>So yes, the lingua franca of new developers is now closer in terms of paradigm to a (mostly) dynamically typed - imperative style of programming and it seems, as a whole, experienced developers stayed loyal (<em>PyCon or Devoxx conferences still see ~5000 participants per day each year</em> <em>and JavaOne (rebranded DevNexus) in the USA sees ~10k participants per day</em>) or moved *laterally* :</p>
<ul>
<li>some experienced developers in OOP languages have moved on to functional programming languages to overcome some of the trauma they faced</li>
<li>others moved on from strict Java / C# etc... to Kotlin/Scala or other more modern forms of the language while Java integrated some of these features to stay relevant and dominant (Streams, lambda, default implementation etc...)</li>
</ul>
<p>Finally the emergence of a new brand of lower level languages like Go and Rust means that even some of the newcomers had additional options to shield themselves from the &quot;enterprise languages&quot;.</p>
<h2>Where to go from there</h2>
<p>There now seems to be a schism between older generations of programmers and newer generations, the later disregarding for the main part all the teachings (the bad and the good) that object oriented programming brought to the table.</p>
<p>Now let&#39;s be frank, none of the concepts that OOP pushed for are special to these languages especially in the later years :</p>
<ul>
<li>the simple fact of defining Abstractions (<em>not too much, not too little</em> ) and following the <a href="https://en.wikipedia.org/wiki/Dependency_inversion_principle">Dependency Inversion Principle</a></li>
<li>the <a href="https://en.wikipedia.org/wiki/Single-responsibility_principle">single responsibility principle</a></li>
<li>the <a href="https://en.wikipedia.org/wiki/Encapsulation_(computer_programming)">encapsulation</a> habit</li>
<li>the <a href="https://en.wikipedia.org/wiki/Composition_over_inheritance">composition over inheritance</a> principle</li>
</ul>
<p>Or as we now say broadly following the <a href="https://en.wikipedia.org/wiki/SOLID">SOLID</a> principles, none of these concepts are things that you can only do with classes, inheritance or a stubbornly opinionated Object Oriented programming language.</p>
<p>As a side note, most of the time it&#39;s easier to follow the spirit of these principles in a functional programming language, but amongst the list of popular programming languages they are notoriously absent, the closer we&#39;d get is that some of these languages have &quot;functional programming features&quot; like first-order functions, map, filters and that&#39;s it.</p>
<p>OOP has <a href="https://loup-vaillant.fr/articles/deaths-of-oop">died many times in the past</a> it has however survived until today but in other forms. We still call this by continuity OOP each time mostly because OOP has always been very loosely defined - the projects we build today using OOP languages and frameworks do not use as many abstractions, layers of indirections, overrides or even overloads than when the hype was at its peak, and it&#39;s a good thing! For simplicity is always a good thing!</p>
<p>This lack of definition is even clearer if we go back to Alan Kay the creator of SmallTalk who coined the term &quot;Object Oriented Programming&quot; when he meant to say the following :</p>
<blockquote>
<p><em><strong>&quot;OOP to me means only messaging, local retention and protection and hiding of state-process, and extreme late-binding of all things.&quot;</strong></em></p>
</blockquote>
<p>None of the current leaders in OOP are message-passing oriented (sadly), yet we consider them object oriented.</p>
<p>I do not care that much for the survival of OOP but I do see the value in the core teachings and the values in terms of separation of concerns it brought us - the efficient tooling and compilers that have been developed and refined for the last 30 years. We, as a profession, are not doomed to repeat the cycle of hype, fame, banishment and rewrite, I&#39;ve already experienced multiple times in my short career.</p>
<p>We should encourage all software engineers to strive for knowledge, learn, and develop critical thinking rather than forget all kinds of rational behavior, considering only the hype and prejudice of our times - in the end, even &quot;old&quot; programming languages and paradigms can be <a href="https://medium.com/nerd-for-tech/is-oop-relevant-today-3b3fdc2d1ab2#:~:text=Wrapping%20Up-,Is%20OOP%20still%20an%20effective%20software%20development%20tool%20or%20is,and%20communications%20models%20are%20crucial.">relevant</a> today for the objectives we all have; to stay sane in a convoluted codebase.</p>
<br />

]]></content:encoded>
            <category>oss</category>
            <category>python</category>
            <category>techzone</category>
            <category>java</category>
            <category>dev</category>
            <enclosure url="https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771036280-image-1.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[From Pandas to Apache Spark's Dataframe]]></title>
            <link>https://ogirardot.writizzy.com/p/from-pandas-to-apache-sparks-dataframe</link>
            <guid>https://ogirardot.writizzy.com/p/from-pandas-to-apache-sparks-dataframe</guid>
            <pubDate>Fri, 31 Jul 2015 00:00:00 GMT</pubDate>
            <description><![CDATA[With the introduction in Spark 1.4 of Window operations, you can finally port pretty much any relevant piece of Pandas' Dataframe computation to Apache Spark parallel computation framework using Spark...]]></description>
            <content:encoded><![CDATA[<p>With the introduction in Spark 1.4 of Window operations, you can finally port pretty much any relevant piece of Pandas&#39; Dataframe computation to Apache Spark parallel computation framework using Spark SQL&#39;s Dataframe. If you&#39;re not yet familiar with Spark&#39;s Dataframe, don&#39;t hesitate to checkout my last article <a href="https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/">RDDs are the new bytecode of Apache Spark</a> and come back here after :p.</p>
<p>I figured some feedback on how to port existing &quot;complex&quot; code might be useful so the goal of this article will be to take a few concepts from Pandas Dataframe and see how we can translate this to PySpark&#39;s Dataframe using Spark  1.4.</p>
<p><strong>Disclaimer</strong>: a few operations that you can do in Pandas don&#39;t have any sense using Spark. Please remember that Dataframes in Spark are like RDD in the sense that they&#39;re an immutable data structure. Therefore things like :</p>
<pre><code class="language-python">df[&#39;three&#39;] = df[&#39;one&#39;] * df[&#39;two&#39;] # to create a new column &quot;three&quot;
</code></pre>
<p>Can&#39;t exist, just because this kind of affectation goes against the principles of Spark. Another example would be trying to access by index a single element within a Dataframe. Don&#39;t forget that you&#39;re using a distributed data structure, not an in-memory random-access data structure. To be clear, this doesn&#39;t mean that you can&#39;t do the same kind of thing (i.e. create a new column) using Spark, it means that you have to think immutable/distributed and re-write parts of your code, mostly the parts that are not purely thought-of as transformations on a stream of data. So let&#39;s dive in</p>
<h2>Column selection</h2>
<p>This part is not that much different in Pandas and Spark, but you have to take into account the immutable character of your dataframe. First let&#39;s create two dataframes one in Pandas <strong>pdf</strong> and one in Spark <strong>df</strong> : </p>
<pre><code class="language-python"># Pandas =&gt; pdf  
pdf = pd.DataFrame.from_items([(&#39;A&#39;, [1, 2, 3]), (&#39;B&#39;, [4, 5, 6])])

In [18]: pdf.A  
Out[18]:  
0 1  
1 2  
2 3  
Name: A, dtype: int64

\# SPARK SQL =&gt; df  
In [19]: df = sqlCtx.createDataFrame([(1, 4), (2, 5), (3, 6)], [&quot;A&quot;, &quot;B&quot;])

In [20]: df  
Out[20]: DataFrame[A: bigint, B: bigint]

In [21]: df.show()  
+-+-+  
|A|B|  
+-+-+  
|1|4|  
|2|5|  
|3|6|  
+-+-+  
</code></pre>
<p>Now in Spark SQL or Pandas you use the same syntax to refer to a column :</p>
<pre><code class="language-python">In [27]: df.A  
Out[27]: Column&lt;A&gt;

In [28]: df[&#39;A&#39;]  
Out[28]: Column&lt;A&gt;

In [29]: pdf.A  
Out[29]:  
0 1  
1 2  
2 3  
Name: A, dtype: int64

In [30]: pdf[&#39;A&#39;]  
Out[30]:  
0 1  
1 2  
2 3  
Name: A, dtype: int64  
</code></pre>
<p>The output seems different, but these are still the same ways of referencing a column using Pandas or Spark, the only difference is that in Pandas, it is a mutable data structure that you can change, not in Spark.</p>
<h2>Column adding</h2>
<pre><code class="language-python">In [31]: pdf[&#39;C&#39;] = 0

In [32]: pdf  
Out[32]:  
A B C  
0 1 4 0  
1 2 5 0  
2 3 6 0

\# In Spark SQL you&#39;ll use the withColumn or the select method,  
\# but you need to create a &quot;Column&quot;, a simple int won&#39;t do :  
In [33]: df.withColumn(&#39;C&#39;, 0)  
-------------------------  
AttributeError Traceback (most recent call last)  
&lt;ipython-input-33-fd1261f623cf&gt; in &lt;module&gt;()  
--&gt; 1 df.withColumn(&#39;C&#39;, 0)

/Users/ogirardot/Downloads/spark-1.4.0-bin-hadoop2.4/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)  
1196 &quot;&quot;&quot;  
-&gt; 1197 return self.select(&#39;*&#39;, col.alias(colName))  
1198  
1199 @ignore\_unicode\_prefix

AttributeError: &#39;int&#39; object has no attribute &#39;alias&#39;

\# Here&#39;s your new best friend &quot;pyspark.sql.functions.*&quot;  
\# If you can&#39;t create it from composing columns  
\# this package contains all the functions you&#39;ll need :  
In [35]: from pyspark.sql import functions as F  
In [36]: df.withColumn(&#39;C&#39;, F.lit(0))  
Out[36]: DataFrame[A: bigint, B: bigint, C: int]

In [37]: df.withColumn(&#39;C&#39;, F.lit(0)).show()  
+-+-+-+  
|A|B|C|  
+-+-+-+  
|1|4|0|  
|2|5|0|  
|3|6|0|  
+-+-+-+  
</code></pre>
<p>Most of the time in Spark SQL you can use Strings to reference columns but there are two cases where you&#39;ll want to use the Column objects rather than Strings :</p>
<ul>
<li>In Spark SQL Dataframe columns are allowed to have the same name, they&#39;ll be given unique names inside of Spark SQL, but this means that you can&#39;t reference them with the column name only as this becomes ambiguous.</li>
<li>When you need to manipulate columns using expressions like <strong>&quot;Adding two columns to each other&quot;</strong> , <strong>&quot;Twice the value of this column&quot;</strong> or even <strong>&quot;Is the column value larger than 0 ?&quot;</strong>, you won&#39;t be able to use simple strings and will need the Column reference</li>
<li>Finally if you need renaming, cast or any other complex feature, you&#39;ll need the Column reference too.</li>
</ul>
<p>Here&#39;s an example : </p>
<pre><code class="language-python">In [39]: df.withColumn(&#39;C&#39;, df.A * 2)  
Out[39]: DataFrame[A: bigint, B: bigint, C: bigint]

In [40]: df.withColumn(&#39;C&#39;, df.A * 2).show()  
+-+-+-+  
|A|B|C|  
+-+-+-+  
|1|4|2|  
|2|5|4|  
|3|6|6|  
+-+-+-+

In [41]: df.withColumn(&#39;C&#39;, df.B &gt; 0).show()  
+-+-+--+  
|A|B| C|  
+-+-+--+  
|1|4|true|  
|2|5|true|  
|3|6|true|  
+-+-+--+
</code></pre>
<p>When you’re selecting columns, to create another <em>projected</em> dataframe, you can also use expressions :</p>
<pre><code class="language-python">In [42]: df.select(df.B &gt; 0)  
Out[42]: DataFrame[(B &gt; 0): boolean]

In [43]: df.select(df.B &gt; 0).show()  
+---+  
|(B &gt; 0)|  
+---+  
| true|  
| true|  
| true|  
+---+  
</code></pre>
<p>As you can see the column name will actually be computed according to the expression you defined, if you want to rename this, you’ll need to use the **alias **method on Column :</p>
<pre><code class="language-python">In [44]: df.select((df.B &gt; 0).alias(&quot;is_positive&quot;)).show()  
+----+  
|is_positive|  
+----+  
| true|  
| true|  
| true|  
+----+  
</code></pre>
<p>All of the expressions that we&#39;re building here can be used for Filtering, Adding a new column or even inside Aggregations, so once you get a general idea of how it works, you&#39;ll be fluent throughout all of the Dataframe manipulation framework.</p>
<h2>Filtering</h2>
<p>Filtering is pretty much straightforward too, you can use the RDD-like <strong>filter</strong> method and copy any of your existing Pandas expression/predicate for filtering : </p>
<pre><code class="language-python">In [48]: pdf[(pdf.B &gt; 0) &amp; (pdf.A &lt; 2)]  
Out[48]:  
A B C  
0 1 4 0

In [49]: df.filter((df.B &gt; 0) &amp; (df.A &lt; 2)).show()  
+-+-+  
|A|B|  
+-+-+  
|1|4|  
+-+-+

In [55]: df[(df.B &gt; 0) &amp; (df.A &lt; 2)].show()  
+-+-+  
|A|B|  
+-+-+  
|1|4|  
+-+-+ 
</code></pre>
<h2>Aggregations</h2>
<p>What can be confusing at first in using aggregations is that the minute you write <strong>groupBy</strong> you&#39;re not using a Dataframe object, you&#39;re actually using a <strong>GroupedData</strong> object and you need to precise your aggregations to get back the output Dataframe : </p>
<pre><code class="language-python">In [77]: df.groupBy(&quot;A&quot;)  
Out[77]: &lt;pyspark.sql.group.GroupedData at 0x10dd11d90&gt;

In [78]: df.groupBy(&quot;A&quot;).avg(&quot;B&quot;)  
Out[78]: DataFrame[A: bigint, AVG(B): double]

In [79]: df.groupBy(&quot;A&quot;).avg(&quot;B&quot;).show()  
+-+--+  
|A|AVG(B)|  
+-+--+  
|1| 4.0|  
|2| 5.0|  
|3| 6.0|  
+-+--+  
</code></pre>
<p>As a syntactic sugar if you need only one aggregation, you can use the simplest functions like : <strong>avg, cout, max, min, mean</strong> and<strong>sum</strong> directly on GroupedData, but most of the time, this will be too simple and you&#39;ll want to create a few aggregations during a single groupBy operation. After all (c.f. <a href="https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/">RDDs are the new bytecode of Apache Spark</a> ) this is one of the greatest features of the Dataframes. To do so you&#39;ll be using the <strong>agg</strong> method : </p>
<pre><code class="language-python">In [83]: df.groupBy(&quot;A&quot;).agg(F.avg(&quot;B&quot;), F.min(&quot;B&quot;), F.max(&quot;B&quot;)).show()  
+-+--+--+--+  
|A|AVG(B)|MIN(B)|MAX(B)|  
+-+--+--+--+  
|1| 4.0| 4| 4|  
|2| 5.0| 5| 5|  
|3| 6.0| 6| 6|  
+-+--+--+--+  
</code></pre>
<p>Of course, just like before, you can use any expression especially column compositions, alias definitions etc… and some other non-trivial functions :</p>
<pre><code class="language-python">In [84]: df.groupBy(&quot;A&quot;).agg(  
....: F.first(&quot;B&quot;).alias(&quot;my first&quot;),  
....: F.last(&quot;B&quot;).alias(&quot;my last&quot;),  
....: F.sum(&quot;B&quot;).alias(&quot;my everything&quot;)  
....: ).show()  
+-+---+---+-----+  
|A|my first|my last|my everything|  
+-+---+---+-----+  
|1| 4| 4| 4|  
|2| 5| 5| 5|  
|3| 6| 6| 6|  
+-+---+---+-----+  
</code></pre>
<h2>Complex operations  Windows</h2>
<p>Now that Spark 1.4 is out, the Dataframe API provides an efficient and easy to use Window-based framework - this single feature is what makes any Pandas to Spark migration actually do-able for 99% of the projects - even considering some of Pandas&#39; features that seemed hard to reproduce in a distributed environment. </p>
<p>A simple example that we can pick is that in Pandas you can compute a <strong>diff</strong> on a column and Pandas will compare the values of one line to the last one and compute the difference between them. Typically the kind of feature hard to do in a distributed environment because each line is supposed to be treated independently, now with Spark 1.4 window operations you can define a window on which Spark will &quot;<strong>execute some aggregation functions&quot;</strong> but relatively to a specific line. Here&#39;s how to port some existing Pandas code using diff : </p>
<pre><code class="language-python">In [86]: df = sqlCtx.createDataFrame([(1, 4), (1, 5), (2, 6), (2, 6), (3, 0)], [&quot;A&quot;, &quot;B&quot;])

In [95]: pdf = df.toPandas()

In [96]: pdf  
Out[96]:  
A B  
0 1 4  
1 1 5  
2 2 6  
3 2 6  
4 3 0

In [98]: pdf[&#39;diff&#39;] = pdf.B.diff()

In [102]: pdf  
Out[102]:  
A B diff  
0 1 4 NaN  
1 1 5 1  
2 2 6 1  
3 2 6 0  
4 3 0 -6  
</code></pre>
<p>In Pandas you can compute a diff on an arbitrary column, with no regard for keys, no regards for order or anything. It’s cool… but most of the time not exactly what you want and you might end up cleaning up the mess afterwards by setting the column value back to NaN from one line to another when the keys changed.</p>
<p>Here’s how you can do such a thing in PySpark using Window functions, a Key and, if you want, in a specific Order :</p>
<pre><code class="language-python">In [107]: from pyspark.sql.window import Window

In [108]: window\_over\_A = Window.partitionBy(&quot;A&quot;).orderBy(&quot;B&quot;)

In [109]: df.withColumn(&quot;diff&quot;, F.lead(&quot;B&quot;).over(window\_over\_A) - df.B).show()  
+-+-+ 
| A| B|diff|  
+-+-+ 
| 1| 4| 1|  
| 1| 5|null|  
| 2| 6| 0|  
| 2| 6|null|  
| 3| 0|null|  
+-+-+
</code></pre>
<p>With that you are now able to compute a diff line by line - ordered or not - given a specific key. The great point about Window operation is that you’re not actually breaking the structure of your data. Let me explain myself.</p>
<p>When you’re computing some kind of aggregation (once again according to a key), you’ll usually be executing a <strong>groupBy</strong> operation given this key and compute the multiple metrics that you’ll need (if you’re lucky <em>at the same time</em>, if you’re not in multiple <strong>reduceByKey</strong> or <strong>aggregateByKey</strong> transformations).</p>
<p>But whether you’re using RDDs or Dataframe, if you’re not using window operations then you’ll actually crush your data in a part of your flow and then you’ll need to join back again the results of your aggregations to the <em>main</em>-dataflow. Window operations allows you to execute your computation and copy the results as additional columns without any explicit join.</p>
<p>This is a quick way to enrich your data adding rolling computations as just another column directly. Two additional resources are worth noting regarding these new features, the official <a href="https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html">Databricks blog article on Window operations</a> and <a href="http://twitter.com/chris_bour">Christophe Bourguignat</a>‘s article evaluating <a href="https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2">Pandas and Spark Dataframe differences</a>.</p>
<p>To sum up you now have all the tools you need in Spark &gt; 1.4 to port any Pandas computation in a distributed environment using the <em>very</em> similar Dataframe API.</p>
<p><em>Vale</em></p>
]]></content:encoded>
            <category>oss</category>
            <category>python</category>
        </item>
        <item>
            <title><![CDATA[RDDs are the new bytecode of Apache Spark]]></title>
            <link>https://ogirardot.writizzy.com/p/rdds-are-the-new-bytecode-of-apache-spark</link>
            <guid>https://ogirardot.writizzy.com/p/rdds-are-the-new-bytecode-of-apache-spark</guid>
            <pubDate>Fri, 29 May 2015 00:00:00 GMT</pubDate>
            <description><![CDATA[With the Apache Spark 1.3 release the Dataframe API for Spark SQL got introduced, for those of you who missed the big announcements, I'd recommend to read the article : [Introducing Dataframes in Spar...]]></description>
            <content:encoded><![CDATA[<p>With the Apache Spark 1.3 release the Dataframe API for Spark SQL got introduced, for those of you who missed the big announcements, I&#39;d recommend to read the article : <a href="https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html" title="Introducing Dataframes in Spark for Large Scale Data Science">Introducing Dataframes in Spark for Large Scale Data Science</a> from the Databricks blog. Dataframes are very popular among data scientists, personally I&#39;ve mainly been using them with the great Python library <a href="http://pandas.pydata.org" title="Pandas">Pandas</a> but there are many examples in R (originally) and Julia.</p>
<p>Of course if you&#39;re using only Spark&#39;s core features, nothing seems to have changed with Spark 1.3 : Spark&#39;s main abstraction remains the RDD (Resilient Distributed Dataset), its API is very stable now and everyone used it to handle any kind of data since now.</p>
<p>But the introduction of Dataframe is actually a big deal, because when RDDs were the only option to load data, it was obvious that you needed to parse your &quot;maybe&quot; un-structured data using RDDs, transform them using case-classes or tuples and then do the special work that you actually needed. Spark SQL is not a new project and you were, of course, able to load your structured-data (like Parquet files) directly from a <a href="https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.sql.SQLContext" title="SQLContext in Apache Spark 1.0">SQLContext</a> before 1.3 - but the advantages were not that clear at the time - except if you wanted to run SQL queries or expose a JDBC-compatible server for other BI tools.</p>
<p>Now the advantages are quite clear and I&#39;ll try to explain them as simply as possible :</p>
<ol>
<li>Dataframes are a higher level of abstraction than RDDs</li>
</ol>
<hr>
<p>If you&#39;re familiar with Pandas syntax, you will feel at home using Spark&#39;s Dataframe and even if you&#39;re not, you&#39;ll learn and - I&#39;d even add - learn to love it. Why ? Because it&#39;s a higher level of programming than the RDD, you can <a href="http://www.domorefasterbook.com/" title="Oops">do more, faster</a> (old joke now ;-) ). Here&#39;s an example from <a href="http://www.pwendell.com/" title="Patrick Wendell">Patrick Wendell</a>&#39;s Strata London 2015 presentation &quot;What&#39;s coming in Spark&quot; of RDDs in Python vs Dataframe :</p>
<p><a href="https://ogirardot.wordpress.com/wp-content/uploads/2015/05/rdd-vs-dataframe.png">![RDD vs Dataframe](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771034246-rdd-vs-dataframe.png)</a></p>
<p>Of course the second way of writing it is obviously more concise and more understandable, but I&#39;d like to add something else, the <em>tried-and-tested</em> Spark programmers have surely noticed the <strong>reduceByKey</strong> transformation used here. It is a very common mistake in Spark for common aggregation tasks to use the <strong>groupBy</strong> then <strong>mapValues</strong> or <strong>map</strong> transformation which can be dangerous in a production environment and produce <strong>OutOfMemory</strong>errors on workers.</p>
<p>Do you notice that such a mistake <strong>cannot</strong> happen using the Dataframe API below for you will be expressing your aggregations using, for example, the <strong>agg(...)</strong> method (or even directly the **avg(...)**method like up there). This will even allow you to define multiple aggregations at the same time, something that is usually tricky using RDDs :
[code language=&quot;scala&quot;] case class Person(id: Int, first_name: String, last_name: String, age: Double) // get simple stats on age repartitions by first_name(min, max, avg, count) val rdd: RDD[Person] = ... // first you need to only handle the data you really need, and cache it because you&#39;ll - sadly - reuse it val persons = rdd.map(person =&gt; (person.first_name, person.age)).cache() val minAgeByFirstName = persons.reduceByKey( scala.math.min(_, <em>) ) val maxAgeByFirstName = persons.reduceByKey( scala.math.max(</em>, _) ) val avgAgeByFirstName = persons.mapValues(x =&gt; (x, 1)) .reduceByKey((x, y) =&gt; (x._1 + y._1, x._2 + y.<em>2)) // simple right ? val countByFirstName = persons.mapValues(x =&gt; 1).reduceByKey(</em> + _) [/code]</p>
<p>Without even trying to consider the complexity of all I had to write to get all my answers - answers that I would need to join back if I want a consistent RDD with all the informations I need - the most painful point is that I had to duplicate all these aggregations and therefore <strong>cache</strong> my dataset to mitigate the damages.</p>
<p>Now using the dataframe API, I get to leverage out-of-the-box functions and I can even reuse my computations afterward without having to join-back anything :
[code language=&quot;scala&quot;] case class Person(id: Int, first_name: String, last_name: String, age: Double) // get simple stats on age repartitions by first_name(min, max, avg, count) val df: Dataframe = ... persons = df.groupBy(&quot;first_name&quot;).agg( min(&quot;age&quot;).alias(&quot;min_age&quot;), max(&quot;age&quot;).alias(&quot;max_age&quot;), avg(&quot;age&quot;).alias(&quot;average_age&quot;), count(&quot;*&quot;).alias(&quot;number_of_persons&quot;) ) // let&#39;s add a new column to our schema re-using the two last-computed aggregations : val finalDf = persons.withColumn(&quot;age_delta&quot;, persons(&quot;max_age&quot;) - persons(&quot;min_age&quot;)) [/code]</p>
<p>This is a higher level of programming than RDDs, so some things might be more difficult to express with Dataframe than they were using RDDs when you could <strong>groupBy(...)</strong> anything and get the <em>List[...]</em> of result as values... But this was a bad practice anyway :).</p>
<ol start="2">
<li>Spark SQL/Catalyst is more intelligent than you</li>
</ol>
<hr>
<p>When you&#39;re using Dataframe, you&#39;re not defining directly a DAG (Directed Acyclic Graph) anymore, you&#39;re actually creating an AST (Abstract Syntax Tree) that the Catalyst engine will parse, check and improve using both Rules-Based Optimisation and Cost-Based Optimisation. This is an excerpt from the Spark SQL paper submitted for SIGMOD 2015 :</p>
<p><a href="https://ogirardot.wordpress.com/wp-content/uploads/2015/05/spark-sql-pipeline.png">![spark SQL pipeline](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771034678-spark-sql-pipeline.png)</a></p>
<p>I won&#39;t get into the depth of this here, because that would even need more than one full article about it, but if you want to understand more this article <a href="https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html">Deep dive into Spark SQL's Catalyst optimizer</a> from the Databricks blog (once again) will give you insights into how this works. A simple rule if thumb to get is that a lot of &quot;pretty logical&quot; generic tree-based rules will be used to check and simplify your parsed-Logical Plan and then a few Physical Plans representing different executions strategies will be computed and one will be selected according to their &quot;computation cost&quot;.</p>
<p>The funny thing is that in the end - nothing changes - after all these transformations your Dataframe will get *compiled* down to RDDs and executed on your Spark Cluster.</p>
<ol start="3">
<li>Python &amp; Scala are now even in terms of performance</li>
</ol>
<hr>
<p>Using the Dataframe API, you&#39;re using a DSL that leverages Spark&#39;s Scala bytecode - when using RDDs, Python lambdas will run in a Python VM, Java/Scala lambdas will run in the JVM, this is great because inside RDDs you can use your usual Python libraries (Numpy, Scipy, etc...) and not some Jython code, but it comes at a performance cost :</p>
<p><a href="https://ogirardot.wordpress.com/wp-content/uploads/2015/05/unified-physical-execution.png">![Unified physical execution](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771035151-unified-physical-execution.png)</a></p>
<p>This is still true if you want to use Dataframe&#39;s User Defined Functions, you can write them in Java/Scala or Python and this will impact your computation performance - but if you manage to stay in a pure Dataframe computation - then nothing will get between you and the best computation performance you can possibly get.</p>
<ol start="4">
<li>Dataframes are the future for Spark &amp; You</li>
</ol>
<hr>
<p>Spark ML is already a pretty obvious example of this, the Pipeline API is designed entirely around Dataframes as their sole data structure for parallel computations, model training and predictions. And even if you don&#39;t believe me, here&#39;s once again Patrick Wendell&#39;s presentation of &quot;What the future of Spark is&quot; :</p>
<p><a href="https://ogirardot.wordpress.com/wp-content/uploads/2015/05/future-of-spark.png">![Future of Spark](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771035604-future-of-spark.png)</a></p>
<p>Anyway, I think I made my point regarding the whole goal of this article : RDDs are the new bytecode of Apache Spark. You might be sad or pissed because you spent a lot of time learning how to harness Spark&#39;s RDDs and now you think Dataframes are a completely new paradigm to learn...</p>
<p>You&#39;re partially right because if you don&#39;t already know Pandas or R API, Dataframes are a new thing and you&#39;ll need some work to harness it - but remember that in the end, everything comes down as RDDs - so all that you learned before is still relevant, this is just another skill to add to your resume.</p>
<p><em>Vale</em></p>
]]></content:encoded>
            <category>apache spark</category>
            <category>bigdata</category>
            <category>data</category>
            <enclosure url="https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771034246-rdd-vs-dataframe.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Changing Spark's default java serialization to Kryo]]></title>
            <link>https://ogirardot.writizzy.com/p/changing-sparks-default-java-serialization-to-kryo</link>
            <guid>https://ogirardot.writizzy.com/p/changing-sparks-default-java-serialization-to-kryo</guid>
            <pubDate>Fri, 09 Jan 2015 00:00:00 GMT</pubDate>
            <description><![CDATA[Apache Spark's default serialization relies on Java with the default *readObject(...)* and *writeObject(...)* methods for all **Serializable**classes. This is a very fine default behavior as long as y...]]></description>
            <content:encoded><![CDATA[<p>Apache Spark&#39;s default serialization relies on Java with the default <em>readObject(...)</em> and <em>writeObject(...)</em> methods for all <strong>Serializable</strong>classes. This is a very fine default behavior as long as you don&#39;t rely on it too much...</p>
<p>Why ? Because Java&#39;s serialization framework is notoriously inefficient, consuming too much CPU, RAM and size to be a suitable large scale serialization format.</p>
<p>Ok, but you can always tell me that you, as a Apache Spark user, are not using Java&#39;s serialization framework at all, but the fact is that Apache Spark as a system relies on it a lot :</p>
<ul>
<li>Every task run from Driver to Worker gets serialized : <strong>Closure serialization</strong></li>
<li>Every result from every task gets serialized at some point : <strong>Result serialization</strong></li>
</ul>
<p>And what&#39;s implied is that during all <strong>c</strong> <strong>losure serializations</strong> all the <strong>values used inside</strong> will get serialized as well, for the record, this is also one of the main reasons to use Broadcast variables when closures might get serialized with big values.</p>
<p><a href="https://github.com/EsotericSoftware/kryo" title="Kryo - Java serialization">Kryo</a> is a project like <a href="http://avro.apache.org/">Apache Avro</a> or <a href="https://github.com/google/protobuf/">Google's Protobuf</a> (or it&#39;s Java oriented equivalent <a href="https://github.com/protostuff/protostuff">Protostuff</a> - which I have not tested yet). I&#39;m not a bug fan of benchmarks but they can be useful and Kryo designed a few to measure size and time of serialization. Here&#39;s what such a benchmark looks like a the time of writing (i.e early 2015) :</p>
<p><img src="https://camo.githubusercontent.com/829809b59ac2efe1ec62ac2f2cfbb29606a02a44/68747470733a2f2f63686172742e676f6f676c65617069732e636f6d2f63686172743f636874743d746f74616c2b2532386e616e6f73253239266368663d637c7c6c677c7c307c7c4646464646467c7c317c7c3736413446427c7c307c62677c7c737c7c454645464546266368733d35303078343330266368643d743a313232362c313439322c313536382c323230302c323436352c323933392c333530312c333635392c333637302c343439352c383531362c31303035372c31303437372c31323138372c31333130392c31353632382c31393938302c32383534382c33363034362c343438333826636864733d302c34393332322e313530333526636878743d79266368786c3d303a7c6a736f6e253246666c65786a736f6e2532466461746162696e647c6a6176612d6275696c742d696e7c6a626f73732d6d61727368616c6c696e672d72697665727c786d6c2532467873747265616d253242637c6a736f6e2532467376656e736f6e2d6461746162696e647c6a626f73732d73657269616c697a6174696f6e7c62736f6e2532466a61636b736f6e2532466461746162696e647c6a736f6e253246676f6f676c652d67736f6e2532466461746162696e647c6865737369616e7c786d6c2532466a61636b736f6e2532466461746162696e642d61616c746f7c6a736f6e2532466a61636b736f6e2532466461746162696e647c6a736f6e25324670726f746f73747566662d72756e74696d657c736d696c652532466a61636b736f6e2532466461746162696e647c6a736f6e2532466a61636b736f6e25324664622d61667465726275726e65727c736d696c652532466a61636b736f6e25324664622d61667465726275726e65727c6a736f6e253246666173746a736f6e2532466461746162696e647c6d73677061636b2d6461746162696e647c666173742d73657269616c697a6174696f6e7c6b72796f7c70726f746f73747566662663686d3d4e2532302a662a2c3030303030302c302c2d312c3130266c6b6c6b266368646c703d74266368636f3d3636303030307c3636303033337c3636303036367c3636303039397c3636303043437c3636303046467c3636333330307c3636333333337c3636333336367c3636333339397c3636333343437c3636333346467c3636363630307c3636363633337c363636363636266368743d62686726636862683d31302c302c3130266e6f6e73656e73653d6161612e706e67" alt="" /></p>
<p>So how can you change Spark&#39;s default serializer easily, well, as usual Spark is a pretty configurable system, so all you need is to specify which serializer you want to use when you define your <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext">SparkContext</a> using the <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkConf">SparkConf</a> like that :
[code language=&quot;scala&quot;] val conf = new SparkConf() .set(&quot;spark.serializer&quot;, &quot;org.apache.spark.serializer.KryoSerializer&quot;) [/code]</p>
<p>And voilà ! But that&#39;s not all, if you&#39;ve got big objects to serializer and are prepared to <strong>face the consequences</strong> you might get OutOfMemoryErrors or GC Overflows that will happen very fast using Java&#39;s default serialization (did I tell you it sucks for some reasons... ?) and won&#39;t get resolve auto-magically by switching to Kryo.</p>
<p>Luckily you can define what buffer size Kryo will use by default :
[code language=&quot;scala&quot;] val conf = new SparkConf() .set(&quot;spark.serializer&quot;, &quot;org.apache.spark.serializer.KryoSerializer&quot;) // Now it&#39;s 24 Mb of buffer by default instead of 0.064 Mb .set(&quot;spark.kryoserializer.buffer.mb&quot;,&quot;24&quot;) [/code]</p>
<p>If you&#39;re even bolder you can customize all of these options :</p>
<ul>
<li><strong>spark.kryoserializer.buffer.max.mb</strong>(64 Mb by default) : useful if your default buffer size goes further than 64 Mb;</li>
<li><strong>spark.kryo.referenceTracking</strong> (true by default) : c.f. <a href="https://github.com/EsotericSoftware/kryo#references">reference tracking in Kryo</a></li>
<li><strong>spark.kryo.registrationRequired</strong> (false by default) : Kryo&#39;s parameter to define if all serializable classes must be registered</li>
<li><strong>spark.kryo.classesToRegister</strong> (empty string list by default) : you can add a list of the qualified names of all classes that must be registered (c.f. last parameter)</li>
</ul>
<p>The examples above are defined in Scala, but of course these parameters can be used in Java and Python as well.</p>
<p>Enjoy.</p>
]]></content:encoded>
            <category>java</category>
            <enclosure url="https://camo.githubusercontent.com/829809b59ac2efe1ec62ac2f2cfbb29606a02a44/68747470733a2f2f63686172742e676f6f676c65617069732e636f6d2f63686172743f636874743d746f74616c2b2532386e616e6f73253239266368663d637c7c6c677c7c307c7c4646464646467c7c317c7c3736413446427c7c307c62677c7c737c7c454645464546266368733d35303078343330266368643d743a313232362c313439322c313536382c323230302c323436352c323933392c333530312c333635392c333637302c343439352c383531362c31303035372c31303437372c31323138372c31333130392c31353632382c31393938302c32383534382c33363034362c343438333826636864733d302c34393332322e313530333526636878743d79266368786c3d303a7c6a736f6e253246666c65786a736f6e2532466461746162696e647c6a6176612d6275696c742d696e7c6a626f73732d6d61727368616c6c696e672d72697665727c786d6c2532467873747265616d253242637c6a736f6e2532467376656e736f6e2d6461746162696e647c6a626f73732d73657269616c697a6174696f6e7c62736f6e2532466a61636b736f6e2532466461746162696e647c6a736f6e253246676f6f676c652d67736f6e2532466461746162696e647c6865737369616e7c786d6c2532466a61636b736f6e2532466461746162696e642d61616c746f7c6a736f6e2532466a61636b736f6e2532466461746162696e647c6a736f6e25324670726f746f73747566662d72756e74696d657c736d696c652532466a61636b736f6e2532466461746162696e647c6a736f6e2532466a61636b736f6e25324664622d61667465726275726e65727c736d696c652532466a61636b736f6e25324664622d61667465726275726e65727c6a736f6e253246666173746a736f6e2532466461746162696e647c6d73677061636b2d6461746162696e647c666173742d73657269616c697a6174696f6e7c6b72796f7c70726f746f73747566662663686d3d4e2532302a662a2c3030303030302c302c2d312c3130266c6b6c6b266368646c703d74266368636f3d3636303030307c3636303033337c3636303036367c3636303039397c3636303043437c3636303046467c3636333330307c3636333333337c3636333336367c3636333339397c3636333343437c3636333346467c3636363630307c3636363633337c363636363636266368743d62686726636862683d31302c302c3130266e6f6e73656e73653d6161612e706e67" length="0" type="image//829809b59ac2efe1ec62ac2f2cfbb29606a02a44/68747470733a2f2f63686172742e676f6f676c65617069732e636f6d2f63686172743f636874743d746f74616c2b2532386e616e6f73253239266368663d637c7c6c677c7c307c7c4646464646467c7c317c7c3736413446427c7c307c62677c7c737c7c454645464546266368733d35303078343330266368643d743a313232362c313439322c313536382c323230302c323436352c323933392c333530312c333635392c333637302c343439352c383531362c31303035372c31303437372c31323138372c31333130392c31353632382c31393938302c32383534382c33363034362c343438333826636864733d302c34393332322e313530333526636878743d79266368786c3d303a7c6a736f6e253246666c65786a736f6e2532466461746162696e647c6a6176612d6275696c742d696e7c6a626f73732d6d61727368616c6c696e672d72697665727c786d6c2532467873747265616d253242637c6a736f6e2532467376656e736f6e2d6461746162696e647c6a626f73732d73657269616c697a6174696f6e7c62736f6e2532466a61636b736f6e2532466461746162696e647c6a736f6e253246676f6f676c652d67736f6e2532466461746162696e647c6865737369616e7c786d6c2532466a61636b736f6e2532466461746162696e642d61616c746f7c6a736f6e2532466a61636b736f6e2532466461746162696e647c6a736f6e25324670726f746f73747566662d72756e74696d657c736d696c652532466a61636b736f6e2532466461746162696e647c6a736f6e2532466a61636b736f6e25324664622d61667465726275726e65727c736d696c652532466a61636b736f6e25324664622d61667465726275726e65727c6a736f6e253246666173746a736f6e2532466461746162696e647c6d73677061636b2d6461746162696e647c666173742d73657269616c697a6174696f6e7c6b72796f7c70726f746f73747566662663686d3d4e2532302a662a2c3030303030302c302c2d312c3130266c6b6c6b266368646c703d74266368636f3d3636303030307c3636303033337c3636303036367c3636303039397c3636303043437c3636303046467c3636333330307c3636333333337c3636333336367c3636333339397c3636333343437c3636333346467c3636363630307c3636363633337c363636363636266368743d62686726636862683d31302c302c3130266e6f6e73656e73653d6161612e706e67"/>
        </item>
        <item>
            <title><![CDATA[Try Apache Spark's shell using Docker]]></title>
            <link>https://ogirardot.writizzy.com/p/try-apache-sparks-shell-using-docker</link>
            <guid>https://ogirardot.writizzy.com/p/try-apache-sparks-shell-using-docker</guid>
            <pubDate>Thu, 18 Dec 2014 00:00:00 GMT</pubDate>
            <description><![CDATA[Ever wanted to try out [Apache Spark](https://spark.apache.org/ "Apache Spark") without actually having to install anything ? Well if you've got [Docker](https://www.docker.com/ "Docker"), I've got a...]]></description>
            <content:encoded><![CDATA[<p>Ever wanted to try out <a href="https://spark.apache.org/" title="Apache Spark">Apache Spark</a> without actually having to install anything ? Well if you&#39;ve got <a href="https://www.docker.com/" title="Docker">Docker</a>, I&#39;ve got a christmas present for you, a Docker image you can pull to try and run Spark commands in the Spark shell REPL. The image has been pushed to the <a href="https://registry.hub.docker.com/u/ogirardot/spark-docker-shell">Docker Hub here</a> and can be easily pulled using Docker.</p>
<p>So exactly what is this image, and how can I use it ?</p>
<p>Well, all you need is to execute these few commands :
[code language=&quot;bash&quot;] &gt; docker pull ogirardot/spark-docker-shell [/code]</p>
<p>I&#39;ll try to keep this image up-to-date with future releases of Spark, so if you want to test against a specific version, all you have to do is pull (or directly run) the <a href="https://registry.hub.docker.com/u/ogirardot/spark-docker-shell/tags/manage/">image with the corresponding tag</a> like that :
[code language=&quot;bash&quot;] &gt; docker pull ogirardot/spark-docker-shell:1.1.1 [/code]</p>
<p>And then after Docker will have downloaded the full image, using the run command you will have access to a stand-alone <strong>spark-shell</strong>that will allow you to try and learn Spark&#39;s API in a sandboxed environment, here&#39;s what a correct launch looks like :
[code language=&quot;scala&quot;] &gt; docker run -t -i ogirardot/spark-docker-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark&#39;s default log4j profile: org/apache/spark/log4j-defaults.properties 14/12/11 20:33:14 INFO SecurityManager: Changing view acls to: root 14/12/11 20:33:14 INFO SecurityManager: Changing modify acls to: root 14/12/11 20:33:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 14/12/11 20:33:14 INFO HttpServer: Starting HTTP Server 14/12/11 20:33:14 INFO Utils: Successfully started service &#39;HTTP class server&#39; on port 50535. Welcome to ____ __ / <strong>/</strong> ___ _<em><em><strong>/ /</strong> <em>\ \/ _ \/ _ `/ _<em>/ &#39;</em>/ /_</em></em>/ .__/\</em>,<em>/</em>/ /<em>/\</em>\ version 1.1.1 /_/ Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65) Type in expressions to have them evaluated. Type :help for more information. 14/12/11 20:33:18 INFO SecurityManager: Changing view acls to: root 14/12/11 20:33:18 INFO SecurityManager: Changing modify acls to: root 14/12/11 20:33:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 14/12/11 20:33:19 INFO Slf4jLogger: Slf4jLogger started 14/12/11 20:33:19 INFO Remoting: Starting remoting 14/12/11 20:33:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@ea9ec670e429:43346] 14/12/11 20:33:19 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@ea9ec670e429:43346] 14/12/11 20:33:19 INFO Utils: Successfully started service &#39;sparkDriver&#39; on port 43346. 14/12/11 20:33:19 INFO SparkEnv: Registering MapOutputTracker 14/12/11 20:33:19 INFO SparkEnv: Registering BlockManagerMaster 14/12/11 20:33:19 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20141211203319-f310 14/12/11 20:33:19 INFO Utils: Successfully started service &#39;Connection manager for block manager&#39; on port 58304. 14/12/11 20:33:19 INFO ConnectionManager: Bound socket to port 58304 with id = ConnectionManagerId(ea9ec670e429,58304) 14/12/11 20:33:19 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 14/12/11 20:33:19 INFO BlockManagerMaster: Trying to register BlockManager 14/12/11 20:33:19 INFO BlockManagerMasterActor: Registering block manager ea9ec670e429:58304 with 265.4 MB RAM, BlockManagerId(&amp;amp;amp;lt;driver&amp;amp;amp;gt;, ea9ec670e429, 58304, 0) 14/12/11 20:33:19 INFO BlockManagerMaster: Registered BlockManager 14/12/11 20:33:19 INFO HttpFileServer: HTTP File server directory is /tmp/spark-4c832cee-7ed5-470d-9e41-d4a36227d48f 14/12/11 20:33:19 INFO HttpServer: Starting HTTP Server 14/12/11 20:33:19 INFO Utils: Successfully started service &#39;HTTP file server&#39; on port 55020. 14/12/11 20:33:19 INFO Utils: Successfully started service &#39;SparkUI&#39; on port 4040. 14/12/11 20:33:19 INFO SparkUI: Started SparkUI at <a href="http://ea9ec670e429:4040">http://ea9ec670e429:4040</a> 14/12/11 20:33:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/12/11 20:33:19 INFO Executor: Using REPL class URI: <a href="http://172.17.0.15:50535">http://172.17.0.15:50535</a> 14/12/11 20:33:19 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@ea9ec670e429:43346/user/HeartbeatReceiver 14/12/11 20:33:19 INFO SparkILoop: Created spark context.. Spark context available as sc. scala&gt; [/code]</p>
<p>Once you reach this <strong>scala</strong> prompt, you&#39;re practically done, and you can use your available <strong>SparkContext</strong> (variable <strong>sc</strong> )with simple examples :
[code language=&quot;scala&quot;] scala&gt; sc.parallelize(1 until 1000).map(_ * 2).filter(_ &amp;amp;lt; 10 ).reduce(_ + _) res0: Int = 20 [/code]</p>
<p>If you&#39;ve got this right, you&#39;re all set ! Plus, as this is a Scala prompt, using &lt;tab&gt; you&#39;ll have access to all the auto-completion magic a strong type-system can bring you.</p>
<p>So enjoy, take your time and be bold.</p>
]]></content:encoded>
            <category>apache spark</category>
            <category>bigdata</category>
        </item>
        <item>
            <title><![CDATA[Apache Spark : Memory management and Graceful degradation]]></title>
            <link>https://ogirardot.writizzy.com/p/apache-spark-memory-management-and-graceful-degradation</link>
            <guid>https://ogirardot.writizzy.com/p/apache-spark-memory-management-and-graceful-degradation</guid>
            <pubDate>Thu, 11 Dec 2014 00:00:00 GMT</pubDate>
            <description><![CDATA[Many of the concepts of Apache Spark are pretty straightforward and easy to understand, however some lucky few can be badly misunderstood. One of the greatest misunderstanding of all is the fact that...]]></description>
            <content:encoded><![CDATA[<p>Many of the concepts of Apache Spark are pretty straightforward and easy to understand, however some lucky few can be badly misunderstood. One of the greatest misunderstanding of all is the fact that some still believe that &quot;<em>Spark is only relevant with datasets that can fit into memory, otherwise it will crash&quot;</em>.</p>
<p>This is an understanding mistake, Spark being easily associated as a &quot;Hadoop using RAM more efficiently&quot;, but it still is a mistake.</p>
<p>Spark is by default doing the best it can to load the datasets it handles in memory. Still when the handled datasets are too large to fit into memory, automatically (or should i say auto-magically) these objects will be spilled to disk. This is one of the main features of Spark coined by the expression &quot;<strong>graceful</strong> <strong>degradation&quot;</strong> and it was very well illustrated by these two charts in <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf" title="An Architecture for Fast and General Data Processing on Large Clusters">Matei Zaharia's dissertation : An Architecture for Fast and General Data Processing on Large Clusters</a> :
[caption id=&quot;attachment_1211&quot; align=&quot;aligncenter&quot; width=&quot;660&quot;]<a href="https://ogirardot.wordpress.com/wp-content/uploads/2014/11/graceful-degradation-spark.png">![Behaviour of Spark with less/more RAM, extracted from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771032971-graceful-degradation-spark.png)</a> Behaviour of Spark with less/more RAM, extracted from <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf\[/caption\]">http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf\[/caption\]</a> So the first chart clearly shows something interesting for us, It shows us that the behavior of Spark when you give it more or less RAM is pretty much linear in terms of execution time. In other words, the more RAM Spark can use, the quicker your computation will run, but if you give it less and less RAM, in the end Spark will behave like Hadoop, flushing to disk as much as possible. The second chart is also interesting for debunking the urban legend of &quot;Spark will only work if your datasets fit in RAM&quot; showing how Spark will handle larger and larger datasets, once again its behavior is practically linear between the time the computation takes and the size of the dataset (for a given computation). In the end, not only Spark can handle large datasets but It will gracefully adapt to the amount of memory you give it.</p>
]]></content:encoded>
            <category>apache spark</category>
            <category>bigdata</category>
            <enclosure url="https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771032971-graceful-degradation-spark.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Apache Spark : l'importance du broadcast]]></title>
            <link>https://ogirardot.writizzy.com/p/apache-spark-limportance-du-broadcast</link>
            <guid>https://ogirardot.writizzy.com/p/apache-spark-limportance-du-broadcast</guid>
            <pubDate>Thu, 27 Nov 2014 00:00:00 GMT</pubDate>
            <description><![CDATA[> [Apache Spark](https://spark.apache.org "Apache Spark") est un moteur de calcul distribué visant à remplacer et fournir des APIs de plus haut niveau pour résoudre simplement des problèmes où Hadoop...]]></description>
            <content:encoded><![CDATA[<blockquote>
<p><a href="https://spark.apache.org" title="Apache Spark">Apache Spark</a> est un moteur de calcul distribué visant à remplacer et fournir des APIs de plus haut niveau pour résoudre simplement des problèmes où Hadoop montre ses limitations et sa complexité.</p>
<p>Ce billet fait partie d&#39;une série de billet sur Apache Spark permettant d&#39;approfondir certaines notions du système du développement, à l&#39;optimisation jusqu&#39;au déploiement.</p>
</blockquote>
<p>Un des avantage principaux de Spark est sa capacité à être bien intégré à l&#39;éco-système Scala/Java ou Python. C&#39;est d&#39;autant plus vrai en Scala car les méthodes principales attachées aux contextes Spark sont de la même forme qu&#39;en Scala avec quelques améliorations (et le contexte distribué en plus) ex: <strong>map, flatMap, filter...</strong></p>
<p>Cet avantage vient avec l&#39;inconvénient qu&#39;il est important de savoir quels objets/instances manipulées et dans quel contexte Spark ou Scala ces objets vont être utilisés. Si vous en doutez, voilà un petit exemple permettant de bien l&#39;illustrer :
[code language=&quot;scala&quot;] val multiplier = 50 val data = sc.parallelize(1 to 10000) val result = data .map( _ * multiplier) .filter( _ &gt; 1000 ) .collect() .map( _ / 2 ) .filter( _ &lt; (20 * multiplier) ) [/code]</p>
<p>Si on étudie cet exemple volontairement simpliste, les deux premières opérations <strong>map</strong> et <strong>filter</strong> s&#39;appliquent sur un <strong><a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD" title="Apache Spark - RDD API">RDD[Int]</a></strong> géré par Spark et vont donc s&#39;exécuter dans un contexte parallélisé, ce n&#39;est plus le cas dès l&#39;appel à <strong>collect()</strong> qui va ramener la totalité des données traitées par les <strong>workers</strong> vers la mémoire du <strong>driver Spark</strong> . Ainsi les deux autres appels à <strong>map</strong> et <strong>filter</strong> vont s&#39;appliquer sur une <a href="http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List" title="Scala API - List">**List[Int]**</a> et donc font partie de la Standard Library de Scala.</p>
<p>Cet exemple a deux propriétés importantes, il permet de voir la confusion possible entre les appels Scala et Spark, mais surtout il permet de voir, avec le coefficient <strong>multiplier</strong> utilisé, qu&#39;il est assez simple d&#39;utiliser des <strong>valeurs Scala dans une closure envoyée à Spark.</strong></p>
<p>La sérialisation des closures Scala vers les workers Spark méritera un article à lui tout seul et donc n&#39;est pas l&#39;objet de cet article, mais pour bien comprendre le problème qui nous intéresse il suffit de savoir qu&#39;à <strong>chaque instance de la closure lancée par un worker contiendra une copie de la valeur utilisée.</strong> Ainsi si cette valeur correspond à une donnée un peu volumineuse cela devient rapidement inefficace et surtout dangereux pour l&#39;utilisation mémoire de vos workers.</p>
<p>Heureusement Spark vient avec deux notions de <strong>variables partagées</strong> les Accumulateurs et variables <strong>Broadcastées,</strong> maintenant comme vous aurez deviné c&#39;est cette dernière notion qui vient à notre secours.</p>
<p>En effet au lieu d&#39;avoir autant de copie des valeurs dans les closures que nous avons d&#39;appels dans les workers par celle-ci, il est possible d&#39;utiliser la fonction de <strong>broadcast</strong> <strong>()</strong> pour partager en lecture-seule cette valeur et ainsi n&#39;avoir qu&#39;une copie par noeud géré par le système.</p>
<p>Cette fonction n&#39;est en revanche intéressante que pour partagé de grosses sources de données à travers les workers, par pour notre pauvre petit <strong>Int</strong> de <strong>multiplier</strong> dans l&#39;exemple précédent, voilà comment l&#39;utiliser :
[code language=&quot;scala&quot;] val largeKeyValuePair: Map[String, String] = .... // broadcast this variable for workers to use it efficiently val bdLarge = sc.broadcast(largeKeyValuePair) val data = sc.parallelize(1 to 10000) val result = data .map( item =&gt; (item, bdLarge.value.get(item.toString) ) ... [/code]</p>
<p>Pour résumé, le <strong>broadcast</strong> sert à n&#39;envoyer qu&#39;une seule fois une valeur assez large pour en valoir la peine. Maintenant votre question doit être &quot;grosse comment&quot; ?</p>
<p>L&#39;Université de Berkeley (CA) a étudié la question dans la publication suivante sur les<a href="http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf" title="Broadcast performance for Apache Spark">performances des différents algorithmes de broadcasting entre les noeuds</a> et pour vous la faire courte, le mécanisme standard de broadcasting de Spark **Centralized HDFS Broadcast (CHB pour les intimes)**donne ce genre de performance selon la taille des payloads :</p>
<p><a href="https://ogirardot.wordpress.com/wp-content/uploads/2014/11/spark-broadcast-performance.png">![spark-broadcast-performance](https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771033309-spark-broadcast-performance.png)</a></p>
<br />

<p>Si vous voulez en savoir plus, j&#39;organise avec Lateral Thoughts et Hopwork des formations Spark régulières, l&#39;agenda est disponible ici : <a href="http://www.lateral-thoughts.com/training" title="Lateral Thoughts - Formations">http://www.lateral-thoughts.com/training</a>.</p>
]]></content:encoded>
            <category>apache spark</category>
            <category>bigdata</category>
            <enclosure url="https://writizzy.b-cdn.net/blogs/a62629cd-69f9-4e09-914f-fea827684723/media/1771771033309-spark-broadcast-performance.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Dagger and Play 2 Java]]></title>
            <link>https://ogirardot.writizzy.com/p/dagger-and-play-2-java</link>
            <guid>https://ogirardot.writizzy.com/p/dagger-and-play-2-java</guid>
            <pubDate>Mon, 28 Jul 2014 00:00:00 GMT</pubDate>
            <description><![CDATA[I recently got the occasion of trying out Play 2 in Java and i must say the Play 2 Framwork looks actually really good in Java too.

But, of course... there is a but, one of the few things that strike...]]></description>
            <content:encoded><![CDATA[<p>I recently got the occasion of trying out Play 2 in Java and i must say the Play 2 Framwork looks actually really good in Java too.</p>
<p>But, of course... there is a but, one of the few things that strikes you first, and i must say with great intensity, is the <em>mandatory</em> static methods that you must put in your <strong>Controllers</strong>in order to define your routes. Exemple :
[code language=&quot;java&quot;] // in app/controllers/Application.java package controllers; import play.mvc.Controller; import play.mvc.Result; import service.CoffeeService; import views.html.index; public class Application extends Controller { public static Result index() { return ok(index.render(&quot;Your application is ready.&quot;)); } } [/code]</p>
<p>And with the routes defined as such :
[code] # Home page GET / controllers.Application.index() [/code]</p>
<p>This is relatively great... if you like starting off with the wrong foot. I won&#39;t talk about <strong>modularization</strong> or the <strong>danger of spaghetti-code</strong> , neither will i argue that this is not great for testing <strong>controllers</strong> that will use <strong>services</strong> or any other kind of <strong>external dependencies.</strong></p>
<p>Luckily, the <strong>Play 2 Framwork</strong> people have thought long and hard when they designed their systems, and if they won&#39;t force you to use any kind of dependency injection systems, they&#39;ll allow you to plugin-in your preffered choice. This is clearly <a href="http://www.playframework.com/documentation/latest/ScalaDependencyInjection" title="Dependency Injection in Scala with Play2 ">documented here</a> but this is in Scala and you might think it&#39;s not available for <strong>Play 2 Java</strong>, and you would be wrong.</p>
<p>So here&#39;s a little example on how to do it with a really great project by the teams at <a href="http://squareup.com" title="Square">Square</a> called <a href="http://square.github.io/dagger/">Dagger</a>. Dagger relies on the annotation processing framework of Java to be able to plug itself as an extra step of the compiler and try, as much as possible, to do dependency injection checks (and maybe more) at compile-time. So let&#39;s try to use it in a simple Java app :
[code] // in build.sbt - we&#39;ll add the dependency name := &quot;app&quot; version := &quot;1.0-SNAPSHOT&quot; libraryDependencies ++= Seq( javaJdbc, javaEbean, cache, &quot;com.squareup.dagger&quot; % &quot;dagger&quot; % &quot;1.2.2&quot;, &quot;com.squareup.dagger&quot; % &quot;dagger-compiler&quot; % &quot;1.2.2&quot; ) play.Project.playJavaSettings [/code] [code language=&quot;java&quot;] // in app/controllers/Application.java - we&#39;ll inject a simple Service via dagger package controllers; import play.mvc.Controller; import play.mvc.Result; import service.CoffeeService; import views.html.index; import javax.inject.Inject; public class Application extends Controller { private CoffeeService coffeeService; @Inject public Application(CoffeeService service) { coffeeService = service; } public Result index() { return ok(index.render(&quot;Your application &quot; + this.toString() +&quot; is ready. &quot; + coffeeService.toString())); } } [/code]</p>
<p>Finally to make it all work we need to change the routes file and override the &quot;Global&quot; configuration class :
[code language=&quot;java&quot;] // in app/Global.java - we&#39;ll create this class and override the controller instance creation import dagger.ObjectGraph; import module.ProductionModule; import play.Application; import play.GlobalSettings; public class Global extends GlobalSettings { private ObjectGraph objectGraph; @Override public void beforeStart(Application app) { super.beforeStart(app); objectGraph = ObjectGraph.create(new ProductionModule()); } @Override public &lt;A&gt; A getControllerInstance(Class&lt;A&gt; controllerClass) throws Exception { return objectGraph.get(controllerClass); } } [/code]</p>
<p>and
[code] # Home page GET / @controllers.Application.index() [/code]</p>
<p>The **@controllers.Application.index()**tells the whole system that now he has to create a new instance of Application controller and he will get the controller&#39;s instance through the overriden method in Global.</p>
<p>The goal was not in this article to teach you how to use Dagger or Play, rather more to show you how the two of them can work together. If you want to see the whole project, it&#39;s available online on <a href="https://github.com/lateralthoughts/dagger-play-di-example">https://github.com/lateralthoughts/dagger-play-di-example</a>. So if you want to know more, clone the project and play with it. Any feedback would be appreciated.</p>
<p><em>Vale</em></p>
]]></content:encoded>
            <category>oss</category>
            <category>java</category>
        </item>
        <item>
            <title><![CDATA[How to remove scaladoc generation from Play 2.2.x Production dist]]></title>
            <link>https://ogirardot.writizzy.com/p/how-to-remove-scaladoc-generation-from-play-2-2-x-production-dist</link>
            <guid>https://ogirardot.writizzy.com/p/how-to-remove-scaladoc-generation-from-play-2-2-x-production-dist</guid>
            <pubDate>Tue, 17 Jun 2014 00:00:00 GMT</pubDate>
            <description><![CDATA[After a few hours of searching through the Play 2 documentation, the play-framework google group and other blogs or sources, i finally found this piece of code that i decided to share with you. So if,...]]></description>
            <content:encoded><![CDATA[<p>After a few hours of searching through the Play 2 documentation, the play-framework google group and other blogs or sources, i finally found this piece of code that i decided to share with you. So if, like me, you wanted to remove the Scaladoc generation and packaging inside the <a href="https://ogirardot.writizzy.comwww.playframework.com/documentation/2.2.x/ProductionDist">ProductionDist</a> that you can create from running the <strong>play dist</strong> command, then today&#39;s your lucky day. If you have a <strong>build.sbt</strong> file (and you should) in your Play2 app, then all you need to do is add inside your file **sources in doc in Compile := List()**like that : [code language=&quot;scala&quot;] import play.Project._ name := &quot;my-web-project&quot; playScalaSettings sources in doc in Compile := List() libraryDependencies ++= Seq(...) [/code]</p>
]]></content:encoded>
            <category>oss</category>
        </item>
        <item>
            <title><![CDATA[Timeoff 2014 @ Lateral Thoughts]]></title>
            <link>https://ogirardot.writizzy.com/p/timeoff-2014-lateral-thoughts</link>
            <guid>https://ogirardot.writizzy.com/p/timeoff-2014-lateral-thoughts</guid>
            <pubDate>Mon, 14 Apr 2014 00:00:00 GMT</pubDate>
            <description><![CDATA[Une fois n'est pas coutume, je commencerais cet article avec une photo de notre dernier [Timeoff LT](https://plus.google.com/photos/112015376042019217159/albums/5991032831405401409 "Timeoff 2014 @ LT"...]]></description>
            <content:encoded><![CDATA[<p>Une fois n&#39;est pas coutume, je commencerais cet article avec une photo de notre dernier <a href="https://plus.google.com/photos/112015376042019217159/albums/5991032831405401409" title="Timeoff 2014 @ LT">Timeoff LT</a>.</p>
<p><a href="http://ogirardot.wordpress.com/wp-content/uploads/2014/04/dsc_0004-001.jpg">![Image](http://ogirardot.wordpress.com/wp-content/uploads/2014/04/dsc_0004-001.jpg?w=650)</a></p>
<p>Ça fait surement cliché de dire ça, mais chaque timeoff est différent, et celui là n&#39;a pas dérogé à la règle. J&#39;étais beaucoup plus impliqué dans l&#39;organisation des derniers (ceux où j&#39;allais :p ) alors pour celui-ci je me suis laissé guidé... sans être déçu le moins du monde.</p>
<h2>Ensemble</h2>
<p>Le point que j&#39;ai le plus apprécié lors de ce timeoff, c&#39;est surtout le fait qu&#39;on était <strong>tous ensemble</strong> , et cette fois-ci j&#39;ai beaucoup appris alors même qu&#39;une vraie dynamique autour d&#39;un projet commun s&#39;est mise en place. On a réussit la difficile alchimie qui permet d&#39;apprendre beaucoup les uns des autres, mais aussi de sortir<strong>un projet avec du code</strong> <strong>de prod</strong> <strong>,</strong> <strong>et pas juste un prototype mal maîtrisé</strong> avec des bouts de techno pas testés :).</p>
<p>En prime j&#39;ai pu bien profiter de la compagnie de tout les gens de <a href="http://www.lateral-thoughts.com" title="LateralThoughts">LT</a>, Lyonnais ou Parisien que j&#39;ai peu l&#39;occasion de voir au jour le jour et ainsi de travailler avec des gens que j&#39;aime et que je respecte énormément.</p>
<h2>Au fond</h2>
<p>Une chose dont je suis fier, et cela dépasse même le but initial de <a href="http://www.lateral-thoughts.com">LT</a>, c&#39;est que cette boite ne fait pas que prendre des expérimentés et leur donner la force de frappe pour faire plus et mieux, elle permet - au jour le jour - à des juniors de prendre leur vie en main et de s&#39;améliorer eux-mêmes.</p>
<p>Ça a l&#39;air un peu prétentieux comme ça, et je n&#39;ai pas la prétention de dire que nous investissons au maximum sur nos juniors, je connais des boites qui, de mon point de vue, &quot;investissent plus&quot; pour le développement personnel de leurs employés, mais je dois dire que je suis juste impressionné, si je me pose un instant, sur le chemin accomplit et la maturité atteinte par les différentes personnes qui nous ont rejoint : <a href="http://about.me/fbiville">Florent Biville</a>, <a href="https://twitter.com/Le_3K">Nicolas Rey</a>, <a href="http://vincent.cedeela.fr/">Vincent Doba</a>, <a href="https://twitter.com/StuartCorring">Stuart Corring</a> et <a href="https://twitter.com/jonathan_dray">Jonathan Dray</a>.</p>
<p>Certains étaient moins juniors que d&#39;autres, et chacun avance à son rythme, mais la partie intéressante c&#39;est que le modèle fonctionne. LT n&#39;est pas une société à effet de levier, il n&#39;est pas possible de gagner en puissance financièrement parlant et clairement pas possible de devenir riche (si c&#39;est là le but que l&#39;on veut se fixer... <a href="https://xkcd.com/559/">*no pun intended*</a>).</p>
<p>Car au fond une SSII normal gagne de l&#39;argent en faisant croître la compétence de ses employés (ou leur facturation tout simplement) plus vite que leur salaire. C&#39;est là que se situe sa croissance, son <strong>efficacité opérationnelle</strong>, elle compense ainsi le fait que - pour elle comme pour nous - son chiffre d&#39;affaire est directement proportionnel à son nombre d&#39;employés.</p>
<p>Dans le modèle <strong>plat, a-hierarchique</strong> et<strong>sociocratique</strong> que nous avons construit, cette &quot;croissance&quot; naturelle - à nombre d&#39;employé constant - a été sacrifiée, qu&#39;avons nous gagné en échange de cette croissance du capital ?</p>
<p>Un puriste dirait <strong>rien à part du risque en plus</strong> , personnellement je dirais <strong>un capital humain énorme</strong> - quoiqu&#39;on veuille impliquer derrière ça. J&#39;arrête là mes divagations de financier, mais juste pour vous replacer le contexte de cette réflexion, un de mes objectifs durant le timeoff était d&#39;arriver à évaluer la société, selon les méthodes usuelles de <a href="http://www.vernimmen.net/">finance d'entreprise</a> (<a href="http://en.wikipedia.org/wiki/Discounted_cash_flow">DCF</a>, Methodes des ratios, VAN etc...). Je vous laisse imaginer la complexité dans une société &quot;pas vraiment&quot; capitalistique...</p>
<p>Enfin une autre chose dont je suis fier et qui vous permettra surement de comprendre mieux que le système fonctionne, c&#39;est qu&#39;en tant que Lateral Thoughts <strong>nous avons réussi à avoir les reins nécessaires à l&#39;organisation de <a href="http://scala.io">Scala.IO</a></strong> et l&#39;expérience a été tellement concluante que nous allons remettre le couvert cette année pour l&#39;<strong>édition 2014</strong>.</p>
<h2>Apprendre</h2>
<p>Concernant plutôt la partie apprentissage, j&#39;ai eu l&#39;occasion d&#39;approfondir encore plus, mes connaissances sur la dernière version de Spring avec <a href="https://spring.io/blog/2013/12/12/announcing-spring-framework-4-0-ga-release">Spring 4</a> et <a href="http://projects.spring.io/spring-boot/">Spring Boot</a>, il y a vraiment dedans des améliorations en terme de productivité qui méritent le détour. Et la seule difficulté pour l&#39;instant que j&#39;ai identifiée concerne plutôt l&#39;intégration avec Spring Security, mais n&#39;hésitez pas à vous former une opinion seul.</p>
<p>J&#39;ai eu l&#39;occasion, avec beaucoup de plaisir, de faire partagé mes connaissances sur <a href="http://www.scala-lang.org/">Scala</a>et sur <strong><a href="http://www.ansible.com/home">Ansible</a></strong> et d&#39;en apprendre encore un peu plus sur les bonnes pratiques Angular. Si <strong>Ansible</strong> vous intéresse, mon <a href="http://cfp.devoxx.fr/devoxxfr2014/talk/RKR-886/Ansible%20in%20action%20-%20le%20provisionning%20au%20bon%20niveau%20d'abstraction">Tools In Action sur Ansible</a>a été accepté à <a href="http://www.devoxx.fr">DevoxxFR 2014</a> alors n&#39;hésitez pas à venir me poser des questions chiantes Mercredi 16 Avril ! Sur ce</p>
<p><em>Vale</em></p>
]]></content:encoded>
            <category>spring</category>
            <category>timeoff</category>
            <category>finance</category>
            <category>lt</category>
            <category>ssii</category>
            <category>uncategorized</category>
            <enclosure url="http://ogirardot.wordpress.com/wp-content/uploads/2014/04/dsc_0004-001.jpg?w=650" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Highlighting field in memory-based Lucene indexes]]></title>
            <link>https://ogirardot.writizzy.com/p/highlighting-field-in-memory-based-lucene-indexes</link>
            <guid>https://ogirardot.writizzy.com/p/highlighting-field-in-memory-based-lucene-indexes</guid>
            <pubDate>Mon, 24 Jun 2013 00:00:00 GMT</pubDate>
            <description><![CDATA[I'm using more and more Lucene these days, and getting in depth on a few subjects, today i'm going to talk to you about how to handle the new Highlighting features available with Lucene 4.1.

One of t...]]></description>
            <content:encoded><![CDATA[<p>I&#39;m using more and more Lucene these days, and getting in depth on a few subjects, today i&#39;m going to talk to you about how to handle the new Highlighting features available with Lucene 4.1.</p>
<p>One of the main achievements with this new version is the creation of the great <a href="http://lucene.apache.org/core/4_1_0/highlighter/org/apache/lucene/search/postingshighlight/PostingsHighlighter.html">PostingsHighlighter</a>. Michael McCandless wrote a great piece about it in his article <a href="http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html" title="A new Lucene Highlighter is born">A new Lucene highlighter is born</a> and i encourage you to read it if you want to get serious about highlighting using Lucene :).</p>
<p>Now let&#39;s say you want to use it on a <a href="http://lucene.apache.org/core/4_1_0/memory/org/apache/lucene/index/memory/MemoryIndex.html">MemoryIndex</a>, considering the MemoryIndex as the best In-Memory index type with more than ~500k queries/s handled and the &quot;perfect&quot; **reset()**method, it would be great right ? But it&#39;s a nice dream as the MemoryIndex doesn&#39;t store anything about the raw data, so... we need a plan B.</p>
<p>The plan B can be to use the old-fashioned, but still useful, <a href="http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/store/RAMDirectory.html">RAMDirectory</a> index that will still behave like a normal &quot;Directory&quot;-based index and will give you the ability to store the data you need on the field to match. Here is an example on how to use it :
[code language=&quot;java&quot;] final int MAX_DOCS = 10; final String FIELD_NAME = &quot;text&quot;; final Directory index = new RAMDirectory(); final StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_41); IndexWriterConfig writerConfig = new IndexWriterConfig(Version.LUCENE_41, analyzer); IndexWriter writer = new IndexWriter(index, writerConfig); // create document Document document = new Document(); FieldType type = new FieldType(); type.setIndexed(true); type.setStored(true); // it needs to be stored to be properly highlighted type.setTokenized(true); type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); // necessary for PostingsHighlighter document.add(new Field(FIELD_NAME, &quot;this an example of text that must be highlighted&quot;, type)); // add it to the index writer.addDocument(document); writer.commit(); writer.close(); Query query = new QueryParser(Version.LUCENE_41, FIELD_NAME, analyzer).parse(&quot;example&quot;); DirectoryReader directoryReader = DirectoryReader.open(index); IndexSearcher searcher = new IndexSearcher(directoryReader); PostingsHighlighter highlighter = new PostingsHighlighter(); TopDocs topDocs = searcher.search(query, MAX_DOCS); String[] strings = highlighter.highlight(FIELD_NAME, query, searcher, topDocs); System.out.println(Arrays.toString(strings)); // expected output : [this an &lt;b&gt;example&lt;/b&gt; of text that must be highlighted] [/code]</p>
<p>I&#39;m honestly considering right now to use <strong>both indexes</strong>querying heavily the MemoryIndex and using the RAMDirectory only when i know there&#39;s a match found and i need the highlighting features. Maybe i&#39;m not done digging up around this hole and there&#39;s a way to make any highlighter work with the MemoryIndex, but i doubt it, both conceptually and after testing everything i could.</p>
<p>If you think otherwise, and know a way to do so, tell me :)
<em>Vale</em></p>
]]></content:encoded>
            <category>java</category>
        </item>
        <item>
            <title><![CDATA[How to test and understand custom analyzers in Lucene]]></title>
            <link>https://ogirardot.writizzy.com/p/how-to-test-and-understand-custom-analyzers-in-lucene</link>
            <guid>https://ogirardot.writizzy.com/p/how-to-test-and-understand-custom-analyzers-in-lucene</guid>
            <pubDate>Thu, 20 Jun 2013 00:00:00 GMT</pubDate>
            <description><![CDATA[I've began to work more and more with the great "low-level" library [Apache Lucene](https://lucene.apache.org) created by Doug Cutting. For those of you that may not know, Lucene is the indexing and s...]]></description>
            <content:encoded><![CDATA[<p>I&#39;ve began to work more and more with the great &quot;low-level&quot; library <a href="https://lucene.apache.org">Apache Lucene</a> created by Doug Cutting. For those of you that may not know, Lucene is the indexing and searching library used by great entreprise search servers like Apache Solr and <a href="https://ogirardot.writizzy.comelasticsearch.org">Elasticsearch</a>.</p>
<p>When you start to index and search data, most of the time you need to create a <em>filtering and cleaning pipeline</em> to transform your raw text data into something more <strong>indexable</strong> and slightly more <strong>standardized</strong> . Such a pipeline may include <strong>lowercasing, transforming to ascii</strong> or even<strong>stemming</strong> (transforming for &quot;eating =&gt; eat&quot;). Defining such a pipeline is defining an <strong>Analyzer</strong> in Lucene-world, and while it&#39;s a very easy process to create a new/custom one, tweaking it to your needs is another thing and needs thorough testing.</p>
<p>Today&#39;s article is precisely to help you out regarding how to test your own analyzer or even create a simple test case for Lucene&#39;s analyzers, to allow you to better understand what they do and why they do it.</p>
<p>Luckily for us, using the latest version<strong>Apache Lucene 4.1</strong>, we&#39;re not left on our own and we can rely on a few tools because Lucene comes with a test framework that needs a few trick to work, so here we go :</p>
<p>You need testing right, so we need to add the dependency <strong>org.apache.lucene:lucene-test-framework</strong> as a maven artifact, but not so fast, the test-framework needs to be before <strong>lucene-core</strong>even if they are in completely different scope, and you need to use at least maven 2.x because otherwise the classpath order won&#39;t respect the dependency definition order (what a beautiful world...) :
[code language=&quot;xml&quot;] &lt;!-- must be before lucene core for classpath issues --&gt; &lt;dependency&gt; &lt;groupId&gt;org.apache.lucene&lt;/groupId&gt; &lt;artifactId&gt;lucene-test-framework&lt;/artifactId&gt; &lt;version&gt;${lucene.version}&lt;/version&gt; &lt;scope&gt;test&lt;/scope&gt; &lt;/dependency&gt; &lt;dependency&gt; &lt;groupId&gt;org.apache.lucene&lt;/groupId&gt; &lt;artifactId&gt;lucene-core&lt;/artifactId&gt; &lt;version&gt;${lucene.version}&lt;/version&gt; &lt;/dependency&gt; [/code]</p>
<p>But now if you want to create a new JUnit test for testing the behaviour of an analyzer, you&#39;ve got access to a new Base Class that you can extends called <strong>BaseTokenStreamTestCase</strong> . But the joy of it all is not exactly to be able to write <strong>&quot;public class MyWonderfulTestCase extends <strong>BaseTokenStreamTestCase&quot;</strong></strong> and clap your hands, now you have access to a brand new class of assertions (by the way you need to <strong>enable assertions to execute your tests with the -ea parameter as VM</strong> <strong>args</strong>)</p>
<ul>
<li><strong>assertTokenStream</strong>: it allows you to specify the field on which you&#39;re testing (otherwize &quot;dummy&quot; fieldName gets passed onto the analyzer) and check the token stream output;</li>
<li><strong>assertAnalyzesTo</strong>: you don&#39;t specify the field on which you&#39;re testing, but it has a simpler syntax.</li>
</ul>
<p>And there is an example of it all in action :
[code language=&quot;java&quot;] @Test public void shouldNotAlterKeywordAnalyzed() throws IOException { Analyzer myKeywordAnalyzer = new KeywordAnalyzer(); assertTokenStreamContents( myKeywordAnalyzer.tokenStream(&quot;my_keyword_field&quot;, new StringReader(&quot;ISO8859-1 and all that jazz&quot;)), new String[] { &quot;ISO8859-1 and all that jazz&quot; }); assertAnalyzesTo(myKeywordAnalyzer, &quot;ISO8859-1 and all that jazz&quot;, new String[] { &quot;ISO8859-1 and all that jazz&quot; // a single token output as expected from the KeywordAnalyzer }); } [/code] Hope it will help you out making your search engines more reliable :), <em>Vale</em></p>
]]></content:encoded>
            <category>oss</category>
            <category>java</category>
        </item>
        <item>
            <title><![CDATA[Book review : ElasticSearch Server by Rafal Kuc, Marek Rogozinski]]></title>
            <link>https://ogirardot.writizzy.com/p/book-review-elasticsearch-server-by-rafal-kuc-marek-rogozinski</link>
            <guid>https://ogirardot.writizzy.com/p/book-review-elasticsearch-server-by-rafal-kuc-marek-rogozinski</guid>
            <pubDate>Mon, 17 Jun 2013 00:00:00 GMT</pubDate>
            <description><![CDATA[[![ElasticSearch Server - book cover](http://ogirardot.wordpress.com/wp-content/uploads/2013/06/8444os.jpg?w=243)](http://www.amazon.com/gp/product/1849518440/ref=as_li_qf_sp_asin_tl?ie=UTF8&camp=1789...]]></description>
            <content:encoded><![CDATA[<p><a href="http://www.amazon.com/gp/product/1849518440/ref=as_li_qf_sp_asin_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1849518440&linkCode=as2&tag=rn0a4-20">![ElasticSearch Server - book cover](http://ogirardot.wordpress.com/wp-content/uploads/2013/06/8444os.jpg?w=243)</a></p>
<p>I&#39;m not usually doing a lot of book reviews, mainly because i&#39;m usually not finishing any book i begin... But i decided to finish this one, and i wanted to express my views on this book. If you look at the reviews of <a href="http://www.amazon.com/gp/product/1849518440/ref=as_li_qf_sp_asin_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1849518440&linkCode=as2&tag=rn0a4-20">ElasticSearch Server on amazon.com</a><img src="http://www.assoc-amazon.com/e/ir?t=rn0a4-20&l=as2&o=1&a=1849518440" alt="" /> you will get a first opinion that i can only agree with, <strong>this book is not for you, if you&#39;re looking for advanced tips and tweakings about ElasticSearch</strong>.</p>
<p>It&#39;s mainly <strong>for begginners</strong> and it will get you through your first fears facing this versatile piece of technology, but if i were you i&#39;d only consider this book when learning how to use elasticsearch if you have no prior experience with <a href="http://lucene.apache.org/solr/" title="Apache Solr">Apache Solr or Lucene</a>.</p>
<p>It does a good job introducing <strong>indexes</strong> and <strong>mappings</strong> and the fact that even if elasticsearch shields you from the old &quot;Solr - schema.xml&quot; by defining a default mapping for all newly created indexes and types (belonging to an index), this does not prevent you from needing to re-index all data when you realize the mapping you&#39;re using is not exactly... adequate.</p>
<p>The main part of the book, at least the one i&#39;d recommend, is not the part about <strong>Cluster Administration</strong> or the <strong>Getting started</strong> part, it&#39;s the <strong>Searching your data</strong> chapter of the book. To me, this chapter is a reference part for all the <strong>query types</strong> supported by ElasticSearch and can be very useful when searching what kind of query you need.</p>
<p>All in all it&#39;s not a bad book and you can keep it on the long term as a reference for the <strong>query DSL</strong>used by ElasticSearch through the HTTP/JSon API, but if you need something to guide you safely into production, you&#39;re better off experimenting by yourself.</p>
<p>Don&#39;t hesitate to tell me what you think ;-)</p>
<p><em>Vale</em></p>
]]></content:encoded>
            <category>oss</category>
            <enclosure url="http://ogirardot.wordpress.com/wp-content/uploads/2013/06/8444os.jpg?w=243" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elasticsearch is the way]]></title>
            <link>https://ogirardot.writizzy.com/p/elasticsearch-is-the-way</link>
            <guid>https://ogirardot.writizzy.com/p/elasticsearch-is-the-way</guid>
            <pubDate>Tue, 12 Mar 2013 00:00:00 GMT</pubDate>
            <description><![CDATA[Don't get me wrong, i love [Apache Solr](http://lucene.apache.org/solr/ "Solr"), i think it's a wonderful project and the versions 4.x are definitely something you should check out when building a pro...]]></description>
            <content:encoded><![CDATA[<p>Don&#39;t get me wrong, i love <a href="http://lucene.apache.org/solr/" title="Solr">Apache Solr</a>, i think it&#39;s a wonderful project and the versions 4.x are definitely something you should check out when building a proper search engine.</p>
<p><strong>But</strong> Elasticsearch, at least for me, is now the way to the future. If you need a few reasons why, read on :</p>
<h2>Out of the box scalability</h2>
<p>SolrCloud is doing a good job trying to get Solr into the Cloud era, because even if Solr supported distributed query before, sharding had to be done manually...</p>
<p>Elasticsearch scalability is so easy it&#39;s a bit frightening, every time i set up a new elasticsearch &quot;single&quot; server i deactivate as soon as possible the cluster-search capability, just in case it starts replicating the internet on my machine ! Sharding/Replication is automatic and almost a necessity, because your server (by default) will remind you that you&#39;re a dangerous person keeping all your data on a single machine, and will stay in a <strong>yellow state</strong> until you start adding some nodes !</p>
<h2>Comprehensive Json-based HTTP search API</h2>
<p>In all honesty sometimes the json-based search queries can become quite complicated and tedious to read, but it&#39;s much more powerful than a simple <strong>?q=....</strong> query or the long and complicated list of URL-GET parameters you end up using with Solr... So even if there are no proper Chrome extension to create a GET HTTP request with a JSON body (!! add a comment if you find one !!), i still think it&#39;s a blessing to have that kind of query capacity, and it made me rethink about elasticsearch&#39;s tolerance/suitability for complex query (c.f. the &quot;As complex as Solr&quot; part).</p>
<h2>Rivers...</h2>
<p>Probably one of the best feature of Elasticsearch, it&#39;s designed around the fantastic (and true) idea that an Elasticsearch index needs to be fed !</p>
<p><img src="http://funnyasduck.net/wp-content/uploads/2013/02/funny-crazy-cat-feed-me-kill-whole-family-pics.jpg" alt="" /></p>
<p>Just this concept changes everything, because it makes the **&quot;realtime index&quot;**the default type of index, because anyway nowadays what matters most is to have an up-to-date search index and it&#39;s a fact that Near-Realtime search is one the many advantages that makes Solr and Elasticsearch the best choices out there.</p>
<h2>Vibrant community and plugins</h2>
<p>Probably the most important part, in my opinion, i do think that the Solr ecosystem lacks a lot of good tools and plugins to leverage more of its power. <a href="https://code.google.com/p/luke/" title="Luke">Luke</a> is a pretty useful tool, but it&#39;s very lucene-centric, apart from the solr-provided tools (which are, i must say, sufficient for a lot troubleshooting and debugging). I&#39;ve been on Solr 3.x for a long time, and even if all the tools where there, the UI certainly lacked in terms of &quot;sexy&quot;, nowadays Solr 4.x&#39;s UI is certainly more sexy and a pleasure to work with, but it&#39;s still only the work of Lucidworks.</p>
<p>Elasticsearch is brand new, the documentation is sexy, the project is sexy, they built a wonderful plugin system that <strong>uses github directly !! You don&#39;t have to be a fully accredited &quot;Elasticsearch-compliant plugin creator&quot; to publish your project</strong>.</p>
<p>So a lot of people created wonderful plugins, that already goes beyond what you can use in the Solr/Lucene world, just a quick review :</p>
<ul>
<li><p><a href="https://github.com/karmi/elasticsearch-paramedic" title="Paramedic">Paramedic</a> : a &quot;simple and sexy tool to monitor and inspect elasticsearch clusters&quot;;</p>
</li>
<li><p><a href="https://github.com/mobz/elasticsearch-head" title="Head">Head</a> : &quot;A web front end for an ElasticSearch cluster&quot; with a real-time dashboard;</p>
</li>
<li><p><a href="https://github.com/lukas-vlcek/bigdesk" title="BigDesk">BigDesk</a> : Live charts and statistics for Elasticsearch cluster;</p>
</li>
<li><p>For the analysis, you have <a href="https://github.com/polyfractal/elasticsearch-inquisitor" title="Inquisitor">Inquisitor</a></p>
<p>to help understand and debug your queries in ElasticSearch and <a href="https://github.com/polyfractal/elasticsearch-segmentspy" title="SegmentSpy">SegmentSpy</a> to watch real time segments merging and changing.</p>
</li>
</ul>
<p>This is just the state of the art right now, but i can&#39;t imagine it going anywhere but forward.</p>
<h2>As complex as Solr</h2>
<p>Finally, i had prejudiced, because i thought that the goals of Elasticsearch in terms of scalability where clearly ambitious (and deeply needed !), but that this kind of scalability obviously came at a cost and therefor there would be less features than what Solr offered (ex. <a href="http://wiki.apache.org/solr/DisMax" title="Dismax">Dismax</a> queries).</p>
<p>But <strong>i was</strong> <strong>wrong</strong>, as i discovered recently that Dismax queries, fuzzy matching and other goodies allowing many things from <em>boosted-field at query time</em> to <em>boosted sub-queries</em> , are available and easily accessible thanks to the Elasticsearch API. So the proper section-name should not be &quot;As complex as Solr&quot; but <strong>&quot;As versatile as Solr&quot;.</strong></p>
<p>I hope i made my point, and if you&#39;re considering building a BigData-ready search engine right now, make sure to check out <a href="http://elasticsearch.org">Elasticsearch</a> or you&#39;ll be missing out on a great product.</p>
<p><em>Vale</em></p>
]]></content:encoded>
            <category>bigdata</category>
            <category>data</category>
            <enclosure url="http://funnyasduck.net/wp-content/uploads/2013/02/funny-crazy-cat-feed-me-kill-whole-family-pics.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Reste à ta place et fais ce qu'on te dit.]]></title>
            <link>https://ogirardot.writizzy.com/p/reste-a-ta-place-et-fais-ce-quon-te-dit</link>
            <guid>https://ogirardot.writizzy.com/p/reste-a-ta-place-et-fais-ce-quon-te-dit</guid>
            <pubDate>Fri, 01 Feb 2013 00:00:00 GMT</pubDate>
            <description><![CDATA[Je ne suis pas le plus aguerri des vétérans, et je m'en rend compte encore assez souvent pour savoir que j'ai encore des *sempaïs*dans plus d'un domaine (pas que technique) dont certains avec qui j'ai...]]></description>
            <content:encoded><![CDATA[<p>Je ne suis pas le plus aguerri des vétérans, et je m&#39;en rend compte encore assez souvent pour savoir que j&#39;ai encore des <em>sempaïs</em>dans plus d&#39;un domaine (pas que technique) dont certains avec qui j&#39;ai la chance de travailler, même si ce n&#39;est pas tout le temps au jour le jour.</p>
<p>Au fil de mes années de travail, en Banque, en SSII, et dans d&#39;autres sociétés, il n&#39;y a qu&#39;une seule constante que je peux vraiment distinguer : dans chacune de ces situations, <strong>quelqu&#39;un attendait quelque chose de moi</strong>.</p>
<p>Vous allez me dire qu&#39;il n&#39;y a rien d&#39;inhabituel à ça, quand on embauche quelqu&#39;un c&#39;est rarement (quoique...) pour la flagrance de son inutilité. Mais ce n&#39;est pas là que je veux en venir, vous allez voir rapidement :</p>
<ul>
<li>En banque, on s&#39;attendait à ce que je maintienne une application pour la salle des marchés mais pas à ce que je propose des innovations ou que je prenne le temps de comprendre mieux le métier;</li>
<li>En SSII (dans une autre banque), on s&#39;attendait à ce que je maintienne une application de finance de marché mais encore une fois aucune innovation possible là dedans et plus grave, jusqu&#39;à aujourd&#39;hui, je ne sais pas toujours pas à quoi ressemblait l&#39;ombre de mes utilisateurs...;</li>
<li>En tant qu&#39;ingénieur R&amp;D, on me proposait d&#39;innover, mais dans la direction qu&#39;il fallait (définie par en haut et inconnue à se jour... car changeant tout les mois) sans nous laisser le temps de réfléchir, d&#39;apprendre mieux le métier, le tout dans une urgence latente et permanente.</li>
</ul>
<p>Vous avez peut-être déjà compris où je veux en venir. Quand quelqu&#39;un attendait quelque chose de moi, <strong>il n&#39;attendait qu&#39;une seule chose de moi, que je reste à ma place et que je fasse ce qu&#39;impose mon rôle et seulement ça.</strong></p>
<p>Si on y réfléchit, ce précepte permet de construire un monde très simple avec ça, j&#39;appellerais ça <strong>la modularisation de l&#39;entreprise</strong>, chacun a un rôle et un seul, reste dans son rôle et alors le plus important dans l&#39;entreprise devient que les royaumes de contrôle des différents rôles de chacun ne se touche pas. C&#39;est le principe de l&#39;ouvrier spécialisé ramené à des travaux intellectuels.</p>
<p>Mais le plus grave, c&#39;est que dans le monde d&#39;aujourd&#39;hui, nous avons integré ce que nos parents disait plus violemment vers 68, il n&#39;y a plus de place dans la société Française pour les jeunes. Les entreprises ont créés des rôles bien segmentés où l&#39;on a le droit de se complaire <strong>mais, en tant que jeune, notre seul solution actuelle pour s&#39;épanouir/évoluer devient de changer de boite</strong> avec le travers bien connu :</p>
<p><a href="http://ogirardot.wordpress.com/wp-content/uploads/2013/02/jxc4exee0kkvshm4u1avjg2.jpeg">![JXC4EXEe0kKvShm4u1avjg2](http://ogirardot.wordpress.com/wp-content/uploads/2013/02/jxc4exee0kkvshm4u1avjg2.jpeg)</a></p>
<p>Ca n&#39;a pas toujours été ainsi. Du temps de mon grand père, on se faisait porter par l&#39;entreprise dans laquelle on rentrait, elle lui a fait confiance, l&#39;a challengé, l&#39;a aider à s&#39;améliorer et à se construire pour enfin lui permettre de monter dans celle-ci. Moins loin, du temps de nos parents, baby-boom oblige, beaucoup de niveaux hiérarchiques ont été créés, pas tellement par nécessité (d&#39;efficacité), mais plus par <strong>modularisation de l&#39;entreprise</strong> et surtout pour éviter les conflits (la génération de nos parents reste quand même celle qui, sans avoir fait la guerre, a traité ses propres parents de Nazis...). C&#39;est un peu le début de ce qu&#39;on appellerait aujourd&#39;hui &quot;les petits chefs&quot;.</p>
<p>Seulement ce monde est dangereux, il détruit la créativité des jeunes, laisse l&#39;innovation à une élite sans compétences (le fameux mythe du <em>&quot;Si quelqu&#39;un peut changer les choses c&#39;est bien lui/elle (enfin souvent lui quand même...)&quot;</em> ), et nous complait progressivement dans le rôle que les Etats-Unis nous ont donné depuis plusieurs années, celui de la <strong>&quot;Old Europe&quot;</strong> ou de <strong>l&#39;Europe Musée</strong>qui ne vit que sur ses acquis.</p>
<p>Ce que j&#39;aime dans la création d&#39;une <a href="http://www.lateral-thoughts.com" title="LateralThoughts">NoSSII comme LateralThoughts</a> c&#39;est d&#39;oeuvrer chaque jour à sortir de ce schéma destructeur, de remettre entre les mains de tous la possibilité d&#39;innover, de dégager du temps pour réfléchir, pour s&#39;améliorer et de porter les projets de tous. On dit souvent qu&#39;une bonne idée n&#39;a pas de parti, mais bien trop souvent elle a un niveau hiérarchique...</p>
<p>Et vous, vous avez une bonne idée ?</p>
<p><em>Vale</em></p>
]]></content:encoded>
            <category>techzone</category>
            <enclosure url="http://ogirardot.wordpress.com/wp-content/uploads/2013/02/jxc4exee0kkvshm4u1avjg2.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Sharing PyPi/Maven dependency data]]></title>
            <link>https://ogirardot.writizzy.com/p/sharing-pypimaven-dependency-data</link>
            <guid>https://ogirardot.writizzy.com/p/sharing-pypimaven-dependency-data</guid>
            <pubDate>Thu, 31 Jan 2013 00:00:00 GMT</pubDate>
            <description><![CDATA[As time is always running out, i don't think i'll have the time in a while to work again on the data I collected for the last three articles, [Going offline with Maven](http://ogirardot.wordpress.com/...]]></description>
            <content:encoded><![CDATA[<p>As time is always running out, i don&#39;t think i&#39;ll have the time in a while to work again on the data I collected for the last three articles, <a href="http://ogirardot.wordpress.com/2013/01/14/going-offline-with-maven/" title="Going offline with Maven">Going offline with Maven</a>, <a href="http://ogirardot.wordpress.com/2013/01/11/state-of-the-mavenjava-dependency-graph/" title="State of the maven/java dependency graph">State of the Maven/Java dependency graph</a> and <a href="http://ogirardot.wordpress.com/2013/01/05/state-of-the-pythonpypi-dependency-graph/" title="State of the PyPi/Python dependency graph">State of the PyPi/Python dependency graph</a>.</p>
<p>So, as it took me a long time to build these datasets and even if the datasets were already available on the github project, i want to make it publicly available and define the metadata properly so anyone can reuse them freely. The only licence i&#39;m putting it on is <a href="http://creativecommons.org/licenses/by/2.0/">Creative Commons</a>, so you&#39;re free to use it, re-adapt it, publish based on it, or use it for commercial purposes, as long as you mention me (<em>Olivier Girardot &lt;o.girardot (at) lateral-thoughts.com&gt;</em>) as author.</p>
<p>So the dataset is divided in three files, compressed using LZMA :</p>
<h4><a href="https://github.com/ssaboum/meta-deps/blob/master/mvn-deps.csv.lzma">mvn-deps.csv.lzma</a> and <a href="https://github.com/ssaboum/meta-deps/blob/master/mvn-minimal-deps.csv.lzma" title="mvn-minimal-deps.csv.lzma">mvn-minimal-deps.csv.lzma</a></h4>
<p><strong>mvn-deps</strong> consists in all the Maven artifacts extracted from Maven central repositories and <strong>mvn-minimal-deps</strong> is the minimal set of dependencies you need to for g<a href="http://ogirardot.wordpress.com/2013/01/14/going-offline-with-maven/" title="Going offline with Maven">oing offline with Maven</a>, once uncompressed both files are a simple <strong>tab-separated csv document</strong> with the following columns :</p>
<ul>
<li>artifactId</li>
<li>groupId</li>
<li>version</li>
<li>dependencies : <strong>as a base64 encoded json string with the following keys : artifactId, groupId, version</strong>ex: {&#39;artifactId&#39;: &#39;log4j&#39;, &#39;groupId&#39;: &#39;log4j&#39;, &#39;version&#39;:&#39;1.0.3&#39;}</li>
</ul>
<h4><a href="https://github.com/ssaboum/meta-deps/blob/master/pypi-deps.csv.lzma" title="pypi-deps.csv.lzma">pypi-deps.csv.lzma</a></h4>
<p><strong>pypi-deps</strong> consists in all the PyPi dependencies, once again it&#39;s a <strong>tab-separated csv document</strong> with the following columns :</p>
<ul>
<li>name</li>
<li>version</li>
<li>dependencies : <strong>as a base64 encoded json string with the following keys : name, version</strong>ex: {&#39;artifactId&#39;: &#39;log4j&#39;, &#39;groupId&#39;: &#39;log4j&#39;, &#39;version&#39;:&#39;1.0.3&#39;}</li>
</ul>
<p>An example on how to treat this file to extract it as a <a href="http://networkx.github.com/" title="Networkx">networkx</a> graph is available in the <a href="https://github.com/ssaboum/meta-deps/blob/master/PyPi%20Metadata.ipynb">github project's IPython notebook</a> that you need to download as a raw file to use it with IPython.</p>
<p>I&#39;d be glad that following <a href="http://www.hilarymason.com/" title="Hilary Mason's blog">Hilary Mason</a> posts on <a href="http://www.hilarymason.com/blog/startups-how-to-share-data-with-academics/" title="Sharing data with academics">sharing data with academics</a> some publications were to use these datasets, if any does, please feel free to comment on this blog post to link to your remixed work.</p>
<p><em>Vale</em></p>
]]></content:encoded>
            <category>oss</category>
            <category>python</category>
            <category>java</category>
        </item>
        <item>
            <title><![CDATA[Going offline with Maven]]></title>
            <link>https://ogirardot.writizzy.com/p/going-offline-with-maven</link>
            <guid>https://ogirardot.writizzy.com/p/going-offline-with-maven</guid>
            <pubDate>Mon, 14 Jan 2013 00:00:00 GMT</pubDate>
            <description><![CDATA[At [Lateral-Thoughts](http://www.lateral-thoughts.com "LT"), we organize at least once a year, what we call a "Timeoff" where we get together in a nice place and hack on what we want. It can be a lear...]]></description>
            <content:encoded><![CDATA[<p>At <a href="http://www.lateral-thoughts.com" title="LT">Lateral-Thoughts</a>, we organize at least once a year, what we call a &quot;Timeoff&quot; where we get together in a nice place and hack on what we want. It can be a learning period or a <a href="http://startupweekend.org/" title="Startup Weekend">startup weekend</a>-like event where we hack on a product/idea. <a href="http://ogirardot.wordpress.com/2012/09/13/on-devrait-toujours-travailler-comme-ca-hackatonlt/" title="We should always work like that (French)">Last time</a> it was in a nice house in <a href="http://goo.gl/maps/QVqpu" title="Gérande, Loire Atlantique, France">Guérande</a> where we had everything we needed, <strong>internet access</strong>, rooms, tables, lots of space, an indoor swimming pool and a barbecue !</p>
<p>But when you want to find a nice place in France, it&#39;s not always easy to also have a good/decent <strong>internet access</strong> , so as we&#39;re beginning to plan the next event right now, we asked ourselves what could we do if there was no internet access ? Is there a way to plan for what we would need, so that we wouldn&#39;t suffer from having no contact with the outside world :). But in a Java/Python environment, where you use a lot Maven and PyPi, when you don&#39;t know what you&#39;ll be working on, the one thing you can&#39;t (and <strong>shouldn&#39;t plan)</strong> is the <strong>libraries/dependencies you&#39;ll need.</strong></p>
<p>So what do we do ? The simplest way is to download all the dependencies you can from a Maven repository but that seems like the most in-efficient way ever, and with more than 30Gb of data each, it can take a while...</p>
<p>In the <a href="http://ogirardot.wordpress.com/2013/01/11/state-of-the-mavenjava-dependency-graph/" title="State of the Maven ecosystem">last article</a> I extracted all the libs&#39; metadata and dependencies link, so we know what depends on what. So in order to be more efficient in creating a copied repository, I decided to use those metadata according to two simple rules :</p>
<ul>
<li><strong>Only keep the latest version of artifacts;</strong></li>
<li><strong>And artifacts/versions that are needed to other artifacts in their latest versions.</strong></li>
</ul>
<p>With those simple rules, we can create a &quot;minimum&quot; repository containing only what we would need to start a new project :). The data I extracted is not perfect so don&#39;t take my word on it. This is a first draft of a work I (or someone else) may continue. The result is a simpler graph containing only <strong>25 553 nodes and 52 916 edges</strong> (compared to the <strong>186 384 Nodes and 1 229 083 Edges</strong> of the full repository), we can almost comprehend : [caption id=&quot;attachment_1003&quot; align=&quot;aligncenter&quot; width=&quot;640&quot;]<a href="http://ogirardot.wordpress.com/wp-content/uploads/2013/01/full-graph-limited-mvn-deps.pdf">![Light version of full-compact maven dependencies - Click to get pdf](http://ogirardot.wordpress.com/wp-content/uploads/2013/01/full-graph-limited-deps-mvn-light.png?w=640)</a> Light version of full-compact maven dependencies - Click to get pdf[/caption] The full pdf file, almost as good as the svg version (without the 24Mb overhead) is available for download jut by clicking on the picture. But if you need the data because, just like us, you may have to go off the grid, the raw csv file is available on <a href="https://github.com/ogirardot/meta-deps/raw/master/mvn-minimal-deps.csv.lzma" title="Maven minimal dependencies">GitHub here</a>. It&#39;s a simple CSV file compressed with LZMA, its columns are <em>groupId, artifactId, version,</em> dependencies, <strong>dependencies</strong> being a base64 encoded json dict. Hoping you&#39;ll enjoy this. <em>Vale</em></p>
]]></content:encoded>
            <category>java</category>
            <enclosure url="http://ogirardot.wordpress.com/wp-content/uploads/2013/01/full-graph-limited-deps-mvn-light.png?w=640" length="0" type="image/png"/>
        </item>
    </channel>
</rss>